# Laboratorio de regresión logística

|                |   |
:----------------|---|
| **Nombre**     |   Isabela Torres -Septien Uribe|
| **Fecha**      |   12/10/2025|
| **Expediente** |   730667| 

In machine learning, Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in p-dimensional space (where p is the number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes (or more if we have a multi class problem):

$$ f(x) = w^T \varphi(x) + b $$

where $\varphi: X \rightarrow F $ is a function that makes each input point $x$ correspond to a point in F, where F is a Hilbert space.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces (more specifically using the kernel trick, like the RBF funcion). 

[1]

OLS utilizes the squared residuals to fit the parameters. Large residuals caused by outliers may worsen the accuracy significantly.

Support Vectors use piecewise linear functions to counter this, in which a hyperparameter  $\epsilon$ called the margin lets errors that are less or equal to it be 0, and error larger than it be $e - \epsilon$. 

The problem to solve is:

\begin{split}
        \min_{w, b, \xi, \xi^*} \mathcal{P}_\epsilon(w, b, \xi) &= \frac{1}{2} w^T w + c \sum_{k=1}^{N} \xi_k \\
        \text{s.t. } & y_k [w^T \varphi(x_k) - b] \geq 1- \xi_k,\ \ k = 1, ..., N \\
        & \xi_k \geq 0,\ \ k = 1, ..., N
\end{split}


The most important question that arises when using a SVM is how to choose the correct hyperplane. Consider the following scenarios:

### Scenario 1

In this scenario there are three hyperplanes called A, B, and C. Now, the problem is to identify the hyperplane which best differentiates the stars and the circles.

<center><img src="https://media.geeksforgeeks.org/wp-content/uploads/SVM_21-2.png" alt="what image shows"></center>

In this case, hyperplane B separates the stars and the circle betters, hence it is the correct hyperplane.


### Scenario 2

Now take another scenario where all three hyperplanes are segregating classes well. The question that arises is how to choose the best hyperplane in this situation.

<center><img src="https://media.geeksforgeeks.org/wp-content/uploads/SVM_4-2.png" alt="what image shows"></center>

In such scenarios, we calculate the margin (which is the distance between nearest data point and the hyperplane). The hyperplane with the largest margin will be considered as the correct hyperplane to classify the dataset.

Here C has the largest margin. Hence, it is considered as the best hyperplane.


### Kernels
Knowing 
$$ w = \sum_{k=1}^{N} \alpha_k y_k \varphi(x_k) $$

And
$$ y_{pred} = w^T \varphi(x) + b $$

Then 
$$ y_{pred} = (\sum_{k=1}^{N} \alpha_k y_k \varphi(x_k))^T \varphi(x) + b $$

Where $\varphi$ is a function that makes each input in $x$ correspond to a point in $F$ (a Hilbert space). This can be seen as processing and transforming the input featuers to keep the model's convexity. [2]

This also allows us to transform the inputs into another space where they might be more easily classified.

<center><img src=https://miro.medium.com/max/838/1*gXvhD4IomaC9Jb37tzDUVg.png alt="what image shows"></center>

## ROC and AUC

A ROC (Receiver Operating Characteristic) is a graph that shows how the classification model performs at the classification thresholds. 

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better. [3]

True Positive Rate is a synonym for Recall and defined as:
$$ TPR = \frac{TP}{TP + FN} $$

False Positive Rate is a synonym for Specificity and defined as:

$$ FPR = \frac{FP}{FP + TN} $$

ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).

E.g. If you lower a classification threshold, more items would be classified as positive, increasing False Positives and True Positives.

AUC stands for Area under the ROC.

## Ejercicio 1

- Utiliza el dataset `Iris`, modela con SVC y haz Cross-Validation de diferentes kernels ('linear', 'poly', 'rbf', 'sigmoid').
- Modela con LogisticRegression.
- El método de Cross-Validation es K-Folds con $k=10$.
- Utiliza el AUC como métrico de Cross-Validation.
- Compara resultados.

In [1]:
from sklearn.datasets import load_iris

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.svm import SVC

from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

import pandas as pd
import numpy as np



In [2]:
# Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target



In [3]:

#Procesamiento y escalamiento de los datos
numerical_transformer = StandardScaler()
numerical_features = X.columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features)
    ])

#Definir el cross validation con K folds = 10
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

#modelos para hacer la regresión logística segun el Kernel 
models = [
    ('Logistic Regression', LogisticRegression(max_iter=1000, multi_class='ovr'))]

#Evaluamos con dif kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for k in kernels:
    models.append((f"SVC {k}", SVC(kernel=k, probability=True)))


#Evaluamos los modelos con AUC y sacamos los resultados
resultados = []
for nombre, modelo in models:
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', modelo)
    ])
    
    # Calcular AUC 
    auc_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc_ovr')
    
    #Calculamos el promedio y el std de los resultados
    resultados.append({
        'Modelo': nombre,
        'AUC promedio': np.mean(auc_scores),
        'AUC std': np.std(auc_scores)
    })

df_resultados = pd.DataFrame(resultados)
df_resultados





Unnamed: 0,Modelo,AUC promedio,AUC std
0,Logistic Regression,0.982,0.02291
1,SVC linear,1.0,0.0
2,SVC poly,0.997333,0.008
3,SVC rbf,0.998667,0.004
4,SVC sigmoid,0.983333,0.029098


En el siguiente modelo podemos comparar los siguientes kernels usando la base de datos de iris, Inducablemente el modelo que mejor se ajusta es el SVC lineal teniendo resultados perfectos lo cual, me pone a dudar un poco de que tan bien esta la regresion, o si hay un problema por alguna parte, por otro lado si nos ponemos a compara los otros modelos la gran mayoría tiene resultados casi perfectos lo cual, nos da más tranquilidad en cuanto a nuestro daraset, a mi parecer todas las regresiones son estupendas teniendo una minima volatilidad siendo la más grande de 0.029. Además de mantener un rendimiento por arrva del 98% en todos los casos.

## Ejercicio 2
- Repite el ejercicio 1 con el dataset `Default`. Utiliza `default` como target.

In [4]:
# Iris dataset
data = pd.read_csv("Default.csv")
X = data.drop(columns=['default'], axis=1)
X = pd.get_dummies(X, drop_first=True)

#y = data["default"].map({'Yes': 1, 'No': 0}) otra forma de hacerlo
y = data["default"]
y = pd.get_dummies(y, drop_first=True)


In [5]:

#Procesamiento y escalamiento de los datos
numerical_transformer = StandardScaler()
numerical_features = X.columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features)
    ])

#Definir el cross validation con K folds = 10
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

#modelos para hacer la regresión logística segun el Kernel 
models = [
    ('Logistic Regression', LogisticRegression(max_iter=1000, multi_class='ovr'))]

#Evaluamos con dif kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for k in kernels:
    models.append((f"SVC {k}", SVC(kernel=k, probability=True)))


#Evaluamos los modelos con AUC y sacamos los resultados
resultados = []
for nombre, modelo in models:
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', modelo)
    ])
    
    # Calcular AUC 
    auc_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc_ovr')
    
    #Calculamos el promedio y el std de los resultados
    resultados.append({
        'Modelo': nombre,
        'AUC promedio': np.mean(auc_scores),
        'AUC std': np.std(auc_scores)
    })

df_resultados = pd.DataFrame(resultados)
df_resultados



  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

Unnamed: 0,Modelo,AUC promedio,AUC std
0,Logistic Regression,0.948675,0.014793
1,SVC linear,0.906549,0.023973
2,SVC poly,0.878235,0.025146
3,SVC rbf,0.84151,0.027697
4,SVC sigmoid,0.731859,0.026538


En el siguiente modelo podemos comparar los siguientes kernels estamos buscando aquellos modelos con mejor rendimiento y menor volatilidad por los que se puede decir que el modelo de regresion Logistica es el que tiene mejor reciltado por mucho teniendo el promedio más alto y la menor desviación estandar, debido a que ninguno de nuestros modelos presenta un alto std se descarta la posibilidad de un sobre ajuste, además de que en general todos los modelos tienen buenos resultados. Siendo el modelo menos favorable a mi parecer el Svcc sigmond.

# Addendum

Métricos disponibles para clasificación:
- ‘accuracy’
- ‘balanced_accuracy’
- ‘top_k_accuracy’
- ‘average_precision’
- ‘neg_brier_score’
- ‘f1’
- ‘f1_micro’
- ‘f1_macro’
- ‘f1_weighted’
- ‘f1_samples’
- ‘neg_log_loss’
- ‘precision’ etc.
- ‘recall’ etc.
- ‘jaccard’ etc.
- ‘roc_auc’
- ‘roc_auc_ovr’
- ‘roc_auc_ovo’
- ‘roc_auc_ovr_weighted’
- ‘roc_auc_ovo_weighted’
- ‘d2_log_loss_score’

# References

[1] Shigeo Abe.Support Vector Machines for Pattern Classification,2Ed.Springer-Verlag London,2010. ISBN978-1-84996-097-7. URLhttps://www.springer.com/gp/book/9781849960977.

[2] Johan A K Suykens, Tony Van Gestel, Jos De Brabanter, BartDe Moor, and Joos Vandewalle.Least Squares Support VectorMachines. World Scientific,2002. ISBN9789812381514. URLhttps://www.worldscientific.com/worldscibooks/10.1142/5089.

[3] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159. URL https://www.researchgate.net/post/how_can_I_interpret_the_ROC_curve_result