# Laboratorio de regresión logística

|                |   |
:----------------|---|
| **Nombre**     | André Esteban Vera  |
| **Fecha**      |  12/10/2025 |
| **Expediente** | 745232  | 

In machine learning, Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in p-dimensional space (where p is the number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes (or more if we have a multi class problem):

$$ f(x) = w^T \varphi(x) + b $$

where $\varphi: X \rightarrow F $ is a function that makes each input point $x$ correspond to a point in F, where F is a Hilbert space.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces (more specifically using the kernel trick, like the RBF funcion). 

[1]

OLS utilizes the squared residuals to fit the parameters. Large residuals caused by outliers may worsen the accuracy significantly.

Support Vectors use piecewise linear functions to counter this, in which a hyperparameter  $\epsilon$ called the margin lets errors that are less or equal to it be 0, and error larger than it be $e - \epsilon$. 

The problem to solve is:

\begin{split}
        \min_{w, b, \xi, \xi^*} \mathcal{P}_\epsilon(w, b, \xi) &= \frac{1}{2} w^T w + c \sum_{k=1}^{N} \xi_k \\
        \text{s.t. } & y_k [w^T \varphi(x_k) - b] \geq 1- \xi_k,\ \ k = 1, ..., N \\
        & \xi_k \geq 0,\ \ k = 1, ..., N
\end{split}


The most important question that arises when using a SVM is how to choose the correct hyperplane. Consider the following scenarios:

### Scenario 1

In this scenario there are three hyperplanes called A, B, and C. Now, the problem is to identify the hyperplane which best differentiates the stars and the circles.

<center><img src="https://media.geeksforgeeks.org/wp-content/uploads/SVM_21-2.png" alt="what image shows"></center>

In this case, hyperplane B separates the stars and the circle betters, hence it is the correct hyperplane.


### Scenario 2

Now take another scenario where all three hyperplanes are segregating classes well. The question that arises is how to choose the best hyperplane in this situation.

<center><img src="https://media.geeksforgeeks.org/wp-content/uploads/SVM_4-2.png" alt="what image shows"></center>

In such scenarios, we calculate the margin (which is the distance between nearest data point and the hyperplane). The hyperplane with the largest margin will be considered as the correct hyperplane to classify the dataset.

Here C has the largest margin. Hence, it is considered as the best hyperplane.


### Kernels
Knowing 
$$ w = \sum_{k=1}^{N} \alpha_k y_k \varphi(x_k) $$

And
$$ y_{pred} = w^T \varphi(x) + b $$

Then 
$$ y_{pred} = (\sum_{k=1}^{N} \alpha_k y_k \varphi(x_k))^T \varphi(x) + b $$

Where $\varphi$ is a function that makes each input in $x$ correspond to a point in $F$ (a Hilbert space). This can be seen as processing and transforming the input featuers to keep the model's convexity. [2]

This also allows us to transform the inputs into another space where they might be more easily classified.

<center><img src=https://miro.medium.com/max/838/1*gXvhD4IomaC9Jb37tzDUVg.png alt="what image shows"></center>

## ROC and AUC

A ROC (Receiver Operating Characteristic) is a graph that shows how the classification model performs at the classification thresholds. 

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better. [3]

True Positive Rate is a synonym for Recall and defined as:
$$ TPR = \frac{TP}{TP + FN} $$

False Positive Rate is a synonym for Specificity and defined as:

$$ FPR = \frac{FP}{FP + TN} $$

ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).

E.g. If you lower a classification threshold, more items would be classified as positive, increasing False Positives and True Positives.

AUC stands for Area under the ROC.

Medidas:

- Accuracy: Proporción de predicciones correctas (tanto verdaderos positivos como verdaderos negativos) con respecto al total de predicciones realizadas.

- Precision: Mide qué proporción de los ejemplos que el modelo predijo como positivos realmente lo son.

- Recall: Mide qué proporción de los ejemplos positivos reales fueron correctamente identificados por el modelo

- Specificity: Mide qué proporción de los negativos reales fueron correctamente clasificados. 

## Ejercicio 1

- Utiliza el dataset `Iris`, modela con SVC y haz Cross-Validation de diferentes kernels ('linear', 'poly', 'rbf', 'sigmoid').
- Modela con LogisticRegression.
- El método de Cross-Validation es K-Folds con $k=10$.
- Utiliza el AUC como métrico de Cross-Validation.
- Compara resultados.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Cargar dataset Iris
iris = datasets.load_iris()
X = iris.data
y = iris.target

Conjunto de entrenamiento: (120, 4) (120,)
Conjunto de prueba: (30, 4) (30,)


In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, roc_auc_score

# AUC multiclase como métrica personalizada (Multiclase - 3 clases, )
auc_scorer = make_scorer(roc_auc_score, needs_proba=True, multi_class='ovr') # one-vs-rest

# Validación cruzada estratificada
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Kernel linear
svc_linear = SVC(kernel='linear', probability=True, random_state=42)
scores_linear = cross_val_score(svc_linear, X, y, cv=cv, scoring=auc_scorer)
print(f"SVC (linear): Mean AUC = {scores_linear.mean():.4f}, Std = {scores_linear.std():.4f}")


# Kernel poly
svc_poly = SVC(kernel='poly', probability=True, random_state=42)
scores_poly = cross_val_score(svc_poly, X, y, cv=cv, scoring=auc_scorer)
print(f"SVC (poly): Mean AUC = {scores_poly.mean():.4f}, Std = {scores_poly.std():.4f}")


# Kernel rbf
svc_rbf = SVC(kernel='rbf', probability=True, random_state=42)
scores_rbf = cross_val_score(svc_rbf, X, y, cv=cv, scoring=auc_scorer)
print(f"SVC (rbf): Mean AUC = {scores_rbf.mean():.4f}, Std = {scores_rbf.std():.4f}")

# Kernel sigmoid
svc_sigmoid = SVC(kernel='sigmoid', probability=True, random_state=42)
scores_sigmoid = cross_val_score(svc_sigmoid, X, y, cv=cv, scoring=auc_scorer)
print(f"SVC (sigmoid): Mean AUC = {scores_sigmoid.mean():.4f}, Std = {scores_sigmoid.std():.4f}")




SVC (linear): Mean AUC = 0.9960, Std = 0.0085
SVC (poly): Mean AUC = 0.9960, Std = 0.0085
SVC (rbf): Mean AUC = 0.9973, Std = 0.0080
SVC (sigmoid): Mean AUC = 0.7467, Std = 0.0490


In [4]:
# Regresión logística
logreg = LogisticRegression(max_iter=1000, random_state=42)
scores_logreg = cross_val_score(logreg, X, y, cv=cv, scoring=auc_scorer)
print(f"Logistic Regression: Mean AUC = {scores_logreg.mean():.4f}, Std = {scores_logreg.std():.4f}")


Logistic Regression: Mean AUC = 0.9960, Std = 0.0085


In [5]:
import pandas as pd

resultados = pd.DataFrame({
    'Model': [
        'SVC (linear)', 
        'SVC (poly)', 
        'SVC (rbf)', 
        'SVC (sigmoid)', 
        'Logistic Regression'
    ],
    'Mean AUC': [
        scores_linear.mean(), 
        scores_poly.mean(), 
        scores_rbf.mean(), 
        scores_sigmoid.mean(), 
        scores_logreg.mean()
    ],
    'Std AUC': [
        scores_linear.std(), 
        scores_poly.std(), 
        scores_rbf.std(), 
        scores_sigmoid.std(), 
        scores_logreg.std()
    ]
})

resultados


Unnamed: 0,Model,Mean AUC,Std AUC
0,SVC (linear),0.996,0.008537
1,SVC (poly),0.996,0.008537
2,SVC (rbf),0.997333,0.008
3,SVC (sigmoid),0.746667,0.04899
4,Logistic Regression,0.996,0.008537


### Interpretacion y conclusiones:
- SVC (rbf) tiene el mejor rendimiento, aunque las diferencias con linear, poly y LogisticRegression son mínimas.

- Logistic Regression se comporta tan bien como los SVCs con linear y poly, lo cual es muy bueno considerando que es más interpretable y más fácil de realizar.

- SVC (sigmoid) no es competitivo aquí y debe evitarse sin mayor ajuste.

## Ejercicio 2
- Repite el ejercicio 1 con el dataset `Default`. Utiliza `default` como target.

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, roc_auc_score

# Cargar el dataset
data = pd.read_csv("Default.csv")
data.head()


Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.1347
2,No,No,1073.549164,31767.13895
3,No,No,529.250605,35704.49394
4,No,No,785.655883,38463.49588


In [8]:
# Convertir 'default' y 'student' a binaria (Yes/No a 1/0)
data['default'] = data['default'].map({'Yes': 1, 'No': 0})

data['student'] = data['student'].map({'Yes': 1, 'No': 0})

# Separar variables predictoras y target
X = data.drop('default', axis=1)
y = data['default']

# Escalar características numéricas
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [9]:
# Validación cruzada estratificada
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Métrica AUC (binario: 2 clases)
auc_scorer = make_scorer(roc_auc_score, needs_proba=True) # One-vs-One




In [10]:
# Kernel linear
svc_linear = SVC(kernel='linear', probability=True, random_state=42)
scores_linear = cross_val_score(svc_linear, X_scaled, y, cv=cv, scoring=auc_scorer)
print(f"SVC (linear): Mean AUC = {scores_linear.mean():.4f}, Std = {scores_linear.std():.4f}")

# Kernel poly
svc_poly = SVC(kernel='poly', probability=True, random_state=42)
scores_poly = cross_val_score(svc_poly, X_scaled, y, cv=cv, scoring=auc_scorer)
print(f"SVC (poly): Mean AUC = {scores_poly.mean():.4f}, Std = {scores_poly.std():.4f}")

# Kernel rbf
svc_rbf = SVC(kernel='rbf', probability=True, random_state=42)
scores_rbf = cross_val_score(svc_rbf, X_scaled, y, cv=cv, scoring=auc_scorer)
print(f"SVC (rbf): Mean AUC = {scores_rbf.mean():.4f}, Std = {scores_rbf.std():.4f}")

# Kernel sigmoid
svc_sigmoid = SVC(kernel='sigmoid', probability=True, random_state=42)
scores_sigmoid = cross_val_score(svc_sigmoid, X_scaled, y, cv=cv, scoring=auc_scorer)
print(f"SVC (sigmoid): Mean AUC = {scores_sigmoid.mean():.4f}, Std = {scores_sigmoid.std():.4f}")

# Regresión logística
logreg = LogisticRegression(max_iter=1000, random_state=42)
scores_logreg = cross_val_score(logreg, X_scaled, y, cv=cv, scoring=auc_scorer)
print(f"Logistic Regression: Mean AUC = {scores_logreg.mean():.4f}, Std = {scores_logreg.std():.4f}")



SVC (linear): Mean AUC = 0.9235, Std = 0.0294
SVC (poly): Mean AUC = 0.8754, Std = 0.0386
SVC (rbf): Mean AUC = 0.8420, Std = 0.0400
SVC (sigmoid): Mean AUC = 0.7293, Std = 0.0440
Logistic Regression: Mean AUC = 0.9485, Std = 0.0221


In [11]:
resultados = pd.DataFrame({
    'Model': [
        'SVC (linear)', 
        'SVC (poly)', 
        'SVC (rbf)', 
        'SVC (sigmoid)', 
        'Logistic Regression'
    ],
    'Mean AUC': [
        scores_linear.mean(), 
        scores_poly.mean(), 
        scores_rbf.mean(), 
        scores_sigmoid.mean(), 
        scores_logreg.mean()
    ],
    'Std AUC': [
        scores_linear.std(), 
        scores_poly.std(), 
        scores_rbf.std(), 
        scores_sigmoid.std(), 
        scores_logreg.std()
    ]
})

resultados


Unnamed: 0,Model,Mean AUC,Std AUC
0,SVC (linear),0.923472,0.029396
1,SVC (poly),0.875401,0.03862
2,SVC (rbf),0.842002,0.039991
3,SVC (sigmoid),0.72929,0.04403
4,Logistic Regression,0.948509,0.022115


### Interpretacion y conclusiones:
- El modelo con mejor desempeño fue la Regresión Logística, con un AUC promedio de 0.9485, superando a todos los modelos de SVM tanto en precisión como en estabilidad. Esto indica que es altamente eficaz para discriminar entre clientes que harán default y los que no.

- El SVC con kernel lineal también mostró buen rendimiento (AUC ≈ 0.9235), pero no superó a la regresión logística. Los kernels más complejos (poly, rbf, sigmoid) tuvieron menor desempeño, especialmente el kernel sigmoid, con un AUC significativamente bajo (≈ 0.729).

- Dado que el problema es lineal y el dataset relativamente simple, la regresión logística es la mejor opción por su precisión, interpretabilidad y robustez.

# Addendum

Métricos disponibles para clasificación:
- ‘accuracy’
- ‘balanced_accuracy’
- ‘top_k_accuracy’
- ‘average_precision’
- ‘neg_brier_score’
- ‘f1’
- ‘f1_micro’
- ‘f1_macro’
- ‘f1_weighted’
- ‘f1_samples’
- ‘neg_log_loss’
- ‘precision’ etc.
- ‘recall’ etc.
- ‘jaccard’ etc.
- ‘roc_auc’
- ‘roc_auc_ovr’
- ‘roc_auc_ovo’
- ‘roc_auc_ovr_weighted’
- ‘roc_auc_ovo_weighted’
- ‘d2_log_loss_score’

# References

[1] Shigeo Abe.Support Vector Machines for Pattern Classification,2Ed.Springer-Verlag London,2010. ISBN978-1-84996-097-7. URLhttps://www.springer.com/gp/book/9781849960977.

[2] Johan A K Suykens, Tony Van Gestel, Jos De Brabanter, BartDe Moor, and Joos Vandewalle.Least Squares Support VectorMachines. World Scientific,2002. ISBN9789812381514. URLhttps://www.worldscientific.com/worldscibooks/10.1142/5089.

[3] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159. URL https://www.researchgate.net/post/how_can_I_interpret_the_ROC_curve_result