# Class 3 - Regularization and cross-validation

In [None]:
# %%capture
# !pip install matplotlib numpy pandas seaborn scikit-learn tqdm

In [None]:
from functools import partial

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
from tqdm import tqdm

In [None]:
tqdm = partial(tqdm, position=0, leave=True)

In [None]:
plt.style.use("bmh")

## Dataset - preparation and One-Hot Encoding

In [None]:
# Titanic dataset - information about passangers with indication of survival
# Task: binary classification of target column Survived
dataset = pd.read_csv(
    "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv",
    sep=",",
    header=0,
)

**Dataset exploration**

In [None]:
dataset.isna().sum()

In [None]:
dataset.describe()

In [None]:
dataset.describe(include=["O"])

In [None]:
dataset["Siblings/Spouses Aboard"].value_counts().sort_index()

In [None]:
dataset['Parents/Children Aboard'].value_counts().sort_index()

In [None]:
dataset.Pclass.value_counts()

In [None]:
dataset.Survived.value_counts() / dataset.shape[0]

**Dataset preprocessing**

In [None]:
dataset.drop(columns="Name", inplace=True)

In [None]:
dataset.Pclass = dataset.Pclass.astype(str)

In [None]:
ohe = OneHotEncoder(sparse_output=False)
# ohe.fit(dataset.select_dtypes('O'))
# ohe.transform(dataset.select_dtypes('O'))
ohe_data = ohe.fit_transform(dataset.select_dtypes("O"))
ohe_df = pd.DataFrame(data=ohe_data, columns=ohe.get_feature_names_out())

In [None]:
dataset = pd.concat([dataset.select_dtypes(exclude="O"), ohe_df], axis=1)

In [None]:
X = dataset.drop(columns="Survived")
y = dataset.Survived
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.6, random_state=42
)

## Regularization

**Regularization** techniques are modifications introduced in models aimed at **reducing overfitting**. The modification usually involves putting constraints on the coefficient's estimate or altering the training process with additional steps. Good regularization technique will decrease **variance** significantly, while not increasing **bias** or increasing it slightly.

Examples of regularization techniques are:
- L1 regularization (e.g. lasso regression)
- L2 regualrization (e.g. ridge regression)
- Elasticnet
- cost/complexity pruning for decision trees
- cost parameter for support vector machines
- dropout for neural networks
- early stopping


L1:$$min\frac{1}{n}||Xw-y||^2+\lambda||w||_1 \quad \left(\lambda\sum_i|w_i|\right)$$
L2:$$min\frac{1}{n}||Xw-y||^2+\lambda||w||_2^2 \quad \left(\lambda\sum_iw_i^2\right)$$
Elasticnet:
$$min\frac{1}{n}||Xw-y||^2+\lambda\left(\alpha||w||_1+(1-\alpha)||w||_2^2\right) \quad \left(\lambda\left(\alpha\sum_i|w_i|+(1-\alpha)\sum_iw_i^2\right)\right)$$

Logistic regression without regularization

In [None]:
lr_titanic = LR(penalty=None, max_iter=1000)
lr_titanic.fit(X_train, y_train)
lr_no_reg_auc = roc_auc_score(y_test, lr_titanic.predict_proba(X_test)[:, 1])
print(f"AUC for logistic regression with no regularization: {lr_no_reg_auc:.3f}")

**Hyperparameter tuning**

Tuning elements:
- metric (F1-score, AUC)
- hyperparameter(s) (cutoff threshold, regularization strength, number of trees)
- technique (grid search, random search, bayesian search)

Task: Find value of regularization strength (C) and type of regularization (L1/L2) that maximizes AUC on validation set

Logistic regression L1 regularization

In [None]:
def model_auc(model, X_train, X_test, y_train, y_test):
    trained_model = model.fit(X_train, y_train)
    return roc_auc_score(y_test, trained_model.predict_proba(X_test)[:, 1])

In [None]:
cs = np.linspace(
    0.001, 0.2, 100
)  # 100 values of C evenly distributed between 0.001 and 0.2

In [None]:
LR_L1 = partial(LR, penalty="l1", max_iter=1000, solver="liblinear")
aucs_l1 = [model_auc(LR_L1(C=c), X_train, X_test, y_train, y_test) for c in tqdm(cs)]

In [None]:
p = sns.lineplot(x=cs, y=aucs_l1)
p.set_xlabel("C")
p.set_ylabel("AUC")
p.set_title("Logistic regression with L1 penalty");

In [None]:
def lr_l1_coeffs_for_c(c):
    return dict(zip(X_train.columns, LR_L1(C=c).fit(X_train, y_train).coef_[0]))

In [None]:
lr_l1_coeffs_for_c(0.01)

In [None]:
lr_l1_coeffs_for_c(0.02)

**Important note:** L1 penalty prevents overfitting but also serves as **feature selection** method

Logistic regression L2 regularization

In [None]:
aucs_l2 = [
    model_auc(LR(C=c, max_iter=1000), X_train, X_test, y_train, y_test)
    for c in tqdm(cs)
]

In [None]:
p = sns.lineplot(x=cs, y=aucs_l2)
p.set_xlabel("C")
p.set_ylabel("AUC")
p.set_title("Logistic regression with L2 penalty");

Support Vector Classifier

In [None]:
cs_svc = np.linspace(0.01, 300, 100)
aucs_svc = [
    model_auc(SVC(C=c, probability=True), X_train, X_test, y_train, y_test)
    for c in tqdm(cs_svc)
]

In [None]:
p = sns.lineplot(x=cs_svc, y=aucs_svc)
p.set_xlabel("C")
p.set_ylabel("AUC")
p.set_title("Support Vector Classifier");

**Summary for all models**

In [None]:
print(f"""Logistic regression (no penalty): {lr_no_reg_auc:.4f} AUC
Logistic regression (L1): {max(aucs_l1):.4f} AUC for C={cs[np.argmax(aucs_l1)]:.4f}
Logistic regression (L2): {max(aucs_l2):.4f} AUC for C={cs[np.argmax(aucs_l2)]:.4f}
SVC: {max(aucs_svc):.4f} AUC for C={cs_svc[np.argmax(aucs_svc)]:.4f}""")

In [None]:
round(roc_auc_score(y_train, lr_titanic.predict_proba(X_train)[:, 1]), 4)

## Cross-validation

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width=75%>

Advantages:
- reduces estimation error connected with random split of dataset
- provides robust scores in case of overfitted model
- no need to split data into training and validation subsets explicitly

Disadvantages:
- expensive computationally (training _k_ models instead of 1)
- introduces another hyperparameter (_k_)
- more complex training and evaluation pipeline

More information on [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html) page in scikit-learn documentation.

In [None]:
def auc_scorer(model, X, y):
    return roc_auc_score(y, model.predict_proba(X)[:, 1])

In [None]:
folds = 3
scores = cross_val_score(
    LR(max_iter=1000, random_state=42), X, y, cv=folds, scoring=auc_scorer
)  # scoring='roc_auc'
print(scores)
print(f"Mean AUC score: {np.mean(scores):.3f}")

In [None]:
aucs_mean = np.array([])
aucs_std = np.array([])
cs_svc_xval = range(1, 101)
for c in tqdm(cs_svc_xval):
    xval_arr = cross_val_score(
        SVC(C=c, probability=True, random_state=42), X, y, cv=folds, scoring=auc_scorer
    )
    aucs_mean = np.append(aucs_mean, np.mean(xval_arr))
    aucs_std = np.append(aucs_std, np.std(xval_arr))

In [None]:
plt.plot(cs_svc_xval, aucs_mean, "r")
plt.fill_between(
    cs_svc_xval, aucs_mean - aucs_std, aucs_mean + aucs_std, color="steelblue"
)
best_c = cs_svc_xval[np.argmax(aucs_mean)]
best_mean_auc = np.max(aucs_mean)
plt.plot(best_c, best_mean_auc, "bo")
plt.annotate(
    f"AUC: {best_mean_auc:.3f} \nC: {best_c}",
    (best_c, best_mean_auc * 0.98),
    weight="bold",
)
plt.xlabel("C")
plt.ylabel("Mean AUC ± 1 Std")
plt.title(f"Support Vector Classifier - {folds}-fold Xval");

Grid search hyperparameter tuning + cross-validation = [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

### Exercices

Write your own "scorer" function returning F1-score (with 0.5 cutoff threshold)

Perform hyperparameter tuning for logistic regression with **elasticnet** regularization. Tune value of **L1 ratio** hyperparameter. Use prepared data for Titanic dataset (`X` and `y` Dataframes).

Tuning specification:
- set regularization strength argument to 0.1
- perform grid search of 50 evenly distributed values on whole range of L1 ratio
- use F1-score as target metrics
- use 3-fold cross-validation to estimate the metric

Plot the results on lineplot with L1 ratio on x-axis and mean F1 on y-axis. What is the optimal value of L1 ratio and mean F1-score corresponding to it?

### Homework (5pts, time until laboratory exam)

Perform hyperparameter tuning on prepared **Titanic dataset** using:
1. `GridSearchCV`
2. `RandomizedSearchCV`

Tune hyperparameters of `LogisticRegression` as follows:
- target metric: F1-score
- hyperparameters: `penalty` (either L1 or L2) and `C` between 0.01 and 10
- 8-fold CV

For both grid and randomized search check 200 combinations of hyperparameters. Pick the right `solver` and `max_iter` parameters. Note that boundaries for C hyperparameter must be the same for both approaches, but the implementation to enforce 100 combinations will be different.

Print best hyperparameters (`C` and `penalty`) for both `GridSearchCV` and`RandomizedSearchCV`. Are they similar?

Send the Jupyter notebook (with output) exported in `.html` format on email lkrain@sgh.waw.pl.