Perform hyperparameter tuning on prepared **Titanic dataset** using:
1. `GridSearchCV`
2. `RandomizedSearchCV`

Tune hyperparameters of `LogisticRegression` as follows:
- target metric: F1-score
- hyperparameters: `penalty` (either L1 or L2) and `C` between 0.01 and 10
- 8-fold CV

For both grid and randomized search check 200 combinations of hyperparameters. Pick the right `solver` and `max_iter` parameters. Note that boundaries for C hyperparameter must be the same for both approaches, but the implementation to enforce 100 combinations will be different.

Print best hyperparameters (`C` and `penalty`) for both `GridSearchCV` and`RandomizedSearchCV`. Are they similar?

Send the Jupyter notebook (with output) exported in `.html` format on email lkrain@sgh.waw.pl.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import loguniform

# Wstępne przetwarzanie danych

In [3]:
dataset = pd.read_csv(
    "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv",
    sep=",",
    header=0,
)

dataset.drop(columns="Name", inplace=True)

dataset.Pclass = dataset.Pclass.astype(str)

ohe = OneHotEncoder(sparse_output=False)
# ohe.fit(dataset.select_dtypes('O'))
# ohe.transform(dataset.select_dtypes('O'))
ohe_data = ohe.fit_transform(dataset.select_dtypes("O"))
ohe_df = pd.DataFrame(data=ohe_data, columns=ohe.get_feature_names_out())

dataset = pd.concat([dataset.select_dtypes(exclude="O"), ohe_df], axis=1)

X = dataset.drop(columns="Survived")
y = dataset.Survived
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.6, random_state=42
)

# Model i parametry

Solver -> liblinear (obsługuje obie regularizacje)

Parametry:
1) C -> pomiędzy 0 a 10
2) L1 i L2

In [4]:
lr_model = LogisticRegression(max_iter=10000, solver='liblinear', random_state=42)

# dla Grid search
param_grid = {
    'C': np.linspace(0.01, 10, 100),  
    'penalty': ['l1', 'l2']           
}

# rozkład dla Random Search 
param_distributions = {
    'C': loguniform(0.01, 10), # efektywnie eksploruje szeroki zakres wartości
    'penalty': ['l1', 'l2']
}

# Grid search

Zgodnie z warunkami zadania wybrano cv=8 oraz scoring='f1' (https://scikit-learn.org/1.5/modules/model_evaluation.html#scoring-parameter)

In [5]:
grid_search = GridSearchCV(
    estimator=lr_model,
    param_grid={'C': param_grid['C'], 'penalty': param_grid['penalty']},
    scoring='f1',
    cv=8,
    verbose=1
)
grid_search.fit(X_train, y_train)

Fitting 8 folds for each of 200 candidates, totalling 1600 fits


# Random search

Zgodnie z warunkami zadania wybrano cv=8 oraz scoring='f1' (https://scikit-learn.org/1.5/modules/model_evaluation.html#scoring-parameter)

In [6]:
random_search = RandomizedSearchCV(
    estimator=lr_model,
    param_distributions=param_distributions,
    scoring='f1',
    cv=8,
    n_iter=200,
    random_state=42,
    verbose=1
)
random_search.fit(X_train, y_train)

Fitting 8 folds for each of 200 candidates, totalling 1600 fits


In [7]:
print("GridSearchCV best parameters:", grid_search.best_params_)
print("GridSearchCV best F1 score:", grid_search.best_score_)

print("RandomizedSearchCV best parameters:", random_search.best_params_)
print("RandomizedSearchCV best F1 score:", random_search.best_score_)

GridSearchCV best parameters: {'C': np.float64(0.41363636363636364), 'penalty': 'l1'}
GridSearchCV best F1 score: 0.783022279828099
RandomizedSearchCV best parameters: {'C': np.float64(0.6251373574521749), 'penalty': 'l1'}
RandomizedSearchCV best F1 score: 0.783022279828099


# Wyniki

1) **Grid Search**                            
*Grid search C*: 0.41363636363636364      
*Regularizacja*: L1 (LASSO)       
*Grid search F1*: 0.783022279828099    

2) **Random Search**                              
*Random search C*: 0.6251373574521749       
*Regularizacja*: L1 (LASSO)       
*Random search F1*: 0.783022279828099   

**Porównanie**:
1) Random search C jest większe od C z Grid search. 
2) F1-score w obu przypadkach jest identyczny, wynosząc 0.7830.
3) Zarówno GridSearchCV, jak i RandomizedSearchCV wybrały regularizację L1 (LASSO).