Perform hyperparameter tuning on prepared **Titanic dataset** using:
1. `GridSearchCV`
2. `RandomizedSearchCV`

Tune hyperparameters of `LogisticRegression` as follows:
- target metric: F1-score
- hyperparameters: `penalty` (either L1 or L2) and `C` between 0.01 and 10
- 8-fold CV

For both grid and randomized search check 200 combinations of hyperparameters. Pick the right `solver` and `max_iter` parameters. Note that boundaries for C hyperparameter must be the same for both approaches, but the implementation to enforce 100 combinations will be different.

Print best hyperparameters (`C` and `penalty`) for both `GridSearchCV` and`RandomizedSearchCV`. Are they similar?

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
from scipy.stats import uniform
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression as LR
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

In [2]:
#Wczytanie pliku csv 
dataset = pd.read_csv(
    "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv",
    sep=',',
    header=0)
dataset

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


In [3]:
dataset.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
dataset.describe(include=['O']) 

Unnamed: 0,Name,Sex
count,887,887
unique,887,2
top,Mr. Owen Harris Braund,male
freq,1,573


In [5]:
dataset.drop(columns='Name', inplace=True)

In [6]:
dataset.Pclass = dataset.Pclass.astype(str) #conversion into string

In [8]:
ohe = OneHotEncoder(sparse=False)

In [10]:
# Fit and transform the selected columns with object (string) data type using OneHotEncoder
ohe_data = ohe.fit_transform(dataset.select_dtypes('O'))

# Create a DataFrame from the one-hot encoded data with appropriate column names
ohe_df = pd.DataFrame(data=ohe_data, columns=ohe.get_feature_names_out())

In [11]:
dataset_ohe_t = pd.concat([dataset.select_dtypes(exclude='O'), ohe_df], axis=1) #łączenie zbiorów bez kolumn typu obiekt

In [12]:
dataset_ohe_t

Unnamed: 0,Survived,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male
0,0,22.0,1,0,7.2500,0.0,0.0,1.0,0.0,1.0
1,1,38.0,1,0,71.2833,1.0,0.0,0.0,1.0,0.0
2,1,26.0,0,0,7.9250,0.0,0.0,1.0,1.0,0.0
3,1,35.0,1,0,53.1000,1.0,0.0,0.0,1.0,0.0
4,0,35.0,0,0,8.0500,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
882,0,27.0,0,0,13.0000,0.0,1.0,0.0,0.0,1.0
883,1,19.0,0,0,30.0000,1.0,0.0,0.0,1.0,0.0
884,0,7.0,1,2,23.4500,0.0,0.0,1.0,1.0,0.0
885,1,26.0,0,0,30.0000,1.0,0.0,0.0,0.0,1.0


In [13]:
X = dataset_ohe_t.drop('Survived', axis=1)
y = dataset_ohe_t['Survived']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

GridSearchCV

In [15]:
cs = np.linspace(0.001, 10, 100) 
grid = {
    'penalty': ['l1', 'l2'], # ilosc drzew
    'C':  cs
} 

In [None]:
tuning_gs_lr = GridSearchCV(
    LR(random_state=42, solver='saga', max_iter=10000),
    param_grid=grid,
    scoring='f1',
    n_jobs=1,
    cv=8,
    verbose=2
)
tuning_gs_lr.fit(X_train, y_train)

Fitting 8 folds for each of 200 candidates, totalling 1600 fits
[CV] END ................................C=0.001, penalty=l1; total time=   0.7s
[CV] END ................................C=0.001, penalty=l1; total time=   0.6s
[CV] END ................................C=0.001, penalty=l1; total time=   0.7s
[CV] END ................................C=0.001, penalty=l1; total time=   0.7s
[CV] END ................................C=0.001, penalty=l1; total time=   0.6s
[CV] END ................................C=0.001, penalty=l1; total time=   0.9s
[CV] END ................................C=0.001, penalty=l1; total time=   0.9s
[CV] END ................................C=0.001, penalty=l1; total time=   0.7s
[CV] END ................................C=0.001, penalty=l2; total time=   0.1s
[CV] END ................................C=0.001, penalty=l2; total time=   0.1s
[CV] END ................................C=0.001, penalty=l2; total time=   0.1s
[CV] END ................................C=0.

In [None]:
print(tuning_gs_lr.best_params_) 
# 'C': 0.102, 'penalty': 'l2'

RandomizedSearchCV

In [None]:
subsample_distribution = uniform(loc=0.001, scale=10 - 0.001) 
dist = {
    'penalty': ['l1', 'l2'], # ilosc drzew
    'C':  subsample_distribution
} 

In [None]:
tuning_rs_lr = RandomizedSearchCV(    
    LR(random_state=42, solver='saga', max_iter=10000),
    param_distributions=dist,
    scoring='f1',
    n_iter=200,
    n_jobs=1,
    cv=8,
    verbose=2,
    random_state=42  # fixed random seed
)

tuning_rs_lr.fit(X_train, y_train)

In [None]:
print(tuning_rs_lr.best_params_) 
#'C': 0.181, 'penalty': 'l2'

PODSUMOWANIE

Przy danych założeniach najlepsze hiperparametry (C i penalty) dla GridSearchCV i RandomizedSearchCV są podobne - c: 0,10 vs 0,18 oraz obie metody l2.