<a href="https://colab.research.google.com/github/GuillaumeArp/Wild_Notebooks/blob/main/GridSearch_and_RandomSearch_Guillaume_Arp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [None]:
# Factorize the Sex column

df['Sex_nb'] = df['Sex'].factorize()[0]
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Sex_nb
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25,0
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833,1
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925,1
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1,1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05,0


In [None]:
# Define the variables for the Logistic Regression classification model

cols = ['Pclass', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare', 'Sex_nb']

X = df[cols]
y = df['Survived']

In [None]:
# Train, Test, Split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

model = DecisionTreeClassifier().fit(X_train, y_train)

print(f"Accuracy score on the train dataset: {model.score(X_train, y_train)}")
print(f"Accuracy score on the test dataset: {model.score(X_test, y_test)}")


Accuracy score on the train dataset: 0.9894736842105263
Accuracy score on the test dataset: 0.7972972972972973


In [None]:
# Using a GridSearch

params = {
    'max_depth': range(1, 51),
    'min_samples_leaf': range(1,16),
    'min_samples_split': [2, 5, 7, 10, 15, 30]
}

grid = GridSearchCV(DecisionTreeClassifier(), params)
grid.fit(X, y)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': range(1, 51),
                         'min_samples_leaf': range(1, 16),
                         'min_samples_split': [2, 5, 7, 10, 15, 30]})

In [None]:
# Best score and parameters

print("best score:", grid.best_score_)
print("best parameters:", grid.best_params_)

best score: 0.830895702405891
best parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 7}


It appears that the best parameters are `max_depth=5`, `min_samples_leaf=1` and `min_samples_split=7`

The Grid Search took a bit more than a minute to compute.

In [None]:
# Using a RandomSearch

rando = RandomizedSearchCV(DecisionTreeClassifier(), params, n_iter=200)
rando.fit(X, y)

RandomizedSearchCV(estimator=DecisionTreeClassifier(), n_iter=200,
                   param_distributions={'max_depth': range(1, 51),
                                        'min_samples_leaf': range(1, 16),
                                        'min_samples_split': [2, 5, 7, 10, 15,
                                                              30]})

In [None]:
# Best score and parameters

print("best score:", rando.best_score_)
print("best parameters:", rando.best_params_)

best score: 0.8252586808861805
best parameters: {'min_samples_split': 10, 'min_samples_leaf': 3, 'max_depth': 4}


It appears that the best parameters from the Random Search are `max_depth=4`, `min_samples_leaf=3` and `min_samples_split=10`

The parameters are a bit different than those obtained from the Grid Search, but the score is very close (albeit a bit lower), and the computation took only few seconds. On a much larger dataset, this method would therefore be the only practical one while keeping a good enough accuracy.