# MODEL SELECTION

In this exercise we are going to learn how to choose the proper model and parameters with a Random Forest case. We are going to use the Pumpkin seed dataset (https://www.kaggle.com/datasets/muratkokludataset/pumpkin-seeds-dataset) as an example. 

In [82]:
import pandas as pd
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score

In [83]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [84]:
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return (X, y)

In [85]:
path = os.getcwd() + '\data\Pumpkin_Seeds_Dataset.xlsx'
df = pd.read_excel(path, header=0, names=None)
df.head()

Unnamed: 0,Area,Perimeter,Major_Axis_Length,Minor_Axis_Length,Convex_Area,Equiv_Diameter,Eccentricity,Solidity,Extent,Roundness,Aspect_Ration,Compactness,Class
0,56276,888.242,326.1485,220.2388,56831,267.6805,0.7376,0.9902,0.7453,0.8963,1.4809,0.8207,Çerçevelik
1,76631,1068.146,417.1932,234.2289,77280,312.3614,0.8275,0.9916,0.7151,0.844,1.7811,0.7487,Çerçevelik
2,71623,1082.987,435.8328,211.0457,72663,301.9822,0.8749,0.9857,0.74,0.7674,2.0651,0.6929,Çerçevelik
3,66458,992.051,381.5638,222.5322,67118,290.8899,0.8123,0.9902,0.7396,0.8486,1.7146,0.7624,Çerçevelik
4,66107,998.146,383.8883,220.4545,67117,290.1207,0.8187,0.985,0.6752,0.8338,1.7413,0.7557,Çerçevelik


## Model extraction

First of all, we are going to divide our dataset and apply Random Forest with the parameters that we consider. Then, we are going to apply some methods in order to decide which parameters are the most optimal ones and therefore what model to choose among all the possibilities. 

In [86]:
train_set, val_set, test_set = train_val_test_split(df)

In [88]:
X_train, y_train = remove_labels(train_set, 'Class')
X_val, y_val = remove_labels(val_set, 'Class')
X_test, y_test = remove_labels(test_set, 'Class')

In [89]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train, y_train)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [90]:
y_pred = clf_rnd.predict(X_val)

In [91]:
print("F1 Score:", f1_score(y_pred, y_val, average='weighted'))

F1 Score: 0.8782354211925192


In [92]:
# We are going to use Grid Search to select the better model obtained with Random Forest
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 9 (3×3) combinations of hyperparameters
    {'n_estimators': [100, 500, 1000], 'max_leaf_nodes': [16, 24, 36]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [100, 500], 'max_features': [2, 3, 4]},
  ]

rnd_clf = RandomForestClassifier(n_jobs=-1, random_state=42)


grid_search = GridSearchCV(rnd_clf, param_grid, cv=5,
                           scoring='f1_weighted', return_train_score=True)

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
             param_grid=[{'max_leaf_nodes': [16, 24, 36],
                          'n_estimators': [100, 500, 1000]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [100, 500]}],
             return_train_score=True, scoring='f1_weighted')

In [93]:
grid_search.best_params_

{'max_leaf_nodes': 24, 'n_estimators': 100}

In [94]:
grid_search.best_estimator_

RandomForestClassifier(max_leaf_nodes=24, n_jobs=-1, random_state=42)

In [96]:
#Here we get all the different fscores for the diferent parameters
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print("F1 score:", mean_score, "-", "Parámetros:", params)

F1 score: 0.8885096221193105 - Parámetros: {'max_leaf_nodes': 16, 'n_estimators': 100}
F1 score: 0.8891937257695559 - Parámetros: {'max_leaf_nodes': 16, 'n_estimators': 500}
F1 score: 0.8905248075311661 - Parámetros: {'max_leaf_nodes': 16, 'n_estimators': 1000}
F1 score: 0.8918358744748494 - Parámetros: {'max_leaf_nodes': 24, 'n_estimators': 100}
F1 score: 0.8898567948622939 - Parámetros: {'max_leaf_nodes': 24, 'n_estimators': 500}
F1 score: 0.8891910389291462 - Parámetros: {'max_leaf_nodes': 24, 'n_estimators': 1000}
F1 score: 0.8851476520883266 - Parámetros: {'max_leaf_nodes': 36, 'n_estimators': 100}
F1 score: 0.8891960766360232 - Parámetros: {'max_leaf_nodes': 36, 'n_estimators': 500}
F1 score: 0.8885401253588008 - Parámetros: {'max_leaf_nodes': 36, 'n_estimators': 1000}
F1 score: 0.8751731930339689 - Parámetros: {'bootstrap': False, 'max_features': 2, 'n_estimators': 100}
F1 score: 0.8772177189061965 - Parámetros: {'bootstrap': False, 'max_features': 2, 'n_estimators': 500}
F1 sco

This way that we have seen to explore hyperparameters is fine if we do not have a large number of combinations and we are clear about the possible values. Otherwise, it is possibly more efficient to use **RandomizedSearchCV**, which works in a similar way to the previous case, but performing a search on randomized values.


In [97]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

#We decide a high range of parameters to try

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_depth': randint(low=8, high=50),
    }

rnd_clf = RandomForestClassifier(n_jobs=-1)


rnd_search = RandomizedSearchCV(rnd_clf, param_distributions=param_distribs,
                                n_iter=5, cv=2, scoring='f1_weighted')

rnd_search.fit(X_train, y_train)

RandomizedSearchCV(cv=2, estimator=RandomForestClassifier(n_jobs=-1), n_iter=5,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x000002E9EE3A98E0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x000002E9EF4DC850>},
                   scoring='f1_weighted')

In [98]:
rnd_search.best_params_

{'max_depth': 21, 'n_estimators': 31}

In [99]:
rnd_search.best_estimator_

RandomForestClassifier(max_depth=21, n_estimators=31, n_jobs=-1)

In [100]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print("F1 score:", mean_score, "-", "Parámetros:", params)

F1 score: 0.8838493623452917 - Parámetros: {'max_depth': 16, 'n_estimators': 144}
F1 score: 0.8864711155221536 - Parámetros: {'max_depth': 21, 'n_estimators': 31}
F1 score: 0.8825023499259617 - Parámetros: {'max_depth': 9, 'n_estimators': 152}
F1 score: 0.8831629722839871 - Parámetros: {'max_depth': 42, 'n_estimators': 60}
F1 score: 0.8838764804045125 - Parámetros: {'max_depth': 40, 'n_estimators': 90}


Once the best hyperparameters have been selected using one of the previously used techniques, we can obtain the model from the **best_estimator_** attribute.


In [101]:
rnd_search.best_estimator_.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 21,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 31,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [102]:
# We select the best model
clf_rnd = rnd_search.best_estimator_

In [104]:
# We predict with the train set
y_train_pred = clf_rnd.predict(X_train)
print("F1 score Train Set:", f1_score(y_train_pred, y_train, average='weighted'))

F1 score Train Set: 0.9986666666666667


In [105]:
# We predict with the validation set
y_val_pred = clf_rnd.predict(X_val)
print("F1 score Validation Set:", f1_score(y_val_pred, y_val, average='weighted'))

F1 score Validation Set: 0.8802018633540373
