# Model selection

## Spliting the data / Hold out

In [1]:
# We would need these libraries to manage our dataset
# Numpy: used for large, multi-dimensional arrays and matrices, and for high-level mathematical functions
# Pandas: used for data manipulation and analysis
# matplotlib: used for visualisation and plotting graph/image/etc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
RANDOM_SEED = 42

In [2]:
# Import the iris dataset from sklearn
# https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets
from sklearn.datasets import load_iris
# load the dataset
iris = load_iris()

For the sake of the example, we use multiple model for classification. We don't need to know how they works, just that they have a `fit` and a `predict` method, like all model on scikit-learn

In [3]:
# import differents classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

scikit-learn provides many function for model selection and dataset management, the most simple being `train_test_split`.

In [4]:
# import the function for splitting from sklearn
from sklearn.model_selection import train_test_split
np.random.seed(RANDOM_SEED)
X = iris.data[:, :2] # .reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, iris.target)
print(X_train.shape, X_test.shape, np.unique(y_train, return_counts=True))

(112, 2) (38, 2) (array([0, 1, 2]), array([35, 39, 38]))


We want a training set, a testing set and a validation set, so we have to do 2 splits. 

In [5]:
X_training, X_vali, y_training, y_vali = train_test_split(X_train, y_train)

We create and train our model on the training set  
\- don't mind the warning, it just means the neural network have not converged

In [6]:
rf_clf = RandomForestClassifier().fit(X_training, y_training)
mlp_clf = MLPClassifier().fit(X_training, y_training)
svc_clf = SVC().fit(X_training, y_training)
knn_clf = KNeighborsClassifier().fit(X_training, y_training)



After the training, we test our model on the validation set, and check which one is the most accurate

In [7]:
rf_res = rf_clf.predict(X_vali)
mlp_res = mlp_clf.predict(X_vali)
svc_res = svc_clf.predict(X_vali)
knn_res = knn_clf.predict(X_vali)

In [8]:
# scikit learn has many metrics function
from sklearn.metrics import accuracy_score
print('RF ', accuracy_score(rf_res, y_vali))
print('MLP', accuracy_score(mlp_res, y_vali))
print('SVC', accuracy_score(svc_res, y_vali))
print('KNN', accuracy_score(knn_res, y_vali))

RF  0.75
MLP 0.6428571428571429
SVC 0.7857142857142857
KNN 0.8214285714285714


Since KNN is the model with the best accuracy, we chose it. We will now evaluate it on the testing set, which have not been used before neither.

Usually, the training-validation phase would be run multiple time, with different hyperparameter/model/etc, and the testing would be the final part.

In [9]:
test_res = knn_clf.predict(X_test)
print('KNN', accuracy_score(test_res, y_test))

KNN 0.7631578947368421


## Hyperparameter selection

scikit-learn provided many hyperparameter selection object, the simplest and most common being the grid search.

In [10]:
# import the grid search from sklearn
from sklearn.model_selection import GridSearchCV
# create a classifier
rf_grid = RandomForestClassifier()
# define the list of hyperparameter we eant to evaluate
# as well as the range of value to test for each one
parameters = {'n_estimators': [2, 5, 10, 15, 25, 30, 50], 'min_samples_split': range(2,7)}
# create the grid search object
# it behaves like a sklearn model, with the fit and predict method
grid = GridSearchCV(rf_grid, parameters)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)

{'min_samples_split': 4, 'n_estimators': 30} 0.758102766798419


Once the grid search object is trained, using the predict function will automatically use the best model.

In [11]:
grid_res = grid.predict(X_test)
print('RF ', accuracy_score(grid_res, y_test))

RF  0.7894736842105263


## Cross-validation

cross-validation with scikit-learn is yet another different way to do things than hold out or grid search. You call a function (`cross-validate`) which you give the model as argument. 

We are only a training set which will be split during the CV. We can use a testing set after selecting the best model.

In [12]:
from sklearn.model_selection import cross_validate
# a classifier
rf_cv = RandomForestClassifier(n_estimators=30, min_samples_split=4)
# cv is the type of cross-validation
# if you give an int as argument, it is the number of fold you want for
# k-fold cross-validation
cv_res = cross_validate(rf_cv, X_train, y_train, cv=4)
# the return value of the cross-validation is the list of the score obtained by each
# model during the cross-validation
print('RF ', cv_res['test_score'])
print(np.mean(cv_res['test_score']))

RF  [0.78571429 0.67857143 0.67857143 0.67857143]
0.705357142857143


Leave-one-out cross-validation is yet another way do split dataset in scikit-learn. You create a splitter object, which will gives you a list of the indexes of the element of each group. There is other splitter object in scikit-learn that works the same way

In [13]:
from sklearn.model_selection import LeavePOut
# a classifier
rf_lpo = RandomForestClassifier(n_estimators=30, min_samples_split=4)
# the splitter
lpo = LeavePOut(1)
# displaying how much split we have
n_split = lpo.get_n_splits(X_train)
print('split', n_split)
lpo_res = []
j = 0
# the lpo.split function generate a list that we have to iterate through
# each element of the list has 2 elements, the indexes of the element to use for training
# the indexes of the element to use for testing
for train_idx, test_idx in lpo.split(X_train):
    print(j+1, end='\r')
    j = j+1
    rf_lpo.fit(X_train[train_idx], y_train[train_idx])
    lpo_res.append(accuracy_score(y_train[test_idx], rf_lpo.predict(X_train[test_idx])))
print(np.unique(lpo_res, return_counts=True), 'average', np.mean(lpo_res))

split 112
(array([0., 1.]), array([36, 76])) average 0.6785714285714286
