# **Scikit-learn - Cross Validation Search (GridSearchCV) and Hyperparameter Optimisation Multiple Clf**

## Objectives

* Learn and use GridSearchCV for Hyperparameter Optimisation with mulitple classifiers




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Hyperparameter Optimisation with one algorithm

Import packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

### Multiclass classification

We are going to consider a similar workflow as with other ML operations:

* Split the data
* Define the pipeline and hyperparameters
* Fit the pipeline
* Evaluate the pipeline

We load the iris dataset for this exercise. It contains records of three species or classes of iris plants, with their petal and sepal measurements and split into train and test sets.

In [2]:
#load iris dataset
df_clf = sns.load_dataset('iris')

print(df_clf.shape)
df_clf.head()

(150, 5)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
#split the data
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['species'],axis=1),
                                    df_clf['species'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:", X_test.shape, y_test.shape)

* Train set: (120, 4) (120,) 
* Test set: (30, 4) (30,)


We create a pipeline using three steps: feature scaling, feature selection and modelling with RandomForestClassifier.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

def pipeline_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestClassifier(random_state=101)) ),
      ( "model", RandomForestClassifier(random_state=101)),

    ])

  return pipeline


We define our hyperparameter list based on the algorithm documentation.

* In this case, there will be two hyperparameter combinations.
* We will reduce the number of hyperparameter combinations so the code runs faster. (purely for example) 

In [5]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.model_selection import GridSearchCV

param_grid = {"model__n_estimators":[10,20],
              }
param_grid

{'model__n_estimators': [10, 20]}

For this project we are interested particularly in the Virginica species and we need the predictions for this class to be precise. This is an arbitrary choice, however, it is an example of the type of business requirements given to you by the product owner or business expert.

* In this case, your scoring parameter is precision_score to the class Virginica.
* In a Multiclass classification, when your performance metric is accuracy, you just pass scoring='accuracy' as an argument, as done with a binary classifier.
* In our case, we need to pass arguments to the make_scorer() method to fine-tune the model using precision on the Virginica species. We pass to make_scorer as an argument the metric we want - precision_score. The next argument is labels, where you set the class you want to tune as a list. Note that in this dataset, the species is not encoded as numbers but as categories. If it were numbers, you would pass the number related to the class you want to tune. The last argument is average, and it should equal None since you compute the precision from one class only (in this case Virginica) and you don't need to average.
* Finally, you fit the grid search to the training data.

In [None]:
# Define the grid search and fit it to the training data showing precision for the Virginica species


from sklearn.metrics import make_scorer, precision_score
grid = GridSearchCV(estimator=pipeline_clf(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3, # In the workplace we typically set verbose to 1, 
                    # to reduce the number of messages when fitting the models.
                    # For teaching purposes, we set it to 3 to see the score for each cross-validated model.
                    scoring=make_scorer(precision_score,
                                        labels=['virginica'],
                                        average=None)
                    )


grid.fit(X_train,y_train)

Fitting 2 folds for each of 2 candidates, totalling 4 fits


In [None]:
#display the results
# Print per-fold scores in GridSearchCV verbose style
# for i, params in enumerate(grid.cv_results_['params']):
#     for fold in range(grid.cv):
#         score = grid.cv_results_[f'split{fold}_test_virginica_precision'][i]
#         print(f"[CV {fold+1}/{grid.cv}] END {', '.join([f'{k}={v}' for k, v in params.items()])}; score={score:.3f}")

[CV 1/2] END model__n_estimators=10; score=1.000
[CV 2/2] END model__n_estimators=10; score=0.833
[CV 1/2] END model__n_estimators=20; score=1.000
[CV 2/2] END model__n_estimators=20; score=0.833


Next, we check the results for all four different models with .cv_results_ .


In [14]:
(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

array([[{'model__n_estimators': 10}, 0.9166666666666667],
       [{'model__n_estimators': 20}, 0.9166666666666667]], dtype=object)

In [16]:
#check best hyperparameters
grid.best_params_

{'model__n_estimators': 10}

In [17]:
#check best pipeline
pipeline = grid.best_estimator_
#check best score
best_score = grid.best_score_
pipeline, best_score

(Pipeline(steps=[('feat_scaling', StandardScaler()),
                 ('feat_selection',
                  SelectFromModel(estimator=RandomForestClassifier(random_state=101))),
                 ('model',
                  RandomForestClassifier(n_estimators=10, random_state=101))]),
 0.9166666666666667)

Finally we evaluate the pipeline:

* The precision on Virginica, on the train set, is 98% and on the test set is 100%. It is a very good sign that the precision is maximised for the test set since it shows the pipeline can generalise on unseen data.


In [18]:
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)
    

clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= df_clf['species'].unique()
                )

#### Train Set #### 

---  Confusion Matrix  ---
                      Actual setosa Actual versicolor Actual virginica
Prediction setosa                40                 0                0
Prediction versicolor             0                36                0
Prediction virginica              0                 2               42


---  Classification Report  ---
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        40
  versicolor       1.00      0.95      0.97        38
   virginica       0.95      1.00      0.98        42

    accuracy                           0.98       120
   macro avg       0.98      0.98      0.98       120
weighted avg       0.98      0.98      0.98       120
 

#### Test Set ####

---  Confusion Matrix  ---
                      Actual setosa Actual versicolor Actual virginica
Prediction setosa                10                 0                0
Prediction versicolor             0                12        