# **Scikit-learn - Cross Validation Search (GridSearchCV) and Hyperparameter Optimisation Binary Clf**

## Objectives

* Learn and use GridSearchCV for Hyperparameter Optimisation (Binary)




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Hyperparameter Optimisation with one algorithm

Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

#### Binary Classification

Binary classification with GridSearchCV differs slightly from the same algorithm applied to a regression problem, though the workflow is similar:

* Split the data
* Define the pipeline and hyperparameter
* Fit the pipeline
* Evaluate the pipeline



We will be using the breasct cancer analysis from sklearn - where 0 repesents a malignant tumor and 1 benign. Once loaded we split the data.

In [2]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df_clf = pd.DataFrame(data.data,columns=data.feature_names)
df_clf['diagnostic'] = pd.Series(data.target)
df_clf = df_clf.sample(frac=0.5, random_state=101)


print(df_clf.shape)
df_clf.head()

(284, 31)


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnostic
107,12.36,18.54,79.01,466.7,0.08477,0.06815,0.02643,0.01921,0.1602,0.06066,...,27.49,85.56,544.1,0.1184,0.1963,0.1937,0.08442,0.2983,0.07185,1
437,14.04,15.98,89.78,611.2,0.08458,0.05895,0.03534,0.02944,0.1714,0.05898,...,21.58,101.2,750.0,0.1195,0.1252,0.1117,0.07453,0.2725,0.07234,1
195,12.91,16.33,82.53,516.4,0.07941,0.05366,0.03873,0.02377,0.1829,0.05667,...,22.0,90.81,600.6,0.1097,0.1506,0.1764,0.08235,0.3024,0.06949,1
141,16.11,18.05,105.1,813.0,0.09721,0.1137,0.09447,0.05943,0.1861,0.06248,...,25.27,129.0,1233.0,0.1314,0.2236,0.2802,0.1216,0.2792,0.08158,0
319,12.43,17.0,78.6,477.3,0.07557,0.03454,0.01342,0.01699,0.1472,0.05561,...,20.21,81.76,515.9,0.08409,0.04712,0.02237,0.02832,0.1901,0.05932,1


In [3]:
#split the data
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['diagnostic'],axis=1),
                                    df_clf['diagnostic'],
                                    test_size=0.2,
                                    random_state=101
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:", X_test.shape, y_test.shape)

* Train set: (227, 30) (227,) 
* Test set: (57, 30) (57,)


We create a pipeline with three steps, feature scaling, feature selection and modelling using RandomForestClassifier.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

def pipeline_clf():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),
      ( "feat_selection",SelectFromModel(RandomForestClassifier(random_state=101)) ),
      ( "model", RandomForestClassifier(random_state=101)),

    ])

  return pipeline
pipeline = pipeline_clf()
print(pipeline)


Pipeline(steps=[('feat_scaling', StandardScaler()),
                ('feat_selection',
                 SelectFromModel(estimator=RandomForestClassifier(random_state=101))),
                ('model', RandomForestClassifier(random_state=101))])


We define our hyperparameter list based on the algorithm documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). One method could be to consider the default parameter value and a set of values around the default value.

In this case, there are two possible combinations of hyperparameters.

In [4]:
# create hyperparameter list
from sklearn.model_selection import GridSearchCV

param_grid = {"model__n_estimators":[50,20],
              }

param_grid

{'model__n_estimators': [50, 20]}

For Classification, there will be a different GridSearchCV scoring argument than for regression.

* In our classification projects, the potential performance metrics are accuracy, recall, precision, and F1 score.
* When the metric is either recall, precision or f1 score, we need to inform which class we want to tune for and use make_scorer() as an "auxiliary" function to help define the metric and the class to tune. The documentation for make_scorer is found [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)
* When your performance metric is 'recall', you need to import recall_score; if it is 'precision', [precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) and if it is f1 score, you need to import f1_score; so you can parse the metric to the make_scorer() function.
* When your performance metric is accuracy, you simply write "accuracy" for scoring: scoring='accuracy'

In this exercise, we have 0 and 1 as diagnostics for breast cancer.

* We assume that when defining the ML business case, it was agreed that the performance metric is recall on malignant (0) since the client needs to detect a malignant case.
* The client doesn't want to miss a malignant case, even if that comes with a cost where you misidentify a benign tumour and state it is malignant. For this client, this is not as bad as misidentifying a malignant tumour as benign. Therefore, the model is tuned on recall for malignant (0).

In [5]:
#import metrics
from sklearn.metrics import make_scorer, recall_score
from sklearn.metrics import f1_score # in case your metric is f1 score, you would need this import
from sklearn.metrics import precision_score # in case your metric is precision; you would need this import

The arguments estimator, param_grid, cv, n_jobs, and verbose are similar to the regression example.

* When creating the object to conduct a grid search, the focus is now on scoring. For this binary classifier, you will need make_scorer() to parse your tune on recall for class 0.
* Pass two arguments to make_scorer() for recall_score as your metric and pos_label to identify which class you want to tune recall. In this case, it is 0.
* Next, you fit the grid search with the train set (features and target) as usual.

Since cv=2, we will fit two models for each hyperparameter combination using k-fold cross-validation. Therefore, four models (two times two) are trained in the end.

* The same dynamic repeats: compute the performance for each cross-validated model and get the average performance for a given hyperparameter combination, then iterate for each hyperparameter combination.

In [10]:
grid = GridSearchCV(estimator=pipeline_clf(),
                    param_grid=param_grid,
                    cv=2,
                    n_jobs=-2,
                    verbose=3,
                    # In the workplace, we typically set verbose to 1, 
                    # to reduce the amount of messages when fitting the models
                    # For teaching purposes, we set it to 3 to see the score for each cross-validated model
                    scoring=make_scorer(recall_score, pos_label=0)
                    )


grid.fit(X_train,y_train)

Fitting 2 folds for each of 2 candidates, totalling 4 fits


Next, we check the results for all four different models with .cv_results_ and use the same code from the previous section.



In [11]:
(pd.DataFrame(grid.cv_results_)
.sort_values(by='mean_test_score',ascending=False)
.filter(['params','mean_test_score'])
.values
 )

array([[{'model__n_estimators': 50}, 0.8604651162790697],
       [{'model__n_estimators': 20}, 0.8604651162790697]], dtype=object)

Next check the best params and the pipeline with the best estimator

In [12]:
grid.best_params_

{'model__n_estimators': 50}

In [13]:
pipeline = grid.best_estimator_
pipeline

We use a custom function to check our confusion matrix and performance scores

In [14]:
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

We parse the parameters, as usual, considering that class 0 is malignant and class 1 is benign. Therefore, label_map receives an ordered list that matches the class value and its meaning: ['malignant', 'benign']

* Note that the recall on malignant on the train set is 100%, and on the test set, it is 90%. In a project, you set the threshold you would accept.
* This pipeline is the solution if the threshold you agreed with the client is 90%. If the threshold is 98%, you would still have to look for other algorithms or hyperparameter combinations to improve your pipeline performance, as the recall weighted average is 95%.

In [15]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline,
                label_map= ['malignant', 'benign'] 
                )

#### Train Set #### 

---  Confusion Matrix  ---
                     Actual malignant Actual benign
Prediction malignant               86             0
Prediction benign                   0           141


---  Classification Report  ---
              precision    recall  f1-score   support

   malignant       1.00      1.00      1.00        86
      benign       1.00      1.00      1.00       141

    accuracy                           1.00       227
   macro avg       1.00      1.00      1.00       227
weighted avg       1.00      1.00      1.00       227
 

#### Test Set ####

---  Confusion Matrix  ---
                     Actual malignant Actual benign
Prediction malignant               19             0
Prediction benign                   2            36


---  Classification Report  ---
              precision    recall  f1-score   support

   malignant       1.00      0.90      0.95        21
      benign       0.95      1.00      0.97        36

    accuracy                   