## Model Validation and Hyperparameter Tuning

This notebook explains the important topics of Model Validation and Hyperparameter Tuning using Scikit-Learn. Following are the topics covered:

1. What are the different ways of validating an ML model.
2. How to validate Models using holdout sets and cross validation.
3. How to tune hyperparameters using GridSearchCV.

### 1. Validating on training data

In [1]:
##import digits dataset from sklearn
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

X.shape

(150, 4)

In [2]:
##using KNN classifier to predict flower species
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X, y)
y_pred = model.predict(X)


In [3]:
##checking accuracy
from sklearn.metrics import accuracy_score

##calculate accuracy score
accuracy_score(y, y_pred)

1.0

Aaccuracy score of 1.0 is an indicator of high overfitting. Our validation method violates the fundamental law of training and testing the model on the same data.

### 2. Validation on Holdout Sets

In [4]:
from sklearn.model_selection import train_test_split
##reserving 50% of data for testing/validation
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0,
                                 train_size=0.5)

# fit the model on training
model.fit(X_train, y_train)

# evaluate the model on unseen testing set
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9066666666666666

### 3. Cross-Validation

In [5]:
##train the model on training set and predicting test data
y_test_pred = model.fit(X_train, y_train).predict(X_test)

##train the model on testing set and predicting training set
y_train_pred = model.fit(X_test, y_test).predict(X_train)

##evaluate
accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)


(0.96, 0.9066666666666666)

In [6]:
##use the cross_val_score method from sklearn
from sklearn.model_selection import cross_val_score

##create 10 folds
scores = cross_val_score(model, X, y, cv=10)
scores

array([1.        , 0.93333333, 1.        , 0.93333333, 0.86666667,
       1.        , 0.86666667, 1.        , 1.        , 1.        ])

In [7]:
##take an average of this score
scores.mean()

0.96

### Hyperparameter Tuning with GridSearchSV

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

##define the parameters that go into a DecisionTreeClassfier
criteria = ["gini", "entropy"]
min_sample_split_range = [2,10, 30] 
max_depth_range = [None, 2, 5, 10]
min_samples_leaf_range = [1, 5, 10]
min_leaf_nodes_range = [None, 5, 10, 20]

In [9]:
##define the parameter grid that are to be tested upon
param_grid = {"criterion": criteria,
              "min_samples_split": min_sample_split_range,
              "max_depth": max_depth_range,
              "min_samples_leaf": min_samples_leaf_range,
              "max_leaf_nodes": min_leaf_nodes_range}

##set grid with estimator and scoring method
grid_model = GridSearchCV(estimator=DecisionTreeClassifier(),
                         param_grid=param_grid,
                         cv=5,
                         scoring='accuracy',
                         )

In [10]:
##train the model
grid_model.fit(iris.data, iris.target)

##print best scores and best params
print("Accuracy Score of the fine-tuned model: %.4f"%grid_model.best_score_)

print(grid_model.best_params_)

Accuracy Score of the fine-tuned model: 0.9733
{'criterion': 'gini', 'max_depth': 5, 'max_leaf_nodes': 5, 'min_samples_leaf': 1, 'min_samples_split': 30}
