## Model Validation and Hyperparameter Tuning

This notebooks explains the important topics of Model Validation and Hyperparameter Tuning using Scikit-Learn. Following are the topics covered:

1. What are the different ways of validating an ML model.
2. How to validate Models using holdout sets and cross validation.
3. How to tune hyperparameters using GridSearchCV.

## Ways of Validating Models

1. Validating on training data - the wrong way
2. Validating on holdout sets - setting aside some training data
3. Validaing via cross-validation


### 1. Validating on training data

In [None]:
##import digits dataset from sklearn
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
X.shape

In [None]:
##use KNN classifier to predict flower species
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1)
model.fit(X, y)  # training the model
y_pred = model.predict(X)   # making predictions


In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y, y_pred)

Aaccuracy score of 1.0 is an indicator of high overfitting. Our validation method violates the fundamental law of training and testing the model on the same data.

### 2. Validation on Holdout Sets

In [None]:
from sklearn.model_selection import train_test_split
##reserve 20% of data for testing/validation
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0,
                                 train_size=0.5)

# fit the model on training
model.fit(X_train, y_train)

# evaluate the model on unseen testing set
y2_model = model.predict(X_test)
accuracy_score(y_test, y2_model)

### 3. Validation via Cross-Validation

In [None]:
##train the model on both sets of a 2-split data

##train the model on training set and predicting test data
y_test_pred = model.fit(X_train, y_train).predict(X_test)

##train the model on testing set and predicting training set
y_train_pred = model.fit(X_test, y_test).predict(X_train)

##evaluate
accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)

Scikit-Learn’s K-fold cross-validation feature randomly splits the training set into K distinct subsets called folds, then it trains and evaluates the model K times, picking a different fold for evaluation every time and training on the other K-1 folds.

The result is an array containing the K evaluation scores:

In [None]:
##use the cross_val_score method from sklearn
from sklearn.model_selection import cross_val_score

##create 10 folds
scores = cross_val_score(model, X, y, cv=10)
scores

In [None]:
##take an average of this score
scores.mean()

### Hyperparameter Tuning with GridSearchSV

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

##define the parameters that go into a DecisionTreeClassfier
criteria = ["gini", "entropy"]
min_sample_split_range = [2,10, 30] 
max_depth_range = [None, 2, 5, 10]
min_samples_leaf_range = [1, 5, 10]
min_leaf_nodes_range = [None, 5, 10, 20]


In [None]:
##define the parameter grid that are to be tested upon
param_grid = {"criterion": criteria,
              "min_samples_split": min_sample_split_range,
              "max_depth": max_depth_range,
              "min_samples_leaf": min_samples_leaf_range,
              "max_leaf_nodes": min_leaf_nodes_range
                }

##set grid with estimator and scoring method
grid_model = GridSearchCV(estimator=DecisionTreeClassifier(), 
                    param_grid=param_grid, 
                    cv = 5, 
                    scoring='accuracy', 
                    refit=True)     



In [None]:
##train the model
grid_model.fit(iris.data, iris.target)

print("Accuracy of the fine-tuned model: %.4f" %grid_model.best_score_)
print(grid_model.best_params_)