Hyperparameter tuning for K-Nearest Neighbor classifier to achieve better results on diabetes dataset.

# Import the libraries

In [20]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [21]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

In [22]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [23]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 3)
classifier.fit(X_train, y_train)

In [24]:
y_pred = classifier.predict(X_test)

In [25]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
cr = classification_report(y_test, y_pred)
print("\nClassification Report:",)
print (cr)
acc = accuracy_score(y_test,y_pred)
print("\nAccuracy:",acc)

Confusion Matrix:
[[18  0  0]
 [ 0 15  0]
 [ 0  1 11]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       0.94      1.00      0.97        15
           2       1.00      0.92      0.96        12

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45


Accuracy: 0.9777777777777777


Before we move on to classification, let us see some basic information about our dataset.

k- Nearest Neighbors Classifier

# Hyperparameter Tuning

A hyperparameter is a parameter of the model that is set before the start of learning process. Different machine learning models have different hyperparameters. You can find out more about the different hyperparameters of k-NN <a href =  'https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html'>here</a>.

We will use the Exhaustive Grid Search technique for hyperparameter optimization. An exhaustive grid search takes in as many hyperparameters as you would like, and tries every single possible combination of the hyperparameters as well as as many cross-validations as you would like it to perform. An exhaustive grid search is a good way to determine the best hyperparameter values to use, but it can quickly become time consuming with every additional parameter value and cross-validation that you add.

We will use three hyperparamters- n-neighbors, weights and metric.
1. n_neighbors: Decide the best k based on the values we have computed earlier.

2. p: The p value in Minkowski distance metric to be used.

In [26]:
# Define the hyperparameter grid
param_grid = {'n_neighbors': [3, 5, 7, 9, 11],
              'p': [1, 2, 3]}

In [27]:
gs = GridSearchCV(KNeighborsClassifier(), param_grid, cv=3)

Since we have provided the class validation score as 3 ( cv= 3), Grid Search will evaluate the model 5 x 3 x 3 = 45 times with different hyperparameters.

In [28]:
# fit the model on our train set
g_res = gs.fit(X_train, y_train)

In [29]:
# find the best score
g_res.best_score_

0.9619047619047619

In [30]:
# get the hyperparameters with the best score
g_res.best_params_

{'n_neighbors': 5, 'p': 2}

In [31]:
# use the best hyperparameters
knn = KNeighborsClassifier(n_neighbors = 11, p=2)
knn.fit(X_train, y_train)

In [32]:
print("Best Hyperparameters:", g_res.best_params_)
print("Best Accuracy:", g_res.best_score_)

# Train and evaluate the model with the best hyperparameters
best_knn = KNeighborsClassifier(n_neighbors=g_res.best_params_['n_neighbors'],
                                 p=g_res.best_params_['p'])
best_knn.fit(X_train, y_train)


Best Hyperparameters: {'n_neighbors': 5, 'p': 2}
Best Accuracy: 0.9619047619047619


# Model Evaluation

In [33]:
y_pred = best_knn.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print("Accuracy:", accuracy)

Accuracy: 1.0


<h2 style='color:purple'>Exercise</h2>

Use Breast cancer dataset from sklearn library (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) and use cross_val_score against following
models to perform hyperparameter tuning using random search for Decision Tree.

In [34]:
from sklearn import datasets
bcd = datasets.load_breast_cancer()
X = bcd.data
y = bcd.target

In [35]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

In [37]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9590643274853801

<h2 style='color:purple'>Hyperparameter Tuning</h2>

In [38]:
# hyperparameter search space 


In [39]:
#Create an object for RandomizedSearchCV() along with the required parameters


In [40]:
#print the best score


In [41]:
#prin the best hyperparameters

In [42]:
#Train Random Forest with best hyperparameters

In [43]:
#Compute and print accuracy on test set

<h3 style='color:purple'>Write your understanding about the effect of hyperparameter tuning on model performance in the next cell </h3>