In [1]:
from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import tree
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# read the data in
df = pd.read_csv("data/diabetes.csv")
X = df.drop(labels='Outcome',axis=1)# independent variables
y = df['Outcome'].values# dependent variables

# Normalize Data
sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)

#split our data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)



=================================================================================================================
# Hyperparameter Optimization
Hyperparameters are a model’s inbuilt configuration variables. These variables require fine tuning to produce a better performing model. These parameters are model dependent and vary from model to model. For example a random forest model will have the following hyperparameters:<br>

In [2]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

All the above features are a model’s inbuilt features. It is very difficult to manually change the hyperparameters and fit them on my training data every time because <br>
<ul>
    <li>it is time-consuming</li>
    <li>it is hard to keep track of hyperparameters we tried and we still have to try</li>
</ul>
In order to overcome this problem, GridSearchCV and RandomizedSearchCv are the best solutions <b>to choose the best parameters for classifying problems.</b>
<br>
<h2>1) What is GridSearchCV?</h2>
    
GridSearchCV is a library function that is a member of sklearn’s model_selection package. It is the most basic algorithms for Hyperparameter Optimization. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters.
<br>
In addition to that, you can specify the number of times for the cross-validation for each set of hyperparameters.<br>

In [3]:
kn = KNeighborsClassifier()
params = {
    'n_neighbors' : [5, 25],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}
grid_kn = GridSearchCV(estimator = kn,
                        param_grid = params,
                        scoring = 'accuracy', 
                        cv = 5, 
                        verbose = 1,
                        n_jobs = -1)
grid_search = grid_kn.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:    2.6s finished


Let’s break down the code block above. As usual, you need to import the GridSearchCV and the estimator/model (in my example KNClassifier) from the sklearn library.<br><br>

The next step is to define the hyperparameters you want to try out. It is depending on the estimator you selected. All you need to do is create a dictionary (variable params in my code) that has the hyperparameters as keys and an iterable that holds the options you need to try out.<br><br>

Then all you have to do is create an object of GridSearchCV. Here basically you need to define a few named arguments:<br>

    estimator: estimator object you created
    params_grid: the dictionary object that holds the hyperparameters you want to try
    scoring: evaluation metric that you want to use, you can simply pass a valid string/ object of evaluation metric
    cv: number of cross-validation you have to try for each selected set of hyperparameters
    verbose: you can set it to 1 to get the detailed print out while you fit the data to GridSearchCV
    n_jobs: number of processes you wish to run in parallel for this task if it -1 it will use all available processors.

That is all pretty much you need to define. Then you have to fit your training data as you do normally.

In [4]:
print(grid_search)
print("The best score for our search is ",grid_search.best_score_)
print("The best parameters searched by GridSearchCV are ",grid_search.best_params_)

GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=-1,
             param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
                         'n_neighbors': [5, 25],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
The best score for our search is  0.7568246716162192
The best parameters searched by GridSearchCV are  {'algorithm': 'auto', 'n_neighbors': 25, 'weights': 'distance'}


In [5]:
# Now we can use these parameters and use it to initialize to our algorithm and get the best result for our model.
kn = KNeighborsClassifier(algorithm='auto',n_neighbors=25, weights='uniform')
kn.fit(X_train,y_train)
pred = kn.predict(X_test)
accuracy_score(y_test, pred)

0.7401574803149606