<pre>Cross validation is a technique, where the training dataset is divided into multiple equal sized chunks, each chunk     known as a validation set. During each iteration in training, one random validation set will be dropped out of the      training data, and and it will be used for validating the model.
After validation using the validation dataset, model will be tested using test data set.
Cross validation reduces over-fitting of a model, as in every iteration of training, the train data varies based on the validation sets chosen.
</pre>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#import warnings
#warnings.filterwarnings("ignore")

#### Read data
In this demo, we are going to work on defaulter dataset, where based on the customers' income and balance loan amoun to pay, a customer is going to be identified as defaulter or not

In [2]:
#read data from input csv file
defaulter = pd.read_csv("datasets/defaulter.csv")

### Feature Engineering

#### Normalizing the data using MinMaxScaler
<pre>
Normalizing feature 'A' using 'min_max' scaler:
    find the min and max values in feature 'A'
    new normalized value for field 'A'= (actual_value - min_value) / (max_value - min_value)
</pre>

In [3]:
#### Normalizing the data using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
features_to_scale = ["balance","income"]
scaled_values = scaler.fit_transform(defaulter[features_to_scale])
defaulter["norm_balance"] = scaled_values[:,0]
defaulter["norm_income"] = scaled_values[:,1]

#### Splitting the data into train and test set

In [4]:
from sklearn.model_selection import train_test_split
X=defaulter[["norm_balance","norm_income"]]
Y=defaulter['defaulter']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=100)

#### Finding best value of k for KNN

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
#create new a knn model
knn = KNeighborsClassifier()
#create a dictionary of all k neighbor values
param_grid = {'n_neighbors': np.arange(1, 15,2)}
#use gridsearch to test each value of k
knn_gscv = GridSearchCV(knn, param_grid, cv=5,return_train_score=True, verbose=1,scoring='accuracy')
#fit model to data
knn_gscv.fit(X_train,Y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  35 out of  35 | elapsed:   16.6s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid={'n_neighbors': array([ 1,  3,  5,  7,  9, 11, 13])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='accuracy', verbose=1)

In [6]:
#storing results to dataframe
#print(knn_gscv.cv_results_)
df=pd.DataFrame(knn_gscv.cv_results_)
#filtering out columns
df=df[['param_n_neighbors','mean_train_score','mean_test_score']]
df

Unnamed: 0,param_n_neighbors,mean_train_score,mean_test_score
0,1,1.0,0.95575
1,3,0.977688,0.967625
2,5,0.975406,0.970625
3,7,0.974563,0.971375
4,9,0.974781,0.97225
5,11,0.974469,0.9725
6,13,0.974313,0.972375


In [7]:
best_k = df["param_n_neighbors"][df["mean_test_score"]==
                                 df["mean_test_score"].max()]

print(best_k)

5    11
Name: param_n_neighbors, dtype: object


In [8]:

from sklearn.model_selection import KFold
k_fold = KFold(n_splits=5,shuffle=False)
knn_k_vals = [1,3,5,7,9,11]
avg_train_accuracy = []
avg_val_accuracy = []
#Taking each value of k from the list
for k in knn_k_vals: 
    # Iterating over each fold from kfold
    train_accuracy = []
    val_accuracy = []
    for i,(train,val) in enumerate(k_fold.split(X,Y)): 
        #Train a model on the selected fold for a selected value of k
        model = KNeighborsClassifier(n_neighbors=k,metric="euclidean")
        model.fit(X.iloc[train],Y.iloc[train])
        train_accuracy.append(model.score(X.iloc[train],Y.iloc[train]))
        val_accuracy.append(model.score(X.iloc[val],Y.iloc[val]))
    avg_train_accuracy.append(np.mean(train_accuracy))
    avg_val_accuracy.append(np.mean(val_accuracy))

In [9]:
performance_scores = pd.DataFrame(np.array([knn_k_vals,
                                            avg_train_accuracy,
                                            avg_val_accuracy]).T,
                     columns=["k","avg_train_accuracy","avg_val_accuracy"])
performance_scores

Unnamed: 0,k,avg_train_accuracy,avg_val_accuracy
0,1.0,1.0,0.9564
1,3.0,0.97725,0.9668
2,5.0,0.974775,0.9699
3,7.0,0.974625,0.9718
4,9.0,0.9743,0.9722
5,11.0,0.97415,0.9724


In [10]:
best_k = performance_scores["k"][performance_scores["avg_val_accuracy"]==
                                 performance_scores["avg_val_accuracy"].max()]

best_k

5    11.0
Name: k, dtype: float64

#### Using the best k found to train a model

In [11]:
model = KNeighborsClassifier(n_neighbors = 9, metric="euclidean")
model.fit(X_train,Y_train)
train_accuracy = model.score(X_train,Y_train)
test_accuracy = model.score(X_test,Y_test)
print(train_accuracy,test_accuracy)

0.974625 0.9725
