# 6. Nested Cross-Validation

**Instructions:**
* go through the notebook and complete the **tasks** .  
* Make sure you understand the examples given!
* When a question allows a free-form answer (e.g., ``what do you observe?``) create a new markdown cell below and answer the question in the notebook.
* ** Save your notebooks when you are done! **

In the previous lab, we looked at cross-validation when the parameters of our classifier (e.g., k-NN) where known.
In this lab, you will be extending the code for cross-validation in order to find the best parameters to use for each fold, by using a validation set.  Please have a look at the relevant lecture slides that demo how to apply nested cross-validation in order to remember the procedure.

**Note** You can always copy the code in a separate notebook (or, a plain text file .py that you can run with python from the command line) if you want.  After you are done, you can copy the code back in this notepad.

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Run the cell below to load our data. Note that besides adding noise, we also initialize the numpy random seed - this is in order to always get the same results regardless of how many times we run the code. Otherwise, this piece of code is the same as the previous lab.

In [2]:
%matplotlib inline


from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

#import k-nn classifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()

#view a description of the dataset (uncomment next line to do so)
#print(iris.DESCR)

#Set X equal to features, Y equal to the targets

X=iris.data 
y=iris.target 


mySeed=1234567
#initialize random seed generator 
np.random.seed(mySeed)

#we add some random noise to our data to make the task more challenging
X=X+np.random.normal(0,0.5,X.shape)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Your task is now to write your own nested cross-validation function.

You can assume that we want to run 5-fold cross-validation, and evaluate the number of neighbours (from 1 to 10 inclusive), along with the 'euclidean' and 'manhattan' distances.

Your function should split the data (using indexes) into appropriate bins, similarly to how this was done in the previous lab. 

For each fold, the testing set should consist of indices in one bin, the validation set should consist of indices in another bin, and the rest of the bins can be assigned to your training set.

Subsequently, we loop through all different parameters (one for loop for neighbours, one for loop for distances), train on the training set and test on the validation set.

Once we are done, we have the best performing set of parameters on our validation set.  We subsequently merge the training set with the validation set, and train on that set using best parameters.

Finally, we evaluate on our test set, and proceed to the next fold.

Your function should return the accuracies on the test set (with best parameters) over all five folds, e.g. ``[0.80000000000000004, 0.8666666666666667, 0.80000000000000004, 0.96666666666666667, 0.73333333333333328]``

The code below is commented so that you can work through developing the function - if you feel more comfortable, you can start working on this code in a different cell/ide and then copy the code here.

In [66]:
# nested cross validation function
# X - data / features
# y - outputs
# foldK - number of folds
# nns - list of number of neighbours parameter for validation
# dists - list of distances for validation
# mySeed - random seed
# returns: accuracy over 5 folds (list)

#n_neighbors=5, metric='euclidean'
def knnise(training,labels,test,neighbours,myMetric):
    knn=KNeighborsClassifier(n_neighbors=neighbours, metric=myMetric)
    #define training and testing data, fit the classifier
    knn.fit(training,labels)
    #predict values for test data based on training data
    k_non=knn.predict(test)
    return k_non;

def myAccuracy(testing,predicted):
    mistakes=0
    for i in range (len(testing)):
        if (testing[i]!=predicted[i]): mistakes+=1
    return 1-mistakes/len(testing);

def myNestedCrossVal(X,y,foldK,nns,dists,mySeed):
    np.random.seed(mySeed)
    accuracy_fold=np.zeros(foldK)
    
    #TASK: use the function np.random.permutation to generate a list of shuffled indices from in the range (0,number of data)
    #(you did this already in a task above)
    indices=np.random.permutation(np.arange(len(X)))
    #print(indices)
    
    #TASK: use the function array_split to split the indices to foldK different bins (here, 5)
    #uncomment line below
    bins=np.split(indices,foldK)
    #print(bins)
    
    #no need to worry about this, just checking that everything is OK
    assert(foldK==len(bins))
    
    #loop through folds
    for foldNum in range(0,foldK):
        foldTest=bins[foldNum%foldK]  # list to save current indices for testing
        foldVal=bins[(foldNum+1)%foldK]    # list to save current indices for validation
        #loop through all bins, take bin i for testing, the next bin for validation and the rest for testing
        foldTrain=np.delete(bins,[foldNum%foldK,(foldNum+1)%foldK],0).flatten()    # list to save current indices for training
        '''
        print (foldTrain)
        print (foldTest)
        print (foldVal)
        print (bins)
        '''
        

        '''   
        print('** Train', len(foldTrain), foldTrain)
        print('** Val', len(foldVal), foldVal)
        print('** Test', len(foldTest), foldTest)
        '''
        
        #no need to worry about this, just checking that everything is OK
        assert not np.intersect1d(foldTest,foldVal)
        assert not np.intersect1d(foldTrain,foldTest)
        assert not np.intersect1d(foldTrain,foldVal)
       
        #'''
        bestDistance='' #save the best distance metric here
        bestNN=-1 #save the best number of neighbours here
        bestAccuracy=-10 #save the best attained accuracy here (in terms of validation)      
        # loop through all parameters (one for loop for distances, one for loop for nn)
        for distLoop in range (0,len(dists)):
            #print (dists[distLoop])
            for neighLoop in range (0,len(nns)):
                # train the classifier on current number of neighbours/distance
                val_pred=knnise(X[foldTrain],y[foldTrain],X[foldVal],nns[neighLoop],dists[distLoop])
                # obtain results on validation 
                currentAccuracy=myAccuracy(y[foldVal],val_pred)
                '''
                print (dists[distLoop])
                print (nns[neighLoop])
                print (currentAccuracy)
                '''
                # save parameters if results are the best we had
                if (currentAccuracy>bestAccuracy): 
                    bestAccuracy=currentAccuracy
                    bestDistance=dists[distLoop] 
                    bestNN=nns[neighLoop]
        print('** End of val for this fold, best NN', bestNN, 'best Dist', bestDistance)
        #'''
        
        #evaluate on test data:
        #extend your training set by including the validation set             
        foldTrain=np.concatenate((foldTrain,foldVal),0)
        #train k-NN classifier on new training set and test on test set
        test_pred=knnise(X[foldTrain],y[foldTrain],X[foldTest],bestNN,bestDistance)
        #get performance on fold, save result in accuracy_fold array
        accuracy_fold[foldNum]=myAccuracy(y[foldTest],test_pred)
        
        print('==== Final Cross-val on test on this fold with NN', bestNN, 'dist', bestDistance, ' accuracy ',accuracy_score(y[foldTest],y_pred))
        
        #DEBUG: Using old KNN to see if I've messed up anything
        '''
        knn=KNeighborsClassifier(n_neighbors=5, metric='euclidean')
        #define training and testing data, fit the classifier
        knn.fit(X[foldTrain],y[foldTrain])
        #predict values for test data based on training data
        y_mypred=knn.predict(X[foldTest])
        
        
        #print(foldNum)
        #print(accuracy_fold)
        #accuracy_fold[foldNum]=1-mistakes/len(foldTest)
        accuracy_fold[foldNum]=myAccuracy(y[foldTest],y_mypred)
        '''
        
    return accuracy_fold;
    
#call your nested crossvalidation function:
 
accuracy_fold=myNestedCrossVal(X,y,5,list(range(1,11)),['euclidean','manhattan'],mySeed)

print(accuracy_fold)

euclidean
euclidean
1
0.8666666666666667
euclidean
2
0.8666666666666667
euclidean
3
0.8333333333333334
euclidean
4
0.8666666666666667
euclidean
5
0.8666666666666667
euclidean
6
0.8666666666666667
euclidean
7
0.8333333333333334
euclidean
8
0.8333333333333334
euclidean
9
0.8333333333333334
euclidean
10
0.8333333333333334
manhattan
manhattan
1
0.8333333333333334
manhattan
2
0.8666666666666667
manhattan
3
0.8333333333333334
manhattan
4
0.8666666666666667
manhattan
5
0.8666666666666667
manhattan
6
0.8666666666666667
manhattan
7
0.8333333333333334
manhattan
8
0.8333333333333334
manhattan
9
0.8333333333333334
manhattan
10
0.8333333333333334
** End of val for this fold, best NN 1 best Dist euclidean


NameError: name 'y_pred' is not defined