# 1. Nested cross-validation exercise

## Nested cross-validation for k-nearest neighbors <br>
- Use Python 3 to program a nested leave-one-out cross-validation for the k-nearest neighbors (kNN) method so that the number of neighbours k is automatically selected from the set k = [3, 5, 7, 9, 11]. In other words, the base learning algorithm is kNN but the actual learning algorithm, whose prediction performance will be evaluated with nested CV, is kNN with automatic CV-based model selection (see the lectures and the pseudo codes presented on them for more info on this interpretation). 
- Compare the C-index produced by nested leave-one-out CV with normal leave-one-out cross-validation with the best value of k.  
- As a kNN implementation, use the provided kNN and C-index functions in your exercise.
- Use the CV implementations on the provided subsampled iris data (100 randomly drawn data points from iris) and report the resulting classification accuracy via C-index. Hint: you can use the nested CV example provided on sklearn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html as a starting point, but do NOT use the ready made CV implementations of sklearn.

As a summary, for completing this exercise implement the following steps: 
_______________________________________________________________
#### 1. Use leave-one-out cross-validation for determining the optimal k-parameter for the data (X.csv, y.csv) from the set k = [3,5,7,9,11]. When you have solved the optimal k-parameter, save the corresponding C-index (call it loo_c_index) for this best value of k.
#### 2. Similarly, use nested leave-one-out cross-validation (leave-one-out both in outer and inner folds) for determining the C-index (call it nloo_c_index) of the kNN + leave-one-out cross-validation based k selection  approach. 
#### 3. Return both this notebook and as a PDF-file made from it in the exercise submit page. 
_______________________________________________________________

Remember to use the provided C-index and kNN functions in your implementation! 

## Import libraries and data

In [41]:
#In this cell import all libraries you need. For example: 
import numpy as np
import pandas as pd
import sklearn.model_selection
from IPython.display import display

X = pd.read_csv('data/X.csv', header=None).to_numpy()
y = pd.read_csv('data/y.csv', header=None).to_numpy()
display(X[:5], len(X))
display(y[:5], len(y))

array([[6.5, 3. , 5.2, 2. ],
       [6.1, 2.9, 4.7, 1.4],
       [6.4, 3.1, 5.5, 1.8],
       [5.1, 3.7, 1.5, 0.4],
       [5.7, 2.8, 4.5, 1.3]])

100

array([[2.],
       [1.],
       [2.],
       [0.],
       [1.]])

100

## Provided functions 

In [42]:
"""
C-index function: 
- INPUTS: 
'y' an array of the true output values
'yp' an array of predicted output values
- OUTPUT: 
The c-index value
"""
def cindex(y, yp):
    n = 0
    h_num = 0 
    # display(len(y))
    # display(len(yp))
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

"""
Self-contained k-nearest neighbor
- INPUTS: 
'X_train' a numpy matrix of the X-features of the train data points
'y_train' a numpy matrix of the output values of the train data points
'X_test' a numpy matrix of the X-features of the test data points
'k' the k-parameter integer value for kNN
- OUTPUT: 
'y_predictions' a list of the output value predictions
"""
def knn(X_train, y_train, X_test, k):
    y_train = np.array(y_train, dtype=int)
    y_predictions = []
    for test_ind in range(0, X_test.shape[0]):
        diff = X_test[test_ind, :].reshape(1, -1) - X_train
        distances = np.sqrt(np.sum(diff * diff, axis = 1))
        sort_inds = np.array(np.argsort(distances), dtype=int)
        counts = np.bincount(y_train[sort_inds[0:k]])
        y_predictions.append(np.argmax(counts))
    return y_predictions

## Your implementation here

####  Determine optimal k-parameter using LOOCV


In [46]:
# Find best k-parameter for data X using C-index score as evaluator
def calculate_best_k(X, y):
    # List for k-parameters
    kSet = [3, 5, 7, 9, 11]
    # Variable for best k
    best_k = kSet[0]
    best_score = 0
    # For all k-parameters
    for k in kSet:
        # List for predicted values
        preds = []
        # For all rows in data
        for i in range(len(X)):
            # Choose the ith row as test set and exclude ith row from train sets
            X_test = X[i].reshape(1, -1)
            X_train = np.delete(X, i, axis = 0) 
            y_train = np.delete(y, i, axis = 0).reshape(-1)
            # Basic kNN prediction making and saving to our predictions list
            pred = knn(X_train, y_train, X_test, k)
            preds.append(pred)
        # Calculate C-index score with our predictions
        score = cindex(y, preds)
        # Replace best k-parameter if score is higher than current highest score
        if (score > best_score):
            best_score = score
            best_k = k
    # Return best k-parameter and highest score
    return best_k, best_score

best_k, best_score = calculate_best_k(X, y)
print("Best k-parameters from kSet=[3, 5, 7, 9, 11] is", best_k, "with C-index score of", best_score)

Best k-parameters from kSet=[3, 5, 7, 9, 11] is 5 with C-index score of 0.9746474647464747


#### Determine C-index of previous selection using NLOOCV

In [35]:
# Find best k-parameter for data X using C-index score as evaluator
def calculate_best_k(X, y):
    # List for k-parameters
    kSet = [3, 5, 7, 9, 11]
    # Variable for best k
    best_k = kSet[0]
    best_score = 0
    # For all k-parameters
    for k in kSet:
        # List for predicted values
        preds = []
        # For all rows in data
        for i in range(len(X)):
            # Choose the ith row as test set and exclude ith row from train sets
            X_test = X[i].reshape(1, -1)
            X_train = np.delete(X, i, axis = 0) 
            y_train = np.delete(y, i, axis = 0).reshape(-1)
            # Basic kNN prediction making and saving to our predictions list
            pred = knn(X_train, y_train, X_test, k)
            preds.append(pred)
        # Calculate C-index score with our predictions
        score = cindex(y, preds)
        # Replace best k-parameter if score is higher than current highest score
        if (score > best_score):
            best_score = score
            best_k = k
    # Return best k-parameter and highest score
    return best_k, best_score

In [47]:
# Calculate C-index scores for data X using a nested leave-one-out cross-validation
def calculate_scores_nested(X, y):
    # List for C-index scores
    scores = []
    # For all rows in data
    for i in range(len(X)):
        # List for predicted values
        preds = []
        # Choose the ith row as test set and exclude ith row from train sets
        X_test = X[i].reshape(1, -1)
        X_train = np.delete(X, i, axis = 0) 
        y_train = np.delete(y, i, axis = 0).reshape(-1)
        # Find best k-parameter for training set using our function
        best_k, best_score = calculate_best_k(X_train, y_train)
        # Basic kNN prediction making using our best k-parameter and saving to our predictions list
        pred = knn(X_train, y_train, X_test, best_k)
        preds.append(pred)
        score = cindex(y, preds)
        scores.append(score)
    return scores

In [48]:
# Prepare variables
nloocv_c_scores = calculate_scores_nested(X, y)
print("C-index average of LOOCV: ", np.mean(nloocv_c_scores))

IndexError: list index out of range