# Lab 04

## Cross Validation 

### 1-2

#### 1) Reuse the notebook from Lab 3 for the wine data. Make sure to
####        * Reuse the same random seed throughout.
####        * Use nearest neighbors
#### 2) With using KFold to produce the data splits, implement cross validation. Make sure to store the predictions on each test fold and print the classification_report after having looped over all folds.

In [None]:
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
import numpy as np

x, y = load_wine(return_X_y = True)  #split into features X and labels y

In [None]:
kf = KFold(n_splits=3, random_state=None, shuffle=True)

result_array = []
y_test_report = []
y_predict_report = []

for train_index, test_index in kf.split(x):
    x_train, x_test = x[train_index],x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    y_test_report.extend(y_test)
    scaler = StandardScaler(copy=True)
    xTrain_scaled = scaler.fit_transform(x_train, y_train)
    minDis = KNeighborsClassifier(n_neighbors=7)
    minDis.fit(xTrain_scaled, y_train)
    xTest_scaled = scaler.transform(x_test)
    y_predict_report.extend(minDis.predict(xTest_scaled))
    result_array.append(minDis.score(xTest_scaled, y_test))

print('Average score: ', np.mean(result_array))

## print the test reports
print('The classification report:\n')
print(classification_report(y_test_report, y_predict_report))

### 3-4

#### 3) Try with k=3​ and k=10 folds.
#### 4) In order to interpret the results (and fix possible issues), take a close look at the KFold visualization from the User Guide (not based on the wine data!):

In [None]:
kf = KFold(n_splits=10, random_state=None, shuffle=True)

result_array = []
y_test_report = []
y_predict_report = []

for train_index, test_index in kf.split(x):
    x_train, x_test = x[train_index],x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    y_test_report.extend(y_test)
    scaler = StandardScaler(copy=True)
    xTrain_scaled = scaler.fit_transform(x_train, y_train)
    minDis = KNeighborsClassifier(n_neighbors=7)
    minDis.fit(xTrain_scaled, y_train)
    xTest_scaled = scaler.transform(x_test)
    y_predict_report.extend(minDis.predict(xTest_scaled))
    result_array.append(minDis.score(xTest_scaled, y_test))

print('Average score: ', np.mean(result_array))
## print the test reports
print('The classification report:\n')
print(classification_report(y_test_report, y_predict_report))

##### Setting the shuffle parameter is very important since the classes are already ordered dataset

## Grid Search

### 1-3

#### 1) Implement Grid Search in combination with cross validation.
####        * Use the following parameters from the KNeighborsClassifier for the grid: n_neighbors and p . Select reasonable values for both.
####        * Implement a for loop to iterate over all combinations of the grid:
#### 2) Run the Grid Search and print the classification report for each parameter combination.
#### 3) Which parameter combination performs best?

In [None]:
from sklearn.model_selection import ParameterGrid

n_neighbours = [2, 10]
p = [1, 2]

result_acb = {}

for n_nei in n_neighbours:
    for p_ in p:
        kf = KFold(n_splits=10, random_state=None, shuffle=True)
        result_array = []
        result_acb[str(n_nei) + " / " + str(p_)] = {}
        result_acb[str(n_nei) + " / " + str(p_)]["Y_TEST"] = []
        result_acb[str(n_nei) + " / " + str(p_)]["Y_PREDICT"] = []
        result_acb[str(n_nei) + " / " + str(p_)]["Y_Score"] = []

        for train_index, test_index in kf.split(x):
            x_train, x_test = x[train_index],x[test_index]
            y_train, y_test = y[train_index], y[test_index]
            result_acb[str(n_nei) + " / " + str(p_)]["Y_TEST"].extend(y_test)
            scaler = StandardScaler(copy=True)
            xTrain_scaled = scaler.fit_transform(x_train, y_train)
            minDis = KNeighborsClassifier(n_neighbors=n_nei, p=p_ )
            minDis.fit(xTrain_scaled, y_train)
            xTest_scaled = scaler.transform(x_test)
            result_acb[str(n_nei) + " / " + str(p_)]["Y_PREDICT"].extend(minDis.predict(xTest_scaled))
            result_acb[str(n_nei) + " / " + str(p_)]["Y_Score"].append(minDis.score(xTest_scaled, y_test))


## print the test reports
for parameters_in in result_acb.keys():
    print('Grid parameters:')
    print(parameters_in)
    print('Average score: ', np.mean(result_acb[parameters_in]["Y_Score"]))
    print(classification_report(result_acb[parameters_in]["Y_TEST"], result_acb[parameters_in]["Y_PREDICT"]))
    print('---------------------------------------------------------------------------')

##### neighbour = 10 and manhattahn distance gets the best result 

## Combining Grid Search and Cross Validation

### 1 - 4

#### 1) Carefully read the documentation of 📝 GridSearchCV, which combines the mechanisms of the grid search and the cross validation.
#### 2) Reuse the kNeighborsClassifier and the ParameterGrid (check for correct naming).
#### 3) Set the cross validation splitting strategy to k=10​ folds.
#### 4) Evaluate the results using GridSearchCV 's built-in methods.

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {"n_neighbors":[2,10], "p":[1,2]}
kn = KNeighborsClassifier()
clf = GridSearchCV(kn, parameters, cv = 10)

clf.fit(x,y)

print(clf.best_estimator_.score)
print("best score: ",clf.score(x,y))

##### As we see the result is the same as we got in the previous task

### 5-6 

#### 5) Change the parameter scoring to use the F1 score for evaluation.
#### 6) Find out how to store/access the best model parametrization.

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {"n_neighbors":[2,10], "p":[1,2]}
kn = KNeighborsClassifier()
# f1_weighted because we do not have only 0 and 1 as values
clf = GridSearchCV(kn, parameters, cv = 10, scoring = "f1_weighted")

clf.fit(x,y)
estimator = clf.best_estimator_
print("best value for n_neighbors parameter: ",estimator.get_params()['n_neighbors'])
print("best value for p parameter: ",estimator.get_params()["p"])

## Homework

#### Extend the grid with a parameter for switching the scaling of the data on/off. Then, for each test run made so far, enter the cross validation results in your table. Those values are more robust and reliable than those obtained from a single run.

In [None]:
x, y = load_wine(return_X_y = True)  #split into features X and labels y

from sklearn.preprocessing import StandardScaler

n_neighbours = [2, 10]
p = [1, 2]
sc = [True, False]

result_acb = {}

for n_nei in n_neighbours:
    for p_ in p:
        for s in sc:
            kf = KFold(n_splits=10, random_state=None, shuffle=True)
            result_array = []
            index_string = str(n_nei) + " / " + str(p_) + " / " + str(s)
            result_acb[index_string] = {}
            result_acb[index_string]["Y_TEST"] = []
            result_acb[index_string]["Y_PREDICT"] = []
            result_acb[index_string]["Y_Score"] = []

            for train_index, test_index in kf.split(x):
                x_train, x_test = x[train_index], x[test_index]
                y_train, y_test = y[train_index], y[test_index]
                if s:
                    scaler = StandardScaler(copy=True)
                    x_train = scaler.fit_transform(x_train, y_train)
                    x_test = scaler.transform(x_test)

                result_acb[index_string]["Y_TEST"].extend(y_test)
                scaler = StandardScaler(copy=True)
                xTrain_scaled = scaler.fit_transform(x_train, y_train)
                minDis = KNeighborsClassifier(n_neighbors=n_nei, p=p_)
                minDis.fit(xTrain_scaled, y_train)
                xTest_scaled = scaler.transform(x_test)
                result_acb[index_string]["Y_PREDICT"].extend(minDis.predict(xTest_scaled))
                result_acb[index_string]["Y_Score"].append(minDis.score(xTest_scaled, y_test))

## print the test reports
for parameters_in in result_acb.keys():
    print('Grid parameters:')
    print(parameters_in)
    print('Average score: ', np.mean(result_acb[parameters_in]["Y_Score"]))
    print(classification_report(result_acb[parameters_in]["Y_TEST"], result_acb[parameters_in]["Y_PREDICT"]))
    print('---------------------------------------------------------------------------')