# Part 1: Introduction and brief overview

This workshop is conducted by ***Milena Vujović$^{1}$***,  ***Frederikke Isa Marin$^{1}$*** and ***Anna-lisa Schaap-Johansen$^{1}$***. 

You have already gone through some data wrangling, visualisation,clustering and regression techniques, as well as looking up documentation. Therefore, today's exercise will be more independent and you are expected to look at  notebooks that you've worked on before for reminding yourself how to do certain steps if you don't remember e.g. loading the data. 


Today we will go over K nearest neighbours algorithm. 

At the end of this day you should be able to:
1. analyse your data using KNN clustering 
2. confidently plot the results of your analysis
3. Assess your KNN clustering 


We will go over two datasets in order to fullfil our goals. 

The first dataset stored in "aa_frequency_location.tsv" has information on the N-terminus of proteins. The dataset cosists of two classes Secretory and Non-secretory proteins. The input consists of 20 features, which are the amino acid frequencies of the first 30 amino acids of a protein (N-terminal part). Our main question is whether or not we can see any differnces in amino acid usage between secretory and non secretory proteins and how can we use this to classify the proteins. 

The second dataset is stored in "tissue_expression.tsv". It contains gene expression levels for 189 samples and 7 tissues. 


***
You can contact us at<br>
Milena Vujovic: milvu@dtu.dk (twitter: *@sciencisto* ) <br>
Frederikke Isa Marin: frisa@dtu.dk (twitter: *@fimarin42) <br> 
Anna-Lisa Schaap-Johansen: alsj@dtu.dk (twitter: *@SchaapJohansen) <br>
and also for the duration of this course on wechat :)

$^{1}$ Bioinformatics section, DTU Health Technology, Technical University of Denmark, Greater Copenhagen area, Denmark<br>


In [104]:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cmx

from sklearn import decomposition
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

In [2]:
plt.rcParams['figure.figsize'] = [10, 10]

In [None]:
## Symbolic link to the data: 
%cd
%cd ml_data
!ln -s /exercises/ml_intro/ml_data/aa_frequency_location.tsv ./aa_frequency_location.tsv # command to make symbolic link
!ln -s /exercises/ml_intro/ml_data/aa_frequency_location_incomplete.tsv ./aa_frequency_location_incomplete.tsv # command to make symbolic link
!ln -s /exercises/ml_intro/ml_data/tissue_expression.tsv ./tissue_expression.tsv # command to make symbolic link
!pwd
!ls

# Q1 Load the data (1 point)

Load the data into a pandas dataframe called aa_freq_loc_df

In [52]:
aa_freq_loc_df = pd.read_csv("aa_frequency_location.tsv", sep = "\t")
aa_freq_loc_df

Unnamed: 0,location,A,C,D,E,F,G,H,I,K,...,M,N,P,Q,R,S,T,V,W,Y
0,Non-secretory,0.133333,0.000000,0.000000,0.000000,0.033333,0.033333,0.000000,0.033333,0.033333,...,0.066667,0.033333,0.033333,0.000000,0.066667,0.300000,0.066667,0.100000,0.000000,0.000000
1,Non-secretory,0.233333,0.033333,0.000000,0.000000,0.033333,0.066667,0.000000,0.100000,0.033333,...,0.033333,0.000000,0.066667,0.000000,0.066667,0.266667,0.000000,0.000000,0.000000,0.000000
2,Non-secretory,0.166667,0.033333,0.000000,0.000000,0.033333,0.000000,0.033333,0.000000,0.000000,...,0.033333,0.033333,0.033333,0.033333,0.100000,0.266667,0.000000,0.166667,0.000000,0.000000
3,Non-secretory,0.266667,0.033333,0.000000,0.000000,0.066667,0.000000,0.000000,0.033333,0.066667,...,0.033333,0.033333,0.033333,0.000000,0.066667,0.200000,0.000000,0.133333,0.000000,0.000000
4,Non-secretory,0.200000,0.066667,0.000000,0.000000,0.000000,0.066667,0.000000,0.066667,0.033333,...,0.033333,0.033333,0.033333,0.033333,0.033333,0.233333,0.033333,0.033333,0.000000,0.000000
5,Non-secretory,0.200000,0.000000,0.000000,0.000000,0.066667,0.066667,0.000000,0.066667,0.066667,...,0.033333,0.066667,0.033333,0.000000,0.066667,0.200000,0.033333,0.066667,0.000000,0.000000
6,Non-secretory,0.266667,0.033333,0.000000,0.000000,0.000000,0.033333,0.000000,0.000000,0.100000,...,0.100000,0.000000,0.066667,0.000000,0.000000,0.166667,0.000000,0.033333,0.000000,0.033333
7,Non-secretory,0.100000,0.000000,0.033333,0.033333,0.033333,0.033333,0.000000,0.033333,0.066667,...,0.033333,0.033333,0.100000,0.033333,0.033333,0.133333,0.066667,0.066667,0.000000,0.000000
8,Non-secretory,0.066667,0.066667,0.033333,0.066667,0.000000,0.033333,0.000000,0.066667,0.066667,...,0.033333,0.000000,0.066667,0.100000,0.133333,0.066667,0.033333,0.066667,0.033333,0.033333
9,Non-secretory,0.033333,0.000000,0.033333,0.033333,0.033333,0.033333,0.000000,0.033333,0.066667,...,0.033333,0.033333,0.033333,0.066667,0.100000,0.200000,0.000000,0.100000,0.033333,0.000000


# KNN 

## Q2 Using yourown words explain why it is important to split the data into the training and testing set in supervised learning methods? (2 points)



# Q3 extract the location column and save it as location_df (2 points)

*Hint look in the visalisation notebook how we extracted all values from a column*

In [58]:
location_df = aa_freq_loc_df["location"]


# Encode the labels from string to numerical

In order to use the KNN algorithm we need to convert the labels from string format "Secretory"/"Non-secretory" into numerical values. You have already coded your own version of a labeling encoder. Today we will use the Label encoder from the sklearn library. Use the code bellow to endoce the labels y: 

In [59]:
LE = preprocessing.LabelEncoder()

aa_freq_loc_df['location'] = LE.fit_transform(aa_freq_loc_df['location'])
aa_freq_loc_df

Unnamed: 0,location,A,C,D,E,F,G,H,I,K,...,M,N,P,Q,R,S,T,V,W,Y
0,0,0.133333,0.000000,0.000000,0.000000,0.033333,0.033333,0.000000,0.033333,0.033333,...,0.066667,0.033333,0.033333,0.000000,0.066667,0.300000,0.066667,0.100000,0.000000,0.000000
1,0,0.233333,0.033333,0.000000,0.000000,0.033333,0.066667,0.000000,0.100000,0.033333,...,0.033333,0.000000,0.066667,0.000000,0.066667,0.266667,0.000000,0.000000,0.000000,0.000000
2,0,0.166667,0.033333,0.000000,0.000000,0.033333,0.000000,0.033333,0.000000,0.000000,...,0.033333,0.033333,0.033333,0.033333,0.100000,0.266667,0.000000,0.166667,0.000000,0.000000
3,0,0.266667,0.033333,0.000000,0.000000,0.066667,0.000000,0.000000,0.033333,0.066667,...,0.033333,0.033333,0.033333,0.000000,0.066667,0.200000,0.000000,0.133333,0.000000,0.000000
4,0,0.200000,0.066667,0.000000,0.000000,0.000000,0.066667,0.000000,0.066667,0.033333,...,0.033333,0.033333,0.033333,0.033333,0.033333,0.233333,0.033333,0.033333,0.000000,0.000000
5,0,0.200000,0.000000,0.000000,0.000000,0.066667,0.066667,0.000000,0.066667,0.066667,...,0.033333,0.066667,0.033333,0.000000,0.066667,0.200000,0.033333,0.066667,0.000000,0.000000
6,0,0.266667,0.033333,0.000000,0.000000,0.000000,0.033333,0.000000,0.000000,0.100000,...,0.100000,0.000000,0.066667,0.000000,0.000000,0.166667,0.000000,0.033333,0.000000,0.033333
7,0,0.100000,0.000000,0.033333,0.033333,0.033333,0.033333,0.000000,0.033333,0.066667,...,0.033333,0.033333,0.100000,0.033333,0.033333,0.133333,0.066667,0.066667,0.000000,0.000000
8,0,0.066667,0.066667,0.033333,0.066667,0.000000,0.033333,0.000000,0.066667,0.066667,...,0.033333,0.000000,0.066667,0.100000,0.133333,0.066667,0.033333,0.066667,0.033333,0.033333
9,0,0.033333,0.000000,0.033333,0.033333,0.033333,0.033333,0.000000,0.033333,0.066667,...,0.033333,0.033333,0.033333,0.066667,0.100000,0.200000,0.000000,0.100000,0.033333,0.000000


## Q4 Compare the values in the location column of the aa_freq_loc_df and the values of the location column in the location_df. How is "Secretory"/"Non-secretory" encoded now? (2 points)

**Answer** Secretory is 1 and Non-secretory is 0. 

## Q5 separate the dataset into numerical observables and categorical variables. What is the shape of x and y? (2 points)

remove the [] brackets surrounding the categorical variable name when you are extracting categorical variables. 
Example: 

`y = df.loc[:,"categorical_variable"].values` 

instead of `y = df.loc[:,["categorical_variable"]].values` 


*Hint: Refer to the data visualisation notebook on how to do this. Do not convert **y** to a list*


In [75]:
observables = list(aa_freq_loc_df)[1:]
# Separating out the features
x = aa_freq_loc_df.loc[:, observables].values
# Separating out the target
y = aa_freq_loc_df.loc[:,'location'].values


In [76]:
x.shape, y.shape

((3058, 20), (3058,))

## Q6 What is x and y? (1 point)

**Answer**: x are the amino acid frequencies and y are the lables numerical labels denoting the protein location (secretory, Non-secretory)

## Holdout method 

We will partition the data into a training and testing set accoring to the hold out method. This way the model is trained on the predefined training set. The model is then used to predict the labels of the testing set which is not in the training set. The accuracy of the label prediction is evaluated with an accuracy_score. 

### Q7 Use the train_test_split function on your observables and categorical variables using the code bellow. What percentage of the data is in the x_train, and what percentage of the data is in the x_test? (2 points)

*Hint: use shape on the training and testing set to see how many proteins are in it and compare to the total number of proteins*

*Note: [Not for this exercise!] In the future, in your own research, if you ever need to split the data into partitions of different percentage you can do so by using this function. You need to set the correct arguments so that you can specify the size of partitions - you can always look in the documentation to see how it's done**

In [77]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 42)

In [78]:
x_train.shape

(2293, 20)

In [79]:
x_test.shape

(765, 20)

**Answer**: x_train has 75% of the data, and x_test has 25% of the data

### Fit the KNN model using the code bellow: 

In [81]:
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn5.fit(x_train, y_train)

### Predict target categories in the testing set and compare with actual values:

In [93]:
y_pred_knn5 = knn5.predict(x_test)

### Calculate the accuracy of your predicted labels to the real data labels using the accuracy_score function. 

In [94]:
print("accuracy: {}".format(accuracy_score(y_test, y_pred_knn5)))

accuracy: 0.8901960784313725


### Q8 Look into the accuracy_score documentation. What is the accuracy score function equal to in the binary and multiclass classification? (1 point)

**Answer** It is equal to the jaccard_score function. 


### Q9 Fit the KNN model with 2 nearest neighbors now. What is the accuracy now? Can you explain why? (5 points)

In [98]:
knn2 = KNeighborsClassifier(n_neighbors = 2)
knn2.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                     weights='uniform')

In [99]:
y_pred_knn2 = knn2.predict(x_test)
print("accuracy: {}".format(accuracy_score(y_test, y_pred_knn2)))

accuracy: 0.8470588235294118


**Answer** The Accuracy goes down because we have less information on the class than before. 

### Q10 Fit the KNN model with 10 nearest neighbors now. What is the accuracy now? Can you explain why? (5 points)

In [101]:
knn10 = KNeighborsClassifier(n_neighbors = 10)
knn10.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [102]:
y_pred_knn10 = knn10.predict(x_test)
print("accuracy: {}".format(accuracy_score(y_test, y_pred_knn10)))

accuracy: 0.9058823529411765


**Answer** Accuracy increases because we have more information on the classes of neighbouring points than before. 

## Cross-validation 

In order to allow for each point in the data to be a part of the training set we perform cross validation. 

FIt the KNN model with 5 nearest neighbors using the code bellow. Notice that the cross-validation method is splitting the data into 5 folds each time the algorithm is run. 



In [108]:
knn5_cv = KNeighborsClassifier(n_neighbors=5)
#train model with cv of 5 (5 folds)
cv_scores = cross_val_score(knn5_cv, x, y, cv=5)
#print each cv score (accuracy) and average them
print(cv_scores)
print("cv_scores mean:{}".format(np.mean(cv_scores)))


[0.84991843 0.87908497 0.87397709 0.86415712 0.78559738]
cv_scores mean:0.8505469977626241


### Q11 compare the average accuracy of the cross validated KNN model with 5 nearest neighbours to the hold-out method one. Which one of them has the highest accuracy and which one reflects better how the model will perform on unseen data? (3 points)

**Answer** KNN using the holdout method has higher accuracy than the cross validated one. The CV one reflects better how the model will perform on unseen data. 

# Tissue dataset

## Q12 Load the tissue dataset (1 point)

In [110]:
tissue_expression_df = pd.read_csv("tissue_expression.tsv", sep = "\t")
tissue_expression_df 

Unnamed: 0,tissue,0,1,2,3,4,5,6,7,8,...,22205,22206,22207,22208,22209,22210,22211,22212,22213,22214
0,kidney,10.191267,6.040463,7.447409,12.025042,5.269269,8.535176,6.921690,5.718190,8.082076,...,8.108419,5.251074,7.098663,8.210405,7.736744,6.434851,5.700448,9.211163,8.339130,7.367797
1,kidney,10.509167,6.696075,7.775354,12.007817,5.180389,8.587241,6.962430,5.596042,7.568178,...,8.072807,5.409345,6.905827,8.322514,8.192083,7.676989,6.566479,9.415980,8.214426,7.917754
2,kidney,10.272027,6.144663,7.696235,11.633279,5.301714,8.277414,7.054633,5.576952,7.136474,...,7.809687,5.297679,6.718544,8.404708,7.961902,6.424996,5.641277,8.192909,8.456095,7.598461
3,kidney,10.252952,6.575153,8.478135,11.075286,5.372235,8.603650,7.115067,5.860551,8.605091,...,8.036512,6.025769,6.716618,8.797825,8.325583,6.354779,5.754815,8.522238,8.558297,7.799779
4,kidney,10.157605,6.606701,8.116336,10.832528,5.334905,8.303227,7.078587,5.728177,8.967108,...,8.205598,5.612748,6.581476,8.577977,8.064061,6.438092,6.053994,7.971105,8.421945,7.540570
5,kidney,9.966782,6.060069,7.644452,11.705062,5.253682,8.341625,7.120603,5.255161,7.300578,...,7.768980,5.284946,6.846413,8.380289,8.086254,6.483949,5.584975,8.299793,8.489801,7.720350
6,kidney,9.839348,6.186596,8.009581,11.706145,5.228794,8.947253,7.135402,5.785426,9.902376,...,8.718157,5.776072,7.170825,8.660116,8.344578,6.590414,5.927212,8.184667,8.560192,7.668692
7,kidney,9.945652,5.927861,7.847192,11.750370,5.155278,8.389109,7.092967,5.540472,7.634552,...,7.977619,5.263604,6.769122,8.360551,7.907557,6.542008,5.831587,8.181179,8.630993,7.563993
8,kidney,9.913031,6.337478,7.983850,10.706184,5.236442,8.513929,6.947729,5.705970,8.155148,...,8.077371,5.576664,6.947507,8.439523,7.928164,7.051343,6.438543,7.752406,8.275516,7.835893
9,kidney,10.170344,6.045789,7.544486,11.760161,5.405336,8.227427,7.001822,5.633313,7.379390,...,7.729364,5.556860,6.706641,8.103447,7.954726,6.403876,5.968543,8.304945,8.441026,7.568690


## Q13 Save the tissue column in tissue_df (1 point)

In [140]:
tissue_df = tissue_expression_df["tissue"]
tissue_df

0      4
1      4
2      4
3      4
4      4
5      4
6      4
7      4
8      4
9      4
10     4
11     4
12     4
13     4
14     4
15     4
16     3
17     3
18     3
19     3
20     3
21     3
22     3
23     3
24     3
25     3
26     3
27     3
28     3
29     3
      ..
159    2
160    2
161    2
162    2
163    5
164    5
165    5
166    5
167    5
168    5
169    5
170    5
171    5
172    5
173    5
174    5
175    0
176    0
177    0
178    0
179    0
180    0
181    0
182    0
183    6
184    6
185    6
186    6
187    6
188    6
Name: tissue, Length: 189, dtype: int64

## Q14 Encode the labels from string to numerical (1 point)

In [112]:
LE = preprocessing.LabelEncoder()

tissue_expression_df['tissue'] = LE.fit_transform(tissue_expression_df['tissue'])
tissue_expression_df

Unnamed: 0,tissue,0,1,2,3,4,5,6,7,8,...,22205,22206,22207,22208,22209,22210,22211,22212,22213,22214
0,4,10.191267,6.040463,7.447409,12.025042,5.269269,8.535176,6.921690,5.718190,8.082076,...,8.108419,5.251074,7.098663,8.210405,7.736744,6.434851,5.700448,9.211163,8.339130,7.367797
1,4,10.509167,6.696075,7.775354,12.007817,5.180389,8.587241,6.962430,5.596042,7.568178,...,8.072807,5.409345,6.905827,8.322514,8.192083,7.676989,6.566479,9.415980,8.214426,7.917754
2,4,10.272027,6.144663,7.696235,11.633279,5.301714,8.277414,7.054633,5.576952,7.136474,...,7.809687,5.297679,6.718544,8.404708,7.961902,6.424996,5.641277,8.192909,8.456095,7.598461
3,4,10.252952,6.575153,8.478135,11.075286,5.372235,8.603650,7.115067,5.860551,8.605091,...,8.036512,6.025769,6.716618,8.797825,8.325583,6.354779,5.754815,8.522238,8.558297,7.799779
4,4,10.157605,6.606701,8.116336,10.832528,5.334905,8.303227,7.078587,5.728177,8.967108,...,8.205598,5.612748,6.581476,8.577977,8.064061,6.438092,6.053994,7.971105,8.421945,7.540570
5,4,9.966782,6.060069,7.644452,11.705062,5.253682,8.341625,7.120603,5.255161,7.300578,...,7.768980,5.284946,6.846413,8.380289,8.086254,6.483949,5.584975,8.299793,8.489801,7.720350
6,4,9.839348,6.186596,8.009581,11.706145,5.228794,8.947253,7.135402,5.785426,9.902376,...,8.718157,5.776072,7.170825,8.660116,8.344578,6.590414,5.927212,8.184667,8.560192,7.668692
7,4,9.945652,5.927861,7.847192,11.750370,5.155278,8.389109,7.092967,5.540472,7.634552,...,7.977619,5.263604,6.769122,8.360551,7.907557,6.542008,5.831587,8.181179,8.630993,7.563993
8,4,9.913031,6.337478,7.983850,10.706184,5.236442,8.513929,6.947729,5.705970,8.155148,...,8.077371,5.576664,6.947507,8.439523,7.928164,7.051343,6.438543,7.752406,8.275516,7.835893
9,4,10.170344,6.045789,7.544486,11.760161,5.405336,8.227427,7.001822,5.633313,7.379390,...,7.729364,5.556860,6.706641,8.103447,7.954726,6.403876,5.968543,8.304945,8.441026,7.568690


## Q15 Compare the values in the tissue column of the tissue_expression_df and the values of the tissue column in the tissue_df. How is are the tissues encoded now? (2 points)


In [118]:
for tissue_num, tissue in zip(list(tissue_expression_df['tissue']), list(tissue_df)): 
    print(tissue_num, tissue)


4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
3 hippocampus
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
0 cerebellum
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 kidney
4 ki

**Answer**: 
- kidney: 4
- hippocampus: 3
- cerebellum: 0
- colon: 1 
- liver: 5 
- endometrium: 2
- placenta: 6

## Q16 separate the dataset into numerical observables and categorical variables. What is the shape of x and y? (2 points)

remove the [] brackets surrounding the categorical variable name when you are extracting categorical variables. 
Example: 

`y = df.loc[:,"categorical_variable"].values` 

instead of `y = df.loc[:,["categorical_variable"]].values` 


*Hint: Refer to the data visualisation notebook on how to do this. Do not convert **y** to a list*


In [119]:
observables = list(tissue_expression_df)[1:]
# Separating out the features
x = tissue_expression_df.loc[:, observables].values
# Separating out the target
y = tissue_expression_df.loc[:,'tissue'].values


In [122]:
x.shape, y.shape

((189, 22215), (189,))

## Q17 What is x and y? (1 point)

**Answer**: x are gene expression values and y are the numerical labels denoting the tissue of origin. 

##  Q18 Use the holdout method to partition your data and create a KNN model using 5 nearest neighbours. What is the accuracy of the model? Can you try and explain what does it mean to get this accuracy? (7 points)

*Hint:* Steps are: 

- split the data into train and test
- fit a knn model on the training data
- predict the labels of the testing set using the knn model 
- compare the predicted labels with the true labels of the model and find the accuracy

In [129]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 42)

In [130]:
x_train.shape

(141, 22215)

In [131]:
x_test.shape

(48, 22215)

In [132]:
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn5.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [133]:
y_pred_knn5 = knn5.predict(x_test)

In [134]:
print("accuracy: {}".format(accuracy_score(y_test, y_pred_knn5)))

accuracy: 1.0


**Answer** The Accuracy is 1. The model is overfitted. 

##  Q19 Use the holdout method to partition your data and create a KNN model using 30 nearest neighbours. What is the accuracy of the model? Can you try and explain the difference in accuracy compared to 5 nearest neighbours? (4 points)


In [144]:
knn30 = KNeighborsClassifier(n_neighbors = 30)
knn30.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=30, p=2,
                     weights='uniform')

In [145]:
y_pred_knn30 = knn30.predict(x_test)

In [146]:
print("accuracy: {}".format(accuracy_score(y_test, y_pred_knn30)))

accuracy: 0.8333333333333334


**Answer**: 

## Q20 Use the crossvalidation method (5 folds) to partition your data and create a KNN model using 30 nearest neighbours. What is the accuracy of the model? Compare the accuracy of the hold out method to the mean accuracy in the KNN model. Explain what you can conclude from this (4 points)



In [147]:
knn30_cv = KNeighborsClassifier(n_neighbors=30)
#train model with cv of 5 (5 folds)
cv_scores = cross_val_score(knn30_cv, x, y, cv=5)
#print each cv score (accuracy) and average them
print(cv_scores)
print("cv_scores mean:{}".format(np.mean(cv_scores)))


[0.92682927 0.94736842 0.92105263 0.89189189 0.8       ]
cv_scores mean:0.8974284425632307


**Answer**: 
