# **KNN - K-Nearest Neighbors**

Welcome to your KNN exercise. Feel free to download the Juypter Notebook and execute it on your own device.

**Note:** to solve the exercise, it is not necessary to download anything. 

In [1]:
import pandas as pd
import numpy as np

**1. Load Dataset**

In the following exercise we are using the dataset from the National Institute of Diabetes and Digestive and Kidney Diseases. The sense of the dataset is to predict wheter a patient has diabetes or not. Before you continue with the next section, make sure that you understand the dataset.

Source: https://www.kaggle.com/uciml/pima-indians-diabetes-database

In [2]:
df = pd.read_csv("diabetes.csv")
df['Insulin'] = df['Insulin'].replace(0, np.nan)
df = df.dropna()

In [3]:
df.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
3,1,89,66,23,94.0,28.1,0.167,21,0
4,0,137,40,35,168.0,43.1,2.288,33,1
6,3,78,50,32,88.0,31.0,0.248,26,1
8,2,197,70,45,543.0,30.5,0.158,53,1
13,1,189,60,23,846.0,30.1,0.398,59,1
14,5,166,72,19,175.0,25.8,0.587,51,1
16,0,118,84,47,230.0,45.8,0.551,31,1
18,1,103,30,38,83.0,43.3,0.183,33,0
19,1,115,70,30,96.0,34.6,0.529,32,1
20,3,126,88,41,235.0,39.3,0.704,27,0


In [4]:
X = df.iloc[:,:8].values
y = df.iloc[:,8].values

**2. Splitting Data**

To split the data in training and test sets, we make use of the train_test_split method from sklearn.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

**3. Feature Scaling**

To avoid that some features have a bigger impact to the "model" than others, just because of their large magnitude, we are going to scale our X data. The fit method computes the mean and standard deviation for transforming the data.

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**4. Training**

Now it's your turn to create and train a KNN classifier. Please make sure to use the imported KNeighborsClassifier method from the sklearn library. After creating the classifier, you have to fit X_train as well as y_train to the Model.

Create your classifier with k = 5 and the euclidean distance. Take a look at the documentation to get more details and examples.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Hint: In practice you would try different k values and take the one with the highest accuracy. 

In [7]:
from sklearn.neighbors import KNeighborsClassifier

### Start Code ### (≈ 2 lines of code)
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

### End Code ###

KNeighborsClassifier()

**5. Predict and Evaluate**

Let's make some predictions to evaluate the performance of our classifier. For that, we are using the predict method from our KNeighborsClassifier. 

In [265]:
y_pred = classifier.predict(X_test)

For the sake of simplicitiy we make use of the accuracy_score. Feel free to try out different evaluation metrics.

In [266]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.810126582278481


**6. Optional**

As you might have noticed, our dataset has a lot of missing data. If you go back to the first part, you will see that we have a lot of zeros in columns like Insulin. There are multiple ways to solve that problem. For instance you could fill these cells with the mean of the column or just remove them.

To get a sense on how to do that, try to drop all Insulin cells with the value of 0. You could solve that by first replacing all 0 values with nan and then remove them.

You may find these two approaches helpful:

- ... replace(0, np.nan)  -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
- ... dropna()            -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

Go back to "1. Load Dataset" and insert your code.

Note: There are different solutions to handle missing values. Depending on your dataset you have to choose what works best. The example here is just to illustrate such a problem.

_________________________________________________

Congratulations, you got it! As you will see, the implementation of other classifiers like SVMs are not that different. So feel free to take the code and change the classifier to try out different algorithms and see what works best. Good luck!

Author: Franz Just