![Logo](Technikklein.png)

# Task 10: k-Nearest Neighbour
### Data Science


##### Dozent: Prof. Dr. Stefan Edlich
##### Medieninformatik
##### Hochschule Emden/Leer
##### Sommersemester 2021
   
_____


### Ex. 1 Implement KNN

Implement KNN by hand for just 2 dimensions with normalization.

This is easy because:

funct: You normalize your data in another table

funct: You code a simple euclid distance function

funct: You take a point and calculate the distance to all points

funct: You take the list from above and sort it

funct: You aggregate by target variable

funct: you take the max to determine the targe class

You are finished!


#### I am going to use the Titanic Dataset (from https://www.kaggle.com/c/titanic/overview) to predict if a person would have survived based on age and passenger class via KNN.
#### We are going to load the training set first.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
from scipy import stats
from sklearn.model_selection import train_test_split

data = pd.read_csv("train.csv")

bigdf = pd.DataFrame(data)

print (bigdf)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

#### Now, we are going to extract the columns I need and delete n.a. entries.

In [2]:
df_train = bigdf.drop(['PassengerId', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis=1)
df_train = df_train.dropna()

df_train

Unnamed: 0,Survived,Pclass,Age
0,0,3,22.0
1,1,1,38.0
2,1,3,26.0
3,1,1,35.0
4,0,3,35.0
...,...,...,...
885,0,3,39.0
886,0,2,27.0
887,1,1,19.0
889,1,1,26.0


#### Now we can normalize the data and make this a new dataset.

In [46]:
def normalize(array):
    max_value = max(array)
    min_value = min(array)
    norm_data = []
    for value in array:
        norm_val = (value - min_value) / (max_value - min_value)
        norm_data.append(norm_val)
    return np.array(norm_data)

norm_age = normalize(df_train.Age)
norm_class = normalize(df_train.Pclass)

norm_data = {'NormClass' : norm_class,
             'NormAge' : norm_age,
             'Survival' : df_train.Survived}
norm_df = pd.DataFrame(norm_data)
norm_df

Unnamed: 0,NormClass,NormAge,Survival
0,1.0,0.271174,0
1,0.0,0.472229,1
2,1.0,0.321438,1
3,0.0,0.434531,1
4,1.0,0.434531,0
...,...,...,...
885,1.0,0.484795,0
886,0.5,0.334004,0
887,0.0,0.233476,1
889,0.0,0.321438,1


#### Now we need a Euclid distance function.

In [47]:
def eucdist(x1, y1, x2, y2):
    dist = (x1 - x2)**2 + (y1 - y2)**2
    return np.sqrt(dist)

eucdist(1, 0.5, 1, 0.2711)

0.2289

#### We also need a function to iterate over all of our points in the dataset and create a new column with the respective distances to a test point (in this example: P(1, 0.5), that would be a low class passenger who is aged around the mean of the passengers).

In [48]:
def add_distcolumn(x, y, df):
    dist_array = []
    for index in range(len(df.index)):
        dist = eucdist(x, y, df.iloc[index, 0], df.iloc[index, 1])
        dist_array.append(dist)
    dist_df = df.assign(Distance = dist_array)
    return dist_df

dist_df = add_distcolumn(1, 0.5, norm_df)
dist_df

Unnamed: 0,NormClass,NormAge,Survival,Distance
0,1.0,0.271174,0,0.228826
1,0.0,0.472229,1,1.000386
2,1.0,0.321438,1,0.178562
3,0.0,0.434531,1,1.002141
4,1.0,0.434531,0,0.065469
...,...,...,...,...
885,1.0,0.484795,0,0.015205
886,0.5,0.334004,0,0.526835
887,0.0,0.233476,1,1.034908
889,0.0,0.321438,1,1.015817


#### Now we sort our data by distance.

In [49]:
sort_df = dist_df.sort_values('Distance')
sort_df

Unnamed: 0,NormClass,NormAge,Survival,Distance
661,1.0,0.497361,0,0.002639
40,1.0,0.497361,0,0.002639
188,1.0,0.497361,0,0.002639
561,1.0,0.497361,0,0.002639
360,1.0,0.497361,0,0.002639
...,...,...,...,...
493,0.0,0.886906,0,1.072239
445,0.0,0.044986,1,1.098653
297,0.0,0.019854,0,1.109297
305,0.0,0.006283,1,1.115238


#### We now pick a good k (in this case around sqrt(instances)) and define our neighbours.

In [50]:
n_df = sort_df[:26]
n_df

Unnamed: 0,NormClass,NormAge,Survival,Distance
661,1.0,0.497361,0,0.002639
40,1.0,0.497361,0,0.002639
188,1.0,0.497361,0,0.002639
561,1.0,0.497361,0,0.002639
360,1.0,0.497361,0,0.002639
153,1.0,0.503644,0,0.003644
525,1.0,0.503644,0,0.003644
638,1.0,0.509927,0,0.009927
860,1.0,0.509927,0,0.009927
761,1.0,0.509927,0,0.009927


#### Now we get a vote from the neighbours.
(Remember: 1 equals survival, 0 equals no survival; so our passenger would probably not survive.)

In [60]:
predicted_class = n_df.Survival.mode()
predicted_class

0    0
dtype: int64

#### We can also consolidate everything in one function.

In [64]:
def knn_titanic_survival_prediction(normclass, normage):
    dist_df = add_distcolumn(normclass, normage, norm_df)
    sort_df = dist_df.sort_values('Distance')
    n_df = sort_df[:26]
    predicted_class = n_df.Survival.mode()
    return predicted_class

#### Our test passenger would probably not survive. We are going to check some more examples.
(In theory, higher class as well as younger age should give survival advantages, although class might have a bigger impact.)

In [75]:
# young and low class
knn_titanic_survival_prediction(1, 0.2)

0    0
dtype: int64

In [65]:
# young and middle class
knn_titanic_survival_prediction(0.5, 0.2)

0    0
dtype: int64

In [76]:
# middle-aged and middle class
knn_titanic_survival_prediction(0.5, 0.5)

0    0
dtype: int64

In [71]:
# old and high class
knn_titanic_survival_prediction(0, 0.85)

0    0
dtype: int64

In [77]:
# old and middle class
knn_titanic_survival_prediction(0.5, 0.8)

0    0
dtype: int64

In [69]:
# old and low class
knn_titanic_survival_prediction(1, 0.85)

0    0
dtype: int64

In [74]:
# middle-aged and high class (a survivor!)
knn_titanic_survival_prediction(0, 0.5)

0    1
dtype: int64

In [73]:
# young and high class (another survivor!)
knn_titanic_survival_prediction(0, 0.2)

0    1
dtype: int64

#### From these examples, our theory that younger and/or higher class people might survive more often, seems to be correct!


______________________________


### Ex. 2 Iris Data
In the logistic regression example, I gave you a new iris data:

4.8,2.5,5.3,2.4

Please classify this flower using KNN.

#### First, import the Iris Dataset and select my test instance.

In [87]:
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df['Class'] = iris.target
test_inst = [4.8, 2.5, 5.3, 2.4]
iris_df

Unnamed: 0,0,1,2,3,Class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


#### There is no need to normalize this data. We can go on with adjusting the above mentioned functions to adapt to 4-part instances.

In [109]:
def eucdist2(instance1, instance2):
    return np.linalg.norm(instance1 - instance2)

def add_distcolumn2(instance1, df):
    dist_array = []
    for index in range(len(df.index)):
        dist = eucdist2(instance1, df.iloc[index, :4])
        dist_array.append(dist)
    dist_df = df.assign(Distance = dist_array)
    return dist_df

#### With our new functions, we can follow all the steps mentioned above, while picking a new k, of course.

In [113]:
dist_df = add_distcolumn2(test_inst, iris_df)
sort_df = dist_df.sort_values('Distance')
n_df = sort_df[:12]
n_df

Unnamed: 0,0,1,2,3,Class,Distance
121,5.6,2.8,4.9,2.0,2,1.024695
113,5.7,2.5,5.0,2.0,2,1.029563
114,5.8,2.8,5.1,2.4,2,1.063015
106,4.9,2.5,4.5,1.7,2,1.067708
101,5.8,2.7,5.1,1.9,2,1.153256
142,5.8,2.7,5.1,1.9,2,1.153256
149,5.9,3.0,5.1,1.8,2,1.363818
84,5.4,3.0,4.5,1.5,1,1.43527
83,6.0,2.7,5.1,1.6,1,1.469694
138,6.0,3.0,4.8,1.8,2,1.516575


In [112]:
predicted_class = n_df.Class.mode()
predicted_class

0    2
dtype: int32

#### And it is predicting 2 for Virginica, just like the regression algorithm!