# K-Nearest Neighbors (KNN)
## Language: Python
## Author: Daisy Nsibu
### Data 4319 - Statistical and Machine Learning

# Introduction

Nearest neighbors is one of the simplest predictive models there is. It makes no math‐ematical assumptions, and it doesn’t require any sort of heavy machinery. 
The only things it requires are:

•	Some notion of distance

•	An assumption that points that are close to one another are similar

### What is KNN?

A simple supervised learning algorithm. So this algorithm can be used for:
+ classification ( the labels are qualitative, os some type of class)
+ regression ( have labels that are continous, real values)
+ search (reccomendations) algorithm

### How does kNN work?

![](https://camo.githubusercontent.com/734e103045a2504d7ebf7190471a83deb33353a76a4baaff8993999312a5db0f/68747470733a2f2f7777772e636f72796a6d616b6c696e2e636f6d2f6d656469612f6d616368696e652d6c6561726e696e672d616c676f726974686d732d706172742d362d6b2d6e6561726573742d6e65696768626f72732d696e2d707974686f6e2d312e706e67)

It makes no mathematical assumptions, and it doesn’t require any sort of heavy machinery. 

The only things it requires are:

•	Some notion of distance

•	An assumption that points that are close to one another are similar

If we take a look at the graph,  we can determine the  class of the new point (star) if we look to the *k* closest neighbors to the star. The way the *k* nearest neighbors works is sort of similar to a [greedy algorithm](https://en.wikipedia.org/wiki/Greedy_algorithm) approach. so we see of  all of the neighbors that are with this new point, we're going to take the mode of them. There's either type 1 or type 2. So with this observation we would predict that since the purple's (class b) dominate everything at *k=3* , we would predict that this new point is a purple point, because the *k* closest things, the closest feature vectors to that feature vector are class B (purple). However, at *k=6* we would predict that this new point is a yellow point, because the *k* closest things, the closest feature vectors to that feature vector are class A (yellow). So we assume that all the class B data points kind of lie around each other, and all the class A data points hangout together.

In order to implement k-nearest neighbors we some metric to determine distance. In this notebook, we're going to use the **standard euclidean distance**.

$$d(x^{i}, x^{j}) = \sqrt{\sum_{n=1}^{l}(x^{i}_{n}-x^{j}_{n})^{2}}$$

So, we choose a value for k and then we put it in n dimensional space and then we calculate all the points around it. The *k* closest of those for classification, the mode of that, out of *k* would give us our label. It's kind of assuming the data, the feature vectors all hangout together sort of in classes. 

So in classification typically we're going to choose *k* to be odd, because that will break the ties that might happen. You might choose an even *k* and you might have two neighbors of class 1 and two neighbors of class 2. So, in order to break that classification we typically choose k to be odd.

When dealing with regression of the *k* closest points, you take the mean of those *k* closest points and that gives you your real number output.

#### Advantages
+ simple, very easy to implement from scratch
+ no optimization of parameters
So when we look at the PLA, we were iteratively updating the weights for our model. We were basically optimizing our weights. In kNN we're not optimizing weights at all because there's no parameters to optimize there. We are admittedly doing a greeedy appraoch, looking at the k nearest things to the point in question and then the majority rules and so that's the label it gets.
+ easy to do classification, regression, and reccomendations

#### Disadvantages
+ slow
+ sensitive to high dimension feature vectors
So when your feature vectors are very high dimensional this notion of distance *d* might not capture how the points relate to each other.

# 2. Implementing k-Nearest Neighbors
### Load Packages

In [1]:
from math import sqrt
import pandas as pd
import numpy as np 
from collections import Counter

### Distance Function

In [2]:
# standard euclidean distance
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
        return sqrt(distance)

### Locating Nearest Neighbors
The distance function is used to calculate the distance between each train_row and the new test_row.

In [3]:
def knn(train_x, train_y, dis_func, sample, k):
    
    distances = {}
    for i in range(len(train_x)):
        d = dis_function(sample, train_x.iloc[i])
        distances[i] = d
    sorted_dist = sorted(distances.items(), key = lambda x : (x[1], x[0]))
    # take k nearest neighbors
    neighbors = []
    for i in range(k):
        neighbors.append(sorted_dist[i][0])
    
    #convert indices into groups
    groups = [train_y.iloc[c] for c in neighbors]
    
    #count each group in top k
    counts = Counter(groups)
    
    #max number of samples of a class
    list_values = list(counts.values())
    list_keys = list(counts.keys())
    gr = list_keys[list_values.index(max(list_values))]
    
    return gr

---
# Study of Penguins Species on Antarctica Case
---

![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png)

There are many different penguin species, but the 8 most iconic live in Antarctica, its nearby islands, and the sub-Antarctic archipelagos of South Georgia and the Falklands.
In this notebook I focus to mainly 3 type of penguins: 
**Adelie, Gentoo & Chinstrap Penguins**

In [4]:
# k-nearest neighbors on the penguins Dataset
penguin = pd.read_csv('penguins.csv')

### Understanding the Data

In [5]:
penguin.shape
penguin.columns
penguin.head(6)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181,3750,MALE
1,Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
2,Adelie,Torgersen,40.3,18.0,195,3250,FEMALE
3,Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
4,Adelie,Torgersen,39.3,20.6,190,3650,MALE
5,Adelie,Torgersen,38.9,17.8,181,3625,FEMALE


The dataset consists of 7 columns.

* species: penguin species (Chinstrap, Adélie, or Gentoo)
* island: island name (Dream, Torgersen, or Biscoe) in the Palmer 
* Archipelago (Antarctica)
* culmen_length_mm: culmen length (mm)
* culmen_depth_mm: culmen depth (mm)
* flipper_length_mm: flipper length (mm)
* body_mass_g: body mass (g)
* sex: penguin sex

In [6]:
penguin.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 334 entries, 0 to 333
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            334 non-null    object 
 1   island             334 non-null    object 
 2   culmen_length_mm   334 non-null    float64
 3   culmen_depth_mm    334 non-null    float64
 4   flipper_length_mm  334 non-null    int64  
 5   body_mass_g        334 non-null    int64  
 6   sex                334 non-null    object 
dtypes: float64(2), int64(2), object(3)
memory usage: 18.4+ KB


In [7]:
print(penguin.shape)
penguin= penguin.loc[penguin.sex != '.', :]
print(penguin.shape)

(334, 7)
(333, 7)


In [8]:
penguin.isna().sum() # i removed Nas before loading 

species              0
island               0
culmen_length_mm     0
culmen_depth_mm      0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

### Train / Test Split

In [9]:
penguin['is_train'] = np.random.uniform(0, 1, len(penguin)) <= .75
train = penguin[penguin['is_train'] == True]
test = penguin[penguin['is_train'] == False]

train_x = train[train.columns[:len(train.columns) - 1]] # training samples
train_x = train_x.drop('species', axis=1) # label drop
train_y = train['species'] # corresponding labels


test_x = test[test.columns[:len(test.columns) - 1]]
test_x = test_x.drop('species', axis=1) # label drop
test_y = test['species']

In [10]:
train_x = pd.get_dummies(train_x)
test_x = pd.get_dummies(test_x);
print(train_x.shape)
train_x.head()

(247, 9)


Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,island_Biscoe,island_Dream,island_Torgersen,sex_FEMALE,sex_MALE
1,39.5,17.4,186,3800,0,0,1,1,0
2,40.3,18.0,195,3250,0,0,1,1,0
3,36.7,19.3,193,3450,0,0,1,1,0
5,38.9,17.8,181,3625,0,0,1,1,0
6,39.2,19.6,195,4675,0,0,1,0,1


In [16]:
model = knn(train_x, train_y, euclidean_distance, test_x.iloc[13], k=5)
print(model)
print(test_y.iloc[13])

Adelie
Adelie


In [12]:
# Calculate the accuracy
def get_accuracy(test_x, test_y, train_x, train_y, k):
    correct = 0
    for i in range(len(test_x)):
        sample = test_x.iloc[i]
        true_label = test_y.iloc[i]
        predicted_label_euclidean = knn(train_x, train_y, euclidean_distance, sample, k)
        if predicted_label_euclidean == true_label:
            correct += 1
    
    accuracy_euclidean = (correct / len(test_x)) * 100
    
    print("Model accuracy with Euclidean Distance is %.2f" %(accuracy_euclidean), "%")


In [13]:
get_accuracy(test_x, test_y, train_x, train_y, k=5)

Model accuracy with Euclidean Distance is 72.09 %


# Conclusion

The accuracy of the knn model shows that we are only 72.09% correct.It's not bad but also not great. We know that the KNN algorithm is sensitive to high dimension feature vectors and high volume of data , so perhaps I could have gotten better results with fewer features or a better distance function other than euclidean.

# References

Géron, Aurélien. *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems*. 2nd ed., O’Reilly Media, 2019.

Penguin Data originally published in:

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081