# KNN

### What is KNN?

- Supervised algorithm
- Makes predictions based on how close a new data point is to known data points
- Lazy (computation is performed when the model is implemented)
- Sensitive to scaling

Link: [KNN Diagram](https://cambridgecoding.files.wordpress.com/2016/01/knn2.jpg)

#### Pros:

1. Simple to implement 
2. Performs calculations "just in time"
3. Data is easy to keep up to date to keep predictions accurate

#### Cons:

1. Need to determine how many neighbors is optimal
2. Computation cost is high (has to calculate every single distance to every feature)
3. Data must be stored and accessible to the model
4. Complexity arises with higher dimensions (multiple features)

In [1]:
# DS Libraries
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# knn submodules from scikit learn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

# Data Acquisition
from pydataset import data

## Acquire data

- Use the `iris` dataset from pydata

In [2]:
df = data('iris')

## Change column names
df.columns = [col.lower().replace('.', '_') for col in df]


#### Note: Inspect the units of the features

Scaling is important for an algorithm like knn

## Prepare/Preprocessing

[Train Test Split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### 1. Split into train, validate, test


In [3]:
train, test = train_test_split(df, 
                               stratify=df['species'], 
                               train_size=0.8, 
                               random_state=1729)

train, validate = train_test_split(train, 
                                   stratify=train['species'], 
                                   train_size=0.7, 
                                   random_state=1729)

### 2. Spliting based on features

Create X and Y columns where: 

   - X is the feature
   
   - Y is the target

In [16]:
#Train
X_train = train.drop(columns=['species','petal_length', 'petal_width'])
y_train = train.species

In [17]:
#Validate
X_validate = validate.drop(columns=['species','petal_length', 'petal_width'])
y_validate = validate.species

In [18]:
#Test
X_test = test.drop(columns=['species','petal_length', 'petal_width'])
y_test = test.species

In [10]:
train.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [11]:
y_train

149     virginica
87     versicolor
48         setosa
61     versicolor
23         setosa
142     virginica
133     virginica
141     virginica
70     versicolor
5          setosa
98     versicolor
8          setosa
145     virginica
140     virginica
80     versicolor
42         setosa
138     virginica
19         setosa
129     virginica
135     virginica
81     versicolor
123     virginica
46         setosa
136     virginica
115     virginica
13         setosa
12         setosa
11         setosa
54     versicolor
45         setosa
66     versicolor
18         setosa
43         setosa
63     versicolor
44         setosa
52     versicolor
10         setosa
112     virginica
130     virginica
109     virginica
124     virginica
114     virginica
72     versicolor
62     versicolor
41         setosa
85     versicolor
134     virginica
35         setosa
100    versicolor
16         setosa
49         setosa
108     virginica
6          setosa
110     virginica
51     versicolor
17        

## k-nearest neighbors model


[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html)


In [None]:
sns.scatterplot(train, x='sepal_lenth', y='sepal_width', hue='species')


#### Create KNN Object

In [14]:
knn = KNeighborsClassifier(n_neighbors=1)

#### Fit the Model to the Training Data

In [19]:
knn.fit(X_train, y_train)

#### Make Predictions

In [21]:
y_pred = knn.predict(X_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [22]:
y_pred

array(['virginica', 'versicolor', 'setosa', 'versicolor', 'setosa',
       'virginica', 'virginica', 'versicolor', 'versicolor', 'setosa',
       'versicolor', 'setosa', 'virginica', 'virginica', 'versicolor',
       'setosa', 'virginica', 'setosa', 'virginica', 'virginica',
       'versicolor', 'virginica', 'setosa', 'virginica', 'virginica',
       'setosa', 'setosa', 'setosa', 'versicolor', 'setosa', 'versicolor',
       'setosa', 'setosa', 'versicolor', 'setosa', 'virginica', 'setosa',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'versicolor', 'versicolor', 'setosa', 'versicolor', 'virginica',
       'setosa', 'versicolor', 'setosa', 'setosa', 'virginica', 'setosa',
       'virginica', 'versicolor', 'setosa', 'versicolor', 'virginica',
       'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'virginica', 'versicolor', 'virginica', 'setosa',
       've

#### Estimate Probability of the Prediction

In [23]:
y_pred_proba = knn.predict_proba(X_train)

In [24]:
y_pred_proba

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0

In [26]:
knn.classes_

array(['setosa', 'versicolor', 'virginica'], dtype=object)

## Evaluate Model

#### Compute the Accuracy

In [27]:
#(actual, prediction)
confusion_matrix(y_train, y_pred)

array([[28,  0,  0],
       [ 0, 27,  1],
       [ 0,  3, 25]])

In [28]:
#use crosstab
#going down the diaginal 
#shows the accuracy pretty good layout on the below
pd.crosstab(y_train, y_pred)

col_0,setosa,versicolor,virginica
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,28,0,0
versicolor,0,27,1
virginica,0,3,25


#### Create a Classification Report

In [29]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        28
  versicolor       0.90      0.96      0.93        28
   virginica       0.96      0.89      0.93        28

    accuracy                           0.95        84
   macro avg       0.95      0.95      0.95        84
weighted avg       0.95      0.95      0.95        84



## Changing the k value

In [31]:
#set the number of neighbors(use odd numbers for no tie values)
#set the weight


knn5 = KNeighborsClassifier(n_neighbors = 5, weights= 'uniform')


## Fit to the training data

knn5.fit(X_train, y_train)

## Make predictions usint the training features

y_pred5 = knn5.predict(X_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [32]:
##Evaluate the new k = 5 model

print(classification_report(y_train, y_pred5))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        28
  versicolor       0.82      0.64      0.72        28
   virginica       0.71      0.86      0.77        28

    accuracy                           0.83        84
   macro avg       0.84      0.83      0.83        84
weighted avg       0.84      0.83      0.83        84



In [33]:
#use the score to determine mean accuracy
#for k = 5
knn5.score(X_train, y_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.8333333333333334

In [34]:
#### use classification reports
## for k = 1
print(classification_report(y_validate, knn.predict(X_validate)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       0.75      0.75      0.75        12
   virginica       0.75      0.75      0.75        12

    accuracy                           0.83        36
   macro avg       0.83      0.83      0.83        36
weighted avg       0.83      0.83      0.83        36



  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [35]:
#### use classification reports
## for k = 5
print(classification_report(y_validate, knn5.predict(X_validate)))

              precision    recall  f1-score   support

      setosa       0.92      1.00      0.96        12
  versicolor       0.78      0.58      0.67        12
   virginica       0.64      0.75      0.69        12

    accuracy                           0.78        36
   macro avg       0.78      0.78      0.77        36
weighted avg       0.78      0.78      0.77        36



  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


## Finding the best value for k

In [38]:
#for loop creating a list of models 
#stored in model_list= []
#dicitionary model_accuracies = {} 
#as the dictionary to hold the model scores


model_accuracies = {}
model_list = []

for i in range(1,10):
    nknn = KNeighborsClassifier(n_neighbors = i)
    nknn.fit(X_train, y_train)
    model_list.append(nknn)
    model_accuracies[f'{i} - Neighbors'] = {'Train Score:':round(nknn.score(X_train, y_train),2),
                                           'Validate Score:':round(nknn.score(X_validate, y_validate),2)}
    

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [39]:
model_accuracies

{'1 - Neighbors': {'Train Score:': 0.95, 'Validate Score:': 0.83},
 '2 - Neighbors': {'Train Score:': 0.86, 'Validate Score:': 0.72},
 '3 - Neighbors': {'Train Score:': 0.88, 'Validate Score:': 0.75},
 '4 - Neighbors': {'Train Score:': 0.83, 'Validate Score:': 0.67},
 '5 - Neighbors': {'Train Score:': 0.83, 'Validate Score:': 0.78},
 '6 - Neighbors': {'Train Score:': 0.81, 'Validate Score:': 0.86},
 '7 - Neighbors': {'Train Score:': 0.83, 'Validate Score:': 0.83},
 '8 - Neighbors': {'Train Score:': 0.82, 'Validate Score:': 0.75},
 '9 - Neighbors': {'Train Score:': 0.81, 'Validate Score:': 0.81}}

In [40]:
model_list

[KNeighborsClassifier(n_neighbors=1),
 KNeighborsClassifier(n_neighbors=2),
 KNeighborsClassifier(n_neighbors=3),
 KNeighborsClassifier(n_neighbors=4),
 KNeighborsClassifier(),
 KNeighborsClassifier(n_neighbors=6),
 KNeighborsClassifier(n_neighbors=7),
 KNeighborsClassifier(n_neighbors=8),
 KNeighborsClassifier(n_neighbors=9)]

## Moving forward

- We selected `sepal_length` and `sepal_width` as features. 
     - Build new models with different and/or additional features. 


- Tuning hyperparameters

    `'weights'`: Uniform weight is the default (all points are treated equally). 
    Switch to a distance-weighted approach where nearer neighbors are given more weight in the voting process
    
    `'algorithm'`: Large datasets use a sampling algorithm to save on computational cost. We can try different samplers. 
    
    `'metric'`: There is more than one way to measure distance


- There are very similar models that we can try e.g. `RadiusNeighborsClassifier`