# KNN

### What is KNN?

- Supervised algorithm
- Makes predictions based on how close a new data point is to known data points
- Lazy (computation is performed when the model is implemented)
- Sensitive to scaling

Link: [KNN Diagram](https://cambridgecoding.files.wordpress.com/2016/01/knn2.jpg)

#### Pros:

1. Simple to implement 
2. Performs calculations "just in time"
3. Data is easy to keep up to date to keep predictions accurate

#### Cons:

1. Need to determine how many neighbors is optimal
2. Computation cost is high (has to calculate every single distance to every feature)
3. Data must be stored and accessible to the model
4. Complexity arises with higher dimensions (multiple features)

In [38]:
# DS Libraries
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# knn submodules from scikit learn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

# Data Acquisition
from pydataset import data

## Acquire data

- Use the `iris` dataset from pydata

In [48]:
df = data('iris')

## Change column names
df.columns = [col.lower().replace('.', '_') for col in df]


#### Note: Inspect the units of the features

Scaling is important for an algorithm like knn

## Prepare/Preprocessing

[Train Test Split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### 1. Split into train, validate, test


In [44]:
train, test = train_test_split(df, 
                               stratify=df['species'], 
                               train_size=0.8, 
                               random_state=1729)

train, validate = train_test_split(train, 
                                   stratify=train['species'], 
                                   train_size=0.7, 
                                   random_state=1729)

### 2. Spliting based on features

Create X and Y columns where: 

   - X is the feature
   
   - Y is the target

## k-nearest neighbors model


[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html)



#### Create KNN Object

#### Fit the Model to the Training Data

#### Make Predictions

#### Estimate Probability of the Prediction

## Evaluate Model

#### Compute the Accuracy

#### Create a Classification Report

## Changing the k value

## Finding the best value for k

## Moving forward

- We selected `sepal_length` and `sepal_width` as features. 
     - Build new models with different and/or additional features. 


- Tuning hyperparameters

    `'weights'`: Uniform weight is the default (all points are treated equally). 
    Switch to a distance-weighted approach where nearer neighbors are given more weight in the voting process
    
    `'algorithm'`: Large datasets use a sampling algorithm to save on computational cost. We can try different samplers. 
    
    `'metric'`: There is more than one way to measure distance


- There are very similar models that we can try e.g. `RadiusNeighborsClassifier`