# KNN

### What is KNN?

- Supervised Algorithm
- Makes predictions based on how close a new data point is to known data points.
- Lazy
- Sensitive to scaling

Link: [KNN Diagram](https://cambridgecoding.files.wordpress.com/2016/01/knn2.jpg)

#### Pros:
1. Simple
2. Robust to noise
3. Effective with large datasets
4. Performs calculations "just in time"
5. Data is easy to keep up to date to keep predictions accurate

#### Cons:
1. Need to determine how many neighbors is optimal
2. Computation cost is high (has to calculate every single distance to every feature)
3. Euclidean volume increases exponentially as number of features increases (curse of dimensionality)

In [None]:
# Quiet my warnings for the sake of the lesson:
import warnings
warnings.filterwarnings("ignore")

# Tabular data friends:
import pandas as pd
import numpy as np

# Data viz:
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn stuff:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

# Data acquisition
from pydataset import data

## Acquire (Iris Dataset)

In [None]:
# read Iris data from pydatset
df = data('iris')

# convert column names to lowercase, replace '.' in column names with '_'
df.columns = [col.lower().replace('.', '_') for col in df]

df.head()

## Prepare/Preprocessing

In [None]:
# split into train, validate, test
train, test = train_test_split(df, stratify=df['species'], train_size=0.8, random_state=1349)
train, validate = train_test_split(train, stratify=train['species'], train_size=0.7, random_state=1349)

# create X & y version of train/validate/test
# where X contains the features we want to use and y is a series with just the target variable

X_train = train.drop(columns=['species', 'petal_length', 'petal_width'])
y_train = train.species

X_validate = validate.drop(columns=['species', 'petal_length', 'petal_width'])
y_validate = validate.species

X_test = test.drop(columns=['species', 'petal_length', 'petal_width'])
y_test = test.species

## Train Model

#### Create KNN Object

#### Fit the Model to the Training Data

#### Make Predictions

#### Estimate Probability

## Evaluate Model

#### Compute the Accuracy

#### Create a Classification Report

## Lets Do it Again

In [None]:
# Create KNN Object

# Fit object to training data

# Make predictions on training data


How does the boundary map for **k = 5** compared to **k = 1**?

In [None]:
# Evaluate the new model



## Finding the Best value for k

## Next Steps
- We only used sepal_length and sepal_width. We can try new models with different and/or additional features. 


- There are other hyperparameters we can tweak
    - 'weights': Uniform is the default (all points are treated equally), but we can switch to a 'distance' approach where nearer neighbors are given more weight in the voting process
    - 'algorithm': Large datasets use a sampling algorithm to save on computational cost. We can try different samplers. 
    - 'metric': There is more than one way to measure distance


- There are very similar models that we can try (RadiusNeighborsClassifier)