Lab: k-nearest neighbors (k-NN) with scikit-learn
===

By The End Of This Session You Should Be Able To:
----

- Visually interpret data
- Fit KNN model with scikit-learn
- Explain how features effects modeling
- Explain how changing the number of neighbors effects modeling

In [None]:
reset -fs

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

palette = "Dark2"
%matplotlib inline

Iris Data
-----

<center><img src="http://s5047.pcdn.co/wp-content/uploads/2015/04/iris_petal_sepal.png" width="75%"/></center>

In [None]:
# Load data
iris_sns = sns.load_dataset("iris")
iris_sns.head(n=3)

In [None]:
# Always a good habit to manually inspect the tail
iris_sns.tail(n=3)

In [None]:
# Pretty plot
sns.pairplot(iris_sns, hue='species', size=2.5, palette=palette);

Which two dimensions would provide for the __best__ seperation between the three classes?

<details><summary>
Click here for the solution…
</summary>
petal_length  and petal_width 
</details>

Which two dimensions would provide for the __worst__ seperation between the three classes?

<details><summary>
Click here for the solution…
</summary>
sepal_length and sepal_width
</details>

-----
Modeling
-----

We are going to take a model comparison approach in the class.

First, we are going to fit baseline model. Then try to improve the models from there. We might also make things worse.

In [None]:
# Load data for modeling
from sklearn.datasets import load_iris

iris = load_iris()

In [None]:
# scikit-learn datasets are dicts
iris.keys()

In [None]:
print(iris.DESCR)

In [None]:
def select_two_worst_features(iris: sklearn.utils.Bunch) -> np.array:
    "Select the two WORST features/columns from iris"
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""
1 point
Test code for the 'select_two_worst_features' function. 
This cell should NOT give any errors when it is run.
"""

assert select_two_worst_features(iris).shape == (150, 2)
assert list(select_two_worst_features(iris)[0]) == [5.1, 3.5]
assert list(select_two_worst_features(iris)[-1]) == [5.9, 3. ]

Let's check out the documention for [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Create an instance of the model from the class
knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
# Define the input data and labels
X = select_two_worst_features(iris)
y = iris.target

# Fit the model
knn.fit(X, y)

In [None]:
# Predict a within sample datapoint
i = 100 # Index could also be a random integer 
y_predicted = knn.predict(X[i].reshape(1, -1))
y_predicted

In [None]:
y_actual = y[i]
y_actual

In [None]:
# Pretty plot the selected data
scatter = sns.scatterplot(x=X[:, 0],
                          y=X[:, 1],
                          hue=y,     #'hue' will color code each group
                          palette=palette,
                         );


# Plot data point
scatter.plot(X[i, 0], 
            X[i, 1], 
            color='red', 
            marker='*');

You have just fit your a model with sci-kit learn  💥

In [None]:
# Predict an out of sample datapoint
noise = -.35 # Noise could also be a random float
new_data = X[i].reshape(1, -1)+noise
y_predicted = knn.predict(new_data)
y_predicted

In [None]:
# Pretty plot the selected data
scatter = sns.scatterplot(x=X[:, 0],
                          y=X[:, 1],
                          hue=y,     #'hue' will color code each group
                          palette=palette,
                         );


# Plot data point
scatter.plot(new_data[0][0], 
            new_data[0][1], 
            color='red', 
            marker='*');

In [None]:
# Predict the response for every within sample data
# NOTE: We are not doing train / test split. Splitting the data is pointless with k NN because it just memorized the observed data.
y_predicted = knn.predict(X)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy = accuracy_score(y_predicted, y)
print(f'Model accuracy {accuracy:.2%}')

Why do you think is the accuracy so low? (ungraded)

In [None]:
def fit_knn(X: np.array, y: np.array, n_neighbors: int=3) -> float:
    "Fit a KNN model, returning accuracy. Use code above an example."
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""
1 point
Test code for the 'fit_KNN' function with 2 worst features
This cell should NOT give any errors when it is run.
"""

assert round(fit_knn(X, y, n_neighbors=3), 4) == .8533

In [None]:
def select_two_best_features(iris: sklearn.utils.Bunch) -> np.array:
    "Select the two BEST features/columns from iris and return as numpy.ndarray"
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""
1 point
Test code for the 'select_two_worst_features' function. 
This cell should NOT give any errors when it is run.
"""

assert select_two_best_features(iris).shape == (150, 2)
assert list(select_two_best_features(iris)[0]) == [1.4, 0.2]
assert list(select_two_best_features(iris)[-1]) == [5.1, 1.8]

In [None]:
X = select_two_best_features(iris)

In [None]:
"""
1 point
Test code for the 'fit_KNN' function with 2 best features.
This cell should NOT give any errors when it is run.
"""

assert round(fit_knn(X, y, n_neighbors=3), 4) == 0.98

In [None]:
X = iris.data  # Use all features

In [None]:
"""
1 point
Test code for the 'fit_KNN' function with all features.
This cell should NOT give any errors when it is run.
"""

assert round(fit_knn(X, y, n_neighbors=3), 4) == 0.96

Why do you think the accuracy go down when we have more features? (ungraded)

-----
Explore how the number of neighbors impacts modeling
-----

In [None]:
X = iris.data  # Use all features

print(f'{"# of Neighbors"} | {"Model accuracy":>8}')
print('-'*35)
for n_neighbors in range(1, 16):
    accuracy = fit_knn(X, y, n_neighbors)
    print(f'{n_neighbors:^14} | {accuracy:>10.2%}')

Why is accuracy 100% with 1 neighbor? (ungraded)

Why does accuracy go down then back up as number of neighbors increase? (ungraded)

<br>
<br> 
<br>

----