<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

#  k-Nearest Neighbors with `scikit-learn`

<a id="learning-objectives"></a>
## Learning Objectives

1. Introduce the kNN classification model
2. Utilize the kNN model on the iris data set
3. Implement scikit-learn's kNN model
4. Assess the fit of a kNN Model using scikit-learn

<a id="home"></a>

### Lesson Guide
- [a) Loading the Iris Data Set](#overview-of-the-iris-dataset)
	- [i) Terminology](#terminology)
- [b) Guided Practice: "Human Learning" With Iris Data](#exercise-human-learning-with-iris-data)
- [c) Human Learning on the Iris Data Set](#human-learning-on-the-iris-dataset)
- [d) Guided Intro to KNN: NBA Position KNN Classifier](#knn-classification-nba)
	- [i) Using the Train/Test Split Procedure (K=1)](#using-the-traintest-split-procedure-k)
    - [ii) Comparing Testing Accuracy With Null Accuracy (the baseline)](#null-accuracy)
- [e) Tuning a KNN Model](#tuning-a-knn-model)
	- [i) What Happens If We View the Accuracy of our Training Data?](#what-happen-if-we-view-the-accuracy-of-our-training-data)
	- [i) Training Error Versus Testing Error](#training-error-versus-testing-error)
- [f) Standardizing Features](#standardizing-features)
	- [i) Use `StandardScaler` to Standardize our Data](#use-standardscaler-to-standardize-our-data)
- [g) Comparing KNN With Other Models](#comparing-knn-with-other-models)

In this lesson, we will get an intuitive and practical feel for the **k-Nearest Neighbors** model. kNN is a **non-parametric model**, which means that the model is not represented as an equation with parameters (e.g. the $\beta$ values in linear regression).

First, we will make a model by hand to classify iris flower data. Next, we will automatedly make a model using kNN.

> You may have heard of the clustering algorithm **k-Means Clustering**. These techniques have nothing in common, aside from both having a parameter k!

<img src="assets/iris_overview.png" style="width: 700px;">


The Iris dataset is a famous machine learning dataset, created in the 1930s by R. Fisher.  It's very complete (and not very big), but is often used for learning about classification models.

The dataset comes from UCI (University of California Irvine), which is itself a great source of datasets:

https://archive.ics.uci.edu/ml/datasets/iris
<br />
https://archive.ics.uci.edu/ml/index.php

#### [Home](#home)

<a id="overview-of-the-iris-dataset"></a>
# <font style = 'color:blue'>a) Loading the Iris Data Set</font>
---

#### Read the iris data into a pandas DataFrame, including column names.

In [2]:
# Read the iris data into a DataFrame.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Display plots in-notebook
%matplotlib inline

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

data = './iris.data'
iris = pd.read_csv(data)

In [3]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
iris.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


<a id="terminology"></a>
### Terminology

- **150 observations** (n=150): Each observation is one iris flower.
- **Four features** (p=4): sepal length, sepal width, petal length, and petal width.
- **Response**: One of three possible iris species (setosa, versicolor, or virginica)
- **Classification problem** because response is categorical (i.e. a discrete value).

#### [Home](#home)

<a id="exercise-human-learning-with-iris-data"></a>
# <font style = 'color:blue'>b) Guided Practice: "Human Learning" With Iris Data</font>

**Question:** Can we predict the species of an iris using petal and sepal measurements? Together, we will:

1. Read the iris data into a pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use sorting, split-apply-combine, and visualization to look for differences between species.
4. Write down a set of rules that could be used to predict species based on iris measurements.

**BONUS:** Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data and check the accuracy of your predictions.

#### Gather some basic information about the data.

In [None]:
# 150 observations, 5 columns (the 4 features & response)
iris.shape

In [None]:
iris.dtypes

In [None]:
# Verify the basic stats look appropriate
iris.describe()

In [None]:
# Test for imbalanced classes
iris.species.value_counts()

In [None]:
# Verify we are not missing any data
iris.isnull().sum()

#### Use sorting, split-apply-combine, and/or visualization to look for differences between species.

In [None]:
iris.head()

In [None]:
# Sort the DataFrame by petal_width.
iris.sort_values(by='petal_width', ascending=True, inplace=True)
iris.head(30)

In [None]:
# Sort the DataFrame by petal_width and display the NumPy array.
iris.sort_values(by='petal_width', ascending=True).values[0:5]

#### Split-apply-combine: Explore the data while using a `groupby` on `'species'`.

In [None]:
# Mean of sepal_length, grouped by species.
iris.groupby(by='species', axis=0).sepal_length.mean()

In [None]:
# Mean of all numeric columns, grouped by species.
iris.groupby('species').mean()

In [None]:
# describe() of all numeric columns, grouped by species.
iris.groupby('species').describe()

In [None]:
# Box plot of petal_width, grouped by species.
iris.boxplot(column='petal_width', by='species', figsize=(8,8));

In [None]:
# Box plot of all numeric columns, grouped by species.
iris.boxplot(by='species', rot=45, figsize=(8,8));

In [None]:
# Map species to a numeric value so that plots can be colored by species.
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

# Alternative method:
iris['species_num'] = iris.species.factorize()[0]

In [None]:
iris.head()

In [None]:
# Scatterplot of petal_length vs. petal_width, colored by species
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap='brg');

In [None]:
# Scatter matrix of all features, colored by species.
pd.plotting.scatter_matrix(iris.drop('species_num', axis=1), c=iris.species_num, figsize=(12, 10));

#### <font style='color:green'>Exercise: Using the graphs and data above, can you write down a set of rules that can accurately predict species based on iris measurements?</font>

In [None]:
# Feel free to do more analysis if needed to make good rules!

#### Bonus: If you have time during the class break or after class, try to implement these rules to make your own classifier!

Write a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data and check the accuracy of your predictions.

#### [Home](#home)

<a id="human-learning-on-the-iris-dataset"></a>
# <font style = 'color:blue'>c) Human Learning on the Iris Data Set</font>
---

How did we (as humans) predict the species of an iris?

1. We observed that the different species had (somewhat) dissimilar measurements.
2. We focused on features that seemed to correlate with the response.
3. We created a set of rules (using those features) to predict the species of an unknown iris.

We assumed that if an **unknown iris** had measurements similar to **previous irises**, then its species was most likely the same as those previous irises.

In [None]:
# Allow plots to appear in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14

# Create a custom color map.
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

In [None]:
# Map each iris species to a number.
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

In [None]:
# Box plot of all numeric columns, grouped by species.
iris.drop('species_num', axis=1).boxplot(by='species', rot=45);

In [None]:
# Create a scatterplot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES.
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cmap_bold);

In [None]:
# This conversion relies on a 'prediction' column, which is created in the Bonus section of the Exercise in b) above.  Look at the Solution file if needed to get this code.
iris['pred_num'] = iris.prediction.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

# Create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by PREDICTION.
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='pred_num', colormap=cmap_bold);

---

#### [Home](#home)

<a id="knn-classification-nba"></a>
# <font style = 'color:blue'>d) Guided Intro to KNN: NBA Position KNN Classifier</font>

For the rest of the lesson, we will be using a dataset containing the 2015 season statistics for ~500 NBA players. This dataset leads to a nice choice of K, as we'll see below. The columns we'll use for features and the target we are tryng to predict - a player's position ('pos') are:


| Column | Meaning |
| ---    | ---     |
| pos | C: Center. F: Front. G: Guard |
| ast | Assists per game | 
| stl | Steals per game | 
| blk | Blocks per game |
| tov | Turnovers per game | 
| pf  | Personal fouls per game | 

For information about the other columns, see [this glossary](https://www.basketball-reference.com/about/glossary.html).

In [None]:
# Read the NBA data into a DataFrame.
import pandas as pd

path = 'data/NBA_players_2015.csv'
nba = pd.read_csv(path, index_col=0)

In [None]:
nba.info()

In [None]:
nba.head()

In [None]:
# Map positions to numbers
nba['pos_num'] = nba.pos.map({'C':0, 'F':1, 'G':2})

In [None]:
# Create feature matrix (X).
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']
X = nba[feature_cols]

In [None]:
# Create response vector (y).
y = nba.pos_num

<a id="using-the-traintest-split-procedure-k"></a>
### Using the Train/Test Split Procedure (k=1)

#### Step 1: Import the model class we will use

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

#### Step 2: Split X and y into training and testing sets (using `random_state` for reproducibility).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

#### Step 3: Train the model on the training set (using k=1).

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

#### Step 4: Test the model on the testing set and check the accuracy.

In [None]:
y_pred_class = knn.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

<font style='color:green'>**Question:** If we had trained on the entire dataset and tested on the entire dataset, using 1-KNN what accuracy would we get? Why?</font>

#### Repeating for k=50.

In [None]:
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

<font style='color:green'>**Question:** Suppose we again train and test on the *entire* data set (not splitting it as above), but using 50-KNN. Would we expect the accuracy to be the same as compared to 1-KNN?  Why?</font>

<a id="null-accuracy"></a>
### ii) Comparing Testing Accuracy With Null Accuracy (the baseline)

In classification tasks, Null Accuracy is the accuracy that can be achieved by **always predicting the most frequent class**. For example, if most players are Centers, we would always predict Center.

The null accuracy is a benchmark against which you may want to measure every classification model. It is our baseline.

#### Examine the class distribution from the training set.

Remember that we are comparing KNN to this simpler model. So, we must find the most frequent class **of the training set**.

In [None]:
most_freq_class = y_train.value_counts().index[0]

print(y_train.value_counts())
most_freq_class

#### Compute the null accuracy / baseline.

In [None]:
y_test.value_counts()[most_freq_class] / len(y_test)

#### [Home](#home)

<a id="tuning-a-knn-model"></a>
# <font style = 'color:blue'>e) Tuning a KNN Model</font>
---

In [None]:
# Instantiate the model (using the value K=5).
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the model with data.
knn.fit(X, y)

# Store the predicted response values.
y_pred_class = knn.predict(X)

print((metrics.accuracy_score(y, y_pred_class)))

In [None]:
# Calculate predicted probabilities of class membership.
# Each row sums to one and contains the probabilities of the point being a 0-Center, 1-Front, 2-Guard.
knn.predict_proba(X)

In the next class, we'll discuss **model evaluation procedures**, which allow us to use our existing labeled data to estimate how well our models are likely to perform on out-of-sample data (spoiler alert: we'll be revisiting Type I and Type II errors). These procedures will help us to tune our models and choose between different types of models.


<a id="what-happen-if-we-view-the-accuracy-of-our-training-data"></a>
### What Happens If We View the Accuracy of our Training Data?

In [None]:
scores = []
for k in range(1,100):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train,y_train)
    pred = knn.predict(X)
    score = float(sum(pred == y)) / len(y)
    scores.append([k, score])

In [None]:
data = pd.DataFrame(scores,columns=['k','score'])
data.plot.line(x='k',y='score');

<font style='color:green'>**Question:** As K increases, why does the accuracy fall?</font>

#### Search for the 'best' value of K.

In [None]:
# Calculate TRAINING ERROR and TESTING ERROR for K=1 through 100.

k_range = list(range(1, 101))
training_error = []
testing_error = []

# Find test accuracy for all values of K between 1 and 100 (inclusive).
for k in k_range:

    # Instantiate the model with the current K value.
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    
    # Calculate training error (error = 1 - accuracy).
    y_pred_class = knn.predict(X)
    training_accuracy = metrics.accuracy_score(y, y_pred_class)
    training_error.append(1 - training_accuracy)
    
    # Calculate testing error.
    y_pred_class = knn.predict(X_test)
    testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
    testing_error.append(1 - testing_accuracy)

In [None]:
# Allow plots to appear in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
# Create a DataFrame of K, training error, and testing error.
column_dict = {'K': k_range, 'training error':training_error, 'testing error':testing_error}
df = pd.DataFrame(column_dict).set_index('K').sort_index(ascending=False)
df.head()

In [None]:
# Plot the relationship between K (HIGH TO LOW) and TESTING ERROR.
df.plot(y='testing error');
plt.xlabel('Value of K for KNN');
plt.ylabel('Error (lower is better)');

In [None]:
# Find the minimum testing error and the associated K value.
df.sort_values('testing error').head()

In [None]:
# Alternative method:
min(list(zip(testing_error, k_range)))

<a id="training-error-versus-testing-error"></a>
### Training Error Versus Testing Error

In [None]:
# Plot the relationship between K (HIGH TO LOW) and both TRAINING ERROR and TESTING ERROR.
df.plot();
plt.xlabel('Value of K for KNN');
plt.ylabel('Error (lower is better)');

- **Training error** decreases as model complexity increases (the lower the value of K, the higher the complexity).
- **Testing error** is minimized at the optimum model complexity.

Evaluating the training and testing error is important. For example:

- If the training error is much lower than the test error, then our model is likely overfitting. 
- If the test error starts increasing as we vary a hyperparameter, we may be overfitting.
- If either error plateaus, our model is likely underfitting (not complex enough).

#### Making Predictions on Out-of-Sample Data

Given the statistics of a (truly) unknown NBA player, how do we predict his position?

In [None]:
import numpy as np

# Instantiate the model with the best-known parameters.
knn = KNeighborsClassifier(n_neighbors=14)

# Re-train the model with X and y (not X_train and y_train). Why?
knn.fit(X, y)

# Make a prediction for an out-of-sample observation.
knn.predict(np.array([2, 1, 0, 1, 2]).reshape(1, -1))

What could we conclude?

- When using KNN on this data set with these features, the **best value for K** is likely to be around 14.
- Given the statistics of an **unknown player**, we estimate that we would be able to correctly predict his position about 74% of the time.

#### [Home](#home)

<a id="standardizing-features"></a>
# <font style = 'color:blue'>f) Standardizing Features</font>
---

There is one major issue that applies to many machine learning models: they are sensitive to the scales of features (feature scale). 

> KNN in particular is sensitive to feature scale because it (by default) uses the Euclidean distance metric. To determine closeness, Euclidean distance sums the square difference along each axis. So, if one axis has large differences and another has small differences, the former axis will contribute much more to the distance than the latter axis.

This means that it matters whether our feature are centered around zero and have similar variance to each other.

Unfortunately, most data does not naturally start at a mean of zero and a shared variance. Other models tend to struggle with scale as well, even linear regression, when you get into more advanced methods such as regularization.

Fortunately, this is an easy fix.

<a id="use-standardscaler-to-standardize-our-data"></a>
### Use `StandardScaler` to Standardize our Data

StandardScaler standardizes our data by subtracting the mean from each feature and dividing by its standard deviation.  It can be used with Linear Regression as well as Classification.

#### Separate feature matrix and response for scikit-learn.

In [None]:
# Create feature matrix (X).
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']

X = nba[feature_cols]
y = nba.pos_num  # Create response vector (y).

#### Create the train/test split.

Notice that we create the train/test split first. This is because we will reveal information about our testing data if we standardize right away.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

#### Instantiate and fit `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Fit a KNN model and look at the testing error.
Can you find a number of neighbors that improves our results from before?

In [None]:
# Calculate testing error.
knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)

y_pred_class = knn.predict(X_test)
testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
testing_error = 1 - testing_accuracy

print(testing_error)

#### [Home](#home)

<a id="comparing-knn-with-other-models"></a>
# <font style = 'color:blue'>g) Comparing KNN With Other Models</font>
---

**Advantages of KNN:**

- It's **simple to understand** and explain.
- **Model training is fast**.
- It can be used for classification and regression (for regression, take the average value of the K nearest points!).
- **Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.**

**Disadvantages of KNN:**

- It **must store all of the training data**.
- Its **prediction phase can be slow when n is large**.
- It is sensitive to irrelevant features.
- It is **sensitive to the scale of the data**.
- Accuracy is (generally) not competitive with the best supervised learning methods.