#### K-Nearest Neighbors (KNN) Classification
---

K-nearest neighbors classification is (as its name implies) a classification model that uses the "K" most similar observations in order to make a prediction.

KNN is a supervised learning method; therefore, the training data must have known target values.

The process of of prediction using KNN is fairly straightforward:

1. Pick a value for K.
2. Search for the K observations in the data that are "nearest" to the measurements of the unknown iris.
    - Euclidian distance is often used as the distance metric, but other metrics are allowed.
3. Use the most popular response value from the K "nearest neighbors" as the predicted response value for the unknown iris.

The visualizations below show how a given area can change in its prediction as K changes.

- This is simulated data with two predictors
- Colored points represent true values and colored areas represent a **prediction space**. (This is called a Voronoi Diagram.)
- Each prediction space is wgere the majority of the "K" nearest points are the color of the space.
- To predict the class of a new point, we guess the class corresponding to the color of the space it lies in.

##### KNN Classification Map for Iris (K=1)

![1NN classification map](iris_01nn_map.png)

##### KNN Classification Map for Iris (K=5)

![5NN classification map](iris_05nn_map.png)

##### KNN Classification Map for Iris (K=15)

![15NN classification map](iris_15nn_map.png)

##### KNN Classification Map for Iris (K=50)

![50NN classification map](iris_50nn_map.png)

We can see that, as K increases, the classification spaces' borders become more distinct. However, you can also see that the spaces are not perfectly pure when it comes to the known elements within them.

**How are outliers affected by K?** As K increases, outliers are "smoothed out". Look at the above three plots and notice how outliers strongly affect the prediction space when K=1. When K=50, outliers no longer affect region boundaries. This is a classic bias-variance tradeoff -- with increasing K, the bias increases but the variance decreases.

<div style="color:blue;font-size:125%">
- What happens when K $\rightarrow$ number of points in the sample?
</div>
<div style="color:blue;font-size:125%">
- What is the best value for K?
</div>

##### NBA Position KNN Classifier

This dataset containing the 2015 season statistics for ~500 NBA players. The columns we'll use for features (and the target 'pos') are:

| Column | Meaning |
| ---    | ---     |
| pos | C: Center. F: Front. G: Guard |
| ast | Assists per game | 
| stl | Steals per game | 
| blk | Blocks per game |
| tov | Turnovers per game | 
| pf  | Personal fouls per game | 

**First look at the data file to see whether it fits that description**

In [None]:
import pandas as pd
nba = pd.read_csv('NBA_players_2015.csv', usecols=['pos', 'ast', 'stl', 'blk', 'tov', 'pf'])
print(nba.shape)
print(nba.columns)

In [None]:
nba.head(5)

In [None]:
nba.info()

In [None]:
# Map the position categorical variables into numbers

nba.pos.value_counts()

In [None]:
# Notice that for a classifier we can have a Y that is not 
# numeric and do not need dummy variables.

y = nba.pos
X = nba.drop(columns=['pos'], axis=1)

In [None]:
print(X.shape)
print(y.shape)

In [None]:
X.head(5)

##### Build a model
For KNN, the choice of K is crucial, but we will ignore it for now, just choose k=3.
First job is just to get a classifier running and evaluate it -- just like Linear Regression

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

In [None]:
knn.predict(X)

In [None]:
from sklearn import metrics
metrics.accuracy_score(y, knn.predict(X))

In [None]:
#  Setting n_neighbors to 1.  What do you expect to happen here?

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
metrics.accuracy_score(y, knn.predict(X))

#### Side Note:  Remember That the Classifier is Calculating a Conditional Probability

Classifier will choose the class with highest probability, but knowing the underlying probability can be useful for debugging

In [None]:
knn15 = KNeighborsClassifier(n_neighbors=15)
knn15.fit(X, y)
knn15.predict_proba(X)

### Using the Train/Test Split Procedure

* Remember we have been evaluating training error
* To evaluate testing error, we can split training data into training set and test set

#### Step 1: Split X and y into training and testing sets (using `random_state` for reproducibility).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state = 333)

In [None]:
(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

#### Step 2: Train the model on the training set

In [None]:
knn3 = KNeighborsClassifier(n_neighbors=5)
knn3.fit(X_train, y_train)

**And evaluate it on the test set**

In [None]:
metrics.accuracy_score(y_test, knn3.predict(X_test))

In [None]:
#  Now maybe 1 neighbor won't work as well?
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1.fit(X_train, y_train)
metrics.accuracy_score(y_test, knn1.predict(X_test))

#### Comparing Testing Accuracy With Null Accuracy (The Low Bar)

Null accuracy is the accuracy that can be achieved by **always predicting the most frequent class**. For example, if most players are Centers, we would always predict Center.

The null accuracy is a benchmark against which you may want to measure every classification model.

In [None]:
most_freq_class = y.value_counts().index[0]
print(y.value_counts())
print(most_freq_class)

#### Compute null accuracy.

In [None]:
y.value_counts()[most_freq_class] / len(y)

#### Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
scores

In [None]:
(scores.mean(), scores.std())

In [None]:
# From this we can get a 95% confidence interval on test accuracy
(scores.mean() - 2 * scores.std(), scores.mean() + 2 * scores.std())

In [None]:
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier(n_neighbors=1)
scores = cross_val_score(knn, X, y, cv=10)
(scores.mean() - 2*scores.std(), scores.mean() + 2 * scores.std())

There is a tradeoff in choosing number of cross-validation folds
  * Fewer folds, faster, less aggressive use of test data (cv=1 is just a train/test split)
  * More folds, slower.  cv=n-1 is "all but one validation"

In [None]:
knn = KNeighborsClassifier(n_neighbors=10)
for v in (2, 10, 20, 50, 100, 200):
    cv = cross_val_score(knn, X, y, cv=v)
    print (v, cv.mean(), cv.std())    


##### Hyperparameter Optimization

Hyperparameter optimization, or tuning,  means find an optimal value for K

<span style="color:blue">What does optimal mean?</span>

In [None]:
scores = []
for k in range(1, 200, 4):
    knn = KNeighborsClassifier(n_neighbors=k)
    cv = cross_val_score(knn, X, y, cv=10)
    scores.append([k, cv.mean()])

In [None]:
scores[0]

In [None]:
# This is a plot of test accuracy as a function of k

data = pd.DataFrame(scores,columns=['k','score'])
data.plot.line(x='k',y='score');

#####  Since it looks like > 100 is less interesting, let's give it more focus

In [None]:
scores = []
for k in range(1, 100, 1):
    knn = KNeighborsClassifier(n_neighbors=k)
    cv = cross_val_score(knn, X, y, cv=10)
    scores.append([k, cv.mean()])
data = pd.DataFrame(scores,columns=['k','score'])
data.plot.line(x='k',y='score');

**Question:** As K increases, why does the accuracy rise then fall?

**Answer:** ...

#### Search for the "best" value of K.

In [None]:
# Calculate TRAINING ACCURACY and TESTING ACCURACY for K=1 through 358.

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state = 333)

k_range = list(range(1, 100, 2))

training_accuracies = []
testing_accuracies = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    
    # Calculate training accuracy
    y_pred_training = knn.predict(X_train)
    training_accuracy = metrics.accuracy_score(y_train, y_pred_training)
    training_accuracies.append(training_accuracy)
    
    # Calculate testing error.
    y_pred_test = knn.predict(X_test)
    testing_accuracy = metrics.accuracy_score(y_test, y_pred_test)
    testing_accuracies.append(testing_accuracy)

In [None]:
# Create a DataFrame of K, training error, and testing error.
column_dict = {'k': k_range, 'training_accuracy':training_accuracies, 'testing_accuracy':testing_accuracies}
df = pd.DataFrame(column_dict).sort_values(by='k')

In [None]:
# Plot the relationship between k and training and test accuracy
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot( 'k', 'testing_accuracy', data=df, color='skyblue')
plt.plot( 'k', 'training_accuracy', data=df, color='olive')
plt.legend()
plt.xlabel('Value of K for KNN');
plt.ylabel('Accuracy');

<a id="training-error-versus-testing-error"></a>
### Training Accuracy Versus Testing Accuracy

- Remember that model complexity is greatest at $K=1$ -- as $K$ gets larger the model tends toward "predict at the mode"

- **Training accuracy** increases as model complexity increases (lower value of K).
- **Testing accuracy** will tend to be low (overfitting, too much complexity) then increase, then decrease (too little complexity)

Evaluating the training and testing accuracy is important. For example:

- If the training accuracy is much higher than the test accuracy, then our model is likely overfitting. 
- If the test accuracy starts decreasing as we vary a hyperparameter (K), we may be overfitting.
- If either accuracy plateaus, our model is likely underfitting (not complex enough).

##### Grid Search:  

We have this pattern now
* Choose a hyperparameter
* Fit a model using cross validation
* Record accuracy for that model
* Choose the hyperparameter that maximizes cross-val accuracy

In [None]:
from sklearn import grid_search
parameters = {'n_neighbors': list(range(1,100))}
knn = KNeighborsClassifier()
clf = grid_search.GridSearchCV(knn, parameters)
clf.fit(X, y)
clf.get_params()

In [None]:
metrics.accuracy_score(y, clf.predict(X))

In [None]:
testing_accuracies[4]

### Summary

* Classification
* Train vs test error
* Hyperparameter optimization
  * Balancing *bias* versus *variance*
* This procedure is exactly the same in scikit-learn regardless of the algorithm!

**Advantages of KNN:**

- It's simple to understand and explain.
- Model training is fast.
- It can be used for classification and regression (for regression, take the average value of the K nearest points!).
- Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

**Disadvantages of KNN:**

- It must store all of the training data.
- Its prediction phase can be slow when n is large.
- It is sensitive to irrelevant features.
- It is sensitive to the scale of the data.
- Accuracy is (generally) not competitive with the best supervised learning methods.