# Classification based on k-nearest neighbors

## The data

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
data = pd.read_csv('gdp-vs-lifesatisfaction-classes.csv')

In [None]:
data

In [None]:
data.plot.scatter(x='GDP per capita', y='Life satisfied')
# or equally well
# data.plot(x='GDP per capita', y='Life satisfied', kind='scatter')

## Scikit-learn

<img src="images/scikit-learn.png" width=500>

https://scikit-learn.org/stable/index.html
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

## k-nearest neighbors (for classification now! not regression)

In [None]:
import sklearn.neighbors

In [None]:
# Before for regression:
# model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=1)

# Classifier
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=3)

In [None]:
# Another technical note like before
# sklearn will expect x to be like a 2D array 
# which in Pandas means like a dataframe rather than a series
# We make a dataframe by indexing the dataframe with a list containing our column names

x = data[['GDP per capita']]
y = data['Life satisfied']

In [None]:
# Train the model
model.fit(x,y)

In [None]:
# Make a prediction
x_test = [[25000]]
model.predict(x_test)

In [None]:
# Visualize what the predictions are for this model

data.plot.scatter(x='GDP per capita', y='Life satisfied')

x_new = np.linspace(8000,58000,100000)
x_new = x_new.reshape(-1,1)
y_pred = model.predict(x_new)
plt.plot(x_new, y_pred)

plt.show()

## Ascertaining the "goodness" of the model fit

In [None]:
model.score(x, y)

In [None]:
# If the model correctly classifies i points and misclassifies j points out of k total
# the score should be i/k
28/29

What is the above termed?
1. accuracy
2. precision
3. recall
4. actually, none of these

In [None]:
# Note that when calculating the precision and recall here, if your classes are not 0/1
# you will need to specify what class is positive vs negative (the "pos_label")

print(f"Accuracy: {sklearn.metrics.accuracy_score(y, model.predict(x)):.2%}")
print(f"Precision: {sklearn.metrics.precision_score(y, model.predict(x), pos_label='Satisfied'):.2%}")
print(f"Recall: {sklearn.metrics.recall_score(y, model.predict(x), pos_label='Satisfied'):.2%}")

In [None]:
from sklearn.metrics import confusion_matrix

You can get more information on the accuracy of the model with a confusion matrix. 

In the case of binary classification, the confusion matrix shows true negatives, true positives, false positives, and false positives.

In [None]:
confusion_matrix(y, model.predict(x))

## If we take "Not Satisfied" as our negative, which of the above are the 
* true negatives? -- 
* true positives? -- 
* false negatives? -- 
* false positives? -- 

In [None]:
confmat = confusion_matrix(y, model.predict(x))

fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(confmat)

# the below just sets the axis labels, tick marks, and text inside the boxes
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
for i in range(2):
    for j in range(2):
        ax.text(j, i, confmat[i,j], ha='center', va='center', color='red')

plt.show()

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y, model.predict(x)))

## Exercises

As an exercise, repeat the above, but include elements from last week:
* a test/train split
* cross-validation to find the optimum number of neighbors

Do this using the Wisconsin breast cancer dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)

### I will start you off:

Execute the following two cells to import the data:

In [None]:
import sklearn.datasets
import sklearn.model_selection

In [None]:
x,y = sklearn.datasets.load_breast_cancer(return_X_y=True, as_frame=True)

Execute the following two cells to see what's in `x` and `y`

In [None]:
x

In [None]:
y

Make a dataframe of just the `mean radius` column of x and assign to a new variable.

Make a scatter plot of y vs mean radius.

Split your radius and y data into training and test sets.

Initialize your k-nearest neighbors classifier and start with n_neighbors = 3.

Train the model.

Look at the plot above, choose a value for radius, and execute a command that shows which class your model predicts for that value.
* if you get a warning about feature names, ignore it.

Remake the plot from above, and on top of it, plot a line curve showing the predictions of your model over the plotted horizontal range.
* you may find the following useful:
  * `np.linspace(a,b,c)` will make a numpy array with `c` elements starting at the value `a` and going to `b`
  * remember that sklearn's predict method must have a 2D-like array input, so if you use the above, you may need to recast the array with `.reshape(-1,1)`

The following will allow you to get the cross validation, but fill in the places marked with "???"

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
k_range = range(1, ???)
k_scores = []
for k in k_range:
    knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=???)
    acc = cross_val_score(???,
                           ???,
                           ???, 
                           cv=???, 
                           scoring='accuracy')
    k_scores.append(acc.mean())
plt.scatter(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated MSE')
plt.show()
print('Max k = ',np.argmax(???))

Initialize another k-nearest neighbors classifier with the best n_neighbors.

Train the model.

Remake the plot that has the training points and the curve of your new trained model's predictions.

Print out the accuracy, precision, and recall assessed via the test set.

Print the confusion matrix.

Print the classification report.