# KNN for classification and imputation

In this lab you'll practice using KNN for classification first, then explore how it can be used for effective variable imputation.

---

### 1. Load packages

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns


%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.neighbors import KNeighborsClassifier

In [2]:
import imp
plotter = imp.load_source('plotter', '/Users/kiefer/github-repos/DSI-SF-2/utils/plotting/knn_plotter.py')
from plotter import KNNBoundaryPlotter

---

### 3. Load datasets


In [17]:
affair = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/affairs/affair.csv')
churn = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/cell_phone_churn/cell_phone_churn.csv')
coffee = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/coffee_preferences/dat12-coffee-preferences.csv')

---

### 4. Encode affairs vs. not in affair dataset 

This will be your binary target class variable.

In [18]:
affair.head()

Unnamed: 0.1,Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs
0,1,male,37.0,10.0,no,3,18,7,4,0
1,2,female,27.0,4.0,no,4,14,6,4,0
2,3,female,32.0,15.0,yes,1,12,1,4,0
3,4,male,57.0,15.0,yes,5,18,6,5,0
4,5,male,22.0,0.75,no,2,17,6,3,0


In [19]:
affair.nbaffairs.unique()

array([ 0,  3,  7, 12,  1,  2])

In [20]:
affair.drop(affair.columns[0], axis=1, inplace=True)

In [21]:
affair.dropna(inplace=True)

In [22]:
affair.nbaffairs = affair.nbaffairs.map(lambda x: 1 if x >= 1 else 0)

---

### 5. Clean and convert string variables

In [23]:
affair.head(1)

Unnamed: 0,sex,age,ym,child,religious,education,occupation,rate,nbaffairs
0,male,37.0,10.0,no,3,18,7,4,0


In [24]:
affair.sex = affair.sex.map(lambda x: 1 if x == 'male' else 0)
affair.child = affair.child.map(lambda x: 0 if x == 'no' else 1)

---

### 6. Fit a `KNeighborsClassifier` with `weights='uniform'` and `n_neighbors=3`

You should choose **2 predictor variables** to predict had affair vs. not

In [36]:
from sklearn.neighbors import KNeighborsClassifier

X = affair[['age','religious']]
y = affair.nbaffairs.values

knn_uniform_n3 = KNeighborsClassifier(n_neighbors=50, weights='uniform')
knn_uniform_n3.fit(X, y)



KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=50, p=2,
           weights='uniform')

In [37]:
person9 = X.iloc[9,:]

In [40]:
print 'baseline:', 1 - np.mean(y)

baseline: 0.750415973378


In [38]:
print knn_uniform_n3.predict(person9)
print knn_uniform_n3.predict_proba(person9)

[0]
[[ 0.94  0.06]]




---

### 7. Cross-validate the classifier with `StratifiedKFold`



In [50]:
from sklearn.cross_validation import StratifiedKFold

#ysimple = np.array([1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
cv_folds = StratifiedKFold(y, n_folds=5)

scores = []
for i, (train, test) in enumerate(cv_folds):
    print 'fold, ', i
    X_train, y_train = X.iloc[train, :], y[train]
    X_test, y_test = X.iloc[test, :], y[test]
    
    model = KNeighborsClassifier(n_neighbors=3, weights='uniform')
    model.fit(X_train, y_train)
    
    test_accuracy = model.score(X_test, y_test)
    scores.append(test_accuracy)
    
print scores
print np.mean(scores)
    

fold,  0
fold,  1
fold,  2
fold,  3
fold,  4
[0.73553719008264462, 0.71666666666666667, 0.7416666666666667, 0.70833333333333337, 0.69999999999999996]
0.72044077135


---

### 8. Do the same but with  `n_neighbors=11`

Use the same predictor variables and cv folds.

---

### 9. Cross-validate a model with  `n_neighbors=11` and `weights='distance'`

---

### 10. [Optional] Explore the model visually with the `KNNBoundaryPlotter`

---

### 11. With the churn dataset, find the optimal neighbors and weighting to predict churn using gridsearch

Show the cross-validated accuracy of the model.

---

## Variable imputation with KNeighbors

You can actually do both classification _and_ regression with KNN. It is quite flexible due to its simplicity. One of it's most useful features is the ability to perform very nice imputation.

---

### 12. Look at the coffee data, count the missing values

---

### 13. For each of the missing columns, build a `KNeighborsClassifier` to predict rating for that column based on the other columns

Another great benefit of KNN is the ease with which it can do multi-class problems like this.

[Note: there is a more complicated way to do this, but I am doing it the simple way in the solutions.]