# Supervised Learning and K Nearest Neighbors Exercises

## Introduction

We will be using customer churn data from the telecom industry for this week's exercises. The data file is called 
`Orange_Telecom_Churn_Data.csv`. We will load this data together, do some preprocessing, and use K-nearest neighbors to predict customer churn based on account characteristics.

In [3]:
from __future__ import print_function
import os
data_path = ['..\data']

## Question 1

* Begin by importing the data. Examine the columns and data.
* Notice that the data contains a state, area code, and phone number. Do you think these are good features to use when building a machine learning model? Why or why not? 

We will not be using them, so they can be dropped from the data.

In [4]:
import pandas as pd

# Import the data using the file path
filepath = os.sep.join(data_path + ['Orange_Telecom_Churn_Data.csv'])
data = pd.read_csv(filepath)

In [5]:
data.head(5).T

Unnamed: 0,0,1,2,3,4
state,KS,OH,NJ,OH,OK
account_length,128,107,137,84,75
area_code,415,415,415,408,415
phone_number,382-4657,371-7191,358-1921,375-9999,330-6626
intl_plan,no,no,no,yes,yes
voice_mail_plan,yes,yes,no,no,no
number_vmail_messages,25,26,0,0,0
total_day_minutes,265.1,161.6,243.4,299.4,166.7
total_day_calls,110,123,114,71,113
total_day_charge,45.07,27.47,41.38,50.9,28.34


In [10]:
# Remove extraneous columns
data.drop(['state', 'area_code', 'phone_number'], axis=1, inplace=True)

In [11]:
data.columns

Index(['account_length', 'intl_plan', 'voice_mail_plan',
       'number_vmail_messages', 'total_day_minutes', 'total_day_calls',
       'total_day_charge', 'total_eve_minutes', 'total_eve_calls',
       'total_eve_charge', 'total_night_minutes', 'total_night_calls',
       'total_night_charge', 'total_intl_minutes', 'total_intl_calls',
       'total_intl_charge', 'number_customer_service_calls', 'churned'],
      dtype='object')

## Question 2

* Notice that some of the columns are categorical data and some are floats. These features will need to be numerically encoded using one of the methods from the lecture.
* Finally, remember from the lecture that K-nearest neighbors requires scaled data. Scale the data using one of the scaling methods discussed in the lecture.

In [12]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()

for col in ['intl_plan', 'voice_mail_plan', 'churned']:
    data[col] = lb.fit_transform(data[col])
       

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
account_length,128.0,107.0,137.0,84.0,75.0,118.0,121.0,147.0,117.0,141.0,...,140.0,97.0,83.0,73.0,75.0,50.0,152.0,61.0,109.0,86.0
intl_plan,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
voice_mail_plan,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
number_vmail_messages,25.0,26.0,0.0,0.0,0.0,0.0,24.0,0.0,0.0,37.0,...,0.0,0.0,0.0,0.0,0.0,40.0,0.0,0.0,0.0,34.0
total_day_minutes,265.1,161.6,243.4,299.4,166.7,223.4,218.2,157.0,184.5,258.6,...,244.7,252.6,188.3,177.9,170.7,235.7,184.2,140.6,188.8,129.4


In [13]:
# Mute the sklearn warning
import warnings
warnings.filterwarnings('ignore', module='sklearn')

from sklearn.preprocessing import MinMaxScaler

msc = MinMaxScaler()

data = pd.DataFrame(msc.fit_transform(data),  # this is an np.array, not a dataframe.
                    columns=data.columns)

In [22]:
# just checking
import numpy as np
print(np.min(data))
print(np.max(data))

account_length                   0.0
intl_plan                        0.0
voice_mail_plan                  0.0
number_vmail_messages            0.0
total_day_minutes                0.0
total_day_calls                  0.0
total_day_charge                 0.0
total_eve_minutes                0.0
total_eve_calls                  0.0
total_eve_charge                 0.0
total_night_minutes              0.0
total_night_calls                0.0
total_night_charge               0.0
total_intl_minutes               0.0
total_intl_calls                 0.0
total_intl_charge                0.0
number_customer_service_calls    0.0
churned                          0.0
dtype: float64
account_length                   1.0
intl_plan                        1.0
voice_mail_plan                  1.0
number_vmail_messages            1.0
total_day_minutes                1.0
total_day_calls                  1.0
total_day_charge                 1.0
total_eve_minutes                1.0
total_eve_calls        

## Question 3

* Separate the feature columns (everything except `churned`) from the label (`churned`). This will create two tables.
* Fit a K-nearest neighbors model with a value of `k=3` to this data and predict the outcome on the same data.

In [23]:
# Get a list of all the columns that don't contain the label
# x_cols = [x for x in data.columns if x != 'churned']

# Split the data into two dataframes
# X_data = data[x_cols]
# y_data = data['churned']

# # alternatively:
# X_data = data.copy()
# y_data = X_data.pop('churned')

y_data = data['churned']
X_data = data.copy()
X_data.drop(['churned'], axis=1, inplace=True)

In [24]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)

knn = knn.fit(X_data, y_data)

y_pred = knn.predict(X_data)

## Question 4

Ways to measure error haven't been discussed in class yet, but accuracy is an easy one to understand--it is simply the percent of labels that were correctly predicted (either true or false). 

* Write a function to calculate accuracy using the actual and predicted labels.
* Using the function, calculate the accuracy of this K-nearest neighbors model on the data.

In [51]:
# Function to calculate the % of values that were correctly predicted

def accuracy(real, predict):
    return sum(real == predict) / float(real.shape[0])

In [26]:
print(accuracy(y_data, y_pred))

0.9422


## Question 5

* Fit the K-nearest neighbors model again with `n_neighbors=3` but this time use distance for the weights. Calculate the accuracy using the function you created above. 
* Fit another K-nearest neighbors model. This time use uniform weights but set the power parameter for the Minkowski distance metric to be 1 (`p=1`) i.e. Manhattan Distance.

When weighted distances are used for part 1 of this question, a value of 1.0 should be returned for the accuracy. Why do you think this is? *Hint:* we are predicting on the data and with KNN the model *is* the data. We will learn how to avoid this pitfall in the next lecture.

In [27]:
#Student writes code here
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, weights = 'distance')
knn = knn.fit(X_data, y_data)
y_pred = knn.predict(X_data)
print(accuracy(y_data, y_pred))

1.0


In [28]:
#Student writes code here
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, p=1)
knn = knn.fit(X_data, y_data)
y_pred = knn.predict(X_data)
print(accuracy(y_data, y_pred))

0.9456


## Question 6

* Fit a K-nearest neighbors model using values of `k` (`n_neighbors`) ranging from 1 to 20. Use uniform weights (the default). The coefficient for the Minkowski distance (`p`) can be set to either 1 or 2--just be consistent. Store the accuracy and the value of `k` used from each of these fits in a list or dictionary.
* Plot (or view the table of) the `accuracy` vs `k`. What do you notice happens when `k=1`? Why do you think this is? *Hint:* it's for the same reason discussed above.

In [29]:
#Student writes code here
from sklearn.neighbors import KNeighborsClassifier
k = [x+1 for x in range(20)]
acc = [0.0*k for k in k]
for i in k:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn = knn.fit(X_data, y_data)
    y_pred = knn.predict(X_data)
    acc[i-1] = accuracy(y_data, y_pred)


In [30]:
z = pd.DataFrame(columns=['k','accuracy'])
z.k=k ; z.accuracy=acc
z

Unnamed: 0,k,accuracy
0,1,1.0
1,2,0.9292
2,3,0.9422
3,4,0.9154
4,5,0.9284
5,6,0.9156
6,7,0.9254
7,8,0.9122
8,9,0.9224
9,10,0.9092


### Question 7
- Zoek de documentatie van de KNeighborsClassifier op bij scikit-learn.org
- Wat is de betekenis van de verschillende parameters, met name 'distance' en 'p' ?
- Wat zegt accuracy = 1.0 bij k = 1?
- Hoe bruikbaar is ons model voor voorspellingen?

In [32]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)

In [39]:
# print(train.head())
print(train.shape)
print(test.shape)

(4000, 18)
(1000, 18)


In [40]:
y_train = train['churned']
x_train = train.copy()
x_train.drop(['churned'], axis=1, inplace=True)
y_test = test['churned']
x_test = test.copy()
x_test.drop(['churned'], axis=1, inplace=True)

In [57]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, p=1)
knn = knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)


In [58]:
accuracy(y_test,y_pred)

0.926

In [64]:
k = [x+1 for x in range(20)]
acc = [0.0*k for k in k]
for i in k:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn = knn.fit(x_train, y_train)
    y_pred = knn.predict(x_test)
    acc[i-1] = accuracy(y_test, y_pred)

In [65]:
result = pd.DataFrame(columns=['k','accuracy'])
result.k=k ; result.accuracy=acc
result

Unnamed: 0,k,accuracy
0,1,0.897
1,2,0.907
2,3,0.911
3,4,0.909
4,5,0.923
5,6,0.916
6,7,0.927
7,8,0.918
8,9,0.922
9,10,0.922


In [None]:
# write code to apply CV