# Lesson: KNN

<hr style="border:2px solid gray">

### What is K Nearest Neighbors (KNN)?
- Supervised Algorithm
- Makes predictions based on how close a new data point is to known data points.
- Lazy
- Sensitive to scaling

Link: [KNN Diagram](https://cambridgecoding.files.wordpress.com/2016/01/knn2.jpg)

### Pros:
1. Simple
2. Robust to noise
3. Effective with large datasets
4. Performs calculations "just in time"
5. Data is easy to keep up to date to keep predictions accurate

### Cons:
1. Need to determine how many neighbors is optimal
2. Computation cost is high (has to calculate every single distance to every feature)
3. Euclidean volume increases exponentially as number of features increases (curse of dimensionality)

In [1]:
# Quiet my warnings for the sake of the lesson:
import warnings
warnings.filterwarnings("ignore")

# Tabular data friends:
import pandas as pd
import numpy as np

# Data viz:
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn stuff:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

# Data acquisition
from pydataset import data

## Acquire:

In [2]:
#bring in our csv
st_df = pd.read_csv('~/Downloads/space_titanic.csv')

In [3]:
#take a look at the data
st_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [4]:
#what kind of columns and dataframes are we dealing with?
st_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


## Prepare:

In [5]:
#it looks like there are a few nulls- let's get rid of those 
st_df = st_df.dropna()

In [6]:
#create list of columns that we want to drop
columns_to_drop = ['Cabin', 'Name']

In [7]:
#drop those columns and save changes using inplace kwarg
st_df.drop(columns=columns_to_drop, inplace=True)

In [8]:
#create dummy columns for homeplanet and destination
dummies = pd.get_dummies(st_df[['HomePlanet', 'Destination']],drop_first=True)

In [9]:
#assign combined df to st_df
st_df = pd.concat([st_df, dummies], axis=1)

In [10]:
#make sure we have all the data
st_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,1,0,0,1
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,0,0,0,1
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,1,0,0,1
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,1,0,0,1
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,0,0,0,1


In [11]:
#let's drop the original columns
st_df.drop(columns=['HomePlanet', 'Destination'], inplace=True)

In [12]:
st_df['Transported'] = np.where(st_df['Transported'] == True, 'True', 'False')

## Split Data:

In [13]:
def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe, the name of the target variable
    (for stratification purposes), and an integer for a setting a seed
    and splits the data into train, validate and test. 
    Test is 20% of the original dataset, validate is .30*.80= 24% of the 
    original dataset, and train is .70*.80= 56% of the original dataset. 
    The function returns, in this order, train, validate and test dataframes. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    return train, validate, test

In [14]:
#split the data using our function above (stratify on our target variable)
train, validate, test = train_validate_test_split(st_df, 'Transported')

In [15]:
#take a look at the data
train.shape, validate.shape, test.shape

((3698, 14), (1586, 14), (1322, 14))

## Isolate the target

In [16]:
#pull our target out!
X_train = train.drop(columns=['Transported'])
#assign y
y_train = train['Transported']

#pull our target out!
X_val = validate.drop(columns=['Transported'])
#assign y
y_val = validate['Transported']

#pull our target out!
X_test = test.drop(columns=['Transported'])
#assign y
y_test = test['Transported']

<div class="alert alert-block alert-info">
<b>Instructor Note:</b>
<br>
<br>
1. Create the model
    <br>
2. Fit the model
    <br>
3. Use the model
</div>

### Create the model

In [17]:
# weights = ['uniform', 'density']
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

### Fit the model

In [18]:
knn.fit(X_train, y_train)

<b>Make the predictions</b>

In [19]:
#assign our predicted values to a variable
y_pred = knn.predict(X_train)

#call that variable
y_pred

array(['True', 'True', 'True', ..., 'False', 'True', 'True'], dtype=object)

<b>Estimate Probability</b>

In [20]:
#assign our probability to a variable
y_pred_proba = knn.predict_proba(X_train)

#call that variable
y_pred_proba

array([[0.4, 0.6],
       [0.2, 0.8],
       [0.2, 0.8],
       ...,
       [1. , 0. ],
       [0.2, 0.8],
       [0.4, 0.6]])

### Evaluate the model

<b>Compute Accuracy</b>

In [21]:
#find our accuracy score for the train set
train_acc= knn.score(X_train, y_train)

#let's see what score we get
train_acc

0.7214710654407788

In [22]:
#make it look nicer with a print statement
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

Accuracy of KNN classifier on training set: 0.72


<b>Create confusion matrix</b>

In [23]:
print(confusion_matrix(y_train, y_pred))

[[1154  682]
 [ 348 1514]]


In [24]:
#crosstab of our target in the train set vs the predicted values
pd.crosstab(y_train, y_pred)

col_0,False,True
Transported,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1154,682
True,348,1514


## Validate Model
This is our out-of-sample data

<div class="alert alert-block alert-info">
<b>Instructor Note:</b>
<br>
<br>
Be sure to remind the student that they are <b>not</b> to refit the model
</div>

In [25]:
#what is our validate score?
val_acc = knn.score(X_val, y_val)
val_acc

0.5674653215636822

In [26]:
#Make it look nicer with a print statement
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_val, y_val)))

Accuracy of KNN classifier on validate set: 0.57


## Find the best value for K

In [27]:
# iteration:
model_set = []
model_accuracies = {}
for i in range(1,10):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    model_set.append(knn)
    model_accuracies[f'{i}_neighbors'] = {
        'train_score': round(knn.score(X_train, y_train), 2),
        'validate_score': round(knn.score(X_val, y_val), 2)}

In [28]:
#Let's output the 'best' model
model_accuracies

{'1_neighbors': {'train_score': 1.0, 'validate_score': 0.61},
 '2_neighbors': {'train_score': 0.85, 'validate_score': 0.6},
 '3_neighbors': {'train_score': 0.8, 'validate_score': 0.59},
 '4_neighbors': {'train_score': 0.77, 'validate_score': 0.58},
 '5_neighbors': {'train_score': 0.72, 'validate_score': 0.57},
 '6_neighbors': {'train_score': 0.72, 'validate_score': 0.56},
 '7_neighbors': {'train_score': 0.7, 'validate_score': 0.55},
 '8_neighbors': {'train_score': 0.7, 'validate_score': 0.54},
 '9_neighbors': {'train_score': 0.67, 'validate_score': 0.54}}

In [29]:
#Use the thing
#Create a look to go through n_neighbors and compare train to validate
metrics = []

for i in range(1, 25):
    # Make the model
    knn = KNeighborsClassifier(n_neighbors=i)

    # Fit the model (on train and only train)
    knn = knn.fit(X_train, y_train)

    # Use the model
    # We'll evaluate the model's performance on train, first
    in_sample_accuracy = knn.score(X_train, y_train)
    
    out_of_sample_accuracy = knn.score(X_val, y_val)

    output = {
        "n_neighbors": i,
        "train_accuracy": in_sample_accuracy,
        "validate_accuracy": out_of_sample_accuracy
    }
    
    metrics.append(output)
    
df = pd.DataFrame(metrics)
df

Unnamed: 0,n_neighbors,train_accuracy,validate_accuracy
0,1,1.0,0.607188
1,2,0.846944,0.596469
2,3,0.799081,0.586381
3,4,0.765279,0.578815
4,5,0.721471,0.567465
5,6,0.723364,0.561791
6,7,0.702001,0.554224
7,8,0.695511,0.535939
8,9,0.673607,0.539723
9,10,0.661168,0.542875
