# KNN with sklearn - Lab

## Introduction

In this lab, we'll learn how to use sklearn's implementation of a KNN classifier  on some real world datasets!

## Objectives

You will be able to:

* Use KNN to make classification predictions on a real-world dataset
* Perform a parameter search for 'k' to optimize model performance
* Evaluate model performance and interpret results

## Getting Started

In this lab, we'll make use of sklearn's implementation of the **_K-Nearest Neighbors_** algorithm. We'll use it to make predictions on the Titanic dataset. 

We'll start by importing the dataset, and then deal with preprocessing steps such as removing unnecessary columns and normalizing our dataset.

You'll find the titanic dataset stored in the `titanic.csv` file. In the cell below:

* Import pandas and set the standard alias.
* Read in the data from `titanic.csv` and store it in a pandas DataFrame. 
* Print the head of the DataFrame to ensure everything loaded correctly.

In [1]:
import pandas as pd

In [2]:
titanic = pd.read_csv('titanic.csv')

In [3]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Great! Now, we'll preprocess our data to get it ready for use with a KNN classifier.

## Preprocessing Our Data

This stage should be pretty familiar to you by now. Although it's not the fun part of machine learning, it's good practice to get used to it.  Although it isn't as fun or exciting as training machine learning algorithms, it's a very large, very important part of the Data Science Process. As a Data Scientist, you'll often spend the majority of your time wrangling and preprocessing, just to get it ready for use with supervised learning algorithms. 

Since you've done this before, you should be able to do this quite well yourself without much hand holding by now. 

In the cells below, complete the following steps:

1. Remove unnecessary columns (PassengerId, Name, Ticket, and Cabin).
2. Convert `Sex` to a binary encoding, where female is `0` and male is `1`.
3. Detect and deal with any null values in the dataset. 
    * For `Age`, replace null values with the median age for the dataset. 
    * For `Embarked`, drop the rows that contain null values
4. One-Hot Encode categorical columns such as `Embarked`.
5. Store our target column, `Survived`, in a separate variable and remove it from the DataFrame. 

#### Remove unnecessary columns (PassengerId, Name, Ticket, and Cabin).
- `PassengerId` - not needed because it's simply an identifier with no real predictive value.
- `Name` - Same reason as `PassengerId`
- `Ticket` - This looks to be the actual ticket number. Kind of similar to `PassengerId`. Might be able to feature engineer something out of it, but we'll follow directions :)
- `Cabin` - I'm guessing they're suggesting to drop this because, at first glance, there appears to be missing values.

In [4]:
# Drop the directed columns
drop_these = ['PassengerId', 'Name', 'Ticket', 'Cabin']
titanic.drop(drop_these, axis=1, inplace=True)

# Check
titanic.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

Looks good.

#### Convert `Sex` to a binary encoding, where female is `0` and male is `1`.

In [5]:
# First let's make sure of the values of Sex
titanic['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

Okay, we're sure there are no nulls. Let's proceed with making a dictionary map and replacing the values as directed.

In [6]:
gender_map = {
    'female': 0
    , 'male': 1
}

titanic['Sex'] = titanic['Sex'].map(gender_map)

In [7]:
# Check that changes took place
titanic['Sex'].value_counts()

1    577
0    314
Name: Sex, dtype: int64

**Detect and deal with any null values in the dataset.**  
**- For Age, replace null values with the median age for the dataset.**  
**- For Embarked, drop the rows that contain null values**  

In [8]:
# Let's take care of age. Let's get lay of the land first.
titanic['Age'].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
28.00    25
21.00    24
25.00    23
36.00    22
29.00    20
32.00    18
27.00    18
35.00    18
26.00    18
16.00    17
31.00    17
20.00    15
33.00    15
23.00    15
34.00    15
39.00    14
17.00    13
42.00    13
40.00    13
45.00    12
38.00    11
50.00    10
2.00     10
4.00     10
47.00     9
         ..
71.00     2
59.00     2
63.00     2
0.83      2
30.50     2
70.00     2
57.00     2
0.75      2
13.00     2
10.00     2
64.00     2
40.50     2
32.50     2
45.50     2
20.50     1
24.50     1
0.67      1
14.50     1
0.92      1
74.00     1
34.50     1
80.00     1
12.00     1
36.50     1
53.00     1
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype: int64

In [9]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int64
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(5), object(1)
memory usage: 55.8+ KB


In [10]:
import numpy as np


In [11]:
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())

In [12]:
# Let's drop NA Embarked
titanic.dropna(subset=['Embarked'], inplace=True)

In [13]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Sex         889 non-null int64
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(5), object(1)
memory usage: 62.5+ KB


#### One-Hot Encode categorical columns such as Embarked.

In [14]:
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,S
1,1,1,0,38.0,1,0,71.2833,C
2,1,3,0,26.0,0,0,7.925,S
3,1,1,0,35.0,1,0,53.1,S
4,0,3,1,35.0,0,0,8.05,S


In [15]:
titanic['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [16]:
titanic_ohe = pd.get_dummies(titanic, drop_first=True)

In [17]:
titanic_ohe.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.25,0,1
1,1,1,0,38.0,1,0,71.2833,0,0
2,1,3,0,26.0,0,0,7.925,0,1
3,1,1,0,35.0,1,0,53.1,0,1
4,0,3,1,35.0,0,0,8.05,0,1


#### Store our target column, Survived, in a separate variable and remove it from the DataFrame.

In [18]:
y = titanic_ohe['Survived']

In [19]:
X = titanic_ohe.drop(['Survived'], axis=1)

In [20]:
X.columns

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q',
       'Embarked_S'],
      dtype='object')

## Normalizing Our Data

Good job preprocessing our data! This can seem tedious, but its a very important foundational skill in any Data Science toolbox. The final step we we'll take in our preprocessing efforts is to **_Normalize_** our data. Recall that normalization (also sometimes called **_Standardization_** or **_Scaling_**) means making sure that all of our data is represented at the same scale.  The most common way to do this is to convert all numerical values to z-scores. 

Since KNN is a distance-based classifier, data on different scales can negatively affect the results of our model! Predictors on much larger scales will overwhelm data with much smaller scales, because euclidean distance is going to treat them as the same.

To scale our data, we'll make use of the `StandardScaler` object found inside the `sklearn.preprocessing` module. 

In the cell below:

* Import and instantiate a `StandardScaler` object. 
* Use the scaler's `.fit_transform()` method to create a scaled version of our dataset. 
* The result returned by the `fit_transform` call will be a numpy array, not a pandas DataFrame. Create a new pandas DataFrame out of this object called `scaled_df`. To set the column names back to their original state, set the `columns` parameter to `one_hot_df.columns`.
* Print out the head of `scaled_df` to ensure everything worked correctly.

#### Import and instantiate a StandardScaler object.

In [21]:
from sklearn.preprocessing import StandardScaler

In [22]:
# Dont forget to import!
scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [23]:
X.columns

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q',
       'Embarked_S'],
      dtype='object')

In [24]:
scaled_df = pd.DataFrame(scaled_data, columns=X.columns)
scaled_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0.825209,0.735342,-0.563674,0.43135,-0.474326,-0.50024,-0.307941,0.616794
1,-1.572211,-1.359911,0.669217,0.43135,-0.474326,0.788947,-0.307941,-1.621287
2,0.825209,-1.359911,-0.255451,-0.475199,-0.474326,-0.48665,-0.307941,0.616794
3,-1.572211,-1.359911,0.43805,0.43135,-0.474326,0.422861,-0.307941,0.616794
4,0.825209,0.735342,0.43805,-0.475199,-0.474326,-0.484133,-0.307941,0.616794


You may have noticed that the scaler also scaled our binary/one-hot encoded columns, too! Although it doesn't look as pretty, this has no negative effect on our model. Each 1 and 0 have been replaced with corresponding decimal values, but each binary column still only contains 2 values, meaning the overall information content of each column has not changed. 

#### Creating Training and Testing Sets

Now that we've preprocessed our data, the only step remaining is to split our data into training and testing sets. 

In the cell below:

* Import `train_test_split` from the `sklearn.model_selection` module
* Use `train_test_split` to split our data into training and testing sets, with a `test_size` of `0.25`.

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

## Creating and Fitting our KNN Model

Now that we've preprocessed our data successfully, it's time for the fun stuff--let's create a KNN classifier and use it to make predictions on our dataset!  Since you've got some experience on this part from when we built our own model, we won't hold your hand through section. 

In the cells below:

* Import `KNeighborsClassifier` from the `sklearn.neighbors` module.
* Instantiate a classifier. For now, we'll just use the default parameters. 
* Fit the classifier to our training data/labels
* Use the classifier to generate predictions on our testing data. Store these predictions inside the variable `test_preds`.

In [27]:
from sklearn.neighbors import KNeighborsClassifier

In [28]:
knn = KNeighborsClassifier()

In [29]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [30]:
test_preds = knn.predict(X_test)

Now, in the cells below, import all the necessary evaluation metrics from `sklearn.metrics` abd then complete the following `print_metrics()` function so that it prints out **_Precision, Recall, Accuracy,_** and **_F1-Score_** when given a set of `labels` and `preds`. 

Then, use it to print out the evaluation metrics for our test predictions stored in `test_preds`, and the corresponding labels in `y_test`.

In [31]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

In [32]:
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(y_test, test_preds)))
    print("Recall Score: {}".format(recall_score(y_test, test_preds)))
    print("Accuracy Score: {}".format(accuracy_score(y_test, test_preds)))
    print("F1 Score: {}".format(f1_score(y_test, test_preds)))
    
print_metrics(y_test, test_preds)

Precision Score: 0.6428571428571429
Recall Score: 0.5421686746987951
Accuracy Score: 0.7174887892376681
F1 Score: 0.5882352941176471


**_QUESTION:_** Interpret each of the metrics above, and explain what they tell us about our model's capabilities. If you had to pick one score to best describe the performance of the model, which would you choose? Explain your answer.

Write your answer below this line:
________________________________________________________________________________



## Improving Model Performance

Our overall model results are better than random chance, but not by a large margin. For the remainder of this notebook, we'll focus on improving model performance. This is also a big part of the Data Science Process--your first fit is almost never your best. Modeling is an **_iterative process_**, meaning that we should make small incremental changes to our model and use our intuition to see if we can improve the overall performance. 

First, we'll start off by trying to find the optimal number of neighbors to use for our classifier. To do this, we'll write a quick function that iterates over multiple values of k and finds the one that returns the best overall performance. 

In the cell below, complete the `find_best_k()` function.  This function should:

* take in six parameters:
    * `X_train`, `y_train`, `X_test`, and  `y_test`
    * `min_k` and `max_k`. Set these to `1` and `25`, by default
* Create two variables, `best_k` and `best_score`
* Iterate through every **_odd number_** between `min_k` and `max_k + 1`. 
* For each iteration:
    * Create a new KNN classifier, and set the `n_neighbors` parameter to the current value for k, as determined by our loop.
    * Fit this classifier to the training data.
    * Generate predictions for `X_test` using the fitted classifier.
    * Calculate the **_F1-score_** for these predictions.
    * Compare this F1-score to `best_score`. If better, update `best_score` and `best_k`.
* Once it has checked every value for `k`, print out the best value for k and the F1-score it achieved.

In [33]:
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):

    best_score = 0
    best_k = 0
    
    for k in range(min_k, max_k+1):
        if k % 2 == 0: # Even number
            pass
        else: 
            knn = KNeighborsClassifier(n_neighbors=k)
            knn.fit(X_train, y_train)
            test_preds = knn.predict(X_test)
            f1 = f1_score(y_test, test_preds)
            if f1 > best_score:
                best_score = f1
                best_k = k  
            else:
                pass
            
    print('Best k: {}\nf1: {}'.format(best_k, best_score))

In [34]:
find_best_k(X_train, y_train, X_test, y_test)
# Expected Output:

# Best Value for k: 3
# F1-Score: 0.6444444444444444

Best k: 7
f1: 0.6274509803921569


We improved our model performance by over 4 percent just by finding an optimal value for k. Good job! There are other parameters in the model that you can also tune. In a later section, we'll cover how we can automate the parameter search process using a technique called **_Grid Search_**. For, try playing around with the different options for parameters, and seeing how it affects model performance. For a full list of model parameters, see the [sklearn documentation !](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

## (Optional) Level Up: Iterating on the Data

As an optional (but recommended!) exercise, think about the decisions we made during the preprocessing steps that could have affected our overall model performance. For instance, we replaced missing age values with the column median. Could this have affected ourn overall performance? How might the model have fared if we had just dropped those rows, instead of using the column median? What if we reduced dimensionality by ignoring some less important columns altogether?

In the cells below, revisit your preprocessing stage and see if you can improve the overall results of the classifier by doing things differently. Perhaps you should consider dropping certain columns, or dealing with null values differently, or even using a different sort of scaling (or none at all!). Try a few different iterations on the preprocessing and see how it affects the overall performance of the model. The `find_best_k` function handles all of the fitting--use this to iterate quickly as you try different strategies for dealing with data preprocessing! 


## Summary

Good job! This concludes today's section!