# KNN with scikit-learn - Lab

## Introduction

In this lab, I'll learn how to use scikit-learn's implementation of a KNN classifier on the classic Titanic dataset from Kaggle!
 

## Objectives

In this lab I will:

- Conduct a parameter search to find the optimal value for K 
- Use a KNN classifier to generate predictions on a real-world dataset 
- Evaluate the performance of a KNN model  


## Getting Started

Start by importing the dataset, stored in the `titanic.csv` file, and previewing it.

In [1]:
# Import pandas and set the standard alias 
import pandas as pd


# Import the data from 'titanic.csv' and store it in a pandas DataFrame 
raw_df = pd.read_csv('titanic.csv')

# Print the head of the DataFrame to ensure everything loaded correctly 
raw_df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Great!  Next, let's perform some preprocessing steps such as removing unnecessary columns and normalizing features.

## Preprocessing the data

Preprocessing is an essential component in any data science pipeline. It's not always the most glamorous task as might be an engaging data visual or impressive neural network, but cleaning and normalizing raw datasets is very essential to produce useful and insightful datasets that form the backbone of all data powered projects. This can include changing column types, as in: 


In [2]:
# Drop the unnecessary columns
df = raw_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [3]:
# Convert Sex to binary encoding
df['Sex'] = df['Sex'].map({"male": 1, "female": 0})
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,S
1,1,1,0,38.0,1,0,71.2833,C
2,1,3,0,26.0,0,0,7.925,S
3,1,1,0,35.0,1,0,53.1,S
4,0,3,1,35.0,0,0,8.05,S


In [4]:
# Find the number of missing values in each column
missing = df.isnull().sum()
missing


Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [5]:
# Impute the missing values in 'Age'
df['Age'] = df['Age'].fillna(df['Age'].median())
df.isna().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

In [6]:
# Drop the rows missing values in the 'Embarked' column
df = df.dropna(axis=0)
df.isna().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [7]:
df.dtypes

Survived      int64
Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

In [8]:
# One-hot encode the categorical columns
one_hot_df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
one_hot_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.25,0,1
1,1,1,0,38.0,1,0,71.2833,0,0
2,1,3,0,26.0,0,0,7.925,0,1
3,1,1,0,35.0,1,0,53.1,0,1
4,0,3,1,35.0,0,0,8.05,0,1


In [9]:
# Assign the 'Survived' column to labels
labels = one_hot_df['Survived']

# Drop the 'Survived' column from one_hot_df
one_hot_df = one_hot_df.drop('Survived', axis=1)
one_hot_df.head()


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,3,1,22.0,1,0,7.25,0,1
1,1,0,38.0,1,0,71.2833,0,0
2,3,0,26.0,0,0,7.925,0,1
3,1,0,35.0,1,0,53.1,0,1
4,3,1,35.0,0,0,8.05,0,1


## Create training and test sets

Now that i've preprocessed the data, it's time to split it into training and test sets. 

In the cell below I will:

* Import `train_test_split` from the `sklearn.model_selection` module 
* Use `train_test_split()` to split the data into training and test sets, with a `test_size` of `0.25`. Set the `random_state` to 42 

In [10]:
# Import train_test_split 
from sklearn.model_selection import train_test_split


# Split the data
X_train, X_test, y_train, y_test = train_test_split(one_hot_df, labels, test_size=0.25, random_state =42)

In [11]:
y_train.head()

376    1
458    1
732    0
507    1
830    1
Name: Survived, dtype: int64

## Normalizing the data

The final step in your preprocessing efforts for this lab is to **_normalize_** the data. We normalize **after** splitting our data into training and test sets. This is to avoid information "leaking" from our test set into our training set. Remember that normalization (also sometimes called **_Standardization_** or **_Scaling_**) means making sure that all of your data is represented at the same scale. The most common way to do this is to convert all numerical values to z-scores. 

Since KNN is a distance-based classifier, if data is in different scales, then larger scaled features have a larger impact on the distance between points.

To scale your data, use `StandardScaler` found in the `sklearn.preprocessing` module. 

In the cell below I will:

* Import and instantiate `StandardScaler` 
* Use the scaler's `.fit_transform()` method to create a scaled version of the training dataset  
* Use the scaler's `.transform()` method to create a scaled version of the test dataset  
* The result returned by `.fit_transform()` and `.transform()` methods will be numpy arrays, not a pandas DataFrame. Create a new pandas DataFrame out of this object called `scaled_df`. To set the column names back to their original state, set the `columns` parameter to `one_hot_df.columns` 
* Print the head of `scaled_df` to ensure everything worked correctly 

In [12]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler


# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.fit_transform(X_test)

# Convert into a DataFrame
scaled_df_train = pd.DataFrame(scaled_data_train, columns=X_train.columns)
scaled_df_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0.815528,-1.390655,-0.575676,-0.474917,-0.480663,-0.500108,-0.311768,0.620174
1,-0.386113,-1.390655,1.550175,-0.474917,-0.480663,-0.435393,-0.311768,0.620174
2,-0.386113,0.719086,-0.120137,-0.474917,-0.480663,-0.644473,-0.311768,0.620174
3,-1.587755,0.719086,-0.120137,-0.474917,-0.480663,-0.115799,-0.311768,0.620174
4,0.815528,-1.390655,-1.107139,0.413551,-0.480663,-0.356656,-0.311768,-1.612452


In [13]:
scaled_df_test = pd.DataFrame(scaled_data_test, columns=X_test.columns)
scaled_df_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0.854553,0.78482,-0.041886,-0.477737,-0.455288,-0.488543,-0.296319,0.606711
1,-1.527513,-1.274178,-1.175751,0.490767,2.115072,1.84885,-0.296319,0.606711
2,0.854553,-1.274178,-1.175751,0.490767,-0.455288,-0.417939,-0.296319,-1.648231
3,-0.33648,0.78482,0.120095,-0.477737,-0.455288,-0.381292,-0.296319,0.606711
4,-1.527513,-1.274178,-0.85179,-0.477737,2.115072,1.007857,-0.296319,0.606711


You may have noticed that the scaler also scaled our binary/one-hot encoded columns, too! Although it doesn't look as pretty, this has no negative effect on the model. Each 1 and 0 have been replaced with corresponding decimal values, but each binary column still only contains 2 values, meaning the overall information content of each column has not changed.

## Fit a KNN model

Now that I've preprocessed the data it's time to train a KNN classifier and validate its accuracy. 

In the cells below:

* Import `KNeighborsClassifier` from the `sklearn.neighbors` module 
* Instantiate the classifier. For now, you can just use the default parameters  
* Fit the classifier to the training data/labels
* Use the classifier to generate predictions on the test data. Store these predictions inside the variable `test_preds` 

In [14]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier


# Instantiate KNeighborsClassifier
clf = KNeighborsClassifier()

# Fit the classifier
clf.fit(X_train, y_train)

# Predict on the test set
test_preds = clf.predict(X_test)


## Evaluate the model

Now, in the cells below, I will import all the necessary evaluation metrics from `sklearn.metrics` and complete the `print_metrics()` function so that it prints out **_Precision, Recall, Accuracy, and F1-Score_** when given a set of `labels` (the true values) and `preds` (the models predictions). 

Finally, use `print_metrics()` to print the evaluation metrics for the test predictions stored in `test_preds`, and the corresponding labels in `y_test`. 

In [15]:
# Import the necessary functions
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score



In [16]:
def print_metrics(labels, preds):
    precision = precision_score(y_test, test_preds)
    recall = recall_score(y_test, test_preds)
    accuracy = accuracy_score(y_test, test_preds)
    f1 = f1_score(y_test, test_preds)
    print("Precision Score: {:4f}".format(precision))
    print("Recall Score: {:4f}".format(recall))
    print("Accuracy Score: {:4f}".format(accuracy))
    print("F1 Score: {:4f}".format(f1))
    
print_metrics(y_test, test_preds)

Precision Score: 0.602740
Recall Score: 0.536585
Accuracy Score: 0.699552
F1 Score: 0.567742


## Improve model performance

While my overall model results should be better than random chance, they're probably mediocre at best given that you haven't tuned the model yet. For the remainder of this notebook, you'll focus on improving your model's performance. Remember that modeling is an **_iterative process_**, and developing a baseline out of the box model such as the one above is always a good start. 

First, try to find the optimal number of neighbors to use for the classifier. To do this, complete the `find_best_k()` function below to iterate over multiple values of K and find the value of K that returns the best overall performance. 

The function takes in six arguments:
* `X_train`
* `y_train`
* `X_test`
* `y_test`
* `min_k` (default is 1)
* `max_k` (default is 25)

In [17]:
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = None
    best_score = 0.0

    for k in range(min_k, max_k+1, 2):
        # Create a new KNN classifier with current k value
        clf = KNeighborsClassifier(n_neighbors=k)

        # Fit the classifier to the training data
        clf.fit(X_train, y_train)

        # Generate predictions for X_test using the fitted classifier
        test_preds = clf.predict(X_test)

        # Calculate the F1-score for these predictions
        f1 = f1_score(y_test, test_preds)

        # Compare F1-score to best_score, update if better
        if f1 > best_score:
            best_score = f1
            best_k = k

    # Print the best value for k and the corresponding F1-score
    print("Best k:", best_k)
    print("Best F1-score:", best_score)

# Call the function with your data and desired range for k
find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25)


Best k: 11
Best F1-score: 0.6363636363636364


In [18]:
find_best_k(scaled_data_train, y_train, scaled_data_test, y_test)

Best k: 13
Best F1-score: 0.7407407407407408


## Summary

Well done to me! In this lab, I worked with the classic Titanic dataset and practiced fitting and tuning KNN classification models using scikit-learn! As always, this gave me another opportunity to continue practicing my data wrangling skills and model tuning skills using Pandas and scikit-learn!