# KNN with scikit-learn - Lab

## Introduction

In this lab, you'll learn how to use scikit-learn's implementation of a KNN classifier on the classic Titanic dataset from Kaggle!
 

## Objectives

In this lab you will:

- Conduct a parameter search to find the optimal value for K 
- Use a KNN classifier to generate predictions on a real-world dataset 
- Evaluate the performance of a KNN model  


## Getting Started

Start by importing the dataset, stored in the `titanic.csv` file, and previewing it.

In [1]:
# Import pandas for data manipulation
import pandas as pd

# Load the Titanic dataset
titanic_data_path = 'titanic.csv'  # Adjust the path if needed
raw_df = pd.read_csv(titanic_data_path)

# Display the first few rows of the dataset
print(raw_df.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Great!  Next, you'll perform some preprocessing steps such as removing unnecessary columns and normalizing features.

## Preprocessing the data

Preprocessing is an essential component in any data science pipeline. It's not always the most glamorous task as might be an engaging data visual or impressive neural network, but cleaning and normalizing raw datasets is very essential to produce useful and insightful datasets that form the backbone of all data powered projects. This can include changing column types, as in: 


```python
df['col_name'] = df['col_name'].astype('int')
```
Or extracting subsets of information, such as: 

```python
import re
df['street'] = df['address'].map(lambda x: re.findall('(.*)?\n', x)[0])
```

> **Note:** While outside the scope of this particular lesson, **regular expressions** (mentioned above) are powerful tools for pattern matching! See the [regular expressions official documentation here](https://docs.python.org/3.6/library/re.html). 

Since you've done this before, you should be able to do this quite well yourself without much hand holding by now. In the cells below, complete the following steps:

1. Remove unnecessary columns (`'PassengerId'`, `'Name'`, `'Ticket'`, and `'Cabin'`) 
2. Convert `'Sex'` to a binary encoding, where female is `0` and male is `1` 
3. Detect and deal with any missing values in the dataset:  
    * For `'Age'`, replace missing values with the median age for the dataset  
    * For `'Embarked'`, drop the rows that contain missing values
4. One-hot encode categorical columns such as `'Embarked'` 
5. Store the target column, `'Survived'`, in a separate variable and remove it from the DataFrame  

While we always want to worry about data leakage, which is why we typically perform the split before the preprocessing, for this data set, we'll do some of the preprocessing first. The reason for this is that some of the values of the variables only have a handful of instances, and we want to make sure we don't lose any of them.

In [2]:
# Drop the unnecessary columns
df = raw_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])

# Display the first few rows of the modified DataFrame
print(df.head())


   Survived  Pclass     Sex   Age  SibSp  Parch     Fare Embarked
0         0       3    male  22.0      1      0   7.2500        S
1         1       1  female  38.0      1      0  71.2833        C
2         1       3  female  26.0      0      0   7.9250        S
3         1       1  female  35.0      1      0  53.1000        S
4         0       3    male  35.0      0      0   8.0500        S


In [3]:
# Convert 'Sex' to binary encoding
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})

# Display the first few rows to confirm the changes
print(df.head())


   Survived  Pclass  Sex   Age  SibSp  Parch     Fare Embarked
0         0       3    1  22.0      1      0   7.2500        S
1         1       1    0  38.0      1      0  71.2833        C
2         1       3    0  26.0      0      0   7.9250        S
3         1       1    0  35.0      1      0  53.1000        S
4         0       3    1  35.0      0      0   8.0500        S


In [None]:
# Find the number of missing values in each column
missing_values = df.isnull().sum()

# Display the count of missing values for each column
print(missing_values)


Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64


In [7]:
# Impute the missing values in 'Age' with the median age
df['Age'] = df['Age'].fillna(df['Age'].median())

# Verify that there are no missing values in 'Age'
print(df.isna().sum())


Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64


In [8]:
# Drop rows with missing values in the 'Embarked' column
df = df.dropna(subset=['Embarked'])

# Verify that there are no missing values in 'Embarked'
print(df.isna().sum())


Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64


In [9]:
# One-hot encode the 'Embarked' column
one_hot_df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Display the first few rows of the resulting DataFrame
print(one_hot_df.head())


   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked_Q  Embarked_S
0         0       3    1  22.0      1      0   7.2500       False        True
1         1       1    0  38.0      1      0  71.2833       False       False
2         1       3    0  26.0      0      0   7.9250       False        True
3         1       1    0  35.0      1      0  53.1000       False        True
4         0       3    1  35.0      0      0   8.0500       False        True


In [10]:
# Assign the 'Survived' column to labels
labels = one_hot_df['Survived']

# Drop the 'Survived' column from one_hot_df
one_hot_df = one_hot_df.drop(columns=['Survived'])

# Verify that the 'Survived' column has been separated
print("Labels:")
print(labels.head())
print("\nFeatures:")
print(one_hot_df.head())


Labels:
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Features:
   Pclass  Sex   Age  SibSp  Parch     Fare  Embarked_Q  Embarked_S
0       3    1  22.0      1      0   7.2500       False        True
1       1    0  38.0      1      0  71.2833       False       False
2       3    0  26.0      0      0   7.9250       False        True
3       1    0  35.0      1      0  53.1000       False        True
4       3    1  35.0      0      0   8.0500       False        True


## Create training and test sets

Now that you've preprocessed the data, it's time to split it into training and test sets. 

In the cell below:

* Import `train_test_split` from the `sklearn.model_selection` module 
* Use `train_test_split()` to split the data into training and test sets, with a `test_size` of `0.25`. Set the `random_state` to 42 

In [11]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    one_hot_df, labels, test_size=0.2, random_state=42
)

# Verify the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


X_train shape: (711, 8)
X_test shape: (178, 8)
y_train shape: (711,)
y_test shape: (178,)


## Normalizing the data

The final step in your preprocessing efforts for this lab is to **_normalize_** the data. We normalize **after** splitting our data into training and test sets. This is to avoid information "leaking" from our test set into our training set (read more about data leakage [here](https://machinelearningmastery.com/data-leakage-machine-learning/) ). Remember that normalization (also sometimes called **_Standardization_** or **_Scaling_**) means making sure that all of your data is represented at the same scale. The most common way to do this is to convert all numerical values to z-scores. 

Since KNN is a distance-based classifier, if data is in different scales, then larger scaled features have a larger impact on the distance between points.

To scale your data, use `StandardScaler` found in the `sklearn.preprocessing` module. 

In the cell below:

* Import and instantiate `StandardScaler` 
* Use the scaler's `.fit_transform()` method to create a scaled version of the training dataset  
* Use the scaler's `.transform()` method to create a scaled version of the test dataset  
* The result returned by `.fit_transform()` and `.transform()` methods will be numpy arrays, not a pandas DataFrame. Create a new pandas DataFrame out of this object called `scaled_df`. To set the column names back to their original state, set the `columns` parameter to `one_hot_df.columns` 
* Print the head of `scaled_df` to ensure everything worked correctly 

In [12]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Fit and transform the training set, and transform the test set
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.transform(X_test)

# Convert the scaled training data into a DataFrame
scaled_df_train = pd.DataFrame(scaled_data_train, columns=X_train.columns)

# Display the first few rows of the scaled training DataFrame
print(scaled_df_train.head())


     Pclass       Sex       Age     SibSp     Parch      Fare  Embarked_Q  \
0 -1.584104 -1.405310 -0.571868 -0.474516 -0.475644  2.430597   -0.317205   
1  0.812275 -1.405310 -0.115088  0.381780 -0.475644 -0.358135   -0.317205   
2  0.812275  0.711587  0.189432 -0.474516 -0.475644 -0.490949   -0.317205   
3  0.812275 -1.405310 -0.115088  6.375852  2.010994  0.762595   -0.317205   
4  0.812275  0.711587 -1.180908  3.806964  2.010994  0.301860   -0.317205   

   Embarked_S  
0    0.619087  
1   -1.615282  
2    0.619087  
3    0.619087  
4    0.619087  


You may have noticed that the scaler also scaled our binary/one-hot encoded columns, too! Although it doesn't look as pretty, this has no negative effect on the model. Each 1 and 0 have been replaced with corresponding decimal values, but each binary column still only contains 2 values, meaning the overall information content of each column has not changed.

## Fit a KNN model

Now that you've preprocessed the data it's time to train a KNN classifier and validate its accuracy. 

In the cells below:

* Import `KNeighborsClassifier` from the `sklearn.neighbors` module 
* Instantiate the classifier. For now, you can just use the default parameters  
* Fit the classifier to the training data/labels
* Use the classifier to generate predictions on the test data. Store these predictions inside the variable `test_preds` 

In [14]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)  # Using k=5 as an example

# Fit the classifier on the training data
clf.fit(scaled_data_train, y_train)

# Predict on the test set
test_preds = clf.predict(scaled_data_test)

# Display the first few predictions
print("Test Predictions:")
print(test_preds[:10])


Test Predictions:
[0 1 1 0 1 0 0 0 1 1]


## Evaluate the model

Now, in the cells below, import all the necessary evaluation metrics from `sklearn.metrics` and complete the `print_metrics()` function so that it prints out **_Precision, Recall, Accuracy, and F1-Score_** when given a set of `labels` (the true values) and `preds` (the models predictions). 

Finally, use `print_metrics()` to print the evaluation metrics for the test predictions stored in `test_preds`, and the corresponding labels in `y_test`. 

In [15]:
# Import the necessary functions
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score


In [16]:
# Complete the function
def print_metrics(labels, preds):
    from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
    
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))

# Use the function to print metrics for the test predictions
print_metrics(y_test, test_preds)


Precision Score: 0.7222222222222222
Recall Score: 0.7536231884057971
Accuracy Score: 0.7921348314606742
F1 Score: 0.7375886524822695


> Interpret each of the metrics above, and explain what they tell you about your model's capabilities. If you had to pick one score to best describe the performance of the model, which would you choose? Explain your answer.

Write your answer below this line: 


________________________________________________________________________________

Let’s break down the metrics and interpret each one in the context of your KNN model's performance:

---

### **Metric Interpretations**:

1. **Precision Score: 0.722**
   - Precision measures the proportion of positive predictions that were actually correct. 
   - **Interpretation**: Out of all the passengers predicted to survive, 72.2% actually survived. 
   - **Significance**: This metric is important when the cost of false positives (e.g., predicting survival when the passenger didn’t survive) is high.

2. **Recall Score: 0.754**
   - Recall (also known as sensitivity) measures the proportion of actual positives correctly identified.
   - **Interpretation**: The model correctly identified 75.4% of the passengers who actually survived.
   - **Significance**: Recall is critical when the cost of false negatives (e.g., failing to predict survival when the passenger actually survived) is high.

3. **Accuracy Score: 0.792**
   - Accuracy measures the proportion of correct predictions (both positives and negatives) out of all predictions.
   - **Interpretation**: The model correctly predicted survival or non-survival for 79.2% of passengers.
   - **Significance**: Accuracy is a general metric but may not be reliable in datasets with class imbalance.

4. **F1 Score: 0.738**
   - The F1 Score is the harmonic mean of precision and recall, balancing both metrics.
   - **Interpretation**: The F1 Score of 73.8% reflects a balance between the precision and recall of the model.
   - **Significance**: F1 is particularly useful when the dataset has an imbalance between classes (e.g., more non-survivors than survivors).

---

### **Which Metric Best Describes the Model?**

- If **class imbalance** exists in the dataset (e.g., significantly more non-survivors than survivors), **F1 Score** is the most reliable metric. It balances precision and recall, providing a holistic view of the model's performance.
- If the **cost of false negatives** is high (e.g., missing survivors is critical), **Recall** might be more critical.
- If the **cost of false positives** is high (e.g., incorrectly identifying survivors is costly), **Precision** would be more important.

**Recommendation**: If I had to pick one score to best describe the performance of the model, I would choose the **F1 Score (0.738)** because it balances both precision and recall, providing a comprehensive measure of the model's ability to correctly identify survivors while minimizing false positives and false negatives, which is crucial in datasets with potential class imbalance.


## Improve model performance

While your overall model results should be better than random chance, they're probably mediocre at best given that you haven't tuned the model yet. For the remainder of this notebook, you'll focus on improving your model's performance. Remember that modeling is an **_iterative process_**, and developing a baseline out of the box model such as the one above is always a good start. 

First, try to find the optimal number of neighbors to use for the classifier. To do this, complete the `find_best_k()` function below to iterate over multiple values of K and find the value of K that returns the best overall performance. 

The function takes in six arguments:
* `X_train`
* `y_train`
* `X_test`
* `y_test`
* `min_k` (default is 1)
* `max_k` (default is 25)
    
> **Pseudocode Hint**:
1. Create two variables, `best_k` and `best_score`
1. Iterate through every **_odd number_** between `min_k` and `max_k + 1`. 
    1. For each iteration:
        1. Create a new `KNN` classifier, and set the `n_neighbors` parameter to the current value for k, as determined by the loop 
        1. Fit this classifier to the training data 
        1. Generate predictions for `X_test` using the fitted classifier 
        1. Calculate the **_F1-score_** for these predictions 
        1. Compare this F1-score to `best_score`. If better, update `best_score` and `best_k` 
1. Once all iterations are complete, print the best value for k and the F1-score it achieved 

In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score

def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    # Initialize variables to track the best k and its corresponding F1 score
    best_k = None
    best_score = 0

    # Iterate through odd numbers between min_k and max_k + 1
    for k in range(min_k, max_k + 1, 2):  # Only odd numbers
        # Create and fit the KNN classifier
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)

        # Generate predictions on the test set
        preds = knn.predict(X_test)

        # Calculate the F1 score
        current_score = f1_score(y_test, preds)

        # Update best_k and best_score if current_score is better
        if current_score > best_score:
            best_k = k
            best_score = current_score

    # Print the best k and its corresponding F1 score
    print(f"Best K: {best_k}")
    print(f"Best F1 Score: {best_score:.2f}")

    return best_k, best_score



In [20]:
find_best_k(scaled_data_train, y_train, scaled_data_test, y_test)
# Expected Output:

# Best Value for k: 17
# F1-Score: 0.7468354430379746

Best K: 21
Best F1 Score: 0.76


(21, 0.7575757575757576)

If all went well, you'll notice that model performance has improved by 3 percent by finding an optimal value for k. For further tuning, you can use scikit-learn's built-in `GridSearch()` to perform a similar exhaustive check of hyperparameter combinations and fine tune model performance. For a full list of model parameters, see the [sklearn documentation !](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)



## (Optional) Level Up: Iterating on the data

As an optional (but recommended!) exercise, think about the decisions you made during the preprocessing steps that could have affected the overall model performance. For instance, you were asked to replace the missing age values with the column median. Could this have affected the overall performance? How might the model have fared if you had just dropped those rows, instead of using the column median? What if you reduced the data's dimensionality by ignoring some less important columns altogether?

In the cells below, revisit your preprocessing stage and see if you can improve the overall results of the classifier by doing things differently. Consider dropping certain columns, dealing with missing values differently, or using an alternative scaling function. Then see how these different preprocessing techniques affect the performance of the model. Remember that the `find_best_k()` function handles all of the fitting; use this to iterate quickly as you try different strategies for dealing with data preprocessing! 

## Summary

Well done! In this lab, you worked with the classic Titanic dataset and practiced fitting and tuning KNN classification models using scikit-learn! As always, this gave you another opportunity to continue practicing your data wrangling skills and model tuning skills using Pandas and scikit-learn!