# Predicting Titanic survivors with *k*-NN

In this Notebook we're going to predict whether passengers survived on the Titanic or not, using the *k*-NN algorithm. This is a classic dataset and you can find it on [Kaggle](https://www.kaggle.com/c/titanic).

In [1]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

## Data set

1. Let's first look at the dataset and see which variables we can use.

In [2]:
df = pd.read_csv("titanic.csv")
df.head(30) #show a bit more of the dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,C85
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,C123
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,E46
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,


## Data cleaning

2. Create a subset with the selected variables and check for NaN's. Drop the rows with NaN's or drop one or more columns with NaN's. Think about this carefully

In [3]:
df = df[["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch"]]
df.info()
df = df.dropna()
df.head(30)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 41.9+ KB


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
0,0,3,male,22.0,1,0
1,1,1,female,38.0,1,0
2,1,3,female,26.0,0,0
3,1,1,female,35.0,1,0
4,0,3,male,35.0,0,0
6,0,1,male,54.0,0,0
7,0,3,male,2.0,3,1
8,1,3,female,27.0,0,2
9,1,2,female,14.0,1,0
10,1,3,female,4.0,1,1


3. How many passengers survived?

In [4]:
print(df["Survived"].value_counts())

0    424
1    290
Name: Survived, dtype: int64


4. Let's add dummy variables for the variable *Sex*. Remember we can only add one of the variables *male* and *female*. They are perfectly correlated in this dataset so the model wouldn't be able to distinguish between them.

In [5]:
df = pd.get_dummies(df, columns=['Sex'], drop_first = True)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Sex_male
0,0,3,22.0,1,0,1
1,1,1,38.0,1,0,0
2,1,3,26.0,0,0,0
3,1,1,35.0,1,0,0
4,0,3,35.0,0,0,1


## Building the model

Let's build the model. Make sure to normalize the numeric data before splitting the dataset.

In [6]:
from sklearn.preprocessing import normalize

X = df.loc[:, ~df.columns.isin(['Survived'])] #create the X matrix
y = df['Survived'] #create the y-variable

X = normalize(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
pd.DataFrame(X_train).head()

Unnamed: 0,0,1,2,3,4
0,0.041631,0.999133,0.0,0.0,0.0
1,0.173494,0.983135,0.0,0.0,0.057831
2,0.087856,0.995703,0.0,0.0,0.029285
3,0.055132,0.992372,0.0,0.110264,0.0
4,0.019988,0.999401,0.019988,0.0,0.019988


Let's use the *KNeightborsClassifier* class from sklearn:

In [7]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier() #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data



## Model evaluation

Let's start by calculating accuracy. As always, we do the evaluation on the test data.

In [8]:
knn.score(X_test, y_test) #calculate the fit on the *test* data

0.7813953488372093

Accuracy is 80.5%. An easy comparison is to compare with the best baseline guess: always guess "Not Survived". That would give us 424 / (424 + 290) = 59.4% (see *value_counts* above). So the model is a lot better than the baseline guess.

## Model evaluation: precision and recall

Aside from accuracy, we can also look at precision and recall. Let's create a *confusion matrix* for that.

In [9]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[112,  22],
       [ 25,  56]], dtype=int64)

Let's pretty print that.

In [10]:
conf_matrix = pd.DataFrame(cm, index=['Not survived (actual)', 'Survived (actual)'], columns = ['Not survived (predicted)', 'Survived (predicted)']) 
conf_matrix

Unnamed: 0,Not survived (predicted),Survived (predicted)
Not survived (actual),112,22
Survived (actual),25,56


5. Interpret these results, how many false negatives do we have and, what are the consequences, compared to false negatives with COVID-19

#### Accuracy

6. Calculate the accuracy, using the nummers from the confusion matrix. We already calculated it with the *.score* method, but just to check.

In [11]:
(114+59)/(114+59+22+20)

0.8046511627906977

#### Precision (survived)

Remember precision and recall can only be calculated for an *outcome*, not for an entire variable (unlike accuracy). I'll calculate precision and recall for *survived* and leave the others up for the reader to try.

7. Let's start with calculating precision. This is the number of <b>correctly predicted survivors</b>, divided by the <b>total number of predicted survivors</b>. Remember: how "precise" am I in saying people survived?

In [12]:
59/(20+59)

0.7468354430379747

#### Recall (survived)

8. Now calculate recall. This is the number of <b>correctly predicted survivors</b>, divided by the <b>total number of actual survivors</b>. Remember: how many people survived do I "recall"?

In [13]:
59/(22+59)

0.7283950617283951

## Parameter setting

Finally, let's try out different settings for the most important parameter *k*. I'll use a for-loop to do a simple parameter grid search. I'll use a built-in function *classification_report* in sklearn to print out accuracy, precision and recall quickly.

9. Which k value would you choose?

In [14]:
from sklearn.metrics import classification_report

for i in range(1,11):
    knn_new = KNeighborsClassifier(n_neighbors = i) #make a new kNN model with i (1-10) neighbors
    knn_new = knn_new.fit(X_train, y_train) #fit new model on train data
    y_test_pred_new = knn_new.predict(X_test) #predict using new model, with test data
    print(f"With {i} neighbors the result is:")
    print(classification_report(y_test, y_test_pred_new)) #use a built-in function to print out accuracy, precision and recall


With 1 neighbors the result is:
              precision    recall  f1-score   support

           0       0.79      0.75      0.77       134
           1       0.61      0.67      0.64        81

    accuracy                           0.72       215
   macro avg       0.70      0.71      0.70       215
weighted avg       0.72      0.72      0.72       215

With 2 neighbors the result is:
              precision    recall  f1-score   support

           0       0.75      0.87      0.81       134
           1       0.72      0.53      0.61        81

    accuracy                           0.74       215
   macro avg       0.74      0.70      0.71       215
weighted avg       0.74      0.74      0.73       215

With 3 neighbors the result is:
              precision    recall  f1-score   support

           0       0.83      0.78      0.80       134
           1       0.67      0.73      0.70        81

    accuracy                           0.76       215
   macro avg       0.75      0.7