# Predicting Titanic survivors with *k*-NN

In this Notebook we're going to predict whether passengers survived on the Titanic or not, using the *k*-NN algorithm. This is a classic dataset from [Kaggle](https://www.kaggle.com/c/titanic).

In [14]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split #We need this to split the data

## Data set

Let's first look at the dataset and see which variables we can use.

In [24]:
df1 = pd.read_csv("gender_submission.csv")#sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams.
print(df1)

df2 = pd.read_csv("test.csv")
print(df2)

df3 = pd.read_csv("train.csv")
print(df3)



     PassengerId  Survived
0            892         0
1            893         1
2            894         0
3            895         0
4            896         1
..           ...       ...
413         1305         0
414         1306         1
415         1307         0
416         1308         0
417         1309         0

[418 rows x 2 columns]
     PassengerId  Pclass                                          Name  \
0            892       3                              Kelly, Mr. James   
1            893       3              Wilkes, Mrs. James (Ellen Needs)   
2            894       2                     Myles, Mr. Thomas Francis   
3            895       3                              Wirz, Mr. Albert   
4            896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
..           ...     ...                                           ...   
413         1305       3                            Spector, Mr. Woolf   
414         1306       1                  Oliva y Ocana, Don

In [40]:
df = pd.merge(df1, df2, df3) #used to return the identity of an object

#df = pd.concat(
   #map(pd.read_csv, [df1, df2, df3])


df = df[(str(df['male'] > 11,0).any() & (df['male'] < 90,0))]

#any() and all() are two ways to obtain a single truth value based on a mask.

#np.random.seed(0)
#df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
#df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

df = df.fillna(0)

df.head(30) #Shows the dataset


ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

* *PassengerId* is just an ID variable, we don't use it
* *Survived* is our dependent variable
* There are 5 variables that are easy to work with: *Pclass*, *Sex*, *Age* (though it contains some NaNs), *SibSp* (number of siblings and spouses), *Parch* (number of parents and children).


## Data cleaning

Let's select the variables. We also need to drop the rows with NaN's in them. Unfortunately our *k*-NN algorithm won't work with NaN's. Dealing with missing values is actually a very complicated topic within statistics. For now, let's just drop the rows with NaN's. And see how many people survived.

In [36]:
df = df[['Survived','Pclass', 'Age', 'SibSp', 'Sex', 'Parch']]
df = df.dropna() #get rid of rows  empty cells
df.head()
df['Survived'].value_counts()

NameError: name 'df' is not defined

Let's add dummy variables for the variable *Sex*.

In [16]:
dummies = pd.get_dummies(df['Sex'])
df = pd.concat([df, dummies], axis=1)(axis=0 is rows)
df.head()

NameError: name 'df' is not defined

## Building the model

Let's build the model. Remember we can only add one of the variables *male* and *female*. They are correlated in this dataset. 

In [5]:
X = df[['Age', 'Pclass', 'SibSp', 'Parch', 'female']] #create the X matrix

y = df['Survived'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

X_train.head() #show the head of the training set

Unnamed: 0,Age,Pclass,SibSp,Parch,female
641,24.0,1,0,0,1
433,17.0,3,0,0,0
202,34.0,3,0,0,0
585,18.0,1,0,2,1
544,50.0,1,1,0,0


Let's use the *KNeightborsClassifier* class from sklearn:

In [10]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=5) #create a KNN-classifier with 3 neighbors
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data



In [11]:
knn.n_samples_fit_

499

## Model evaluation

Let's start by calculating accuracy. As always, we do the evaluation on the test data.

In [12]:
knn.score(X_test, y_test) #calculate the fit on the *test* data

0.8093023255813954

Accuracy is 79.5%. An easy comparison is to compare with the best baseline guess: always guess "Not Survived". That would give us 424 / (424 + 290) = 59.4% (see *value_counts* above). So the model is a lot better than the baseline guess. Let's create a confusion matrix to evaluate precision and recall.

In [9]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm  

array([[111,  23],
       [ 21,  60]], dtype=int64)

How do I know which row is which (survived or not survived)? There's an attribute for that in our model:

In [24]:
knn.classes_

array([0, 1], dtype=int64)

This tells me that the first outcome is the "0" value (not survived) and the second outcome the "1" value (survived). If there was text in that column instead of 0/1 numbers, it would show the category text label.

Let's pretty print our table:

In [25]:
conf_matrix = pd.DataFrame(cm, index=['Not survived (actual)', 'Survived (actual)'], columns = ['Not survived (predicted)', 'Survived (predicted)']) #make a dataframe, put labels on rows (index) and columns 
conf_matrix

Unnamed: 0,Not survived (predicted),Survived (predicted)
Not survived (actual),111,23
Survived (actual),21,60


#### Accuracy

Let's start with the accuracy. We already calculated it with the *.score* method, but just to check.

In [16]:
(111+60)/(111+23+21+60)


0.7953488372093023

Indeed, accuracy is 79.5%. 

#### Precision (survived)

In [17]:
60/(23+60)

0.7228915662650602

Precision is 72.2%.

In [18]:
60/(21+60)

0.7407407407407407

So 74.1%. So again a bit worse than accuracy overall but not much. How is it possible that both precision and recall are worse than accuracy, you might ask. Well, remember that *survived* is just one outcome. Apparently the model is better in predicting *not survived*. It's typical but not always the case that the more common outcome is predicted better.

## Parameter setting

Finally, let's try out different settings for the most important parameter *k*. I'll use a for-loop to do a simple parameter grid search. I'll use a built-in function *classification_report* in sklearn to print out accuracy, precision and recall quickly.

In [41]:
from sklearn.metrics import classification_report

for i in range(0,10):
    knn_new = KNeighborsClassifier(n_neighbors = i) #make a new kNN model with i (1-10) neighbors
    knn_new = knn_new.fit(X_train, y_train) #fit new model on train data
    y_test_pred_new = knn_new.predict(X_test) #predict using new model, with test data
    print(f"With {i} neighbors the result is:")
    print(classification_report(y_test, y_test_pred_new)) #use a built-in function to print out accuracy, precision and recall


NameError: name 'KNeighborsClassifier' is not defined

The scores seem broadly similar, but 6 or 7 neighbors seem to give the best result. However, with such a small dataset, it could well be coincidence. We don't know if this result would generalize. For this we could try out different test-train splits (a method called cross-validation). 