**The Exercise**

This exercise from kaggle.com (a website for ML and DS challenges) provides us with train.csv and test.csv that contain a lot of information about the passenger of the famous Titanic. The train.csv includes the information if a passenger has survived or not. The goal is to predict which of the passengers, listed in test.csv survived the titanic disaster.

Further information about the data set and its features: https://www.kaggle.com/c/titanic/data

In [302]:
# to do list - until point 5.1 of https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy/data
#DONE - add a title feature (dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0])
#DONE - put every title that appears less than 10 times, to misc

#DONE - maybe add FamilySize feature (=SibSp + Parch+1)

#DONE - maybe add IsAlone feature (=1 for yes(which means family size=1), =2 for no) - redundant to familySize???

#DONE - implement feature scaling / data normalization

# - proof feature engineering with plots (take a look at plotting tutorial first (matplotlib, seaborn and optionally: pandas))

#DONE - split train_df in training data and cv data

In [303]:
import pandas as pd 
import numpy as np
import math
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

import keras
from keras.models import Sequential
from keras.layers import Dense


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv


# Get data

In [304]:
#loading data
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")

In [305]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [306]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# **Get to know the data**

In [307]:
#take a look at the feature correlation (of the numeric features)
corr = train_df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


# Data pre-processing

The name feature is obviously irrelevant for survival, but contains a Persons Title, which could be an indicater for a higher/lower priority. So we create a 'Title' feature for train_df and test_df

In [308]:
#insert a 'Title' feature from the name feature because it probably correlates with survival
train_df['Title'] = train_df['Name'].str.split(", ", expand=True)[1].str.split(". ", expand=True)[0].astype(str)

In [309]:
#take a look at the result
train_df['Title']

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: Title, Length: 891, dtype: object

In [310]:
#looks like there are many different Titles
#see which title appears how often
train_df['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
th            1
Ms            1
Don           1
Lady          1
Mme           1
Capt          1
Jonkheer      1
Sir           1
Name: Title, dtype: int64

In [311]:
#because there are a lot of different features, we can summarize the rare ones in 'else'
#We set the threshold to 10, so we set every Title that appears less than 10 times to 'else'
train_df.loc[(train_df['Title'] != 'Mr') & (train_df['Title'] != 'Miss') & (train_df['Title'] != 'Mrs') & (train_df['Title'] != 'Master'), 'Title'] = 'else'
train_df['Title'].value_counts()

Mr        517
Miss      182
Mrs       125
Master     40
else       27
Name: Title, dtype: int64

In [312]:
#do the same for test_df
#get titles
test_df['Title'] = test_df['Name'].str.split(", ", expand=True)[1].str.split(". ", expand=True)[0]

#see which value appears how often
test_df['Title'].value_counts()

#set every title that appears less than 10 times to 'else'
test_df.loc[(test_df['Title'] != 'Mr') & (test_df['Title'] != 'Miss') & (test_df['Title'] != 'Mrs') & (test_df['Title'] != 'Master'), 'Title'] = 'else'
test_df['Title'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
else        7
Name: Title, dtype: int64

In [313]:
#the features SibSp and Parch can be summarized as FamilySize (split testing showed, that this feature indeed improves the models performance)
#add a FamilySize feature:
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch']
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch']

In [314]:
#that fact that a person is traveling with/without family members could be correlated to survival (split testing showed, that this feature indeed improves the models performance)
#add an isAlone feature which is 1, when FamilySize is 0
train_df.loc[train_df['FamilySize'] == 0, 'isAlone'] = 1
test_df.loc[test_df['FamilySize'] == 0, 'isAlone'] = 1
train_df.loc[train_df['isAlone'] != 1, 'isAlone'] = 0
test_df.loc[test_df['isAlone'] != 1, 'isAlone'] = 0

In [315]:
#check result
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,FamilySize,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,1,0.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,1,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,1,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,0,1.0


In [316]:
#take another look at the feature correlation, including the new features
corr = train_df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,FamilySize,isAlone
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658,-0.040143,0.057462
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307,0.016639,-0.203367
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495,0.065997,0.135207
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,-0.301914,0.19827
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651,0.890712,-0.584471
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225,0.783111,-0.583398
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0,0.217138,-0.271832
FamilySize,-0.040143,0.016639,0.065997,-0.301914,0.890712,0.783111,0.217138,1.0,-0.690922
isAlone,0.057462,-0.203367,0.135207,0.19827,-0.584471,-0.583398,-0.271832,-0.690922,1.0


In [317]:
#drop columns that have mostly missing entries (like Cabin) and/or are irrelevant for survival
#PassengerId in test_df is still needed for the submission in the end
train_df = train_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])
test_df = test_df.drop(columns=['Name', 'Ticket', 'Cabin'])

In [318]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
0,0,3,male,22.0,1,0,7.25,S,Mr,1,0.0
1,1,1,female,38.0,1,0,71.2833,C,Mrs,1,0.0
2,1,3,female,26.0,0,0,7.925,S,Miss,0,1.0
3,1,1,female,35.0,1,0,53.1,S,Mrs,1,0.0
4,0,3,male,35.0,0,0,8.05,S,Mr,0,1.0


In [319]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
0,892,3,male,34.5,0,0,7.8292,Q,Mr,0,1.0
1,893,3,female,47.0,1,0,7.0,S,Mrs,1,0.0
2,894,2,male,62.0,0,0,9.6875,Q,Mr,0,1.0
3,895,3,male,27.0,0,0,8.6625,S,Mr,0,1.0
4,896,3,female,22.0,1,1,12.2875,S,Mrs,2,0.0


In [320]:
#convert sex-feature in categorical int. female=1, male=0
train_df = train_df.replace({'female':1,'male':0})
test_df = test_df.replace( {'female':1,'male':0})

In [321]:
#rename Sex column in Gender
train_df = train_df.rename(columns={'Sex' : 'Gender'})
test_df = test_df.rename(columns={'Sex' : 'Gender'})

In [322]:
#convert Title feature to cetegorial int:
train_df['Title'] = train_df['Title'].replace({'Mr':'1', 'Miss':2, 'Mrs':3, 'Master':4, 'else':5 }).astype(int)
test_df['Title'] = test_df['Title'].replace({'Mr':'1', 'Miss':2, 'Mrs':3, 'Master':4, 'else':5 }).astype(int)

In [323]:
train_df.head()

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
0,0,3,0,22.0,1,0,7.25,S,1,1,0.0
1,1,1,1,38.0,1,0,71.2833,C,3,1,0.0
2,1,3,1,26.0,0,0,7.925,S,2,0,1.0
3,1,1,1,35.0,1,0,53.1,S,3,1,0.0
4,0,3,0,35.0,0,0,8.05,S,1,0,1.0


In [324]:
#check correlation map again with all features being numerical now
corr = train_df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Title,FamilySize,isAlone
Survived,1.0,-0.338481,0.543351,-0.077221,-0.035322,0.081629,0.257307,0.414088,0.016639,-0.203367
Pclass,-0.338481,1.0,-0.1319,-0.369226,0.083081,0.018443,-0.5495,-0.184841,0.065997,0.135207
Gender,0.543351,-0.1319,1.0,-0.093254,0.114631,0.245489,0.182333,0.508099,0.200988,-0.303646
Age,-0.077221,-0.369226,-0.093254,1.0,-0.308247,-0.189119,0.096067,-0.106788,-0.301914,0.19827
SibSp,-0.035322,0.083081,0.114631,-0.308247,1.0,0.414838,0.159651,0.258403,0.890712,-0.584471
Parch,0.081629,0.018443,0.245489,-0.189119,0.414838,1.0,0.216225,0.303608,0.783111,-0.583398
Fare,0.257307,-0.5495,0.182333,0.096067,0.159651,0.216225,1.0,0.137318,0.217138,-0.271832
Title,0.414088,-0.184841,0.508099,-0.106788,0.258403,0.303608,0.137318,1.0,0.328287,-0.38778
FamilySize,0.016639,0.065997,0.200988,-0.301914,0.890712,0.783111,0.217138,0.328287,1.0,-0.690922
isAlone,-0.203367,0.135207,-0.303646,0.19827,-0.584471,-0.583398,-0.271832,-0.38778,-0.690922,1.0


looks like Survived and Title correlates pretty heavily, while the other new features are just slightly correlating

# **Check training data**

In [325]:
#check NaN values in training data
train_df[train_df.isna().any(axis=1)]

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
5,0,3,0,,0,0,8.4583,Q,1,0,1.0
17,1,2,0,,0,0,13.0000,S,1,0,1.0
19,1,3,1,,0,0,7.2250,C,3,0,1.0
26,0,3,0,,0,0,7.2250,C,1,0,1.0
28,1,3,1,,0,0,7.8792,Q,2,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
859,0,3,0,,0,0,7.2292,C,1,0,1.0
863,0,3,1,,8,2,69.5500,S,2,10,0.0
868,0,3,0,,0,0,9.5000,S,1,0,1.0
878,0,3,0,,0,0,7.8958,S,1,0,1.0


Seems like mainly ages are missing. So I have to make assumptions

If you take a look at the correlation map, you can see that Age mainly depends on Pclass, SibSp and Parch (And on FamilySize and IsAlone, but these are depending on SibSp and Parch, so I won't consider them further)
So Passengers will be grouped by these features and missing ages are set to the mean value of their group

In [326]:
#set NaN ages to the mean of the group they belong to
for i_class in range(0,4):
    for i_Sib in range(0,9):
        for i_Parch in range(0,3):
            
            mean_group_age = train_df.loc[(train_df['Pclass'] == i_class) & (train_df['SibSp'] == i_Sib) & (train_df['Parch'] == i_Parch) & (train_df['Age'].isna()==False)]['Age'].mean()
            
            if math.isnan(mean_group_age)==False:
                mean_group_age=int(mean_group_age)
                
                train_df.loc[(train_df['Pclass'] == i_class) & (train_df['SibSp'] == i_Sib) & (train_df['Parch'] == i_Parch) & (train_df['Age'].isna()) ,'Age'] = mean_group_age

In [327]:
#check for NaN values again
train_df.loc[train_df['Age'].isna()]

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
159,0,3,0,,8,2,69.55,S,4,10,0.0
180,0,3,1,,8,2,69.55,S,2,10,0.0
201,0,3,0,,8,2,69.55,S,1,10,0.0
324,0,3,0,,8,2,69.55,S,1,10,0.0
792,0,3,1,,8,2,69.55,S,2,10,0.0
846,0,3,0,,8,2,69.55,S,1,10,0.0
863,0,3,1,,8,2,69.55,S,2,10,0.0


seems like we got only members of one family left / They did not get an age because their was no row with that combination of Pclass, SibSp and Parch AND a valid age, so no mean age could be calculated for that group. I will just set their age to the mean age of their passengers class

In [328]:
#set remaining NaN ages to their Pclasses mean
train_df.loc[train_df['Age'].isna(), ['Age']] = train_df.loc[train_df['Pclass'] == 3]['Age'].mean()
train_df.loc[train_df['Age'].isna()]

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone


In [329]:
#no missing ages left
#check for any NaN entries:
train_df[train_df.isna().any(axis=1)]

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
61,1,1,1,38.0,0,0,80.0,,2,0,1.0
829,1,1,1,62.0,0,0,80.0,,3,0,1.0


In [330]:
# 2 rows without "Embarked" / I will set them manually to the most likely value, which is the one that occured most often
train_df.groupby('Embarked')['Age'].count() 

Embarked
C    168
Q     77
S    644
Name: Age, dtype: int64

In [331]:
#S is by far the mostly appearing entry, so I will set the NaN's to S too
train_df.loc[train_df['Embarked'].isna() == True, 'Embarked'] = str('S')

In [332]:
#check for missing values again
train_df[train_df.isna().any(axis=1)]

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone


Noe more NaN values! :)

In [333]:
#take another look at the training data
train_df.head()

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
0,0,3,0,22.0,1,0,7.25,S,1,1,0.0
1,1,1,1,38.0,1,0,71.2833,C,3,1,0.0
2,1,3,1,26.0,0,0,7.925,S,2,0,1.0
3,1,1,1,35.0,1,0,53.1,S,3,1,0.0
4,0,3,0,35.0,0,0,8.05,S,1,0,1.0


In [334]:
#make Embarked a categorical int feature
train_df['Embarked']=train_df['Embarked'].map({'S':1,'C':2,'Q':3})
train_df.head(50)

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
0,0,3,0,22.0,1,0,7.25,1,1,1,0.0
1,1,1,1,38.0,1,0,71.2833,2,3,1,0.0
2,1,3,1,26.0,0,0,7.925,1,2,0,1.0
3,1,1,1,35.0,1,0,53.1,1,3,1,0.0
4,0,3,0,35.0,0,0,8.05,1,1,0,1.0
5,0,3,0,28.0,0,0,8.4583,3,1,0,1.0
6,0,1,0,54.0,0,0,51.8625,1,1,0,1.0
7,0,3,0,2.0,3,1,21.075,1,4,4,0.0
8,1,3,1,27.0,0,2,11.1333,1,3,2,0.0
9,1,2,1,14.0,1,0,30.0708,2,3,1,0.0


looks like our training data is ready to go

# **Check Test data**

In [335]:
#check for NaNs
test_df[test_df.isna().any(axis=1)]

Unnamed: 0,PassengerId,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
10,902,3,0,,0,0,7.8958,S,1,0,1.0
22,914,1,1,,0,0,31.6833,S,3,0,1.0
29,921,3,0,,2,0,21.6792,C,1,2,0.0
33,925,3,1,,1,2,23.4500,S,3,3,0.0
36,928,3,1,,0,0,8.0500,S,2,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
408,1300,3,1,,0,0,7.7208,Q,2,0,1.0
410,1302,3,1,,0,0,7.7500,Q,2,0,1.0
413,1305,3,0,,0,0,8.0500,S,1,0,1.0
416,1308,3,0,,0,0,8.0500,S,1,0,1.0


In [336]:
#seems like many ages are missing again, so we use the same code like before, just on the training data
#set NaN ages to the mean of the group they belong to
for i_class in range(0,4):
    for i_Sib in range(0,9):
        for i_Parch in range(0,3):
            
            mean_group_age = test_df.loc[(test_df['Pclass'] == i_class) & (test_df['SibSp'] == i_Sib) & (test_df['Parch'] == i_Parch) & (test_df['Age'].isna()==False)]['Age'].mean()
            
            
            if math.isnan(mean_group_age)==False:
                mean_group_age=int(mean_group_age)
                
                test_df.loc[(test_df['Pclass'] == i_class) & (test_df['SibSp'] == i_Sib) & (test_df['Parch'] == i_Parch) & (test_df['Age'].isna()) ,'Age'] = mean_group_age

In [337]:
#check for nans in Age again
test_df.loc[test_df['Age'].isna()]

Unnamed: 0,PassengerId,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
132,1024,3,1,,0,4,25.4667,S,3,4,0.0
342,1234,3,0,,1,9,69.55,S,1,10,0.0
365,1257,3,1,,1,9,69.55,S,3,10,0.0


same issue like in train_df. I will set the missing ages to the mean of their Pclass again

In [338]:
#set missing ages to their Pclass means
test_df.loc[test_df['Age'].isna() == True, 'Age'] = test_df.loc[test_df['Pclass'] ==3, 'Age'].mean()

In [339]:
#check for nan again
train_df.loc[train_df['Age'].isna()]

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone


In [340]:
#seems like there are no more nan ages
#now check for nan's in all columns
test_df[test_df.isna().any(axis=1)]

Unnamed: 0,PassengerId,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
152,1044,3,0,60.5,0,0,,S,1,0,1.0


In [341]:
#seems like only one Fare value is missing
#if you take another look at the correlation map, you can see that Fare most heavily depends on Pclass, so I will simply set the missing Fare to the Pclasses mean 
test_df.loc[test_df['Fare'].isna() == True, 'Fare']=test_df.loc[test_df['Pclass']==3]['Fare'].mean()

In [342]:
#check for nan in whole df again
test_df[test_df.isna().any(axis=1)]

Unnamed: 0,PassengerId,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone


no more nan! :)

In [343]:
#take another look at the test data
test_df.head()

Unnamed: 0,PassengerId,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
0,892,3,0,34.5,0,0,7.8292,Q,1,0,1.0
1,893,3,1,47.0,1,0,7.0,S,3,1,0.0
2,894,2,0,62.0,0,0,9.6875,Q,1,0,1.0
3,895,3,0,27.0,0,0,8.6625,S,1,0,1.0
4,896,3,1,22.0,1,1,12.2875,S,3,2,0.0


In [344]:
#convert Embarked feature to categorical int:
test_df['Embarked']=test_df['Embarked'].replace({'S':1,'C':2,'Q':3})

In [345]:
#take a final look at training data
train_df.head(50)

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,isAlone
0,0,3,0,22.0,1,0,7.25,1,1,1,0.0
1,1,1,1,38.0,1,0,71.2833,2,3,1,0.0
2,1,3,1,26.0,0,0,7.925,1,2,0,1.0
3,1,1,1,35.0,1,0,53.1,1,3,1,0.0
4,0,3,0,35.0,0,0,8.05,1,1,0,1.0
5,0,3,0,28.0,0,0,8.4583,3,1,0,1.0
6,0,1,0,54.0,0,0,51.8625,1,1,0,1.0
7,0,3,0,2.0,3,1,21.075,1,4,4,0.0
8,1,3,1,27.0,0,2,11.1333,1,3,2,0.0
9,1,2,1,14.0,1,0,30.0708,2,3,1,0.0


looks like the test data is ready to go

In [346]:
#split train_df in train, test and cv data

x_train, x_cv, y_train, y_cv = train_test_split( train_df, train_df['Survived'], test_size=0.2, random_state=1)

y_train=x_train['Survived']
y_cv=x_cv['Survived']

x_train=x_train.drop(columns=['Survived'])
x_cv = x_cv.drop(columns=['Survived'])

x_test=test_df.drop(columns=['PassengerId'])

In [347]:
#x_train['Title']=x_train['Title'].astype(int)
#x_train['Embarked']=x_train['Embarked'].astype(int)

In [348]:
#feature scaling
x_train=scale(x_train)
x_cv=scale(x_cv)
x_test=scale(x_test)

# Build models

**Logistic regression**

In [349]:
model_LG = LogisticRegression()
model_LG.fit(x_train, y_train)
y_hat_LG = model_LG.predict(x_cv)
perf_LG = mean_squared_error(y_hat_LG, y_cv)

In [350]:
"Mean squared error of Logistic regression: ",perf_LG

('Mean squared error of Logistic regression: ', 0.16201117318435754)

**Neural Network** 

I will run cv testing on 9 different models with different architectures (no. of nodes and no. of layers will differ)

In [351]:
train_epochs=500

In [352]:
#build model_1
model_1=Sequential()
n_columns = train_df.columns.size -1

model_1.add(Dense(5, activation='relu', input_shape=(n_columns,)))
model_1.add(Dense(5, activation='relu'))
model_1.add(Dense(1))

model_1.compile(optimizer='adam', loss='mean_squared_error')

In [353]:
#train the model
model_1.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_1=model_1.predict(x_cv)

In [354]:
print("Mean squared error NN-model: ", mean_squared_error(y_cv,y_hat_1))

Mean squared error NN-model:  0.1204703998440532


In [355]:
#y_hat_NN = mean_squared_error(y_cv,y_hat_1)

In [356]:
#build model_2
model_2=Sequential()
n_columns = train_df.columns.size -1

model_2.add(Dense(10, activation='relu', input_shape=(n_columns,)))
model_2.add(Dense(10, activation='relu'))
model_2.add(Dense(1))

model_2.compile(optimizer='adam', loss='mean_squared_error')

In [357]:
model_2.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_2=model_2.predict(x_cv)

In [358]:
print("Mean squared error NN-model: ", mean_squared_error(y_cv,y_hat_2))

Mean squared error NN-model:  0.131862676847891


In [359]:
#build model_3
model_3=Sequential()
n_columns = train_df.columns.size -1

model_3.add(Dense(20, activation='relu', input_shape=(n_columns,)))
model_3.add(Dense(20, activation='relu'))
model_3.add(Dense(1))

model_3.compile(optimizer='adam', loss='mean_squared_error')

In [360]:
model_3.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_3=model_3.predict(x_cv)

In [361]:
print("Mean squared error: ", mean_squared_error(y_hat_3, y_cv))

Mean squared error:  0.14204899360234144


In [362]:
#build model_4
model_4=Sequential()
n_columns = train_df.columns.size -1

model_4.add(Dense(5, activation='relu', input_shape=(n_columns,)))
model_4.add(Dense(5, activation='relu'))
model_4.add(Dense(5, activation='relu'))
model_4.add(Dense(5, activation='relu'))
model_4.add(Dense(1))

model_4.compile(optimizer='adam', loss='mean_squared_error')

In [363]:
model_4.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_4=model_4.predict(x_cv)

In [364]:
print("MSE: ",mean_squared_error(y_hat_4, y_cv))

MSE:  0.12217744514793946


In [365]:
#build model_5
model_5=Sequential()
n_columns = train_df.columns.size -1

model_5.add(Dense(10, activation='relu', input_shape=(n_columns,)))
model_5.add(Dense(10, activation='relu'))
model_5.add(Dense(10, activation='relu'))
model_5.add(Dense(10, activation='relu'))
model_5.add(Dense(1))

model_5.compile(optimizer='adam', loss='mean_squared_error')

In [366]:
model_5.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_5=model_5.predict(x_cv)

In [367]:
print("MSE: ",mean_squared_error(y_hat_5, y_cv))

MSE:  0.156209950665661


In [368]:
#build model_6
model_6=Sequential()
n_columns = train_df.columns.size -1

model_6.add(Dense(20, activation='relu', input_shape=(n_columns,)))
model_6.add(Dense(20, activation='relu'))
model_6.add(Dense(20, activation='relu'))
model_6.add(Dense(20, activation='relu'))
model_6.add(Dense(1))

model_6.compile(optimizer='adam', loss='mean_squared_error')

In [369]:
model_6.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_6=model_6.predict(x_cv)

In [370]:
print("MSE: ",mean_squared_error(y_hat_6, y_cv))

MSE:  0.16043788216031335


In [371]:
#build model_7
model_7=Sequential()
n_columns = train_df.columns.size -1

model_7.add(Dense(5, activation='relu', input_shape=(n_columns,)))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(1))

model_7.compile(optimizer='adam', loss='mean_squared_error')

In [372]:
model_7.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_7=model_7.predict(x_cv)

In [373]:
print("MSE: ",mean_squared_error(y_hat_7, y_cv))

MSE:  0.11362395042795001


In [374]:
#build model_8
model_8=Sequential()
n_columns = train_df.columns.size -1

model_8.add(Dense(10, activation='relu', input_shape=(n_columns,)))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(1))

model_8.compile(optimizer='adam', loss='mean_squared_error')

In [375]:
model_8.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_8=model_8.predict(x_cv)

In [376]:
print("MSE: ",mean_squared_error(y_hat_8, y_cv))

MSE:  0.15808987909237152


In [377]:
#build model_9
model_9=Sequential()
n_columns = train_df.columns.size -1

model_9.add(Dense(20, activation='relu', input_shape=(n_columns,)))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(1))

model_9.compile(optimizer='adam', loss='mean_squared_error')

In [378]:
model_9.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_9=model_9.predict(x_cv)

In [379]:
print("MSE: ",mean_squared_error(y_hat_9, y_cv))

MSE:  0.15650178335688075


In [380]:
perf_NN=min(mean_squared_error(y_hat_1, y_cv),mean_squared_error(y_hat_2, y_cv),mean_squared_error(y_hat_3, y_cv),mean_squared_error(y_hat_4, y_cv),mean_squared_error(y_hat_5, y_cv),mean_squared_error(y_hat_6, y_cv),mean_squared_error(y_hat_7, y_cv),mean_squared_error(y_hat_8, y_cv),mean_squared_error(y_hat_9, y_cv))

In [381]:
#choose best NN model
if perf_NN == mean_squared_error(y_hat_1, y_cv):
    y_hat_NN = np.round(y_hat_1)
    model_NN = model_1
    
if perf_NN == mean_squared_error(y_hat_2, y_cv):
    y_hat_NN = np.round(y_hat_2)
    model_NN = model_2
    
if perf_NN == mean_squared_error(y_hat_3, y_cv):
    y_hat_NN = np.round(y_hat_3)
    model_NN = model_3
    
if perf_NN == mean_squared_error(y_hat_4, y_cv):
    y_hat_NN = np.round(y_hat_4)
    model_NN = model_4
    
if perf_NN == mean_squared_error(y_hat_5, y_cv):
    y_hat_NN = np.round(y_hat_5)
    model_NN = model_5
    
if perf_NN == mean_squared_error(y_hat_6, y_cv):
    y_hat_NN = np.round(y_hat_6)
    model_NN = model_6
    
if perf_NN == mean_squared_error(y_hat_7, y_cv):
    y_hat_NN = np.round(y_hat_7)
    model_NN = model_7
    
if perf_NN == mean_squared_error(y_hat_8, y_cv):
    y_hat_NN = np.round(y_hat_8)
    model_NN = model_8
    
if perf_NN == mean_squared_error(y_hat_9, y_cv):
    y_hat_NN = np.round(y_hat_9)
    model_NN = model_9

# Support Vector Machine

In [382]:
#build model one
model_SVM_1 = SVC(kernel='rbf')
model_SVM_1.fit(x_train, y_train)
y_hat_SVM_1 = model_SVM_1.predict(x_cv)

In [383]:
print('Mean squared error: ', mean_squared_error(y_hat_SVM_1, y_cv))

Mean squared error:  0.15083798882681565


In [384]:
#build model two
model_SVM_2 = SVC(kernel='linear')
model_SVM_2.fit(x_train, y_train)
y_hat_SVM_2 = model_SVM_2.predict(x_cv)
print('Mean squared error: ', mean_squared_error(y_hat_SVM_2, y_cv))

Mean squared error:  0.15083798882681565


In [385]:
#build model three
model_SVM_3 = SVC(kernel='poly')
model_SVM_3.fit(x_train, y_train)
y_hat_SVM_3 = model_SVM_3.predict(x_cv)
print('Mean squared error: ', mean_squared_error(y_hat_SVM_3, y_cv))

Mean squared error:  0.17318435754189945


In [386]:
#build model four
model_SVM_4 = SVC(kernel='sigmoid')
model_SVM_4.fit(x_train, y_train)
y_hat_SVM_4 = model_SVM_4.predict(x_cv)
print('Mean squared error: ', mean_squared_error(y_hat_SVM_4, y_cv))

Mean squared error:  0.2737430167597765


In [387]:
perf_SVM = min(mean_squared_error(y_hat_SVM_1, y_cv), mean_squared_error(y_hat_SVM_2, y_cv), mean_squared_error(y_hat_SVM_3, y_cv), mean_squared_error(y_hat_SVM_4, y_cv))

In [388]:
#choose the best SVM model
if perf_SVM == mean_squared_error(y_hat_SVM_1, y_cv):
    model_SVM = model_SVM_1
if perf_SVM == mean_squared_error(y_hat_SVM_2, y_cv):
    model_SVM = model_SVM_2
if perf_SVM == mean_squared_error(y_hat_SVM_3, y_cv):
    model_SVM = model_SVM_3
if perf_SVM == mean_squared_error(y_hat_SVM_4, y_cv):
    model_SVM = model_SVM_4

In [389]:
#y_hat_SVM = model_SVM.predict(y_test)

# KNN 

In [397]:
#try 100 different values for k
for k in range(1,100):
    test_model_KNN = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
    y_hat_test_KNN = test_model_KNN.predict(x_cv)
    print("k: ",k, "MSE: ",mean_squared_error(y_hat_test_KNN, y_cv))

k:  1 MSE:  0.2346368715083799
k:  2 MSE:  0.18435754189944134
k:  3 MSE:  0.1787709497206704
k:  4 MSE:  0.18435754189944134
k:  5 MSE:  0.16201117318435754
k:  6 MSE:  0.16759776536312848
k:  7 MSE:  0.16759776536312848
k:  8 MSE:  0.17318435754189945
k:  9 MSE:  0.1787709497206704
k:  10 MSE:  0.16759776536312848
k:  11 MSE:  0.16201117318435754
k:  12 MSE:  0.16201117318435754
k:  13 MSE:  0.17318435754189945
k:  14 MSE:  0.16759776536312848
k:  15 MSE:  0.16201117318435754
k:  16 MSE:  0.15083798882681565
k:  17 MSE:  0.16201117318435754
k:  18 MSE:  0.16759776536312848
k:  19 MSE:  0.16201117318435754
k:  20 MSE:  0.15083798882681565
k:  21 MSE:  0.15083798882681565
k:  22 MSE:  0.13966480446927373
k:  23 MSE:  0.1452513966480447
k:  24 MSE:  0.13966480446927373
k:  25 MSE:  0.1452513966480447
k:  26 MSE:  0.13966480446927373
k:  27 MSE:  0.13966480446927373
k:  28 MSE:  0.13966480446927373
k:  29 MSE:  0.13966480446927373
k:  30 MSE:  0.13966480446927373
k:  31 MSE:  0.139664804

In [399]:
#seems like around k=30 the MSE is not really improving any further, so we choose k=30
k=30
model_KNN = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
y_hat_KNN = model_KNN.predict(x_cv)
perf_KNN = mean_squared_error(y_hat_KNN, y_cv)
print("MSE:", perf_KNN)

MSE: 0.13966480446927373


# **Check performance**

In [400]:
#take a look at the different performances
data= {'Index':[1,2,3,4],'Model':['Logistic regression', 'Neural Network', 'Support Vector Machine','K nearest neighbors'], 'MSE':[perf_LG, perf_NN, perf_SVM, perf_KNN]}
performance_df=pd.DataFrame(data)

In [401]:
performance_df

Unnamed: 0,Index,Model,MSE
0,1,Logistic regression,0.162011
1,2,Neural Network,0.113624
2,3,Support Vector Machine,0.150838
3,4,K nearest neighbors,0.139665


In [402]:
#choose the best model
best_model_index = performance_df.loc[performance_df['MSE']==performance_df['MSE'].min(), 'Index']
print(int(best_model_index))

2


In [403]:
#let the best model make a prediction for the test data
if int(best_model_index) == 1:
    y_hat=model_LG.predict(x_test)
if int(best_model_index) == 2:
    y_hat=model_NN.predict(x_test)
if int(best_model_index) == 3:
    y_hat=model_SVM.predict(x_test)
if int(best_model_index) == 4:
    y_hat=model_KNN.predict(x_test)

In [404]:
#formatting y_hat
y_hat=pd.DataFrame(data=y_hat, columns=['Survived'])
y_hat=round(y_hat).astype(int)
y_hat

Unnamed: 0,Survived
0,0
1,0
2,0
3,0
4,0
...,...
413,0
414,1
415,0
416,0


In [405]:
#merge y_hat and 'PassengerId' together for submission to kaggle.com
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived' : y_hat['Survived']})

In [406]:
#check submission length (should be 418)
len(submission.index)

418

In [407]:
#check for invalid values
submission.loc[(submission['Survived']!=0) & (submission['Survived'] != 1)]

Unnamed: 0,PassengerId,Survived


In [408]:
#just in case: make values valid
submission.loc[submission['Survived']<0, 'Survived']=0
submission.loc[submission['Survived']>1, 'Survived']=1

In [409]:
#take a final look at the submission
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [410]:
#create csv
submission.to_csv("submission.csv", index=False)

The submission was uploaded to kaggle.com and scored a 0.78947 accuracy which is good enough for 4685th place out of 220006 who have completed this challenge.