**The Exercise**

This exercise from kaggle.com (a website for ML and DS challenges) provides us with train.csv and test.csv that contain a lot of information about the passenger of the famous Titanic. The train.csv includes the information if a passenger has survived or not. The goal is to predict which of the passengers, listed in test.csv survived the titanic disaster.

Further information about the data set and its features: https://www.kaggle.com/c/titanic/data

In [None]:
import pandas as pd 
import numpy as np
import math
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

import keras
from keras.models import Sequential
from keras.layers import Dense


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Get data

In [None]:
#loading data
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
train_df.head()

In [None]:
test_df.head()

# **Get to know the data**

In [None]:
#take a look at the feature correlation (of the numeric features)
corr = train_df.corr()
corr.style.background_gradient(cmap='coolwarm')

# Data pre-processing

The name feature is obviously irrelevant for survival, but contains a Persons Title, which could be an indicater for a higher/lower priority. So we create a 'Title' feature for train_df and test_df

In [None]:
#insert a 'Title' feature from the name feature because it probably correlates with survival
train_df['Title'] = train_df['Name'].str.split(", ", expand=True)[1].str.split(". ", expand=True)[0].astype(str)

In [None]:
#take a look at the result
train_df['Title']

In [None]:
#looks like there are many different Titles
#see which title appears how often
train_df['Title'].value_counts()

In [None]:
#because there are a lot of different features, we can summarize the rare ones in 'else'
#We set the threshold to 10, so we set every Title that appears less than 10 times to 'else'
train_df.loc[(train_df['Title'] != 'Mr') & (train_df['Title'] != 'Miss') & (train_df['Title'] != 'Mrs') & (train_df['Title'] != 'Master'), 'Title'] = 'else'
train_df['Title'].value_counts()

In [None]:
#do the same for test_df
#get titles
test_df['Title'] = test_df['Name'].str.split(", ", expand=True)[1].str.split(". ", expand=True)[0]

#see which value appears how often
test_df['Title'].value_counts()

#set every title that appears less than 10 times to 'else'
test_df.loc[(test_df['Title'] != 'Mr') & (test_df['Title'] != 'Miss') & (test_df['Title'] != 'Mrs') & (test_df['Title'] != 'Master'), 'Title'] = 'else'
test_df['Title'].value_counts()

In [None]:
#the features SibSp and Parch can be summarized as FamilySize (split testing showed, that this feature indeed improves the models performance)
#add a FamilySize feature:
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch']
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch']

In [None]:
#that fact that a person is traveling with/without family members could be correlated to survival (split testing showed, that this feature indeed improves the models performance)
#add an isAlone feature which is 1, when FamilySize is 0
train_df.loc[train_df['FamilySize'] == 0, 'isAlone'] = 1
test_df.loc[test_df['FamilySize'] == 0, 'isAlone'] = 1
train_df.loc[train_df['isAlone'] != 1, 'isAlone'] = 0
test_df.loc[test_df['isAlone'] != 1, 'isAlone'] = 0

In [None]:
#check result
train_df.head()

In [None]:
#take another look at the feature correlation, including the new features
corr = train_df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
#drop columns that have mostly missing entries (like Cabin) and/or are irrelevant for survival
#PassengerId in test_df is still needed for the submission in the end
train_df = train_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])
test_df = test_df.drop(columns=['Name', 'Ticket', 'Cabin'])

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
#convert sex-feature in categorical int. female=1, male=0
train_df = train_df.replace({'female':1,'male':0})
test_df = test_df.replace( {'female':1,'male':0})

In [None]:
#rename Sex column in Gender
train_df = train_df.rename(columns={'Sex' : 'Gender'})
test_df = test_df.rename(columns={'Sex' : 'Gender'})

In [None]:
#convert Title feature to cetegorial int:
train_df['Title'] = train_df['Title'].replace({'Mr':'1', 'Miss':2, 'Mrs':3, 'Master':4, 'else':5 }).astype(int)
test_df['Title'] = test_df['Title'].replace({'Mr':'1', 'Miss':2, 'Mrs':3, 'Master':4, 'else':5 }).astype(int)

In [None]:
train_df.head()

In [None]:
#check correlation map again with all features being numerical now
corr = train_df.corr()
corr.style.background_gradient(cmap='coolwarm')

looks like Survived and Title correlates pretty heavily, while the other new features are just slightly correlating

# **Check training data**

In [None]:
#check NaN values in training data
train_df[train_df.isna().any(axis=1)]

Seems like mainly ages are missing. So I have to make assumptions

If you take a look at the correlation map, you can see that Age mainly depends on Pclass, SibSp and Parch (And on FamilySize and IsAlone, but these are depending on SibSp and Parch, so I won't consider them further)
So Passengers will be grouped by these features and missing ages are set to the mean value of their group

In [None]:
#set NaN ages to the mean of the group they belong to
for i_class in range(0,4):
    for i_Sib in range(0,9):
        for i_Parch in range(0,3):
            
            mean_group_age = train_df.loc[(train_df['Pclass'] == i_class) & (train_df['SibSp'] == i_Sib) & (train_df['Parch'] == i_Parch) & (train_df['Age'].isna()==False)]['Age'].mean()
            
            if math.isnan(mean_group_age)==False:
                mean_group_age=int(mean_group_age)
                
                train_df.loc[(train_df['Pclass'] == i_class) & (train_df['SibSp'] == i_Sib) & (train_df['Parch'] == i_Parch) & (train_df['Age'].isna()) ,'Age'] = mean_group_age

In [None]:
#check for NaN values again
train_df.loc[train_df['Age'].isna()]

seems like we got only members of one family left / They did not get an age because their was no row with that combination of Pclass, SibSp and Parch AND a valid age, so no mean age could be calculated for that group. I will just set their age to the mean age of their passengers class

In [None]:
#set remaining NaN ages to their Pclasses mean
train_df.loc[train_df['Age'].isna(), ['Age']] = train_df.loc[train_df['Pclass'] == 3]['Age'].mean()
train_df.loc[train_df['Age'].isna()]

In [None]:
#no missing ages left
#check for any NaN entries:
train_df[train_df.isna().any(axis=1)]

In [None]:
# 2 rows without "Embarked" / I will set them manually to the most likely value, which is the one that occured most often
train_df.groupby('Embarked')['Age'].count() 

In [None]:
#S is by far the mostly appearing entry, so I will set the NaN's to S too
train_df.loc[train_df['Embarked'].isna() == True, 'Embarked'] = str('S')

In [None]:
#check for missing values again
train_df[train_df.isna().any(axis=1)]

Noe more NaN values! :)

In [None]:
#take another look at the training data
train_df.head()

In [None]:
#make Embarked a categorical int feature
train_df['Embarked']=train_df['Embarked'].map({'S':1,'C':2,'Q':3})
train_df.head(50)

looks like our training data is ready to go

# **Check Test data**

In [None]:
#check for NaNs
test_df[test_df.isna().any(axis=1)]

In [None]:
#seems like many ages are missing again, so we use the same code like before, just on the training data
#set NaN ages to the mean of the group they belong to
for i_class in range(0,4):
    for i_Sib in range(0,9):
        for i_Parch in range(0,3):
            
            mean_group_age = test_df.loc[(test_df['Pclass'] == i_class) & (test_df['SibSp'] == i_Sib) & (test_df['Parch'] == i_Parch) & (test_df['Age'].isna()==False)]['Age'].mean()
            
            
            if math.isnan(mean_group_age)==False:
                mean_group_age=int(mean_group_age)
                
                test_df.loc[(test_df['Pclass'] == i_class) & (test_df['SibSp'] == i_Sib) & (test_df['Parch'] == i_Parch) & (test_df['Age'].isna()) ,'Age'] = mean_group_age

In [None]:
#check for nans in Age again
test_df.loc[test_df['Age'].isna()]

same issue like in train_df. I will set the missing ages to the mean of their Pclass again

In [None]:
#set missing ages to their Pclass means
test_df.loc[test_df['Age'].isna() == True, 'Age'] = test_df.loc[test_df['Pclass'] ==3, 'Age'].mean()

In [None]:
#check for nan again
train_df.loc[train_df['Age'].isna()]

In [None]:
#seems like there are no more nan ages
#now check for nan's in all columns
test_df[test_df.isna().any(axis=1)]

In [None]:
#seems like only one Fare value is missing
#if you take another look at the correlation map, you can see that Fare most heavily depends on Pclass, so I will simply set the missing Fare to the Pclasses mean 
test_df.loc[test_df['Fare'].isna() == True, 'Fare']=test_df.loc[test_df['Pclass']==3]['Fare'].mean()

In [None]:
#check for nan in whole df again
test_df[test_df.isna().any(axis=1)]

no more nan! :)

In [None]:
#take another look at the test data
test_df.head()

In [None]:
#convert Embarked feature to categorical int:
test_df['Embarked']=test_df['Embarked'].replace({'S':1,'C':2,'Q':3})

In [None]:
#take a final look at training data
train_df.head(50)

looks like the test data is ready to go

In [None]:
#split train_df in train, test and cv data

x_train, x_cv, y_train, y_cv = train_test_split( train_df, train_df['Survived'], test_size=0.2, random_state=1)

y_train=x_train['Survived']
y_cv=x_cv['Survived']

x_train=x_train.drop(columns=['Survived'])
x_cv = x_cv.drop(columns=['Survived'])

x_test=test_df.drop(columns=['PassengerId'])

In [None]:
#x_train['Title']=x_train['Title'].astype(int)
#x_train['Embarked']=x_train['Embarked'].astype(int)

In [None]:
#feature scaling
x_train=scale(x_train)
x_cv=scale(x_cv)
x_test=scale(x_test)

# Build models

**Logistic regression**

In [None]:
model_LG = LogisticRegression()
model_LG.fit(x_train, y_train)
y_hat_LG = model_LG.predict(x_cv)
perf_LG = mean_squared_error(y_hat_LG, y_cv)

In [None]:
"Mean squared error of Logistic regression: ",perf_LG

**Neural Network** 

I will run cv testing on 9 different models with different architectures (no. of nodes and no. of layers will differ)

In [None]:
train_epochs=500

In [None]:
#build model_1
model_1=Sequential()
n_columns = train_df.columns.size -1

model_1.add(Dense(5, activation='relu', input_shape=(n_columns,)))
model_1.add(Dense(5, activation='relu'))
model_1.add(Dense(1))

model_1.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
#train the model
model_1.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_1=model_1.predict(x_cv)

In [None]:
print("Mean squared error NN-model: ", mean_squared_error(y_cv,y_hat_1))

In [None]:
#y_hat_NN = mean_squared_error(y_cv,y_hat_1)

In [None]:
#build model_2
model_2=Sequential()
n_columns = train_df.columns.size -1

model_2.add(Dense(10, activation='relu', input_shape=(n_columns,)))
model_2.add(Dense(10, activation='relu'))
model_2.add(Dense(1))

model_2.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_2.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_2=model_2.predict(x_cv)

In [None]:
print("Mean squared error NN-model: ", mean_squared_error(y_cv,y_hat_2))

In [None]:
#build model_3
model_3=Sequential()
n_columns = train_df.columns.size -1

model_3.add(Dense(20, activation='relu', input_shape=(n_columns,)))
model_3.add(Dense(20, activation='relu'))
model_3.add(Dense(1))

model_3.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_3.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_3=model_3.predict(x_cv)

In [None]:
print("Mean squared error: ", mean_squared_error(y_hat_3, y_cv))

In [None]:
#build model_4
model_4=Sequential()
n_columns = train_df.columns.size -1

model_4.add(Dense(5, activation='relu', input_shape=(n_columns,)))
model_4.add(Dense(5, activation='relu'))
model_4.add(Dense(5, activation='relu'))
model_4.add(Dense(5, activation='relu'))
model_4.add(Dense(1))

model_4.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_4.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_4=model_4.predict(x_cv)

In [None]:
print("MSE: ",mean_squared_error(y_hat_4, y_cv))

In [None]:
#build model_5
model_5=Sequential()
n_columns = train_df.columns.size -1

model_5.add(Dense(10, activation='relu', input_shape=(n_columns,)))
model_5.add(Dense(10, activation='relu'))
model_5.add(Dense(10, activation='relu'))
model_5.add(Dense(10, activation='relu'))
model_5.add(Dense(1))

model_5.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_5.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_5=model_5.predict(x_cv)

In [None]:
print("MSE: ",mean_squared_error(y_hat_5, y_cv))

In [None]:
#build model_6
model_6=Sequential()
n_columns = train_df.columns.size -1

model_6.add(Dense(20, activation='relu', input_shape=(n_columns,)))
model_6.add(Dense(20, activation='relu'))
model_6.add(Dense(20, activation='relu'))
model_6.add(Dense(20, activation='relu'))
model_6.add(Dense(1))

model_6.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_6.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_6=model_6.predict(x_cv)

In [None]:
print("MSE: ",mean_squared_error(y_hat_6, y_cv))

In [None]:
#build model_7
model_7=Sequential()
n_columns = train_df.columns.size -1

model_7.add(Dense(5, activation='relu', input_shape=(n_columns,)))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(5, activation='relu'))
model_7.add(Dense(1))

model_7.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_7.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_7=model_7.predict(x_cv)

In [None]:
print("MSE: ",mean_squared_error(y_hat_7, y_cv))

In [None]:
#build model_8
model_8=Sequential()
n_columns = train_df.columns.size -1

model_8.add(Dense(10, activation='relu', input_shape=(n_columns,)))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(10, activation='relu'))
model_8.add(Dense(1))

model_8.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_8.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_8=model_8.predict(x_cv)

In [None]:
print("MSE: ",mean_squared_error(y_hat_8, y_cv))

In [None]:
#build model_9
model_9=Sequential()
n_columns = train_df.columns.size -1

model_9.add(Dense(20, activation='relu', input_shape=(n_columns,)))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(20, activation='relu'))
model_9.add(Dense(1))

model_9.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model_9.fit(x_train, y_train, epochs=train_epochs, verbose=0)
y_hat_9=model_9.predict(x_cv)

In [None]:
print("MSE: ",mean_squared_error(y_hat_9, y_cv))

In [None]:
perf_NN=min(mean_squared_error(y_hat_1, y_cv),mean_squared_error(y_hat_2, y_cv),mean_squared_error(y_hat_3, y_cv),mean_squared_error(y_hat_4, y_cv),mean_squared_error(y_hat_5, y_cv),mean_squared_error(y_hat_6, y_cv),mean_squared_error(y_hat_7, y_cv),mean_squared_error(y_hat_8, y_cv),mean_squared_error(y_hat_9, y_cv))

In [None]:
#choose best NN model
if perf_NN == mean_squared_error(y_hat_1, y_cv):
    y_hat_NN = np.round(y_hat_1)
    model_NN = model_1
    
if perf_NN == mean_squared_error(y_hat_2, y_cv):
    y_hat_NN = np.round(y_hat_2)
    model_NN = model_2
    
if perf_NN == mean_squared_error(y_hat_3, y_cv):
    y_hat_NN = np.round(y_hat_3)
    model_NN = model_3
    
if perf_NN == mean_squared_error(y_hat_4, y_cv):
    y_hat_NN = np.round(y_hat_4)
    model_NN = model_4
    
if perf_NN == mean_squared_error(y_hat_5, y_cv):
    y_hat_NN = np.round(y_hat_5)
    model_NN = model_5
    
if perf_NN == mean_squared_error(y_hat_6, y_cv):
    y_hat_NN = np.round(y_hat_6)
    model_NN = model_6
    
if perf_NN == mean_squared_error(y_hat_7, y_cv):
    y_hat_NN = np.round(y_hat_7)
    model_NN = model_7
    
if perf_NN == mean_squared_error(y_hat_8, y_cv):
    y_hat_NN = np.round(y_hat_8)
    model_NN = model_8
    
if perf_NN == mean_squared_error(y_hat_9, y_cv):
    y_hat_NN = np.round(y_hat_9)
    model_NN = model_9

# Support Vector Machine

In [None]:
#build model one
model_SVM_1 = SVC(kernel='rbf')
model_SVM_1.fit(x_train, y_train)
y_hat_SVM_1 = model_SVM_1.predict(x_cv)

In [None]:
print('Mean squared error: ', mean_squared_error(y_hat_SVM_1, y_cv))

In [None]:
#build model two
model_SVM_2 = SVC(kernel='linear')
model_SVM_2.fit(x_train, y_train)
y_hat_SVM_2 = model_SVM_2.predict(x_cv)
print('Mean squared error: ', mean_squared_error(y_hat_SVM_2, y_cv))

In [None]:
#build model three
model_SVM_3 = SVC(kernel='poly')
model_SVM_3.fit(x_train, y_train)
y_hat_SVM_3 = model_SVM_3.predict(x_cv)
print('Mean squared error: ', mean_squared_error(y_hat_SVM_3, y_cv))

In [None]:
#build model four
model_SVM_4 = SVC(kernel='sigmoid')
model_SVM_4.fit(x_train, y_train)
y_hat_SVM_4 = model_SVM_4.predict(x_cv)
print('Mean squared error: ', mean_squared_error(y_hat_SVM_4, y_cv))

In [None]:
perf_SVM = min(mean_squared_error(y_hat_SVM_1, y_cv), mean_squared_error(y_hat_SVM_2, y_cv), mean_squared_error(y_hat_SVM_3, y_cv), mean_squared_error(y_hat_SVM_4, y_cv))

In [None]:
#choose the best SVM model
if perf_SVM == mean_squared_error(y_hat_SVM_1, y_cv):
    model_SVM = model_SVM_1
if perf_SVM == mean_squared_error(y_hat_SVM_2, y_cv):
    model_SVM = model_SVM_2
if perf_SVM == mean_squared_error(y_hat_SVM_3, y_cv):
    model_SVM = model_SVM_3
if perf_SVM == mean_squared_error(y_hat_SVM_4, y_cv):
    model_SVM = model_SVM_4

In [None]:
#y_hat_SVM = model_SVM.predict(y_test)

# KNN 

In [None]:
#try 100 different values for k
for k in range(1,100):
    test_model_KNN = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
    y_hat_test_KNN = test_model_KNN.predict(x_cv)
    print("k: ",k, "MSE: ",mean_squared_error(y_hat_test_KNN, y_cv))

In [None]:
#seems like around k=30 the MSE is not really improving any further, so we choose k=30
k=30
model_KNN = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
y_hat_KNN = model_KNN.predict(x_cv)
perf_KNN = mean_squared_error(y_hat_KNN, y_cv)
print("MSE:", perf_KNN)

# **Check performance**

In [None]:
#take a look at the different performances
data= {'Index':[1,2,3,4],'Model':['Logistic regression', 'Neural Network', 'Support Vector Machine','K nearest neighbors'], 'MSE':[perf_LG, perf_NN, perf_SVM, perf_KNN]}
performance_df=pd.DataFrame(data)

In [None]:
performance_df

In [None]:
#choose the best model
best_model_index = performance_df.loc[performance_df['MSE']==performance_df['MSE'].min(), 'Index']
print(int(best_model_index))

In [None]:
#let the best model make a prediction for the test data
if int(best_model_index) == 1:
    y_hat=model_LG.predict(x_test)
if int(best_model_index) == 2:
    y_hat=model_NN.predict(x_test)
if int(best_model_index) == 3:
    y_hat=model_SVM.predict(x_test)
if int(best_model_index) == 4:
    y_hat=model_KNN.predict(x_test)

In [None]:
#formatting y_hat
y_hat=pd.DataFrame(data=y_hat, columns=['Survived'])
y_hat=round(y_hat).astype(int)
y_hat

In [None]:
#merge y_hat and 'PassengerId' together for submission to kaggle.com
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived' : y_hat['Survived']})

In [None]:
#check submission length (should be 418)
len(submission.index)

In [None]:
#check for invalid values
submission.loc[(submission['Survived']!=0) & (submission['Survived'] != 1)]

In [None]:
#just in case: make values valid
submission.loc[submission['Survived']<0, 'Survived']=0
submission.loc[submission['Survived']>1, 'Survived']=1

In [None]:
#take a final look at the submission
submission

In [None]:
#create csv
submission.to_csv("submission.csv", index=False)

The submission was uploaded to kaggle.com and scored a 0.78947 accuracy which is good enough for 4685th place out of 220006 who have completed this challenge.