# Titanic Survival Machine Learning

This is an older Kaggle Competition with an objective of providing a classic dataset to predict the survival of passengers on the Titanic. We have Train Dataset and Test Dataset for rus to apply our Machine Learning Models on.

* Here I would be applying several models on the same dataset in order to check and compare the different models
    1. Simple Logistic Regression

In [1]:
import pandas as pd
import numpy as np

# Importing the datasets
train_data=pd.read_csv("train.csv")
test_data=pd.read_csv("test.csv")


In [2]:
# Looking at the imported data to verify if the import was successful
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Running some Data Exploration to understand and visualize the data

#### Bar Graphs to understand the Predictor Variable

In [3]:
survived_count=train_data.groupby(["Survived"], as_index=False)["PassengerId"].count()
import plotly 
plotly.tools.set_credentials_file(username='atheros167', api_key='t6wQEzT7YVZUIy97HAka')

import plotly.plotly as py
import plotly.graph_objs as go

data = [go.Bar(
            x=survived_count["Survived"],
            y=survived_count["PassengerId"]
    )]

py.iplot(data, filename='basic-bar')

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~atheros167/0 or inside your plot.ly account where it is named 'basic-bar'


In [4]:
#### Pie Charts for number of Missing Data

missing_count={}
missing_count["Missing"]=train_data.shape[0] - train_data.dropna().shape[0]
missing_count["Not_Missing"]=train_data.dropna().shape[0]


import plotly.plotly as py
import plotly.graph_objs as go

labels = list(missing_count.keys())
values = list(missing_count.values())

trace = go.Pie(labels=labels, values=values)

py.iplot([trace], filename='basic_pie_chart')

From the chart above we notice that majority of the rows (~80%) have some record missing. SO lets lets explore the data to see which columns have missing records


In [5]:
train_data.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

The majority of the missing data comes from the Cabin field, which luckily for us is not a very essential column towards predicting our survival column. We can ignore this column since we will be dropping this column from our analysis.

Age columns however is intuitively an important column towards predicting survival. So we will be doing some manipulation to the data to fill the missing rows

The final columns with missing records is the Embarked column with 2 records missing. For the sake of convenience we will be dropping these two records as well, since it should not affect the model much by doing so

#### Survived column vs Gender column Stacked Bar Graph

Age and Gender are other few columns that intuitively makes sense to possible affect the outcome for Survival. So lets take a deeper look to understand these two columns

In [6]:
gender_count=train_data.groupby(["Sex","Survived"], as_index=False)["PassengerId"].count()
gender_count2=gender_count.pivot(index='Sex',columns='Survived', values='PassengerId').reset_index('Sex')
gender_count2.columns=['Sex','Survived_0','Survived_1']
gender_count2

Unnamed: 0,Sex,Survived_0,Survived_1
0,female,81,233
1,male,468,109


In [7]:
import plotly.plotly as py
import plotly.graph_objs as go

trace1 = go.Bar(
    x=gender_count2["Sex"],
    y=gender_count2["Survived_0"],
    name='Didnt_Survive'
)
trace2 = go.Bar(
    x=gender_count2["Sex"],
    y=gender_count2["Survived_1"],
    name='Survived'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')

##  Missing Data - Replacement

There are a number of missing records for Age which intuitively makes a good predictor for the survival. So exploring ways to replace the data.
* We would be parsing a string from the name like "master" or "mr" or "miss" to use that as a segmentation and replacing the missing age for a record with the mean age of those groups.
    - For example: If we have a missing record with Passenger name with "Master" or "master" in the name then we replace with the average age of the "Master" passengers
* If there are still records that cannot be replaced, then we will replace them with the mean age for the group of combination Pclass and Sex.
    - For example if we are not able to replace the age for a record using the method 1, then for the missing record of Pclass 1 and Sex="Female", we will insert the value for the mean age of the passengers who travelled in the same Pclass and were females

In [8]:
# Creating a list consisting of possible titles
title_list=["Mr.","mr.","Mrs.","mrs.","Miss.","miss.","Ms.","ms.","Master.","master."]

# defining a function to get the index for the elements in the title
def title_create(arg):
    for i in title_list:
        train_data[i]=arg.str.find(i)

# Defining a function to remove the created columns after associating hte correct salutation
def remove_columns(arg):
    for i in arg:
        del train_data[i]

title_create(train_data['Name'])
train_data['final_salutation']=train_data[["Mr.","mr.","Mrs.","mrs.","Miss.","miss.","Ms.","ms.","Master.","master."]].idxmax(axis=1)

remove_columns(title_list)

# these are the average values by the salutations.
# Checking if they make sense

age_mean=(train_data.dropna().groupby(["final_salutation"], as_index=False)["Age"].mean()).T.to_dict('List')
age_mean

{0: ['Master.', 3.988571428571429],
 1: ['Miss.', 27.738636363636363],
 2: ['Mr.', 40.712765957446805],
 3: ['Mrs.', 38.23684210526316]}

In [9]:
# Since the above make sense, lets replace the missing Age rows with above numbers
train_data["Age"] = train_data.groupby("final_salutation").transform(lambda x: x.fillna(x.mean()))

In [10]:
# Lets relook at the data to confirm all the records for the column Age look good now
train_data.count()

PassengerId         891
Survived            891
Pclass              891
Name                891
Sex                 891
Age                 891
SibSp               891
Parch               891
Ticket              891
Fare                891
Cabin               204
Embarked            889
final_salutation    891
dtype: int64

##  Dropping columns that wont be useful in our models

* Now that all our columns are ready, we can drop a few non useful columns like Name, final_salutation, Cabin & Ticket
* Also we can drop the two records withmissing embarked.
* Convert the Sex and Embarked columns in to dummy variables to make them numerical categorical variables

In [11]:
del train_data['Name']
del train_data['final_salutation']
del train_data['Cabin']
del train_data['Ticket']

# The function "f" here is to convert the column Sex into dummy variables 1 or 0 since we can input these variables 
# into our model
# We can use One Hot Encoder here, but since the dataset is small and simple we can use this to understand what is happenning
def f(row):
    if row['Sex'] == 'male':
        val = 1
    else:
        val = 0
    return val

train_data['Gender'] = train_data.apply(f, axis=1)
# Now that we have created a column called Gender we can drop the Sex column off
del train_data['Sex']


# Lets take a look at how our data looks like now
train_data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Gender
0,1,0,3,1.0,1,0,7.25,S,1
1,2,1,1,2.0,1,0,71.2833,C,0
2,3,1,3,3.0,0,0,7.925,S,0
3,4,1,1,4.0,1,0,53.1,S,0
4,5,0,3,5.0,0,0,8.05,S,1


Now we still have to drop the two records with Embarked column having "NAN" records and need to substitute the String values for numerical for the Embarked column before we can run our first model Logistic Regression on the data.

In [12]:
train_data['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [13]:
def embarked_replace(row):
    if row['Embarked']=='S':
        val=1
    elif row['Embarked']=='C':
        val=2
    elif row['Embarked']=='Q':
        val=3
    else:
        val=0
    return val

# As before, we are applying the function on our Embark column to make the column numerical.
# Then we delete the column
# Finally we drop the records where the column contained "NAN"
train_data['Embark_new']=train_data.apply(embarked_replace,axis=1)
del train_data['Embarked']

# Creating our final dataset that can be fed into various models

train_data=train_data[train_data.Embark_new != 0]

Final Look at the data before it we can start building different models off of the data called ==> output

In [14]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,Embark_new
0,1,0,3,1.0,1,0,7.25,1,1
1,2,1,1,2.0,1,0,71.2833,0,2
2,3,1,3,3.0,0,0,7.925,0,1
3,4,1,1,4.0,1,0,53.1,0,1
4,5,0,3,5.0,0,0,8.05,1,1


In [15]:
train_data.count()

PassengerId    889
Survived       889
Pclass         889
Age            889
SibSp          889
Parch          889
Fare           889
Gender         889
Embark_new     889
dtype: int64

Everything Looks Set for our first Model. Lets build some models now!!!

### Logistic Regression

Logistic Regression is a classification technique which uses a bunch of independent varaible to predict a probabililty of occurance for a categorical dependent variable. It uses a sigmoid function to predict the probability of dependent variable to be between 0 and 1 using a y=1/(1+e^(-x)) function

In [16]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, model_selection
from sklearn.model_selection import train_test_split

from pylab import scatter, show, legend, xlabel, ylabel

#### Creating the X label and Y Label for our Models
* We need to convert the X and Y variables into a Array

In [17]:
X=train_data[["Pclass","Age","SibSp","Parch","Fare","Gender","Embark_new"]]
X=np.array(X)
Y=train_data[["Survived"]]
Y=np.array(Y)

#### Splitting the data into Train and Test.
* Using 30% of the data as Test Data 
* Remaining 70% is our Train Data

In [18]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3)

In [19]:
clf = LogisticRegression()
clf.fit(X_train, Y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [20]:
predicted=clf.predict(X_test)

#### Print the Accuracy Score

In [21]:
print(metrics.accuracy_score(Y_test, predicted)) 

0.797752808989


#### Print the Confusion Matrix

In [22]:
metrics.confusion_matrix(Y_test, predicted,)

array([[141,  18],
       [ 36,  72]], dtype=int64)

#### Modify the test dataset along the same lines as train and apply the model

In [23]:
test_data.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1235,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C
1,945,1,"Fortune, Miss. Ethel Flora",female,28.0,3,2,19950,263.0,C23 C25 C27,S
2,961,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60.0,1,4,19950,263.0,C23 C25 C27,S
3,916,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
4,951,1,"Chaudanson, Miss. Victorine",female,36.0,0,0,PC 17608,262.375,B61,C


In [24]:
del test_data['Cabin']
del test_data['Ticket']
del test_data['PassengerId']

# Creating a list of salutations based on reading the name string
def title_create(arg):
    for i in title_list:
        test_data[i]=arg.str.find(i)

title_create(test_data['Name'])

# From the above creating a column to pick the max value under the columns
test_data['final_salutation']=test_data[["Mr.","mr.","Mrs.","mrs.","Miss.","miss.","Ms.","ms.","Master.","master."]].idxmax(axis=1)

# For the records with missing Age, we are applying the average age based on the salutations.

test_data["Age"] = test_data.groupby("final_salutation").transform(lambda x: x.fillna(x.mean()))

# Now we do not need those columns anymore
def remove_columns(arg):
    for i in arg:
        del test_data[i]

remove_columns(title_list)

del test_data['Name']
del test_data['final_salutation']


#The function "f" here is to convert the column Sex into dummy variables 1 or 0 since we can input these variables into our model
#We can use One Hot Encoder here, but since the dataset is small and simple we can use this to understand what is happenning

def f(row):
    if row['Sex'] == 'male':
        val = 1
    else:
        val = 0
    return val

test_data['Gender'] = test_data.apply(f, axis=1)
# Now that we have created a column called Gender we can drop the Sex column off
del test_data['Sex']

test_data['Embark_new']=test_data.apply(embarked_replace,axis=1)
del test_data['Embarked']

# There are a few rows with Fare Information missing. FOr these records, we take the average fare per Pclass and Age

test_data["Fare"] = test_data.groupby(["Pclass","Age"]).transform(lambda x: x.fillna(x.mean()))

#Creating our final dataset that can be fed into various models

X_test_final=test_data[["Pclass","Age","SibSp","Parch","Fare","Gender","Embark_new"]]
X_test_final=np.array(X_test_final)

In [25]:
#Score the test data to predict the values for the Survived column in the test dataset
predicted_final=clf.predict(X_test_final)

In [26]:
# Exporting the data into a csv

test_data=pd.read_csv("test.csv")

submission = pd.DataFrame({
        "PassengerId": test_data["PassengerId"],
        "Survived": predicted_final
    })
submission.to_csv('titanic_output.csv', index=False)

### Support Vector Classifier

Now that we have applied the Logistic Regression Model, we can try some other models to see how they fare against the logistic models. One of the simple models to apply is a SVM Classifier model which basically tries to draw a line at the threshold of the classification (here 1) and tries to predict which side of the line would the observation be predicted

In [27]:
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train,Y_train) 
predict_svr=clf.predict(X_test)



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



In [28]:
print(metrics.accuracy_score(Y_test, predict_svr)) 

0.588014981273


In [29]:
metrics.confusion_matrix(Y_test, predict_svr)

array([[153,   6],
       [104,   4]], dtype=int64)

In [30]:
predicted_final_svr=clf.predict(X_test_final)
test_data=pd.read_csv("test.csv")

submission = pd.DataFrame({
        "PassengerId": test_data["PassengerId"],
        "Survived": predicted_final_svr
    })
submission.to_csv('titanic_output_svc.csv', index=False)

Clearly the Logistic Regression model out performed the SVM model here. Next lets try an Artificial Neural Network

## Artificial Neural Network

Artificial Neural Network is a model that tries to create input nodes using the input variables and with the help of a few hidden layers tries to back propagates a correction to the weights of the input variables in order to make the output more closer to the true values. In theory, this should be more accurate that our earlier models.
Lets check it out if that true :)

In [31]:
# Standardizing the input variables
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [32]:
# Importing the keras libraries to create out Neural Networks

import keras
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.


In [33]:
# Defining the initial object as a sequential model with more than one layers
clf=Sequential()

# Adding the first hidden layer
# output_dim is the average of input dims=7 and output dims=1.
# init is the function to initialize the weights. uniform is the most simple way to add a close to 0 value for weights 
# Choosing the activation function to be relu which corresponds to rectifier function 
clf.add(Dense(units=4,kernel_initializer="uniform",activation='relu',input_dim=7))

#Adding the second hidden layer
clf.add(Dense(units=4,kernel_initializer="uniform",activation='relu'))

# Adding the final output layer
clf.add(Dense(units=1,kernel_initializer="uniform",activation='sigmoid'))

In [34]:
# Compiling using the ADAM optimizer for Stachastic Gradient Descent
clf.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [35]:
# Lets fit this for 100 epochs and a batch_size of 10
clf.fit(X_train,Y_train,batch_size=10,epochs=100)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1bb11b89ba8>

In [36]:
# Applying the model on the split test data
Y_pred=clf.predict(X_test)
Y_pred=(Y_pred>0.5)

# Calculating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y_test,Y_pred)
cm

array([[150,   9],
       [ 46,  62]], dtype=int64)

From the training dataset we get an average accuracy of 81%. From Test dataset we get 79.4 %, which is pretty much a good alignment to conclude that there is not much over fitting.
So, like we expected we are getting a slightly better result using the ANN as compared to Logistic Regression or SVM