# Predict Titanic Survival

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project we will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
passengers = pd.read_csv('passengers.csv', index_col=[0])

In [8]:
passengers

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [7]:
passengers.shape

(891, 12)

##### The file 'passengers.csv' contains the data of 892 passengers onboard the Titanic when it sank that fateful day.

Given the saying, “women and children first,” Sex and Age seem like good features to predict survival. Let’s map the text values in the Sex column to a numerical value. Update Sex such that all values female are replaced with 1 and all values male are replaced with 0.

In [9]:
passengers['Sex'] = passengers['Sex'].map({'male':0,'female':1})
passengers

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",0,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",1,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",0,26.0,0,0,111369,30.0000,C148,C


In [17]:
passengers['Age'].values

array([22.  , 38.  , 26.  , 35.  , 35.  ,   nan, 54.  ,  2.  , 27.  ,
       14.  ,  4.  , 58.  , 20.  , 39.  , 14.  , 55.  ,  2.  ,   nan,
       31.  ,   nan, 35.  , 34.  , 15.  , 28.  ,  8.  , 38.  ,   nan,
       19.  ,   nan,   nan, 40.  ,   nan,   nan, 66.  , 28.  , 42.  ,
         nan, 21.  , 18.  , 14.  , 40.  , 27.  ,   nan,  3.  , 19.  ,
         nan,   nan,   nan,   nan, 18.  ,  7.  , 21.  , 49.  , 29.  ,
       65.  ,   nan, 21.  , 28.5 ,  5.  , 11.  , 22.  , 38.  , 45.  ,
        4.  ,   nan,   nan, 29.  , 19.  , 17.  , 26.  , 32.  , 16.  ,
       21.  , 26.  , 32.  , 25.  ,   nan,   nan,  0.83, 30.  , 22.  ,
       29.  ,   nan, 28.  , 17.  , 33.  , 16.  ,   nan, 23.  , 24.  ,
       29.  , 20.  , 46.  , 26.  , 59.  ,   nan, 71.  , 23.  , 34.  ,
       34.  , 28.  ,   nan, 21.  , 33.  , 37.  , 28.  , 21.  ,   nan,
       38.  ,   nan, 47.  , 14.5 , 22.  , 20.  , 17.  , 21.  , 70.5 ,
       29.  , 24.  ,  2.  , 21.  ,   nan, 32.5 , 32.5 , 54.  , 12.  ,
         nan, 24.  ,

Looking at the Age column we have multiple missing values, or nans.
Hence we will fill all the empty Age values in passengers with the mean age.

In [21]:
passengers['Age'].fillna(inplace = True, value = round(passengers['Age'].mean()))
passengers['Age'].isnull().any()

False

In [22]:
passengers['Age'].values

array([22.  , 38.  , 26.  , 35.  , 35.  , 30.  , 54.  ,  2.  , 27.  ,
       14.  ,  4.  , 58.  , 20.  , 39.  , 14.  , 55.  ,  2.  , 30.  ,
       31.  , 30.  , 35.  , 34.  , 15.  , 28.  ,  8.  , 38.  , 30.  ,
       19.  , 30.  , 30.  , 40.  , 30.  , 30.  , 66.  , 28.  , 42.  ,
       30.  , 21.  , 18.  , 14.  , 40.  , 27.  , 30.  ,  3.  , 19.  ,
       30.  , 30.  , 30.  , 30.  , 18.  ,  7.  , 21.  , 49.  , 29.  ,
       65.  , 30.  , 21.  , 28.5 ,  5.  , 11.  , 22.  , 38.  , 45.  ,
        4.  , 30.  , 30.  , 29.  , 19.  , 17.  , 26.  , 32.  , 16.  ,
       21.  , 26.  , 32.  , 25.  , 30.  , 30.  ,  0.83, 30.  , 22.  ,
       29.  , 30.  , 28.  , 17.  , 33.  , 16.  , 30.  , 23.  , 24.  ,
       29.  , 20.  , 46.  , 26.  , 59.  , 30.  , 71.  , 23.  , 34.  ,
       34.  , 28.  , 30.  , 21.  , 33.  , 37.  , 28.  , 21.  , 30.  ,
       38.  , 30.  , 47.  , 14.5 , 22.  , 20.  , 17.  , 21.  , 70.5 ,
       29.  , 24.  ,  2.  , 21.  , 30.  , 32.5 , 32.5 , 54.  , 12.  ,
       30.  , 24.  ,

Given the strict class system onboard the Titanic, let’s utilize the Pclass column, or the passenger class, as another feature to predict the survival.

In [24]:
passengers['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

As we can see above there is 3 different class level which is class 1, 2 and 3.

First we create a new column named FirstClass that stores 1 for all passengers in first class and 0 for all other passengers.

In [25]:
passengers['FirstClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 1 else 0)

Next we create a new column named SecondClass that stores 1 for all passengers in second class and 0 for all other passengers.

In [27]:
passengers['SecondClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 2 else 0)

In [29]:
passengers.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0
5,6,0,3,"Moran, Mr. James",0,30.0,0,0,330877,8.4583,,Q,0,0
6,7,0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S,1,0
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S,0,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S,0,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C,0,1


In [30]:
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers['Survived']

In [31]:
X_train, X_test, y_train, y_test = train_test_split(features, survival, test_size=0.25)

In [32]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [33]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [34]:
model.score(X_train, y_train)

0.8083832335329342

In [35]:
model.score(X_test, y_test)

0.7802690582959642

The accuracy score of our training set is 80% and the accuracy score for our test set is 78%.

In [38]:
print(list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0])))

[('Sex', 1.2251084426677075), ('Age', -0.46924986033609034), ('FirstClass', 0.9318377525142388), ('SecondClass', 0.48361565148352653)]


Sex and FirstClass features have higher coefficient which indicates that both of this features are important
in predicting the survival on the sinking of the Titanic.

Let’s use our model to make predictions on the survival of a few fateful passengers.

In [39]:
Jack = np.array([0.0, 20.0, 0.0, 0.0])
Rose = np.array([1.0, 17.0, 1.0, 0.0])
Suho = np.array([0.0, 30.0, 1.0, 0.0])

In [40]:
sample_passengers = np.array([Jack, Rose, Suho])

In [41]:
sample_passengers = scaler.transform(sample_passengers)



In [45]:
sample_passengers

array([[-0.80737343, -0.71437467, -0.57907628, -0.52372294],
       [ 1.23858424, -0.93148616,  1.7268882 , -0.52372294],
       [-0.80737343,  0.00933028,  1.7268882 , -0.52372294]])

In [43]:
model.predict(sample_passengers)

array([0, 1, 0], dtype=int64)

In [44]:
model.predict_proba(sample_passengers)

array([[0.88445022, 0.11554978],
       [0.06169226, 0.93830774],
       [0.55628012, 0.44371988]])

##### From our model prediction, it appears that Rose has a better fate compare to Jack and Suho. <br> Base on the prediction, there is a high probability that Rose will survive the sinking of the Titanic.