Predict Titanic Survival
The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we will be using for training our model is provided by Kaggle. Feel free to make the model better on your own and submit it to the Kaggle Titanic competition!

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


In [4]:
# Load the passenger data
passengers = pd.read_csv('train.csv')
print(passengers)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          

Given the saying, "women and children first," Sex and Age seem like good features to predict survival. Let's map the text values in the Sex column to a numerical value. Update Sex such that all values female are replaced with 1 and all values male are replaced with 0.

In [5]:
# Update sex column to numerical
passengers.Sex = np.where(passengers.Sex =="female", 1,0)
print(passengers)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          

In [6]:
# Fill the nan values in the age column
print(passengers['Age'].values)
passengers['Age'].fillna(round(passengers['Age'].mean()),inplace=True)

[22.   38.   26.   35.   35.     nan 54.    2.   27.   14.    4.   58.
 20.   39.   14.   55.    2.     nan 31.     nan 35.   34.   15.   28.
  8.   38.     nan 19.     nan   nan 40.     nan   nan 66.   28.   42.
   nan 21.   18.   14.   40.   27.     nan  3.   19.     nan   nan   nan
   nan 18.    7.   21.   49.   29.   65.     nan 21.   28.5   5.   11.
 22.   38.   45.    4.     nan   nan 29.   19.   17.   26.   32.   16.
 21.   26.   32.   25.     nan   nan  0.83 30.   22.   29.     nan 28.
 17.   33.   16.     nan 23.   24.   29.   20.   46.   26.   59.     nan
 71.   23.   34.   34.   28.     nan 21.   33.   37.   28.   21.     nan
 38.     nan 47.   14.5  22.   20.   17.   21.   70.5  29.   24.    2.
 21.     nan 32.5  32.5  54.   12.     nan 24.     nan 45.   33.   20.
 47.   29.   25.   23.   19.   37.   16.   24.     nan 22.   24.   19.
 18.   19.   27.    9.   36.5  42.   51.   22.   55.5  40.5    nan 51.
 16.   30.     nan   nan 44.   40.   26.   17.    1.    9.     nan 45.


Given the strict class system onboard the Titanic, let's utilize the Pclass column

In [7]:
# Create a first class column
passengers['FirstClass'] = np.where(passengers['Pclass']==1,1.0,0.0)

# Create a second class column
passengers['SecondClass'] = np.where(passengers['Pclass']==2,1.0,0.0)
print(passengers)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          


Now that we have cleaned our data, let's select the columns we want to build our model on. Select columns Sex, Age, FirstClass, and SecondClass and store them in a variable named features. Select column Survived and store it a variable named survival

In [8]:
# Select the desired features
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
Survived  = passengers['Survived']

In [9]:
# Perform train, test, split
f_train,f_test,s_train,s_test = train_test_split(features,Survived)

In [10]:
# Scale the feature data so it has mean = 0 and standard deviation = 1
m = StandardScaler()
f_train = m.fit_transform(f_train)
f_test = m.transform(f_test)

In [11]:
# Create and train the model
model = LogisticRegression()
model.fit(f_train,s_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
# Score the model on the train data
print(model.score(f_train,s_train))

# Score the model on the test data
print(model.score(f_test,s_test))

0.7859281437125748
0.7713004484304933


In [13]:
# Analyze the coefficients. Which feature is most important in predicting survival on the sinking of the Titanic?
print(model.coef_)


[[ 1.27917974 -0.48157728  1.11964694  0.53794477]]


In [14]:
# Sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([1.0,23,0.0,0.0])

In [15]:
# Combine passenger arrays
sample_passengers =np.array([Jack, Rose, You]) 

# Scale the sample passenger features
sample_passengers = m.transform(sample_passengers)
print(sample_passengers)

[[-0.74396799 -0.75499366 -0.57043565 -0.50327259]
 [ 1.34414385 -0.981486    1.75304613 -0.50327259]
 [ 1.34414385 -0.52850133 -0.57043565 -0.50327259]]


In [16]:
# Make survival predictions!
print(model.predict(sample_passengers))
print(model.predict_proba(sample_passengers))


[0 1 1]
[[0.89943849 0.10056151]
 [0.03952016 0.96047984]
 [0.40830045 0.59169955]]
