# Predict Titanic Survival

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.
In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class. The data we will be using for training our model is provided by Kaggle.

The file passengers.csv contains the data of 892 passengers onboard the Titanic when it sank that fateful day. Let’s begin by loading the data into a pandas DataFrame named passengers. Print passengers and inspect the columns. What features could we use to predict survival?

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Load the passenger data
passengers=pd.read_csv('passengers.csv')
print(passengers)
print(passengers.columns)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

Given the saying, “women and children first,” Sex and Age seem like good features to predict survival. Let’s map the text values in the Sex column to a numerical value. Update Sex such that all values female are replaced with 1 and all values male are replaced with 0

In [3]:
passengers['Sex']=np.where(passengers['Sex']=="female",1,0)
print(passengers.Sex)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    1
889    0
890    0
Name: Sex, Length: 891, dtype: int32


Let’s take a look at Age. Print passengers['Age'].values. You can see we have multiple missing values, or nans. Fill all the empty Age values in passengers with the mean age.

In [5]:
# Fill the nan values in the age column
print(passengers.Age.values)
passengers['Age'].fillna(passengers['Age'].mean(),inplace=True)
print(passengers.Age)

[22.   38.   26.   35.   35.     nan 54.    2.   27.   14.    4.   58.
 20.   39.   14.   55.    2.     nan 31.     nan 35.   34.   15.   28.
  8.   38.     nan 19.     nan   nan 40.     nan   nan 66.   28.   42.
   nan 21.   18.   14.   40.   27.     nan  3.   19.     nan   nan   nan
   nan 18.    7.   21.   49.   29.   65.     nan 21.   28.5   5.   11.
 22.   38.   45.    4.     nan   nan 29.   19.   17.   26.   32.   16.
 21.   26.   32.   25.     nan   nan  0.83 30.   22.   29.     nan 28.
 17.   33.   16.     nan 23.   24.   29.   20.   46.   26.   59.     nan
 71.   23.   34.   34.   28.     nan 21.   33.   37.   28.   21.     nan
 38.     nan 47.   14.5  22.   20.   17.   21.   70.5  29.   24.    2.
 21.     nan 32.5  32.5  54.   12.     nan 24.     nan 45.   33.   20.
 47.   29.   25.   23.   19.   37.   16.   24.     nan 22.   24.   19.
 18.   19.   27.    9.   36.5  42.   51.   22.   55.5  40.5    nan 51.
 16.   30.     nan   nan 44.   40.   26.   17.    1.    9.     nan 45.


Given the strict class system onboard the Titanic, let’s utilize the Pclass column, or the passenger class, as another feature. Create a new column named FirstClass that stores 1 for all passengers in first class and 0 for all other passengers.

Create a new column named SecondClass that stores 1 for all passengers in second class and 0 for all other passengers.

Print passengers and inspect the DataFrame to ensure all the updates have been made.

In [6]:
# Create a first class column
passengers['FirstClass']=passengers.Pclass.apply(lambda x: 1 if x==1 else 0)
#print(passengers)
# Create a second class column
passengers['SecondClass']=passengers.Pclass.apply(lambda x: 1 if x==2 else 0)
print(passengers)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name  Sex        Age  SibSp  \
0                              Braund, Mr. Owen Harris    0  22.000000      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...    0  38.000000      1   
2                               Heikkinen, Miss. Laina    0  26.000000      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)    0  35.000000      1   
4                             Allen, Mr. William Henry    0  35.000000      0   
..                                                 ...  .

Now that we have cleaned our data, let’s select the columns we want to build our model on. Select columns Sex, Age, FirstClass, and SecondClass and store them in a variable named features. Select column Survived and store it a variable named survival.

Split the data into training and test sets using sklearn‘s train_test_split() method. We’ll use the training set to train the model and the test set to evaluate the model.

In [8]:
# Select the desired features
features=passengers[['Sex','Age','FirstClass','SecondClass']]
survival=passengers['Survived']

# Perform train, test, split
X_train,X_test,y_train,y_test=train_test_split(features,survival,test_size =0.3)
print(len(X_train))
print(len(X_test))

623
268


Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

In [9]:
# Scale the feature data so it has mean = 0 and standard deviation = 1
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

Create a LogisticRegression model with sklearn and .fit() it on the training data.

Fitting the model will perform gradient descent to find the feature coefficients that minimize the log-loss for the training data.

Score the model

In [10]:
# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Score the model on the train data

print(model.score(X_train, y_train))

# Score the model on the test data
print(model.score(X_test, y_test))

0.7078651685393258
0.667910447761194


Print the feature coefficients determined by the model. Which feature is most important in predicting survival on the sinking of the Titanic?

In [11]:
# Analyze the coefficients
print(model.coef_)
print(list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0])))

[[ 0.         -0.43218985  0.96944388  0.51756115]]
[('Sex', 0.0), ('Age', -0.43218985001281623), ('FirstClass', 0.9694438835782401), ('SecondClass', 0.5175611474036539)]


Let’s use our model to make predictions on the survival of a few fateful passengers. Provided in the code editor is information for 3rd class passenger Jack and 1st class passenger Rose, stored in NumPy arrays. The arrays store 4 feature values, in the following order:

Sex, represented by a 0 for male and 1 for female
Age, represented as an integer in years
FirstClass, with a 1 indicating the passenger is in first class
SecondClass, with a 1 indicating the passenger is in second class
A third array, You, is also provided in the code editor with empty feature values. Uncomment the line containing You and update the array with your information, or the information for some fictitious passenger. Make sure to enter all values as floats with a .!

Combine Jack, Rose, and You into a single NumPy array named sample_passengers

In [12]:
# Sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([0.0,32,1.0,0.0])

# Combine passenger arrays
sample_passengers=np.array([Jack,Rose,You])

Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. Using the StandardScaler object created earlier, apply its .transform() method to sample_passengers and save the result to sample_passengers.

Print sample_passengers to view the scaled features.

In [14]:
# Scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)
print(sample_passengers)

[[ 0.         -2.31464569 -1.82823697 -1.75200703]
 [ 1.         -2.33230455  3.9116181  -1.75200703]
 [ 0.         -2.24401028  3.9116181  -1.75200703]]


Who will survive, and who will sink? Use your model’s .predict() method on sample_passengers and print the result to find out.

Want to see the probabilities that led to these predictions? Call your model’s .predict_proba() method on sample_passengers and print the result. The 1st column is the probability of a passenger perishing on the Titanic, and the 2nd column is the probability of a passenger surviving the sinking (which was calculated by our model to make the final classification decision).

In [15]:
# Make survival predictions!
model.predict(sample_passengers)
print(model.predict_proba(sample_passengers))

[[0.90205497 0.09794503]
 [0.03383528 0.96616472]
 [0.03510518 0.96489482]]


It appears rose and You have a better chance of survival than Jack