### The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

### Project Goal
use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the passenger data
passengers = pd.read_csv('train.csv')
# display(passengers)

# Update sex column to numerical
passengers['Sex'] = passengers['Sex'].apply(lambda x: 1 if x == 'female' else 0 )


# Fill the nan values in the age column
passengers['Age'].fillna(passengers['Age'].mean(),inplace=True)
# print(passengers['Age'].values)
# print(passengers)

# Create a first class column
passengers['FirstClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 1 else 0)

# Create a second class column
passengers['SecondClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 2 else 0)

# Select the desired features
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers['Survived']
display(features)

# Perform train, test, split
features_train, features_test, survival_train, survival_test = train_test_split(features, survival, train_size=0.8, random_state = 42)


# Scale the feature data so it has mean = 0 and standard deviation = 1
scaler = StandardScaler()
train_features = scaler.fit_transform(features_train)
test_features = scaler.transform(features_test)
# Create and train the model
model = LogisticRegression()
model.fit(train_features, survival_train)

# Score the model on the train data
train_score = model.score(train_features, survival_train)
print('The model\'s score on train set:', train_score)
# Score the model on the test data
test_score = model.score(test_features, survival_test)
print('The model\'s prediction score on the split test set:',test_score)
# Analyze the coefficients
# print(model.coef_)
display(list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0])))

# Sample passenger features with numeric values depicting sex, age, firstclass, secondclass in that order
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
May = np.array([0.0,25.0,0.0,1.0])
Mo = np.array([0.0,33.0,1.0,0.0])
Ames = np.array([0.0,31.0,0.0,1.0])

# Combine passenger arrays
sample_passengers = np.array([Jack,Rose,May,Mo,Ames])

# Scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)
print(sample_passengers)

# Make survival predictions!
survive = model.predict(sample_passengers)
survival_probability = model.predict_proba(sample_passengers)
display(survive)
display(survival_probability)

Unnamed: 0,Sex,Age,FirstClass,SecondClass
0,0,22.000000,0,0
1,1,38.000000,1,0
2,1,26.000000,0,0
3,1,35.000000,1,0
4,0,35.000000,0,0
...,...,...,...,...
886,0,27.000000,0,1
887,1,19.000000,1,0
888,1,29.699118,0,0
889,0,26.000000,1,0


The model's score on train set: 0.7949438202247191
The model's prediction score on the split test set: 0.8044692737430168


[('Sex', 1.2131895235075372),
 ('Age', -0.3269549109495998),
 ('FirstClass', 0.8611604341342556),
 ('SecondClass', 0.5112568947269701)]

[[-0.7243102  -0.73453348 -0.54488848 -0.51880845]
 [ 1.38062393 -0.96556183  1.8352379  -0.51880845]
 [-0.7243102  -0.34948624 -0.54488848  1.92749365]
 [-0.7243102   0.26658936  1.8352379  -0.51880845]
 [-0.7243102   0.11257046 -0.54488848  1.92749365]]


array([0, 1, 0, 0, 0], dtype=int64)

array([[0.88643869, 0.11356131],
       [0.06760835, 0.93239165],
       [0.71709017, 0.28290983],
       [0.58237252, 0.41762748],
       [0.74671062, 0.25328938]])

In [51]:
# running the algorithm on kaggle's test set
test = pd.read_csv('test.csv')
display(test)

# update sex column to numerical
test['Sex'] = test['Sex'].apply(lambda x: 1 if x == 'female' else 0)

# fill the missing values in the age column to the mean value
test['Age'].fillna(test['Age'].mean(), inplace=True)

# create a first class column
test['FirstClass'] = test['Pclass'].apply(lambda x: 1 if x == 1 else 0)

# create a second class column
test['SecondClass'] = test['Pclass'].apply(lambda x: 1 if x == 2 else 0)

# display(test)

# selecting desired features
features = test[['Sex', 'Age', 'FirstClass', 'SecondClass']]

features_scaled = scaler.transform(features)
# print(features_scaled)
kaggle_predict = model.predict(features_scaled)
kaggle_proba = model.predict_proba(features_scaled)

print(kaggle_predict)
# print(kaggle_proba)

passengerid = test['PassengerId'].values
# print(passengerid)


df = pd.DataFrame({'passengerid':passengerid, 'survived': kaggle_predict})#.set_index('passengerid')
display(df)

df.to_csv('kaggle_test.csv', index = False)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 1 0 1 0 0 0]


Unnamed: 0,passengerid,survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
