# Titanic - Machine Learning from Disaster

https://www.kaggle.com/competitions/titanic

https://www.kaggle.com/competitions/titanic/data

Sources of inspiration:
- https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial/notebook?scriptVersionId=27280410
- https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb

In [47]:
import kagglehub
from kaggle.api.kaggle_api_extended import KaggleApi

import os

import pandas as pd

import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

## Download dataset

In [4]:
path = kagglehub.competition_download("titanic")
path

'/home/martin/.cache/kagglehub/competitions/titanic'

In [10]:
train_data = pd.read_csv(os.path.join(path, "train.csv"))
train_data.shape

(891, 12)

In [11]:
test_data = pd.read_csv(os.path.join(path, "test.csv"))
test_data.shape

(418, 11)

## Explore data

In [12]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [34]:
def concat_df(train_data, test_data):
    # Returns a concatenated df of training and test set
    return pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

def divide_df(all_data):
    # Returns divided dfs of training and test set
    return all_data.loc[:890], all_data.loc[891:].drop(['Survived'], axis=1)

# The goal of the competition is to predict Survival based on test_data.
# We wouldn't have that kind of luxury in real world and we would stay away from test_data as far as we can (so that we can prepare for production data).
# Here however, we should include the test_data, in exploration, augmentation and feature creation (so that we are best prepared for the test data).
all_data = concat_df(train_data, test_data)

### Survival rate based on Sex

In [14]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


In [15]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


## Prepare the Data for Machine Learning Algorithms

### Feature Engineering

### Missing values

In [25]:
all_data.isnull().sum()

Age             263
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

#### Age

#### Cabin

**New Deck Feature**

Cabin alone doesn't give us a lot of information, but we can extract Deck from that feature which better correlates with other features.

**Missing Cabin values**

As noted in [Titanic - Advanced Feature Engineering Tutorial](https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial?scriptVersionId=27280410&cellId=22),
people with missing Cabin value have low survival rate, so it can be a good feature and let's do the same and create special category for them (instead of replacing it with most common).

In [35]:
# Creating Deck column from the first letter of the Cabin column (M stands for Missing)
all_data["Deck"] = all_data["Cabin"].apply(lambda s: s[0] if pd.notnull(s) else 'M')

**Deck T**

As noted in [Titanic - Advanced Feature Engineering Tutorial](https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial?scriptVersionId=27280410&cellId=22),
"There is one person on the boat deck in T cabin and he is a 1st class passenger. T cabin passenger has the closest resemblance to A deck passengers so he is grouped with A deck".

In [38]:
# Passenger in the T deck is changed to A
idx = all_data[all_data["Deck"] == 'T'].index
all_data.loc[idx, "Deck"] = 'A'

In [39]:
all_data_copy = all_data.copy()

all_data_copy["Deck_Num"] = LabelEncoder().fit_transform(all_data_copy["Deck"])
all_data_copy["Cabin_Num"] = LabelEncoder().fit_transform(all_data_copy["Cabin"])

corr_matrix = all_data_copy.corr(numeric_only=True)
corr_matrix["Survived"].sort_values(ascending=False)


Survived       1.000000
Fare           0.257307
Parch          0.081629
PassengerId   -0.005007
SibSp         -0.035322
Age           -0.077221
Cabin_Num     -0.253406
Deck_Num      -0.290485
Pclass        -0.338481
Name: Survived, dtype: float64

#### Embarked

Just replace NaN with most common value.

In [50]:
all_data["Embarked"] = all_data["Embarked"].fillna(all_data["Embarked"].mode().iloc[0])

In [51]:
all_data.isnull().sum()

Age             263
Cabin          1014
Embarked          0
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
Deck              0
dtype: int64

## Train model

In [52]:
train_data, test_data = divide_df(all_data)

In [53]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Deck", "Embarked"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)


In [54]:
feature_importances = model.feature_importances_

sorted(zip(feature_importances,
           features),
           reverse=True)

[(0.327005154615997, 'Deck'),
 (0.26061062765369747, 'Parch'),
 (0.13706077859531007, 'Pclass'),
 (0.05167849438169839, 'SibSp'),
 (0.04739282219656994, 'Sex'),
 (0.001839772641555666, 'Embarked')]

## Upload new submission

In [56]:
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions.astype('int64')})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


| WARNING: Don't forget to change "changeId" so that we can track version of jupyter book this submission is from! |
| --- |

In [None]:
# In case we run this cell by accident, this script disables its execution - comment out to enable it again
%%script false --no-raise-error

changeId = "c5544276aa346801d355ce72bd6dfa6d823aae31"

api = KaggleApi()
api.authenticate()

# kaggle competitions submit -c titanic -f submission.csv -m "Message"
api.competition_submit(file_name="submission.csv", message=f"ChangeId: {changeId}", competition="titanic")

100%|██████████| 2.77k/2.77k [00:00<00:00, 5.70kB/s]


Successfully submitted to Titanic - Machine Learning from Disaster