# Titanic Prediction
The aim is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

<h3><center>Data Dictionary</center></h3>

| Variable | Definition | Key |
| :-: | :- | :- |
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex |  | 
| Age | Age in years | 	  | 
| sibsp | # of siblings/spouses aboard |  | 
| parch | # of parents/children aboard |  | 
| ticket | Ticket number |  | 
| fare | Passenger fare |  | 
| cabin | Cabin number |  | 
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | 

# 1. Data Exploration

#### Import libraries

In [1]:
import numpy as np
import pandas as pd

# Data Vizualization
import plotly.express as px
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#### Import data into dataframe

In [2]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

# View the top 5 rows in the dataset
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# Get summary about data
# like, How many rows and columns do we have?, What data type is each column?, Do we have null values?
train_df.info()

In [None]:
# Retreive statical summary
train_df.describe()

In [None]:
# Explore the relationship between variables
train_df.corr()

In [None]:
# Get null values count for each column
train_df.isnull().sum()

## Observations

We can see from the correlation table that there is a strong(er) relationship between: 
- Survived, Pclass and Fare
- Parch, SibSp, and Age

## Hypothesis
Based on the observation from the data and own own knowledge regarding the event, we hypothesize the following for an individuals survival probabily:
- The survival rate increases as the ticket class denotes more upper-class status.
- The survival rate increases as the number of siblings/spouses on board increases.
- The survival rate increases as the number of parent/children relationships on board increases.
- The survival rate increases as the age of the individual is more middle-aged (vs children and elderly).
- The survival rate increases if the gender is female.

# 2. Data Cleaning
The data needs to be cleaned up. In this section, we will make following changes:

1. Drop Cabin column: there are too many missing values with no discernible way to replace them. The variable doesn't seem to be connected to survival at first glance. Might reintroduce this variable further down the line if deemed neccesary.
2. Replace missing Age values with average age (mean)
3. Replace missing Embarked values with most frequent value (mode)

In [3]:
# Dropping Cabin column
train_df.drop(['Cabin'], axis=1,inplace=True)
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [4]:
# Replacing missing ages with the average age
train_df['Age'] = train_df['Age'].replace(np.nan,round(train_df['Age'].mean(),1))

In [5]:
# Replacing missing Embarked values with the most frequently occuring value
em_mode = train_df['Embarked'].mode()[0]
train_df['Embarked'] = train_df['Embarked'].replace(np.nan,em_mode)

In [6]:
# Any null values left?
train_df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [7]:
# Taking subset of data - variables that seem like they will have most impact based on the correlation table
train_df = train_df.loc[:,['Pclass', 'Sex', 'Age', 'SibSp','Parch','Embarked','Survived']]
train_df

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked,Survived
0,3,male,22.0,1,0,S,0
1,1,female,38.0,1,0,C,1
2,3,female,26.0,0,0,S,1
3,1,female,35.0,1,0,S,1
4,3,male,35.0,0,0,S,0
...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,S,0
887,1,female,19.0,0,0,S,1
888,3,female,29.7,1,2,S,0
889,1,male,26.0,0,0,C,1


## Save clean data into a csv file

In [8]:
train_df.to_csv('data/train_clean.csv', index=False)

# 3. Data Wrangling/Munging
1. Convert categorical variables into numerical values:
    - Sex: from male, female
    - Embarked: from C, Q, S
2. Ensure Pclass values (1, 2, 3) are treated as values without order/rank

## Converting Categorical Variables

Encoding Sex

In [9]:
train_df = pd.read_csv('data/train_clean.csv')
train_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked,Survived
0,3,male,22.0,1,0,S,0
1,1,female,38.0,1,0,C,1
2,3,female,26.0,0,0,S,1
3,1,female,35.0,1,0,S,1
4,3,male,35.0,0,0,S,0


In [10]:
# creating instance of one-hot-encoder
enc = OneHotEncoder(drop='if_binary',handle_unknown='ignore')

# passing sex column (label encoded values of sex)
enc_df = pd.DataFrame(enc.fit_transform(train_df[['Sex']]).toarray())


In [11]:
enc_df.rename(columns={0: enc.categories_[0][1]}, inplace=True)
# merge with main df bridge_df on key values
train_df = train_df.join(enc_df)
train_df 

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked,Survived,male
0,3,male,22.0,1,0,S,0,1.0
1,1,female,38.0,1,0,C,1,0.0
2,3,female,26.0,0,0,S,1,0.0
3,1,female,35.0,1,0,S,1,0.0
4,3,male,35.0,0,0,S,0,1.0
...,...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,S,0,1.0
887,1,female,19.0,0,0,S,1,0.0
888,3,female,29.7,1,2,S,0,0.0
889,1,male,26.0,0,0,C,1,1.0


Drop original Sex column

In [12]:
train_df.drop(columns=['Sex'])

Unnamed: 0,Pclass,Age,SibSp,Parch,Embarked,Survived,male
0,3,22.0,1,0,S,0,1.0
1,1,38.0,1,0,C,1,0.0
2,3,26.0,0,0,S,1,0.0
3,1,35.0,1,0,S,1,0.0
4,3,35.0,0,0,S,0,1.0
...,...,...,...,...,...,...,...
886,2,27.0,0,0,S,0,1.0
887,1,19.0,0,0,S,1,0.0
888,3,29.7,1,2,S,0,0.0
889,1,26.0,0,0,C,1,1.0


Encoding Pclass

In [16]:
# creating instance of one-hot-encoder
enc_pclass = OneHotEncoder(handle_unknown='ignore')

# passing sex column (label encoded values of sex)
enc_pclass_df = pd.DataFrame(enc_pclass.fit_transform(train_df[['Pclass']]).toarray())
enc_pclass_df.head()

Unnamed: 0,0,1,2
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0


In [20]:
col_name = np.char.add('pclass_',enc_pclass.categories_[0].astype("str"))
enc_pclass_df.columns = col_name
enc_pclass_df.head()

Unnamed: 0,pclass_1,pclass_2,pclass_3
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0


In [21]:
# merge with main df bridge_df on key values
train_df = train_df.join(enc_pclass_df)
train_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked,Survived,male,pclass_1,pclass_2,pclass_3
0,3,male,22.0,1,0,S,0,1.0,0.0,0.0,1.0
1,1,female,38.0,1,0,C,1,0.0,1.0,0.0,0.0
2,3,female,26.0,0,0,S,1,0.0,0.0,0.0,1.0
3,1,female,35.0,1,0,S,1,0.0,1.0,0.0,0.0
4,3,male,35.0,0,0,S,0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,S,0,1.0,0.0,1.0,0.0
887,1,female,19.0,0,0,S,1,0.0,1.0,0.0,0.0
888,3,female,29.7,1,2,S,0,0.0,0.0,0.0,1.0
889,1,male,26.0,0,0,C,1,1.0,1.0,0.0,0.0


In [22]:
train_df.drop(columns=['Sex','Pclass'],inplace=True)
train_df.head()

Unnamed: 0,Age,SibSp,Parch,Embarked,Survived,male,pclass_1,pclass_2,pclass_3
0,22.0,1,0,S,0,1.0,0.0,0.0,1.0
1,38.0,1,0,C,1,0.0,1.0,0.0,0.0
2,26.0,0,0,S,1,0.0,0.0,0.0,1.0
3,35.0,1,0,S,1,0.0,1.0,0.0,0.0
4,35.0,0,0,S,0,1.0,0.0,0.0,1.0


Encoding Embarked

In [23]:
# creating instance of one-hot-encoder
enc_em = OneHotEncoder(handle_unknown='ignore')

# passing sex column (label encoded values of sex)
enc_em_df = pd.DataFrame(enc_em.fit_transform(train_df[['Embarked']]).toarray())
enc_em_df.head()

Unnamed: 0,0,1,2
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [24]:
col_name = np.char.add('embarked_',enc_em.categories_[0].astype("str"))
enc_em_df.columns = col_name
enc_em_df.head()

Unnamed: 0,embarked_C,embarked_Q,embarked_S
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [25]:
# merge with main df bridge_df on key values
train_df = train_df.join(enc_em_df)
train_df.head()

Unnamed: 0,Age,SibSp,Parch,Embarked,Survived,male,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S
0,22.0,1,0,S,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,38.0,1,0,C,1,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,26.0,0,0,S,1,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,35.0,1,0,S,1,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,35.0,0,0,S,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


In [26]:
train_df.drop(columns=['Embarked'],inplace=True)
train_df.head()

Unnamed: 0,Age,SibSp,Parch,Survived,male,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S
0,22.0,1,0,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,38.0,1,0,1,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,26.0,0,0,1,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,35.0,1,0,1,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,35.0,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


Normalize Age column

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(train_df)
scaled = scaler.fit_transform(train_df)

train_df = pd.DataFrame(scaled, columns=train_df.columns)
