# Titanic Prediction
The aim is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

<h3><center>Data Dictionary</center></h3>

| Variable | Definition | Key |
| :-: | :- | :- |
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex |  | 
| Age | Age in years | 	  | 
| sibsp | # of siblings/spouses aboard |  | 
| parch | # of parents/children aboard |  | 
| ticket | Ticket number |  | 
| fare | Passenger fare |  | 
| cabin | Cabin number |  | 
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | 

# 1. Data Exploration

#### Import libraries

In [1]:
import numpy as np
import pandas as pd

# Data Vizualization
import plotly.express as px
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model

#### Import data into dataframe

In [2]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

# View the top 5 rows in the dataset
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# Get summary about data
# like, How many rows and columns do we have?, What data type is each column?
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
# Retreive statical summary
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
# Explore the relationship between variables
train_df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


In [8]:
# Check for null values
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## Observations

We can see from the correlation table that there is a strong relationship between: 
- Survived, Pclass and Fare
- Parch, SibSp, and Age

## Hypothesis
Based on the observation from the data and own own knowledge regarding the event, we hypothesize the following for an individuals survival probabily:
- The survival rate increases as the ticket class denotes more upper-class status.
- The survival rate increases as the number of siblings/spouses on board increases.
- The survival rate increases as the number of parent/children relationships on board increases.
- The survival rate increases as the age of the individual is more middle-aged (vs children and elderly).
- The survival rate increases if the gender is female.

# 2. Data Wrangling
The data needs to be cleaned up. In this section, we will make following changes:

1. Drop Cabin column: there are too many missing values with no discernible way to replace them. The variable doesn't seem to be connected to survival at first glance. Might reintroduce this variable further down the line if deemed neccesary.
2. Replace Age: with average age 
3. Convert categorical variables into numerical values: in column Sex

In [None]:
# Dropping Cabin column
df.drop(['Cabin'], axis=1,inplace=True)

In [None]:
df

In [None]:
# Replacing missing ages with the average age
df['Age'] = df['Age'].replace(np.nan,round(df['Age'].mean(),1))

In [None]:
df.dropna(subset=['Embarked'],inplace=True)

In [None]:
# Any null values left?
df.isnull().sum()

In [None]:
df.columns

In [None]:
# Taking subset of data - variables that seem like they will have most impact based on the correlation table
train = df.loc[:,['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch']]
train

In [None]:
# Understanding the demographic of data
train_men = train[train['Sex']=='male']
train_women = train[train['Sex']!='male']

In [None]:
#px.histogram(train_men, x='Age',color="Survived")
px.histogram(train_women, x='Age',color="Survived")

In [None]:
px.histogram(train, x='Age',color="Sex",nbins=10)

In [None]:
# Convert Sex into categorical variables

# Wy user LabelEncoder vs Dummy variables

# Importing LabelEncoder from Sklearn
# library from preprocessing Module.
from sklearn.preprocessing import LabelEncoder
 
# Creating a instance of label Encoder.
le = LabelEncoder()
 
# Using .fit_transform function to fit label
# encoder and return encoded label
label = le.fit_transform(df['Purchased'])
 
# printing label
label

This is normal
`This is normal`
