# Titanic Challenge from Kaggle

### Challenge description

<div style="text-align: justify">The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, **killing 1502 out of 2224 passengers and crew**. This sensational tragedy shocked the international community and led to better safety regulations for ships.</div>

<div style="text-align: justify">One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as *women, children, and the upper-class.*</div>

<div style="text-align: justify">In this challenge, I try to complete the analysis of what sorts of people were likely to survive applying **machine learning** tools to predict which passengers survived the tragedy.</div>

## 1. Exploratory Data Analysis

Te main libraries involved in this notebook are the following:
- **Pandas** for getting and cleaning data and dataset manipulation.
- **Matplotlib** for data visualization.
- **Numpy** for multidimensional array computing.
- **Sklearn** for machine learning models.

In [28]:
import pandas as pd

from matplotlib import pyplot
import matplotlib as plt
plt.style.use('ggplot')

import numpy as np

The data has been split into two groups:
    
    - training set (train.csv) 
    - test set (test.csv)

<div style="text-align: justify">The training set has been used to build machine learning models. For the training set, it is provided the outcome (also known as the “ground truth”) for each passenger.</div>

<div style="text-align: justify">We are goint to use the test set to see how well the model performs on unseen data. For the test set, it is not provided the ground truth for each passenger. For each passenger in the test set, we are trying to predict, through the trained model whether or not they survived the sinking of the Titanic.</div>

Let's get started!!!

In [9]:
train = pd.read_csv("./data/train.csv")

In [10]:
train.shape

(891, 12)

In [14]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The **Survived** feature is the target variable. If the passenger survived, then *Survived* = 1. Otherwise, he or she died.

#### Data dictionary:
    - survival     Survival 	            0 = No, 1 = Yes
    - pclass       Ticket class 	        1 = 1st, 2 = 2nd, 3 = 3rd
    - sex          Sex 	
    - Age          Age in years 	
    - sibsp 	   # of siblings / spouses aboard the Titanic 	
    - parch 	   # of parents / children aboard the Titanic 	
    - ticket       Ticket number 	
    - fare         Passenger fare 	
    - cabin        Cabin number 	
    - embarked     Port of Embarkation      C = Cherbourg, Q = Queenstown, S = Southampton

#### Variable Notes:
**pclass**: A proxy for socio-economic status (SES)<br />
1st = Upper<br />
2nd = Middle<br />
3rd = Lower<br />

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way:<br />
*Sibling* = brother, sister, stepbrother, stepsister<br />
*Spouse* = husband, wife (mistresses and fiancés were ignored)<br />

**parch**: The dataset defines family relations in this way:<br />
*Parent* = mother, father<br />
*Child* = daughter, son, stepdaughter, stepson<br />
Some children travelled only with a nanny, therefore parch=0 for them.

In [19]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [49]:
train['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

The *Survived* variables shows that 549 passengers of the training set died in the Titanic sinking and 342 could survived. There are 891 passenger in total in the train dataset. Therefore, there are some missing values in the *Age* variable. Let's check it up:

In [47]:
np.isnan(train['Age']).value_counts()

False    714
True     177
Name: Age, dtype: int64

One solution for dealing with missing values in the *Age* variable is to replace them with the mean or the median age. We are selecting the median due to is more robust to outliers than the mean:

In [51]:
train['Age'].fillna(train['Age'].median(),inplace=True)
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.361582,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.019697,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Let's check if some groups of people such as women or children were more likely to survive as the summary of the challenge suggest. For this, we are going to make some plots to help us visualize survival based on the **gender** and the **age**.