# Titanic Kaggle Project

### The Challange

![1200px-RMS_Titanic_3%20%281%29.jpg](attachment:1200px-RMS_Titanic_3%20%281%29.jpg)

The sinking of the Titanic is one of the most infamous shipwrecks in history. 

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive/ML model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

This is a supervised learning problem (since we have labels), furthermore it is a binary classification problem (Only two classes - survived (1) or not survived (0))

### Exploratory Data Analysis (EDA)

#### Summary Statisitics

In [4]:
# Import relevant python modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import sklearn

In [7]:
# Load CSV data into notebook as a DataFrame using pandas 
df = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [9]:
# See first 5 rows of each DataFrame
display(df.head())
display(test.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### Data Dictionary

survival - Did they Survive or not? (1 = Yes, 0 = No)
pclass - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Sex	- Gender
Age	- Age in years	
sibsp - # of siblings / spouses aboard the Titanic	
parch - # of parents / children aboard the Titanic	
ticket - Ticket number	
fare - Passenger fare	
cabin - Cabin number	
embarked - Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [19]:
# Info for the whole DataFrame
display(df.info())
print(df.shape)
print(df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


None

(891, 12)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


<font color='green'> <b> We can see that this dataset is relatively clean, since it has already been preprocessed for us. However, there are null values in the 'Age', 'Cabin', 'Embarked' columns, which we'll have to deal with later on. We have 891 rows, which represent the 891 passengers we have data for including whether they survived. In Total, we have 12 columns from which we can make useful features from. </b> </font>

In [25]:
# Let's look at some descriptive statistics for the numerical columns 
display(df.describe())
print(pd.unique(df['Pclass']))
print(pd.unique(df['Embarked']))



Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


[3 1 2]
['S' 'C' 'Q' nan]


<font color='green'> <b> This shows us several insights:
    - Only 38.38% of passengers survived
    - The vast majority of passengers were lower class (median ticket class is 3)
    - The passenger were fairly young, with the average being around 28 and 75% were younger than 38. However the oldesst person to board was 80 and there were also babies a few months old. 
    - Average # of sibling/spouses was 0.5, but ranges from 0 to 8
    - Average # of parents/children was 0.38, but ranges from 0 to 6
    - Large disparity in the Fare paid by individuals(vast majority paid a low fair as 75% paid less than 31), however max paid is 512. Some people also boarded for free. </b> </font>