# Predicting the survivors of the titanic

#### [Problem](#problem)
#### [Import Packages](#import)
#### [Load the data](#load)
#### [Information about the data](#data_info)
#### [Exploratory Data Analysis](#eda)
##### [General descriptives](#general_descriptives)
##### [Survival rate by Sex](#male_female_survival)
##### [Survival rate by social class](#class_survival)

## Problem <a id='problem' ><a/>
* Predict which people survived the Titanic given a set of variables.

## Import packages <a id='import' ><a/>

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

## Load the data <a id='load' ><a/>

* Two _.csv_ files have been provided for this problem.
* _train.csv_ contains predictor and target data and will be used to train and assess the model(s) built.
* _test.csv_ contains only the features and is a dataset for our final model to be truly tested on! The accuracy of the predictions made by the chosen model will be assessed by uploading to kaggle (or finding the answer sheet online!)

In [24]:
# load the training data
raw_data = pd.read_csv('train.csv')

# create a copy of the raw data so that we always have the raw data to go back to if need be
df = raw_data.copy()

# check out the first 5 rows of data
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


* A quick view of the data tells me that I'll need to transform the _Sex_ and _Embarked_ variables into numerical data before exploring correlations.
* I'm not sure how useful hte _Ticket_ column will be so this may be drpped later on, similarly for the Name. 

### Information about the data <a id='data_info' ><a/>

The following information has been provided about the variables:
* **Pclass**:  A proxy for socio-economic status (SES) 1st = Upper, 2nd = Middle, 3rd = Lower
* **Age**: Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5
* **SibSp**:  Number of Siblings/Spouses Aboard. The dataset defines family relations in this way: _Sibling_ = brother, sister, stepbrother, stepsister. _Spouse_ = husband, wife (mistresses and fianc√©s were ignored)
* **Parch**: Number of Parents/Children Aboard, The dataset defines family relations in this way: _Parent_ = mother, father. _Child_ = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore _parch_ = 0 for them.

## Exploratory Data Analysis <a id='eda' ><a/>

Before beginning with data cleaning, pre-processing etc. I want to get a feel for the data. Specifically I'm interested in:
1. **General descriptives**: to get an idea of counts, means, median and modes. 
2. **Frequency distributions**: to get an idea of if any variables contain outliers.
3. **Missing values**: some machine learning algorithms don't work well with missing values and so if there are missing values I will need to decide how to handle them (delete the entire row, impute with median etc.)
4. **Correlations**: to assess if there are some variables that are very important (which may influence whether they are kept or not if, for example, the very important variable has a lot of missing values). 

### General descriptives and investigating variables <a id='general_descriptives' ><a/>

In [36]:
# get shape of data
df.shape

(891, 12)

In [25]:
# get general descriptives for the numeric data
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Survival rate by Sex <a id='male_female_survival' ><a/>

In [26]:
# did more males or females survive?
df.groupby(['Sex', 'Survived']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Sex,Survived,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
female,0,81,81,81,64,81,81,81,81,6,81
female,1,233,233,233,197,233,233,233,233,91,231
male,0,468,468,468,360,468,468,468,468,62,468
male,1,109,109,109,93,109,109,109,109,45,109


The above output tells us more females survived than males. The counts tell us that there were many more males on board so I am interested to see the percentage breakdown...

In [116]:
# get the percentages breakdown of male and female survival
# this should be read as x% of females survived i.e. if 100 females and 10 survived then the % returned will be 10
print(f"Female survival %: {sum((df.Sex == 'female') & (df.Survived == 1)) / sum(df.Sex == 'female')*100}")
print(f"Male survival %: {sum((df.Sex == 'male') & (df.Survived == 1)) / sum(df.Sex == 'male')*100}")

Female survival %: 74.20382165605095
Male survival %: 18.890814558058924


So, it would seem that _Sex_ is a very important variable in determining survival.

### Survival rate by class <a id='class_survival' ><a/>

We know that _Pclass_ variable is a proxy for socio-economic status, with the following definitions: 1st = Upper, 2nd = Middle, 3rd = Lower so it would be interesting to see if there were any differences in survival rates...

In [124]:
# check the unique values in Pclass to make sure we capture all class
df['Pclass'].unique()

array([3, 1, 2])

As expected the output is the numbers 1, 2, and 3, representing the upper, middle, and lower socio-economic status groups respectively.

In [130]:
df.groupby(['Pclass', 'Survived']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Survived,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,80,80,80,64,80,80,80,80,59,80
1,1,136,136,136,122,136,136,136,136,117,134
2,0,97,97,97,90,97,97,97,97,3,97
2,1,87,87,87,83,87,87,87,87,13,87
3,0,372,372,372,270,372,372,372,372,6,372
3,1,119,119,119,85,119,119,119,119,6,119


In [143]:
# 372 people from the lower class died, what is this as a % of all who died
372/(80+97+372)

0.6775956284153005

In [146]:
# what % of all of those aboard the ship were classified as lower class
(372+119)/891

0.5510662177328844

The output shows us that the majority of those who died  were in the lower class (67% of those who died were in the lowe class). The lower class only made up 55% of the passengers aboard the ship though, I am therefore interested to see what the survival % for each socio-econiomic status looks like to assess whether proportionally more of one class survived than the others...

In [126]:
# get the percentage breakdown of survival by socio-economic status
print(f"Upper class survival %: {sum((df.Pclass == 1) & (df.Survived == 1)) / sum(df.Pclass == 1)*100}")
print(f"Middle class survival %: {sum((df.Pclass == 2) & (df.Survived == 1)) / sum(df.Pclass == 2)*100}")
print(f"Lower class survival %: {sum((df.Pclass == 3) & (df.Survived == 1)) / sum(df.Pclass == 3)*100}")

Upper class survival %: 62.96296296296296
Middle class survival %: 47.28260869565217
Lower class survival %: 24.236252545824847


* Interestingly,the survival rate for the upper class was 2.5 times greater than that of the lower class.
* Middle class survival was 2 times that of the lower class but less than the upper class.
* This would suggest that socio-economic status also had an important bearing on who survived.

In [137]:
# quick challenge try do the above in a for loop
for i in df['Pclass'].unique():
    print(f"SES class {i} survival %: {sum((df.Pclass == i) & (df.Survived == 1)) / sum(df.Pclass == i)*100}")

SES class 3 survival %: 24.236252545824847
SES class 1 survival %: 62.96296296296296
SES class 2 survival %: 47.28260869565217


## Correlations

Let's see if there are correlations between the variables...

In [118]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0
