In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib notebook

# Loading the data

In [13]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [14]:
train_len = len(train)
print("Training dataset size = {}".format(train_len))
train.head()

Training dataset size = 891


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [15]:
test_len = len(test)
print("Test dataset size = {}".format(test_len))
test.head()

Test dataset size = 418


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Data analysis

We are going to start by analysing the features and how they relate to the passengers survival. First, let's get a sense of the data types and ranges of values for each feature.

Numerical features:

In [16]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Categorical features:

In [17]:
train.describe(include=np.object)

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Moussa, Mrs. (Mantoura Boulos)",male,347082,B96 B98,S
freq,1,577,7,4,644


Percentage of missing data by feature:

In [18]:
print('Training set:')
print(train.isna().sum() / len(train))

Training set:
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64


In [19]:
print('Test set:')
print(test.isna().sum() / len(test))

Test set:
PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.205742
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.002392
Cabin          0.782297
Embarked       0.000000
dtype: float64


Features with missing data:
- **Cabin**: ~78% of missing data; remove it.
- **Age**: Decide whether to fill the missing data or not after analysis.
- **Fare**: Very small portion of the data is missing; fill it with the median (just need a reasonable value since it won't influence the result significantly).
- **Fare**: Very small portion of the data is missing; fill it with the median (just need a reasonable value since it won't influence the result significantly).

Besides Cabin, drop the PassengerId.

In [20]:
train.drop(columns=['Cabin', 'PassengerId'], inplace=True)
test.drop(columns=['Cabin', 'PassengerId'], inplace=True)

### Age

In [36]:
g = sns.FacetGrid(train, col='Survived')
_ = g.map(sns.distplot, 'Age')

<IPython.core.display.Javascript object>



In [41]:
bins = 10
age = train[['Survived', 'Age']].dropna()
age_bins = pd.cut(age['Age'], bins)

survival = (age['Survived'].groupby(age_bins)
                           .agg(['count', 'sum', 'mean'])
                           .reset_index()
                           .rename(columns={'count': 'Total', 'sum': 'Survived', 'mean': 'Survival Rate'}))

survival

Unnamed: 0,Age,Total,Survived,Survival Rate
0,"(0.34, 8.378]",54,36,0.666667
1,"(8.378, 16.336]",46,19,0.413043
2,"(16.336, 24.294]",177,63,0.355932
3,"(24.294, 32.252]",169,65,0.384615
4,"(32.252, 40.21]",118,52,0.440678
5,"(40.21, 48.168]",70,24,0.342857
6,"(48.168, 56.126]",45,21,0.466667
7,"(56.126, 64.084]",24,9,0.375
8,"(64.084, 72.042]",9,0,0.0
9,"(72.042, 80.0]",2,1,0.5


Children (age < 8) have much higher chances of survival. There are also small spikes in survival for passengers in the following bands:
- (32.252, 40.21] - 44.1%
- (48.168, 56.126] - 46.7%

### Filing missing Age values

Use Pclass and Title to fill the missing age values.

In [67]:
# Use both the training and test dataset to fill the missing Age values
fill_age_df = pd.concat([train, test])[['Name', 'Age', 'Pclass']]

# Get the titles of the passengers
fill_age_df['Title'] = fill_age_df['Name'].apply(lambda x: x[x.find(', ') + 1:x.find('.') + 1])

# Now that we have the titles we can drop the names
fill_age_df.drop(columns='Name', inplace=True)

# Encode titles in integers 
fill_age_df['Title'] = fill_age_df['Title'].astype('category').cat.codes
fill_age_df.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,Age,Pclass,Title
0,22.0,3,12
1,38.0,1,13
2,26.0,3,9
3,35.0,1,13
4,35.0,3,12


Use linear regression to fit a model to predict Age from Pclass and Title.