# Data Preparation - Titanic Dataset

**Overview**:
The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

**Data Dictionary**

Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex	
Age	Age in years	
sibsp	# of siblings / spouses aboard the Titanic	
parch	# of parents / children aboard the Titanic	
ticket	Ticket number	
fare	Passenger fare	
cabin	Cabin number	
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import pandas as pd

In [2]:
# Import the data
train_data = pd.read_csv('data/train.csv', index_col=False)

In [3]:
print(train_data.dtypes)
train_data.head()

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Remove Passenger Persondata to avoid overfitting the index value
# train_data = train_data.drop(['PassengerId', 'Name'], axis=1)

In [5]:
train_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,female,35.0,1,0,113803,53.1,C123,S
4,0,3,male,35.0,0,0,373450,8.05,,S


Check the Dataset for missing Values

In [6]:
print('Data Types:\n',train_data.dtypes)
print('\nPercent Missing values:\n')
print(train_data.isnull().mean())
train_data.describe()

Data Types:
 Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

Percent Missing values:

Survived    0.000000
Pclass      0.000000
Sex         0.000000
Age         0.198653
SibSp       0.000000
Parch       0.000000
Ticket      0.000000
Fare        0.000000
Cabin       0.771044
Embarked    0.002245
dtype: float64


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


Changing the Datatypes to work with conventional Training Algorithms

In [7]:
print(train_data.Fare.median())

14.4542


In [8]:
# Converting categorical labels from string to Numeric
train_data.Sex = (train_data.Sex=='male').astype('int')
train_data['Embarked_Cherbourg'] = (train_data.Embarked == 'C').astype('int')
train_data['Embarked_Queenstown'] = (train_data.Embarked == 'Q').astype('int')
train_data['Embarked_Southampton'] = (train_data.Embarked == 'S').astype('int')
train_data = train_data.drop('Embarked', axis=1)
# Converting class Labels
train_data.Pclass = train_data.Pclass.astype('category')
train_data.Sex = train_data.Sex.astype('category')
# train_data.Embarked = train_data.Embarked.astype('category')
train_data.Survived = train_data.Survived.astype('category')

# Could be converted, let's see ....
# train_data.SibSp = train_data.SibSp.astype('category')
# train_data.Parch = train_data.Parch.astype('category')

# Imputing the missing values
train_data.Age.fillna(train_data.Age.mean(), inplace=True)

In [9]:
print('\nPercent Missing values:\n')
print(train_data.isnull().mean())


Percent Missing values:

Survived                0.000000
Pclass                  0.000000
Sex                     0.000000
Age                     0.000000
SibSp                   0.000000
Parch                   0.000000
Ticket                  0.000000
Fare                    0.000000
Cabin                   0.771044
Embarked_Cherbourg      0.000000
Embarked_Queenstown     0.000000
Embarked_Southampton    0.000000
dtype: float64


In [10]:
print(train_data.describe())
train_data.head()


              Age       SibSp       Parch        Fare  Embarked_Cherbourg  \
count  891.000000  891.000000  891.000000  891.000000          891.000000   
mean    29.699118    0.523008    0.381594   32.204208            0.188552   
std     13.002015    1.102743    0.806057   49.693429            0.391372   
min      0.420000    0.000000    0.000000    0.000000            0.000000   
25%     22.000000    0.000000    0.000000    7.910400            0.000000   
50%     29.699118    0.000000    0.000000   14.454200            0.000000   
75%     35.000000    1.000000    0.000000   31.000000            0.000000   
max     80.000000    8.000000    6.000000  512.329200            1.000000   

       Embarked_Queenstown  Embarked_Southampton  
count           891.000000            891.000000  
mean              0.086420              0.722783  
std               0.281141              0.447876  
min               0.000000              0.000000  
25%               0.000000              0.000000  


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_Cherbourg,Embarked_Queenstown,Embarked_Southampton
0,0,3,1,22.0,1,0,A/5 21171,7.25,,0,0,1
1,1,1,0,38.0,1,0,PC 17599,71.2833,C85,1,0,0
2,1,3,0,26.0,0,0,STON/O2. 3101282,7.925,,0,0,1
3,1,1,0,35.0,1,0,113803,53.1,C123,0,0,1
4,0,3,1,35.0,0,0,373450,8.05,,0,0,1


### Create a Subset of the Data cleaned from all combined Features

In [11]:
train_min_cleaned = train_data.drop(['Ticket', 'Cabin'], axis=1)

In [12]:
train_min_cleaned.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Cherbourg,Embarked_Queenstown,Embarked_Southampton
0,0,3,1,22.0,1,0,7.25,0,0,1
1,1,1,0,38.0,1,0,71.2833,1,0,0
2,1,3,0,26.0,0,0,7.925,0,0,1
3,1,1,0,35.0,1,0,53.1,0,0,1
4,0,3,1,35.0,0,0,8.05,0,0,1


## Data Set Output

The preformated Data is stored to different Files to be loaded in the upcoming Model Training Process.

1) 'train_cleaned' : Complete Trainingset incl. all Features prepared for processing

2) 'train_subdata_[C,Q,S]' : Trainingset subset by Departure Location


In [14]:
# Storing the complete, cleaned Dataset
train_data.to_csv('data/train_cleaned.csv')

# Storing the minimized Dataset
train_min_cleaned.to_csv('data/train_min_cleaned.csv')

# Creating Datasets grouped by survival
train_data[train_data['Survived']==0].to_csv('data/survived.csv')
train_data[train_data[]]

# # Creating subsets of the data by embarkement Harbour
# for harbour in train_data.Embarked.cat.categories:
#     train_data[train_data.Embarked==harbour].to_csv('data/train_subdata_departure_{}.csv'.format(harbour))

