# Cleaning the Data and Pre-processing it
***

In the previous chapter, we've got familiar with the data by exploring it. In this chapter, we will clean the data and pre-process it in order to apply the different learning algorithms.

In [2]:
import numpy as np
import pandas as pd

In [66]:
# Import the data
original_data = pd.read_csv('./data/train.csv')
data = original_data

In [4]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


***
## Step 1: Drop unnecessary features and samples

Not all the features/variables/attributes are necessary to create our predictive models. We will drop them:

In [5]:
# Drop unnecessary features
data = data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)

In [6]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


Concerning the samples, there are missing values for `Sex` attribute and `Embarked`. Missing `Sex` values can be estimated, but unfortunately, missing `Embarked` values cannot:

In [8]:
data[data['Embarked'].isnull()]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
61,1,1,female,38.0,0,0,80.0,
829,1,1,female,62.0,0,0,80.0,


As we can see, `Embarked` values cannot be guessed or approximated using other attributes. Better dropping them.

In [9]:
data = data.drop(data[data['Embarked'].isnull()].index, axis = 0)

In [10]:
data.shape

(889, 8)

This confirms the dropping of the two samples without the `Embarked` information. We will now reindex the data frame.

In [11]:
data = data.reset_index(drop = True)

***
## Step 2: Convert String features to integers

The `Sex` and `Embarked` attributes can't be handled by our predictive algorithms because they are strings. We will convert them to integers:

In [12]:
# Converting Sex to Gender with the following mapping: 0 = female, 1 = male
data['Gender'] = data['Sex'].map({ 'female': 0, 'male': 1 }).astype(int)

In [13]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Gender
0,0,3,male,22.0,1,0,7.25,S,1
1,1,1,female,38.0,1,0,71.2833,C,0
2,1,3,female,26.0,0,0,7.925,S,0
3,1,1,female,35.0,1,0,53.1,S,0
4,0,3,male,35.0,0,0,8.05,S,1


In [14]:
# Converting Embarked to Boarding with the following mapping: 0 = S, 1 = C, 2 = Q
data['Boarding'] = data['Embarked'].map({ 'S': 0, 'C': 1, 'Q': 2 }).astype(int)

In [15]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Gender,Boarding
0,0,3,male,22.0,1,0,7.25,S,1,0
1,1,1,female,38.0,1,0,71.2833,C,0,1
2,1,3,female,26.0,0,0,7.925,S,0,0
3,1,1,female,35.0,1,0,53.1,S,0,0
4,0,3,male,35.0,0,0,8.05,S,1,0


We can now drop the two string attributes `Sex` and `Embarked`:

In [16]:
data = data.drop(['Sex', 'Embarked'], axis = 1)

In [17]:
data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,Boarding
0,0,3,22.0,1,0,7.25,1,0
1,1,1,38.0,1,0,71.2833,0,1
2,1,3,26.0,0,0,7.925,0,0
3,1,1,35.0,1,0,53.1,0,0
4,0,3,35.0,0,0,8.05,1,0


In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 8 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Age         712 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Gender      889 non-null int64
Boarding    889 non-null int64
dtypes: float64(2), int64(6)
memory usage: 55.6 KB


Now, all attributes are numerical attributes (integer or float).

***
## Step 3: Filling missing data

As we have seen before, the only missing values are `Age` values. They can be approximated using the mean of similar passengers, given the passenger class and gender.

In [19]:
data_by_class_and_gender = data.groupby(['Pclass', 'Gender'])

In [21]:
for names, groups in data_by_class_and_gender:
    print names
    print groups.head(3)

(1, 0)
    Survived  Pclass   Age  SibSp  Parch     Fare  Gender  Boarding
1          1       1  38.0      1      0  71.2833       0         1
3          1       1  35.0      1      0  53.1000       0         0
11         1       1  58.0      0      0  26.5500       0         0
(1, 1)
    Survived  Pclass   Age  SibSp  Parch      Fare  Gender  Boarding
6          0       1  54.0      0      0   51.8625       1         0
23         1       1  28.0      0      0   35.5000       1         0
27         0       1  19.0      3      2  263.0000       1         0
(2, 0)
    Survived  Pclass   Age  SibSp  Parch     Fare  Gender  Boarding
9          1       2  14.0      1      0  30.0708       0         1
15         1       2  55.0      0      0  16.0000       0         0
41         0       2  27.0      1      0  21.0000       0         0
(2, 1)
    Survived  Pclass   Age  SibSp  Parch  Fare  Gender  Boarding
17         1       2   NaN      0      0  13.0       1         0
20         0       2  

In [24]:
# Get the mean age in each group
mean_age_per_group = data_by_class_and_gender['Age'].mean()
mean_age_per_group

Pclass  Gender
1       0         34.240964
        1         41.281386
2       0         28.722973
        1         30.740707
3       0         21.750000
        1         26.507589
Name: Age, dtype: float64

In [26]:
# Example
print mean_age_per_group[1, 0]
print mean_age_per_group[2, 1]

34.2409638554
30.7407070707


Now that we have the mean ages in each class and gender, we will fill missing values with these approximations:

In [30]:
# Create an alternative attribute to fill ages
data['EstimatedAge'] = data['Age']
data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,Boarding,EstimatedAge
0,0,3,22.0,1,0,7.25,1,0,22.0
1,1,1,38.0,1,0,71.2833,0,1,38.0
2,1,3,26.0,0,0,7.925,0,0,26.0
3,1,1,35.0,1,0,53.1,0,0,35.0
4,0,3,35.0,0,0,8.05,1,0,35.0


In [53]:
data[data['Age'].isnull()].head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,Boarding,EstimatedAge
5,0,3,,0,0,8.4583,1,2,26.507589
17,1,2,,0,0,13.0,1,0,30.740707
19,1,3,,0,0,7.225,0,1,21.75
26,0,3,,0,0,7.225,1,1,26.507589
28,1,3,,0,0,7.8792,0,2,21.75


In [43]:
for c in range(3): # This is the class of passenger
    for g in range(2): # This is the gender of passenger
        data.loc[(data['Age'].isnull()) & (data['Pclass'] == (c + 1)) & (data['Gender'] == g), 'EstimatedAge'] = \
        mean_age_per_group[c + 1, g]

In [52]:
print len(data[data['Age'].isnull()])
print len(data[data['EstimatedAge'].isnull()])

177
0


In [54]:
data[data['Age'].isnull()].head() # Missing Age values

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,Boarding,EstimatedAge
5,0,3,,0,0,8.4583,1,2,26.507589
17,1,2,,0,0,13.0,1,0,30.740707
19,1,3,,0,0,7.225,0,1,21.75
26,0,3,,0,0,7.225,1,1,26.507589
28,1,3,,0,0,7.8792,0,2,21.75


In [55]:
data[~data['Age'].isnull()].head() # Not missing Age values

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,Boarding,EstimatedAge
0,0,3,22.0,1,0,7.25,1,0,22.0
1,1,1,38.0,1,0,71.2833,0,1,38.0
2,1,3,26.0,0,0,7.925,0,0,26.0
3,1,1,35.0,1,0,53.1,0,0,35.0
4,0,3,35.0,0,0,8.05,1,0,35.0


And voila! We can see that we filled missing age values with an estimation of the age based on the passenger class and gender, and we just kept the age when it isn't missing.

We can now drop the original `Age` attribute and change `EstimatedAge` attribute to `Age`:

In [56]:
data = data.drop(['Age'], axis = 1)

In [59]:
data = data.rename(columns = {'EstimatedAge': 'Age'})

In [60]:
data.head()

Unnamed: 0,Survived,Pclass,SibSp,Parch,Fare,Gender,Boarding,Age
0,0,3,1,0,7.25,1,0,22.0
1,1,1,1,0,71.2833,0,1,38.0
2,1,3,0,0,7.925,0,0,26.0
3,1,1,1,0,53.1,0,0,35.0
4,0,3,0,0,8.05,1,0,35.0


In [61]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 8 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Gender      889 non-null int64
Boarding    889 non-null int64
Age         889 non-null float64
dtypes: float64(2), int64(6)
memory usage: 55.6 KB


***
## Step 4: Saving the cleaned data into CSV

The generated data frame is the data that will be loaded and processed by learning algorithms. We need to save them in a separate CSV file for future processing.

In [64]:
data.to_csv('./data/train_clean.csv')

***
## Misc

In [65]:
data.corr()

Unnamed: 0,Survived,Pclass,SibSp,Parch,Fare,Gender,Boarding,Age
Survived,1.0,-0.335549,-0.03404,0.083151,0.25529,-0.541585,0.108669,-0.072689
Pclass,-0.335549,1.0,0.081656,0.016824,-0.548193,0.127741,0.043835,-0.403713
SibSp,-0.03404,0.081656,1.0,0.414542,0.160887,-0.116348,-0.060606,-0.250693
Parch,0.083151,0.016824,0.414542,1.0,0.217532,-0.247508,-0.07932,-0.179671
Fare,0.25529,-0.548193,0.160887,0.217532,1.0,-0.179958,0.063462,0.115165
Gender,-0.541585,0.127741,-0.116348,-0.247508,-0.179958,1.0,-0.118593,0.118678
Boarding,0.108669,0.043835,-0.060606,-0.07932,0.063462,-0.118593,1.0,-0.047096
Age,-0.072689,-0.403713,-0.250693,-0.179671,0.115165,0.118678,-0.047096,1.0


In [67]:
original_data.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


Comparing the two correlation matrices (before cleaning and after cleaning the data), we can notice that the dependance between survival and the other attributes doesn't change a lot. We just added two more attributes to the study which are the `Age` and the `Boarding` of a passenger.