In this exercise we will be processing messy data from the popular "Titanic" dataset (taken from Kaggle), so that it is ready to use for bulding a model that can predict whether a given passenger survived or not. This process is based on the one found here: https://www.kaggle.com/samsonqian/titanic-guide-with-sklearn-and-eda

First we need to import the data, as well as the libraries we will be using. Numpy and Pandas will help us manipulate the data, and matplotlib and Seaborn will help us visualize it.

In [None]:
import numpy as np 
import pandas as pd 

import seaborn as sns
from matplotlib import pyplot as plt
sns.set_style("whitegrid")
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

training = pd.read_csv("../input/titanic/train.csv")
testing = pd.read_csv("../input/titanic/test.csv")

Using Pandas we can take a look at some of the data we will be processing

In [None]:
training.head()

In [None]:
testing.head()

We can see right away that there seems to be a lot of missing values in the data, especially for the Cabin column. We need to check the rest of the dataset to see just how much data is missing, and then see what we can do to replace it. That way the model can have a complete prediction for every row in the dataset.

In [None]:
def null_table(training, testing):
    print("---Training---")
    print(pd.isnull(training).sum()) 
    print(" ")
    print("---Testing---")
    print(pd.isnull(testing).sum())

null_table(training, testing)

Thankfully the missing values seem to be contained to just two of the columns. Given how few values for Cabin there are in the dataset, it would be a good idea to just drop that column altogether. After all, it's probably highly correlated with the passenger's class or fare. While we're dropping attributes, we might as well do the same for Ticket, since the number appears to be pretty much random and it will likely throw off the model. 

For the Age, we will need to take a look at its distribution to decide what to replace the missing values with

In [None]:
copy = training.copy()
copy.dropna(inplace = True)
sns.distplot(copy["Age"])

Since the distribution is a bit skewed, it would be best to replace the missing values with the median age. Having the values lean towards one end greatly impacts the mean, making it a less accurate replacement in this case.

In [None]:
training.drop(labels = ["Cabin", "Ticket"], axis = 1, inplace = True)
testing.drop(labels = ["Cabin", "Ticket"], axis = 1, inplace = True)

training["Age"].fillna(training["Age"].median(), inplace = True)
testing["Age"].fillna(testing["Age"].median(), inplace = True) 
training["Embarked"].fillna("S", inplace = True)
testing["Fare"].fillna(testing["Fare"].median(), inplace = True)

null_table(training, testing)

Before making any more modifications to our data, it would be a good idea to try and put it into charts so we can visualize it. Doing so might help us identify which attributes have the largest impact on whether the passenger survived or not. For example, let's take a look at the gender:

In [None]:
sns.barplot(x="Sex", y="Survived", data=training)
plt.title("Distribution of Survival based on Gender")
plt.show()

As expected, the survivors were mostly women instead of men, likely due to the classic "women and children first" rule. This can also be seen in the distribution of survival based on age:

In [None]:
sns.stripplot(x="Survived", y="Age", data=training, jitter=True)

This one is closer, but we can still distinguish some clustering of passengers at the bottom of the right strip, meaning that younger passengers were more likely to survive.

Let's take a look at one more example with passengers' class:

In [None]:
sns.barplot(x="Pclass", y="Survived", data=training)
plt.ylabel("Survival Rate")
plt.title("Distribution of Survival Based on Class")
plt.show()

This one shows us that higher-class passengers were more likely to survive than lower-class passengers, which again, shouldn't be too surprising.

Next we need to make sure that all of our atributes are numerical, since it is required for most classification models. The attributes Name, Sex and Embarked are the only ones that are not numerical. Sex and Embarked are categorical, meaning that they can easily be transformed by assigning a number to each posible value they can have.

In [None]:
from sklearn.preprocessing import LabelEncoder

le_sex = LabelEncoder()
le_sex.fit(training["Sex"])

encoded_sex_training = le_sex.transform(training["Sex"])
training["Sex"] = encoded_sex_training
encoded_sex_testing = le_sex.transform(testing["Sex"])
testing["Sex"] = encoded_sex_testing

le_embarked = LabelEncoder()
le_embarked.fit(training["Embarked"])

encoded_embarked_training = le_embarked.transform(training["Embarked"])
training["Embarked"] = encoded_embarked_training
encoded_embarked_testing = le_embarked.transform(testing["Embarked"])
testing["Embarked"] = encoded_embarked_testing

training.head()

The Name column might seem like it could be dropped, but we can still take some useful information out of it. Many of the passengers have titles (like "Mr, Miss, Capt, etc.) which might help predict their survival. We can take these titles and turn them into a categorical attribute, and then to a numeric one, like with the previous two attributes.

In [None]:
for name in training["Name"]:
    training["Title"] = training["Name"].str.extract("([A-Za-z]+)\.",expand=True)
    
for name in testing["Name"]:
    testing["Title"] = testing["Name"].str.extract("([A-Za-z]+)\.",expand=True)

titles = set(training["Title"])
title_list = list(training["Title"])
frequency_titles = []

for i in titles:
    frequency_titles.append(title_list.count(i))

titles = list(titles)

title_dataframe = pd.DataFrame({
    "Titles" : titles,
    "Frequency" : frequency_titles
})

print(title_dataframe)

Looks like there were 17 different titles in there. Now that we now what they all are, we just need to assign a number to each one.

In [None]:
title_replacements = {"Mlle": "Other", "Major": "Other", "Col": "Other", "Sir": "Other", "Don": "Other", "Mme": "Other",
          "Jonkheer": "Other", "Lady": "Other", "Capt": "Other", "Countess": "Other", "Ms": "Other", "Dona": "Other"}

training.replace({"Title": title_replacements}, inplace=True)
testing.replace({"Title": title_replacements}, inplace=True)

le_title = LabelEncoder()
le_title.fit(training["Title"])

encoded_title_training = le_title.transform(training["Title"])
training["Title"] = encoded_title_training
encoded_title_testing = le_title.transform(testing["Title"])
testing["Title"] = encoded_title_testing

training.drop("Name", axis = 1, inplace = True)
testing.drop("Name", axis = 1, inplace = True)

training.head()

One more thing we can do before the final preprocessing step is to combine SibSp (number of siblings/spouses aboard) and Parch (number of parents/children aboard). These two attributes might not seem too impactful on their own, but together they determine the total amount of family members aboard for each passenger, which might be a bit more significant.

In [None]:
training["FamSize"] = training["SibSp"] + training["Parch"] + 1
testing["FamSize"] = testing["SibSp"] + testing["Parch"] + 1

training.head()

The last preprocessing task we have left is scaling the data. We can see that the Age and Fare attributes above deviate too much from the values in the other columns. Some modelling techniques rely on calculating distance between examples in order to properly classify them, and this difference in ranges would heavily impact their performance. It would be better to scale these values so that they are in line wit the ranges of the other attributes.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

ages_train = np.array(training["Age"]).reshape(-1, 1)
fares_train = np.array(training["Fare"]).reshape(-1, 1)
ages_test = np.array(testing["Age"]).reshape(-1, 1)
fares_test = np.array(testing["Fare"]).reshape(-1, 1)

training["Age"] = scaler.fit_transform(ages_train)
training["Fare"] = scaler.fit_transform(fares_train)
testing["Age"] = scaler.fit_transform(ages_test)
testing["Fare"] = scaler.fit_transform(fares_test)

training.head()

With all that done, the data is now finally ready for modelling.