# Exercise: Titanic Dataset - Types of Data and Handling Missing Data

To build better Machine Learning models we have to understand that there's different types of data used to describe both **features** and **labels**:

- Real Valued data: A number that describes a **quantitative** feature, such as "Age", "Salary" or "Number of Relatives".

- Categorical data: Describes a **qualitative property**, such as "Sex", "Occupation" or "Blood Type". They can be represented by numbers or text, but must be converted into a numerical format for processing.

- Identity data: Data that is used to uniquely identify a record, such as an "ID", "SSN" or "Name".

Incomplete data can also negatively affect a model's perfomance and even completely stop it from working, hence why it's important to identify and correct gaps in our datasets.

In this exercise we take a deeper look into the Titanic Dataset, then build and compare Machine Learning models with the original and "cleaned" data.

## Preparing data

Let's reload the Titanic Dataset and reacquaint ourselves with its data:


In [6]:
import pandas as pd

# Load data from our dataset file into a pandas dataframe
dataset = pd.read_csv('Data/titanic.csv', index_col=False, sep=",",header=0)

# Let's take a look at the data
dataset.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Taking a careful look at the columns and data we can identify the **Real Valued** features, like "Age", "SibSp", "Parch" and "Fare" and **Categorical** features, such as "Survived", "Sex", "PClass" and "Embarked". 


We can display a brief summary of the dataypes by using panda's `info()` method:



In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Notice that `dataset.info()` also shows us the number of non-null items per column, but does not indicate if a feature is Categorical or Real valued.



## Handling Missing Data

As seen in the last Unit, the Titanic Dataset is not 100% complete:


In [8]:
# Calculate the number of empty cells in each column
# and store it in a new dataframe
missing_data = dataset.isnull().sum().to_frame()

# Rename column holding the sums
missing_data = missing_data.rename(columns={0:'Empty Cells'})

# Print the results
print(missing_data)

             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                  177
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                687
Embarked               2




Some rows in the "Age", "Cabin" and "Embarked" have "empty" cells (or cells where the value is `null`). 

There are **many** ways to address this issue, each with pros and cons.

Let's take a look at the less complicated options:

### Option 1: Delete data with missing rows

The "Embarked" column is the perfect candidate for this option, because it has only two rows with missing data. By deleting them we don't lose too much information (in contrast, we would lose almost the entire dataset if we applied this to the "Cabin" column).

In [9]:
# Create a "clean" dataset where we cumulatively fix missing values
# Start by removing rows ONLY where "Embarked" has no values
clean_dataset = dataset.drop(dataset[dataset["Embarked"].isnull()].index)
clean_dataset = clean_dataset.reindex()

# WE started with 891 rows in the dataset and deleted 2.
# How many rows do we have now?
print(f"The shape for the clean dataset is {clean_dataset.shape}")


The shape for the clean dataset is (889, 12)


We expected `cleaned_dataset` to have 889 rows since we deleted 2 from the original dataset.

### Option 2: Replace empty values with the mean or median for that data.

We should use this option for "Age" field, and since it holds **Real valued** data and less that 20% of the rows are empty, we should have enough data to make a decent estimation for the mean age:


In [10]:
import numpy as np
# Calculate the mean value for the Age column
mean_age = clean_dataset["Age"].mean()  # 29.6420...

# Replace empty values in "Age" with the mean calculated above
clean_dataset["Age"].fillna(mean_age, inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))

             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                687
Embarked               0


As you can see above, the "Age" field has no empty cells anymore.

### Option 3: Assign a new category to unknown categorical data

The "Cabin" field is a categorical field. There's a finite number of possible options for cabins in the Titanic. Unfortunately, we don't know which of these should be applied to a large percentage of our records.

For this exercise it makes perfect sense to create an "Unknown" category and assign it to the cases where the cabin is, well, not known:


In [11]:
# Assign unknow to records where "Cabin" is empty
clean_dataset["Cabin"].fillna("Unknown", inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))

# Save the clean dataset for future use
clean_dataset.to_csv("Data/Cleaned_Titanic.csv")



             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                  0
Embarked               0


That's it! No more missing data!

We only lost two records (where "Embarked" was empty), and although we had to make some approximations to fill the missing gaps for the "Age" and "Cabin" columns, and those will certainly influence the performance of our model, the dataset is ready to be used.

### A Note on Categorical Data

Most Machine Learning algorithms require that categorical data is converted into numbers before it is processed.

One way to accomplish that is by using a technique called "one-hot" encoding, that transforms category names into a numeric vector.

We will explain how that works in the next Unit, but for the sake of siplicity we will only use numerical data when building the models below.

## Building a Model with Cleaned Data

The model we will build to predict whether a person would survive or perish in the sinking of the Titanic can have have only to possible outcomes: `0` if the person perished, '1' if the person survived.

We can use an algorithm called "Logistic Regression" to make this kind of prediction with the dataset we have:


In [12]:
import sklearn.model_selection as model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Let's remove categorical data and only the features that we deem relevant for this model
clean_dataset = clean_dataset.drop(["PassengerId","Name","Sex","Ticket","Cabin","Embarked", "Pclass"], axis=1)

# X is our feature matrix
X = clean_dataset[["Age", "SibSp", "Parch", "Fare"]]

# y is the label vector 
y = clean_dataset["Survived"]

# Create Train and test sets with a 70/30 split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.70,test_size=0.30, random_state=101)

# train the model
model = LogisticRegression(random_state=0).fit(X_train, y_train)

# score is the mean accuracy on the given test data and labels
score = model.score(X_train, y_train)

# calculate loss
probabilities = model.predict_proba(X_test)
loss = metrics.log_loss(y_test, probabilities)

# save results for comparison
clean_score = score
clean_loss = loss




## Building a Model with "Unclean" Data

Let's repeat the process with the original dataset:

In [13]:
# Let's remove categorical data and only the features that we deem relevant for this model
dataset = dataset.drop(["PassengerId","Name","Sex","Ticket","Cabin","Embarked", "Pclass"], axis=1)

# Fill empty Age cells with 0 otherwise Logistic Regression will not run.
dataset["Age"].fillna(0, inplace=True)

# X is our feature matrix
X = dataset[["Age", "SibSp", "Parch", "Fare"]]

# y is the label vector 
y = dataset["Survived"]

# Create Train and test sets with a 70/30 split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.70,test_size=0.30, random_state=101)

# train the model
model = LogisticRegression(random_state=0).fit(X_train, y_train)

# score is the mean accuracy on the given test data and labels
score = model.score(X_train, y_train)

# calculate loss
probabilities = model.predict_proba(X_test)
loss = metrics.log_loss(y_test, probabilities)

# save results for comparison
unclean_score = score
unclean_loss = loss

## Comparing Models

We gathered `score` and `loss` metrics for each model and can use these metrics to make an informed comparison:

In [14]:
# Use a dataframe to create a comparison table of metrics
l = [["Clean", clean_score, clean_loss],
    ["Unclean", unclean_score, unclean_loss]]

pd.DataFrame(l, columns=["Dataset", "Score", "Loss"])

Unnamed: 0,Dataset,Score,Loss
0,Clean,0.696141,0.60963
1,Unclean,0.686998,0.645384


The **clean** model yields both a better score and a smaller loss, as expected.

## Summary

In this Unit you've learned about different types of data and the importance of dealing with missing values before training our model.

Recall that we discussed the options of deleting rows with missing data, replacing `null` values in a numeric column with the `mean` of its valid values, and creating new categories for unknown categorical data, but that there are other more sofisticated ways to make these corrections. 

Finally, we built and compared models built with the original dataset and a "cleaned" dataset, with the latter model showing the best results.
