# Exercise: Titanic Dataset - Types of Data and Handling Missing Data

Briefly explain types of data and why we have to handle missing values.

## Preparing data

Reload Titanic


In [8]:
import pandas as pd

# Load data from our dataset file into a pandas dataframe
dataset = pd.read_csv('Data/titanic.csv', index_col=False, sep=",",header=0)

# Let's take a look at the data
dataset.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Discuss different types of data above; show code to display data type (ex: dataset.info()).





## Handling Missing Data

- Recall what data is missing
- Show 3 different ways to deal with missing data (one for each column) - explain that there are many other ways
    - Embarked: Delete data
    - Cabin: create new category
    - Age: Use mean
- Save cleaned data to a "clean" dataset

In [9]:
# Calculate the number of empty cells in each column
# and store it in a new dataframe
missing_data = dataset.isnull().sum().to_frame()

# Rename column holding the sums
missing_data = missing_data.rename(columns={0:'Empty Cells'})

# Print the results
print(missing_data)

             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                  177
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                687
Embarked               2


Recalling the last Unit, we have three fields with missing data: "Age", "Cabin" and "Embarked".

There are **many** ways to deal with that, each with pros and cons, but for now we are going to work with the less complicated options.

### Option 1: Delete data with missing rows

The "Embarked" column is the perfect candidate for this option, because it has only two rows with missing data. By deleting them we don't lose too much information (in contrast, we would lose almost the entire dataset if we applied this to the "Cabin" column).

In [10]:
# Create a "clean" dataset where we cumulativelly fix missing values
# Start by removing rows ONLY where "Embarked" has no values
clean_dataset = dataset.drop(dataset[dataset["Embarked"].isnull()].index)
clean_dataset = clean_dataset.reindex()

# How many rows and columns in the new dataset?
print(clean_dataset.shape)


(889, 12)


We expected `cleaned_dataset` to have 889 rows since we deleted 2 from the original dataset.

### Option 2: Replace empty values with the mean or median for that data.

We should use this option for "Age" field, since it holds **numerical** data and less that 20% of the rows are empty (we should have enough data to make a decent estimation for the mean age):


In [11]:
import numpy as np
# Calculate the mean value for the Age column
mean_age = clean_dataset["Age"].mean()  # 29.6420...

# Replace empty values in "Age" with the mean calculated above
clean_dataset["Age"].fillna(mean_age, inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))

             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                687
Embarked               0


As you can see above, the "Age" field has no empty cells anymore.

### Option 3: Assign a new category to unknown categorical data

The "Cabin" field is a categorical field, meaning that there's a finite number of possible options for cabins in the Titanic. Unfortunately, we don't know which of these should be applied to a large percentage of our records.

For this exercise it makes perfect sense to create an "Unknown" category and assign it to the cases where the cabin is, well, unknown:


In [12]:
# Assign unknow to records where "Cabin" is empty
clean_dataset["Cabin"].fillna("Unknown", inplace=True)

# Let's see what the clean dataset looks like now
print(clean_dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))



             Empty Cells
PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                  0
Embarked               0


That's it! No more missing data!

We only lost two records (where "Embarked" was empty), and although we had to make some approximations to fill the missing gaps for the "Age" and "Cabin" columns, and those will certainly influence the performance of our model, the dataset is ready to be used.

## Building a Model with Cleaned Data

- Split model
- Train
- Calculate loss



## Building a Model with Uncleaned Data

Split model
Train
Calculate loss

Compare losses

## Summary
.....
Recall types of data, transformation to handle missing data and draw concclusion on model comparison.
