# Exercise: Titanic Dataset - One-Hot Vectors

In this Unit we explain how to use One-Hot Vectors to represent categorical values, so they can be processed by Machine Learning algorithms.

We will use the cleaned data from Unit 5 as the starting point, encode the categorical columns, and build a new model with the additional data.

Then we compare it to the previous models and evaluate the results.


## Preparing data

This time we start by using the "cleaned" dataset we saved in Unit 5:


In [1]:
import pandas as pd

# Load data from our dataset file into a pandas dataframe
dataset = pd.read_csv('Data/Cleaned_Titanic.csv', index_col=False, sep=",",header=0)

# Quickly check the data
dataset.head()


Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S


## One-Hot Encoding
In the previous Unit we had to disregard all **Categorical** data because it wasn't ready to be processed by our Logistic Regression algorithm, which requires all input values to be numerical.

But that is a lot of valuable data that we can easily integrate into our model, if we can encode it (i.e. represent it) in a format that the algorithm understands.

One way to address that is to use a technique called _One-Hot Encoding_, where - for a categorical feature - we create a new column for each possible category option and set the value to `1` **only** where that category describes the entry (leaving it as zero otherwise).

Let's try to visualize it:


In [2]:
# Get all possible categories for the "Embarked" column
print(f"Possible values for Embarked: {dataset['Embarked'].unique()}")

Possible values for Embarked: ['S' 'C' 'Q']


We have three possible values for the port where the passenger embarked: 

- S = Southampton
- C = Cherbourg, 
- Q = Queenstown, 


To One-Hot encode the dataset:

1. For each possible value, create a column.

2. Assign "1" **only** to the column corresponding to the entry's category:


| PassengerId 	| Name                                              	| Embarked 	| Embarked_S 	| Embarked_Q 	| Embarked_C 	|
|-------------	|---------------------------------------------------	|:--------:	|:----------:	|:----------:	|:----------:	|
| 1           	| Braund, Mr. Owen Harris                           	|     S    	|      1     	|      0     	|      0     	|
| 2           	| Cumings, Mrs. John Bradley (Florence Briggs Th... 	|     Q    	|      0     	|      1     	|      0     	|

Using the first row as an example, Mr. Braund embarked from port `S`, therefore only that column is marked as `1`

We don't need to do this manually. Pandas, for example, offers us a way to do this automatically. Let's practice using this method now with the `Sex` category:

In [3]:
# Print all possible categories for the "Sex" column
print(f"Possible values for Sex: {dataset['Sex'].unique()}")

# Use Pandas to One-Hot encode the Sex category
pd.get_dummies(dataset, columns=["Sex"], drop_first=False).head()


Possible values for Sex: ['male' 'female']


Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,Unknown,S,0,1
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,Unknown,S,1,0
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,1,0
4,4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,Unknown,S,0,1


Notice how each passenger now has a 1 in either the `sex_female` or `sex_male` columns rather than a value of `male` or `female`.

### Numeric Categories

Some categories are numerical, but might be better represented as categorical. For example, the passenger's class is represented numerically in this dataset as 1,2, or 3, but we may rather have this represented as categorical data and One-Hot encode it was well:

In [4]:
# Get all possible categories for the "Pclass" column
print(f"Possible values for Pclass: {dataset['Pclass'].unique()}")

# Use Pandas to One-Hot encode the Sex category
pd.get_dummies(dataset, columns=["Pclass"], drop_first=False).head()


Possible values for Pclass: [3 1 2]


Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_1,Pclass_2,Pclass_3
0,0,1,0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S,0,0,1
1,1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,0,0
2,2,3,1,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S,0,0,1
3,3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,0,0
4,4,5,0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S,0,0,1


## Building a new model

Let's build a new model making sure we include the categorical columns this time.

We need to one-hot encode the following columns:

- Pclass
- Sex
- Cabin
- Embarked

In [5]:
import sklearn.model_selection as model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Generate One-Hot encodings for the categorical columns
complete_dataset = pd.get_dummies(dataset, columns=["Pclass", "Sex", "Cabin", "Embarked"], drop_first=False)

# Check resulting dataset
complete_dataset.head()


Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Pclass_1,...,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Cabin_Unknown,Embarked_C,Embarked_Q,Embarked_S
0,0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,0,...,0,0,0,0,0,0,1,0,0,1
1,1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,1,...,0,0,0,0,0,0,0,1,0,0
2,2,3,1,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,0,...,0,0,0,0,0,0,1,0,0,1
3,3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,1,...,0,0,0,0,0,0,0,0,0,1
4,4,5,0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,0,...,0,0,0,0,0,0,1,0,0,1


The One-Hot encoded dataset now has 162 columns (Most of which come from different categories for `Cabin`).

We can now train and evaluate our new model using the "complete" dataset:


In [6]:
from module_0c_logistic_regression import prepare_test_set, train_logistic_regression

# Make a list of our features
# This will be all features except those listed explicitly below
features = list(complete_dataset.columns) 
features.remove("PassengerId")
features.remove("Survived")
features.remove("Name")
features.remove("Ticket")

# Extract a test dataset
# This is performed using code not shown in this notebook
# as it is not neccessary to understand these details of
# this step to complete this module
prepare_test_set(features)

# Train a model with the clean data and get its 
# * score (accuracy on the traning dataset), and
# * loss (performance on the test data)
# This is also performed using code that is not shown
# in this notebook as it is not necessay to understand
# its detail in order to complete this module
complete_score, complete_loss = train_logistic_regression(complete_dataset)
print(f"Score: {complete_score}, Loss: {complete_loss}")

Score: 0.8325925925925926, Loss: 0.47384084907191887


## Comparing Models

We can compare the  `score` and `loss` for this model with the metrics we gathered in Unit 5:

In [7]:
# Use a dataframe to create a comparison table of metrics
# Copy metrics from previous Unit
l = [["Numeric Features Only (original)", 0.698, 0.681],
    ["Numeric Features Only (cleaned)", 0.710, 0.673],
    ["Numeric and Categorical (present Unit)", complete_score, complete_loss]]

pd.DataFrame(l, columns=["Dataset", "Score (High is better)", "Loss (Low is better)"])

Unnamed: 0,Dataset,Score (High is better),Loss (Low is better)
0,Numeric Features Only (original),0.698,0.681
1,Numeric Features Only (cleaned),0.71,0.673
2,Numeric and Categorical (present Unit),0.832593,0.473841



## Summary

In this unit you learned how to use One-Hot encoding to address categorical data.

As you can see by the comparison above, adding that data brought a significant improvement in both metrics, despite demanding very little change in our code.