# Exercise: Titanic Dataset - One-Hot Vectors

In the Unit we will explain how to use One-Hot Vectors to represent categorical values, so they can be processed by Machine Learning algorithms.

We will use the cleaned data from Unit 5 as the starting point, encode the categorical columns and build a new model with the additional data.

Then we compare the results with the previous models and see how they improved.


## Preparing data

This time we start by using the "cleaned" dataset we saved in Unit 5:


In [893]:
import pandas as pd

# Load data from our dataset file into a pandas dataframe
dataset = pd.read_csv('Data/Cleaned_Titanic.csv', index_col=False, sep=",",header=0)

# Quickly check the data
dataset.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S


## One-Hot Encoding
In the previous Unit we had to disregard all **Categorical** data because it wasn't ready to be processed by our **Logistic Regression** algorithm, since it requires all input values to be numerical.

But that is a lot of valuable data that we can easily integrate into our model, if we can encode it (in other words, "represent it") in a format that the algorithm understands.

One way to address that is to use a technique called "One-Hot Encoding", where - for a categorical feature - we create a new column for each possible category option and set the value to "1" **only** where that category describes the entry (leaving it as zero otherwise).

Let's try to visualize it:


In [894]:
# Get all possible categories for the "Embarked" column
print(f"Possible values for Embarked: {dataset['Embarked'].unique()}")

Possible values for Embarked: ['S' 'C' 'Q']


We have three possible values for the port where the passenger embarked: 

- S = Southampton
- C = Cherbourg, 
- Q = Queenstown, 


To One-Hot encode the dataset:

1. For each possible value, create a column.

2. Assign "1" **only** to the column corresponding to the entry's category:


| PassengerId 	| Name                                              	| Embarked 	| Embarked_S 	| Embarked_Q 	| Embarked_C 	|
|-------------	|---------------------------------------------------	|:--------:	|:----------:	|:----------:	|:----------:	|
| 1           	| Braund, Mr. Owen Harris                           	|     S    	|      1     	|      0     	|      0     	|
| 2           	| Cumings, Mrs. John Bradley (Florence Briggs Th... 	|     Q    	|      0     	|      1     	|      0     	|

Above, Mr. Braund embarked on por "S", therefore only that column is marked as "1"

We can use One-Hot encoding for the "Sex" category as well.





In [895]:
# Get all possible categories for the "Sex" column
print(f"Possible values for Sex: {dataset['Sex'].unique()}")

Possible values for Sex: ['male' 'female']


| PassengerId 	| Name                                              	| Sex    	| Sex_m 	| Sex_f 	|
|-------------	|---------------------------------------------------	|--------	|:-----:	|:-----:	|
| 1           	| Braund, Mr. Owen Harris                           	| male   	|   1   	|   0   	|
| 2           	| Cumings, Mrs. John Bradley (Florence Briggs Th... 	| female 	|   0   	|   1   	|

Passenger class" is represented numerically in this dataset, and although we could possibly use it "as is", it will treat it as categorical data and "One-Hot" encode it was well:





In [896]:
# Get all possible categories for the "Pclass" column
print(f"Possible values for Pclass: {dataset['Pclass'].unique()}")

Possible values for Pclass: [3 1 2]


| PassengerId 	| Name                                              	| Pclass 	| Pclass_1 	| Pclass_2 	| Pclass_3 	|
|-------------	|---------------------------------------------------	|:------:	|:--------:	|:--------:	|:--------:	|
| 1           	| Braund, Mr. Owen Harris                           	|    3   	|     0    	|     0    	|     1    	|
| 2           	| Cumings, Mrs. John Bradley (Florence Briggs Th... 	|    1   	|     1    	|     0    	|     0    	|




In [897]:
# Just out of curiosity, how many options do we have for the "Cabin" category?
print(f"Possible options for Cabin: {dataset['Cabin'].unique().shape}")

Possible options for Cabin: (147,)


Using numerical `vectors` to describe categories allows us to use that information in most Machine Learning algorithms.

## Building a new model

Let's build a new model, using the cleaned dataset from the previous Unit (where we addressed missing values), but making sure we include the categorical columns this time.

We need to one-hot encode the following columns:

- Pclass
- Sex
- Cabin
- Embarked

In [898]:
import sklearn.model_selection as model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Let's remove some fields that are not needed right now
dataset = dataset.drop(["PassengerId","Name","Ticket"], axis=1)

# Generate One-Hot encodings for the categorical columns
complete_dataset = pd.get_dummies(dataset, columns=["Pclass", "Sex", "Cabin", "Embarked"], drop_first=False)

# Check resulting dataset
complete_dataset.head()


Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,...,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Cabin_Unknown,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,1,26.0,0,0,7.925,0,0,1,1,0,...,0,0,0,0,0,0,1,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,0,35.0,0,0,8.05,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,1


In [899]:
dataset.shape

(889, 9)

The One-Hot encoded dataset now has 160 columns (remember that we had 147 categories for the field "Cabin" alone, plus several different possibilities for each other categorical field).

We can now train and evaluate our new model using the "complete" dataset:


In [900]:
# X is our feature matrix
X = complete_dataset.drop(["Survived"], axis=1)

# y is the label vector 
y = complete_dataset["Survived"]

# Create Train and test sets with a 70/30 split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.70,test_size=0.30, random_state=101)

# train the model (increase then # of maximum iterations for the new dataset)
model = LogisticRegression(random_state=0, max_iter=2000).fit(X_train, y_train)

# score is the mean accuracy on the given test data and labels
score = model.score(X_train, y_train)

# calculate loss
probabilities = model.predict_proba(X_test)
loss = metrics.log_loss(y_test, probabilities)

# save results for comparison
complete_score = score
complete_loss = loss


## Comparing Models

We can compare the  `score` and `loss` for this model with the metrics we gathered in Unit 5:

In [901]:
# Use a dataframe to create a comparison table of metrics
# Copy metrics from previous Unit
l = [["Numeric Features Only (original)", 0.686998, 0.645384],
    ["Numeric Features Only (cleaned)", 0.696141, 0.609630],
    ["Numeric and Categorical (present Unit)", complete_score, complete_loss]]

pd.DataFrame(l, columns=["Dataset", "Score", "Loss"])

Unnamed: 0,Dataset,Score,Loss
0,Numeric Features Only (original),0.686998,0.645384
1,Numeric Features Only (cleaned),0.696141,0.60963
2,Numeric and Categorical (present Unit),0.826367,0.425014



## Summary

In this unit you learned how to use One-Hot encoding to address categorical data.

As you can see by the metrics comparison above, adding that data brought a significant improvement in both **Score** and **Loss** metrics, despite demanding very little change in our code.