# Exercise: Titanic Dataset - One-Hot Vectors

- Load cleaned dataset
- Explain 1-hot vectors
- 1-hot a categorical value
- Build a model, compare to previous results


## Preparing data

This time we start by using the "cleaned" dataset we saved in Unit 5:


In [839]:
import pandas as pd

# Load data from our dataset file into a pandas dataframe
dataset = pd.read_csv('Data/Cleaned_Titanic.csv', index_col=False, sep=",",header=0)

# Make sure we don't have missing values
dataset.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Name         889 non-null    object 
 4   Sex          889 non-null    object 
 5   Age          889 non-null    float64
 6   SibSp        889 non-null    int64  
 7   Parch        889 non-null    int64  
 8   Ticket       889 non-null    object 
 9   Fare         889 non-null    float64
 10  Cabin        889 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.5+ KB


The dataset has 12 columns and 889 non-null (not empty) values. We are good to go!

## One-Hot Encoding
In the previous Unit we had to disregard all **Categorical** data because it wasn't ready to be processed by our `Logistic Regression` algorithm, as it requires all input values to be numerical.

But that is a lot of valuable data that we can easily integrate into our model, if we can represent it (or encode it) in a format that the algorithm understands.

One way to address that is to use a technique called "One-Hot Encoding", where we create a new column for each possible category and set the value to "1" **only** where that category describes the entry (leaving it as zero otherwise).

Let's try to visualize it:




In [840]:
# Get all possible categories for the "Embarked" column
print(f"Possible values for Embarked: {dataset['Embarked'].unique()}")

Possible values for Embarked: ['S' 'C' 'Q']


We have three possible values for the port where the passenger embarked: 

- S = Southampton
- C = Cherbourg, 
- Q = Queenstown, 


The code above conveniently arranged those options as a three valued "ports vector":


`['S' 'C' 'Q']`

To One-Hot encode the dataset:

1. For each possible value, create a column.

2. Assign "1" **only** to the column corresponding to the entry's category:


| PassengerId 	| Name                                              	| Embarked 	| Embarked_S 	| Embarked_Q 	| Embarked_C 	|
|-------------	|---------------------------------------------------	|:--------:	|:----------:	|:----------:	|:----------:	|
| 1           	| Braund, Mr. Owen Harris                           	|     S    	|      1     	|      0     	|      0     	|
| 2           	| Cumings, Mrs. John Bradley (Florence Briggs Th... 	|     Q    	|      0     	|      1     	|      0     	|

Above, Mr. Braund embarked on por "S", therefore only that column is marked as "1"

We can use One-Hot encoding for the "Sex" category as well.





In [841]:
# Get all possible categories for the "Sex" column
print(f"Possible values for Sex: {dataset['Sex'].unique()}")

Possible values for Sex: ['male' 'female']


| PassengerId 	| Name                                              	| Sex    	| Sex_m 	| Sex_f 	|
|-------------	|---------------------------------------------------	|--------	|:-----:	|:-----:	|
| 1           	| Braund, Mr. Owen Harris                           	| male   	|   1   	|   0   	|
| 2           	| Cumings, Mrs. John Bradley (Florence Briggs Th... 	| female 	|   0   	|   1   	|

Passenger class" is represented numerically in this dataset, and although we could possibly use it "as is", it will treat it as categorical data and "One-Hot" encode it was well:





In [842]:
# Get all possible categories for the "Pclass" column
print(f"Possible values for Pclass: {dataset['Pclass'].unique()}")

Possible values for Pclass: [3 1 2]


| PassengerId 	| Name                                              	| Pclass 	| Pclass_1 	| Pclass_2 	| Pclass_3 	|
|-------------	|---------------------------------------------------	|:------:	|:--------:	|:--------:	|:--------:	|
| 1           	| Braund, Mr. Owen Harris                           	|    3   	|     0    	|     0    	|     1    	|
| 2           	| Cumings, Mrs. John Bradley (Florence Briggs Th... 	|    1   	|     1    	|     0    	|     0    	|




In [843]:
# Justb out of curiosity, how many options do we have for the "Cabin" category
print(f"Possible options for Cabin: {dataset['Cabin'].unique().shape}")

Possible options for Cabin: (147,)


Using numerical `vectors` to describe categories allows us to use that information in most Machine Learning algorithms.

## Building a new model

Let's build a new model, using the cleaned dataset from the previous Unit (where we addressed missing values), but making sure we include the categorical columns this time.

We need to one-hot encode the following columns:

- Pclass
- Sex
- Cabin
- Embarked

In [844]:
import sklearn.model_selection as model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Let's remove some fields that are not needed right now
dataset = dataset.drop(["PassengerId","Name","Ticket"], axis=1)

# Generate One-Hot encodings for the categorical columns
complete_dataset = pd.get_dummies(dataset, columns=["Pclass", "Sex", "Cabin", "Embarked"], drop_first=False)

# Check resulting dataset
complete_dataset.head()


Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,...,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Cabin_Unknown,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,1,26.0,0,0,7.925,0,0,1,1,0,...,0,0,0,0,0,0,1,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,0,35.0,0,0,8.05,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,1


In [845]:
dataset.shape

(889, 9)

The One-Hot encoded dataset now has 160 columns (remember that we had 147 categories for the field "Cabin" alone, plus several different possibilities for each other categorical field).

We can now train and evaluate our new model using the "complete" dataset:


In [846]:
# X is our feature matrix
X = complete_dataset.drop(["Survived"], axis=1)



# y is the label vector 
y = complete_dataset["Survived"]

# Create Train and test sets with a 70/30 split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.70,test_size=0.30, random_state=101)

# train the model (increase then # of maximum iterations for the new dataset)
model = LogisticRegression(random_state=0, max_iter=2000).fit(X_train, y_train)

# score is the mean accuracy on the given test data and labels
score = model.score(X_train, y_train)

# calculate loss
probabilities = model.predict_proba(X_test)
loss = metrics.log_loss(y_test, probabilities)

# save results for comparison
complete_score = score
complete_loss = loss



## Comparing Models

We can compare the  `score` and `loss` metrics for this model with the metrics we gathered in Unit 5:

In [847]:
# Use a dataframe to create a comparison table of metrics
# Copy metrics from previous Unit
l = [["Numeric Features Only (original)", 0.686998, 0.609630],
    ["Numeric Features Only (cleaned)", 0.696141, 0.609630],
    ["Numeric and categorical", complete_score, complete_loss]]

pd.DataFrame(l, columns=["Dataset", "Score", "Loss"])

Unnamed: 0,Dataset,Score,Loss
0,Numeric Features Only,0.696141,0.60963
1,Numeric and categorical,0.826367,0.425014


-----

## Summary

-----