<img align="right" src="images/nuked_crop.jpg" width="180px">

## Categorical Encoding Dangers: Silent but Deadly!

First, make sure you read our [Categorical Encoding Guide](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Data_Guide.ipynb) for a full overview of Pandas categorical support, memory optimization and performance improvements.

Now that you've read the notebook above and are familiar with categorical data, lets dive into the real dangers that can happen when using categorical data for machine learning. These dangers are **silent AND deadly**, the ML model will simply start giving incorrect predictions and pinning down the issue can be extremely hard to diagnose.

### Categorical Data Review
<img align="right" src="images/data_types.png" width="450px">

For a review of the different data types we can look at the diagram on the right. Categorical features are named values and include things like:

    Music Genre = 'jazz', 'rock', 'pop', ...
    DNS Record = 'A', 'PTR', 'TXT', 'SVR', ...
  
These types of features can be important discriminators when classifying or organizing data. Since machine learning libraries like Scikit-Learn or PyTorch need numerical data we need to transform our categorical data into a usable form.

The standard approach for this is called **one hot encoding** or **dummy-encoding**. This technique converts categorical variables into numeric data. In the notebook below we'll describe the process, go through some code and then describe the often **silent dangers** and how to avoid them.

<img align="right" src="images/dynamic.jpg" width="350px">

## Synthetic Data
Since the dangers are subtle we're going to construct a small synthetic dataset that illustrates the issues that can happen when doing categorical encoding for model training and evaluation.

## Resources for Categorical Encoding
- [Categorical Encoding Guide](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Data_Guide.ipynb)
<img align="right" src="images/pandas.png" width="280px">
- [Pandas Categorical Docs](https://pandas.pydata.org/pandas-docs/stable/categorical.html)
- [Pandas Github Issue](https://github.com/pandas-dev/pandas/issues/8918)
- [Get Smarties](https://github.com/joeddav/get_smarties)
- [Tom Augspurger PyData Chicago 2016](https://youtu.be/KLPtEBokqQ0)
- [Categorical Handling for Python](https://www.datacamp.com/community/tutorials/categorical-data)

## General Resouces for Pandas
- [Python for Data Analysis (Great Book!)](http://shop.oreilly.com/product/0636920050896.do)
- [Data School on YouTube](https://www.youtube.com/channel/UCnVzApLJE2ljPZSeQylSEyg)

In [37]:
# Note: Good idea to print out library versions
import pandas as pd
import sklearn
print('Pandas: {:s}'.format(pd.__version__))
print('Scikit-Learn: {:s}'.format(sklearn.__version__))

# Plotting defaults
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14.0
plt.rcParams['figure.figsize'] = 12.0, 5.0

Pandas: 0.23.4
Scikit-Learn: 0.20.0


In [103]:
# Synthetic Data (Survey of Potential Customers)
names = ['bob', 'sue', 'jane', 'joe', 'bill', 'sally', 'cindy', 'hank']
fav_food = ['pizza', 'ham', 'pizza', 'tacos', 'pizza', 'ham', 'pizza', 'tacos']
fav_genre = ['sci-fi', 'action', 'sci-fi', 'comedy', 'sci-fi', 'comedy', 'sci-fi', 'action']
like_app = [True, False, True, False, True, False, True, False]
df = pd.DataFrame({'name':names, 'food':fav_food, 'genre':fav_genre, 'like_app':like_app})
df.head(10)

Unnamed: 0,name,food,genre,like_app
0,bob,pizza,sci-fi,True
1,sue,ham,action,False
2,jane,pizza,sci-fi,True
3,joe,tacos,comedy,False
4,bill,pizza,sci-fi,True
5,sally,ham,comedy,False
6,cindy,pizza,sci-fi,True
7,hank,tacos,action,False


<img align="right" src="images/dynamic.jpg" width="400px">

## Machine Learning Use Case
We're a hot new startup and we're trying to determine which customer demographics to focus on when marketing our new mobile app. We surveyed potential customers, asked if they liked the mock-up of our app **(crayon drawing taped to a phone...cough...)** and collected data on favorite foods and movie genres. 

Now we're using the data to train our tensorflow, deep-mind, hypertronic, gpu enabled blockchain, generalized AI model...

In [104]:
# First lets divide our DataFrame into input features and output variables
# We'll use the standard naming: 'X' for feature matrix and 'y' for output array
X = df[['food', 'genre']]
y = df['like_app']

<img align="right" src="images/one_hot.png" width="450px">

## What is One Hot Encoding?
Machine learning libraries like scikit-learn, PyTorch, TensorFlow will operate on numerical data in the form of arrays or matrices. Looking at the diagram on the right we see that **one hot encoding** converts categorical data into a set of binary arrays, there will be **N new columns** based on the **N possible values** of the categorical data. **Dummy encoding** does the exact same thing except removes one of the columns (so **N-1** columns). Pandas provides both:

    pd.get_dummies(df)                  # One Hot Encoding
    pd.get_dummies(df, drop_first=True) # Dummy Encoding

In [105]:
# Convert Pandas dataframe columns into a numerical arrays.
X = pd.get_dummies(X)
X

Unnamed: 0,food_ham,food_pizza,food_tacos,genre_action,genre_comedy,genre_sci-fi
0,0,1,0,0,0,1
1,1,0,0,1,0,0
2,0,1,0,0,0,1
3,0,0,1,0,1,0
4,0,1,0,0,0,1
5,1,0,0,0,1,0
6,0,1,0,0,0,1
7,0,0,1,1,0,0


In [55]:
# Now we're ready to for our next-gen, deep learning, AI, blah...blah...
# Actually before we do that, lets use RandomForest as a baseline and
# we'll leave the <bingo words> for another notebook. :)
from sklearn.ensemble import RandomForestClassifier
awesome_model = RandomForestClassifier(n_estimators=100)

# Train the model using cross validation (K-fold)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(awesome_model, X, y, cv=4)
print(scores)
print('Accuracy: {:2f} (STD = {:.2f})'.format(scores.mean(), scores.std() * 2))  

[1. 1. 1. 1.]
Accuracy: 1.000000 (STD = 0.00)


In [56]:
# Sweet our model is 100% accurate, now we train on all the data
# and investigate which features the model thought were most important
awesome_model.fit(X, y)
feature_scores = [(f, s) for f, s in zip(X.columns, awesome_model.feature_importances_)]
feature_scores.sort(key=lambda x: x[1], reverse=True)
feature_scores

[('genre_sci-fi', 0.3584190476190476),
 ('food_pizza', 0.3438666666666667),
 ('genre_action', 0.08297777777777778),
 ('food_tacos', 0.0787111111111111),
 ('genre_comedy', 0.07531428571428572),
 ('food_ham', 0.060711111111111105)]

<img align="left" src="images/faberge_egg.gif" width="180px">

# Nice! We're SO going to be rich!
Alright our **extensive** feature analysis shows that if you like Science Fiction and you eat Pizza then your going to love our new app (which Carl made a mock-up of by taping a crayon drawing to his phone). 

But before we start spending all our new, well earned, money lets try out are model on some new data, because we want to make sure it works before buying that set of faberge eggs!

In [83]:
# We grab some new data (that wasn't part of the training set)
names = ['ted', 'bob', 'flo', 'june']
fav_food = ['steak','pizza', 'tacos', 'pizza']
fav_genre = ['thriller', 'sci-fi', 'comedy', 'sci-fi']
like_app = [False, True, False, True]
new_df = pd.DataFrame({'name':names, 'food':fav_food, 'genre':fav_genre, 'ground_truth_like':like_app})
new_df.head(10)

Unnamed: 0,name,food,genre,ground_truth_like
0,ted,steak,thriller,False
1,bob,pizza,sci-fi,True
2,flo,tacos,comedy,False
3,june,pizza,sci-fi,True


<img align="right" src="images/sci_fi.png" width="280px">
<img align="left" src="images/pizza.jpg" width="240px">

## So close we can smell it!
Our feature analysis told us that people who like pizza and sci-fi will like our app. In fact, we can **SEE** that in the small dataset above. So lets run our model prediction, claim our victory and go buy some eggs!

In [78]:
# We'll do the same conversions we did above when we trained the model
X = new_df[['food', 'genre']]
X = pd.get_dummies(X)
X

Unnamed: 0,food_pizza,food_steak,food_tacos,genre_comedy,genre_sci-fi,genre_thriller
0,0,1,0,0,0,1
1,1,0,0,0,1,0
2,0,0,1,1,0,0
3,1,0,0,0,1,0


In [102]:
# Alright that looks good, lets get predictions from our awesome model!
predictions = []
for index, row in X.iterrows():
    pred = awesome_model.predict([row])[0]
    predictions.append(pred)
new_df['predicted_like'] = predictions
new_df.head()

Unnamed: 0,name,food,genre,ground_truth_like,predicted_like
0,ted,steak,thriller,False,True
1,bob,pizza,sci-fi,True,False
2,flo,tacos,comedy,False,False
3,june,pizza,sci-fi,True,False


<img align="left" src="images/confused.jpg" width="230px">
<img align="right" src="images/nuked.jpg" width="170px">

## I am SO confused right now... 
Our model is giving incorrect predictions! Even worse, it didn't crash or complain or do anything. It just **silently** went about its business and gave us complete **junk** for output. 

As part of a long data pipeline, predictions go into a database and get displayed in a web interface like a week later. In this scenario you may have **no idea** what happened.

This small exercise is the simplest example we could contrive. The **category values** in our evaluation data were **slightly different** then the training data. But this **silent and deadly** issue can happen in more subtle ways.. like having the same category values but in different order. Similar issues can occur with Scikit-Learn's [One Hot Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) if you're not careful.

<img align="right" src="images/fix_it.jpg" width="200px">

## Pandas Categorical Types to the Rescue 
Okay, so obviously we have the functionality available in Pandas to fix this issue in a robust way. If you remember from our [Categorical Encoding Guide](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Data_Guide.ipynb) we can explicitly convert these columns to a categorical dtype. As discussed in the notebook this conversion gives us a **substantial decrease in memory** usage, an **increase in performance**, and as we'll see now, **saves us** from several silent but deadly issues! 

<img align="right" src="images/rewind.png" width="200px">

## Rewind and Recap
Let's rewind a bit and codify exactly what we should do during both the training and prediction phases. Once you've done these steps a few times they will start to feel more natural and intuitive.

**Training:**
1. Defined ALL possible categorical values as a **list**
1. Use **list** as part of your category dtype definition
1. Train model
1. Save the model and category list (serialize them to disk/database)

**Prediction:**
1. Retrieve the model and category **list** (deserialize them from disk/database)
1. Use **list** as part of your category dtype definition
1. Predict from model


In [None]:
# Going back to our original data, setting up the 
# categorical types, and then training like before.
food_types = ['ham', 'pizza', 'tacos', 'stake']
genre_types = ['action', 'comedy', 'sci_fi', 'thriller']
df['']

<img align="left" src="images/deep_dive.jpg" width="280px">

## Deeper Dive
We covered a simple approach to avoiding Categorical Encoding Dangers. For a production system with large data pipelines there are better/more formal approaches. In particular you may want to use a TransformerMixin class. Tom Augspurger goes through this approach, if you're interesting in a deeper dive this is a great talk [Tom Augspurger PyData Chicago 2016](https://youtu.be/KLPtEBokqQ0).

<img align="left" src="images/SCP_med.png" width="180px">
<img align="right" src="images/nuked.jpg" width="190px">


## Wrap Up
Well this notebook certainly was an **exciting and dangerous adventure!**. We're now prepared to use categorical data in our machine learning models without fear of the hidden dangers.

If you liked this notebook please visit [SCP Labs](https://github.com/SuperCowPowers/scp-labs) for more notebooks and examples, or visit our company page for consulting, development and products [SuperCowPowers](https://www.supercowpowers.com)

### Feedback
We welcome feedback on errors, improvements, or alternative approaches. Please send suggestions to <feedback@supercowpowers.com>

In [12]:
# This cell is simply for adding some CSS (Ignore it :)
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()