# Preprocessing Categorical Variables

In this section, we'll mention some techniques to preprocess categorical variables

## UCI Mushroom dataset (mushroom.data)

* 22 categorical features and one response variable (label)
* Objective is to classify whether a mushroom is edible ("e") or poisonous ("p")
* Full information about the dataset is available [here](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names)

In [2]:
import pandas as pd

prefix = "../datasets/"
df = pd.read_csv(prefix + "mushroom.data", sep=",")

In [3]:
df.head()

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,...,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g


In [4]:
columns = ["edible", "cap-shape", "cap-surface", "cap-color", "bruises?",
        "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color",
        "stalk-shape", "stalk-root", "stalk-surface-above-ring",
        "stalk-surface-below-ring", "stalk-color-above-ring",
        "stalk-color-below-ring", "veil-type", "veil-color", "ring-number",
        "ring-type", "spore-print-color", "population", "habitat"
        ]

df = pd.read_csv(prefix + "mushroom.data", sep=",", names=columns)

In [5]:
df.head()

Unnamed: 0,edible,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


## Problem with categorical variables

* We can't plug categorical variables straight away into a classifier; we need to make them numeric feature vectors first

    **NOTE**. Machine learning algorithms often involve performing dot products with the input vector. For example,       SVMs will classify an object based upon the following criterion. 
    
    If $\vec{w}$ is the weight vector and $b$ the bias of an
    SVM, and $\vec{x}$ is the input feature vector, then the label given to $\vec{x}$, $f(\vec{x})$, is

    $$f(\vec{x}) = \text{sign}(\vec{w}\cdot \vec{x} + b) \quad (1)$$

    where 
    
    $$\text{sign}(x) = \left\{
        \begin{array}{ll}
        -1 &  \text{ if } x < 0 \\
        1  &  \text{ otherwise}
        \end{array}\right.$$
    
    Thus, $\vec{x}\in \mathbb{R}^n$ has to be a vector of $n$ real value numbers. We need an encoding technique.
    
* One common way to do it: **one hot encodings** (ie. "dummy encodings")
    * Scikit-Learn implementation: [sklearn.preprocessing.OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
        * Not that perfect, can only encode integer features. Why? I don't know why.
        * We nee a pull request to make this better. See [here](http://stackoverflow.com/q/35107559/2014591)
    * Pandas implementation: [pandas.get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
        * **Note**: Possible bug in releases below 0.17 (I think)
        * Does not necessarily preserve encodings for a training / test set split; training and test sets might have different categorical variables

## Part 1: One Hot Encodings

(Part 1 of 1)

In [None]:
# Example:

features = [["red"],
            ["blue"],
            ["green"],
            ["red"]]

sample = pd.DataFrame(data=features, columns=["color"])

In [None]:
sample.head()

In [None]:
onehot = pd.get_dummies(sample)
onehot

Now suppose that we trained an SVM on a dataset with a single feature ("color") with three possible values, "blue", "green", "red". You perform a one hot encoding on your training set, and train your SVM. 

this 3 sample dataset, and the SVM came up with the following weights and bias after training.

$$\vec{w} = \begin{pmatrix} 3 \\ -2 \\ 3 \\ \end{pmatrix} \qquad b = -1$$

($\vec{w}$ is length 3, because there are 3 categories for this one feature dataset.)

Your one hot encoding above would yield

$$
    \begin{pmatrix} 
        \text{"red"} \\ 
        \text{"blue"} \\ 
        \text{"green"} \\ 
        \text{"red"} 
    \end{pmatrix} 
    \implies 
    \begin{pmatrix} 
    \text{"blue"} & \text{"green"} & \text{"red"} \\ 
        0 & 0 & 1 \\ 
        1 & 0 & 0 \\ 
        0 & 1 & 0 \\ 
        0 & 0 & 1 
    \end{pmatrix}$$

and so for each row, you would apply equation (1) from above. For example, for the first sample, you will have

$$
\begin{align*}
f(\vec{x}) 
&= \text{sign}(\vec{w}\cdot \vec{x}) + b & \text{by equation (1)}\\ 
&= \text{sign}\Bigg( 
    \begin{pmatrix} 
    3 \\ -2 \\ 3 
    \end{pmatrix} \cdot 
    \begin{pmatrix} 
    0 \\ 0 \\ 1 
    \end{pmatrix} \Bigg) - 1 \\
&= 1.
\end{align*}
$$

If you're fluent in matrix operations, you'll notice that you can perform equation (1) in one sweep. If $X$ is the matrix of all your samples - each sample as a row in $X$ - the following will compute all labels for each sample in $X$.

$$f(X) = X \cdot w + \begin{pmatrix} b \\ \vdots \\ b \end{pmatrix} $$

where $f(X) \in \{0, 1\}^4$.

Notes about one hot encodings

* Assumes that each possible category is independent of each other
* If a variable has $k$ unique values, the encoding will add $k - 1$ features to your feature vector
* No effort to understand semantic meaning of categories

In [None]:
X = df.drop("edible", axis=1)
y = df["edible"]

In [None]:
X.head()

## Applying a one hot encoding for classification

For this, we'll simplify steps and encode the training and test data at the same time. But you wouldn't be able to do this in a production environment.

In [None]:
X_encoded = pd.get_dummies(X)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y)

predictions = RandomForestClassifier().fit(X_train, y_train).predict(X_test)
print "Accuracy of random forest: ", accuracy_score(y_test, predictions)

## Last words about one hot encodings

* Certainly not the best way to encode categorical variables, but it's easy and widely used
    * Makes feature vectors very sparse
    * Can't enforce dependence between any two categories
    * If a new category appears in your testing set which was not present in the training set, you're screwed!
        * The way we encoded the dataset above was "not kosher". Whoops.
* For some models, R will do the encodings for you under the hood
    * "Categorical" variables in R are called [factor variables](http://www.ats.ucla.edu/stat/r/modules/factor_variables.htm). 
    * Models in R will detect whether a variable is a factor type before doing anything