## One-hot encoding

Is a method to expand the feature space from a single categorical feature to an n-length unit vector, where n is the cardinality of the categorical feature.  For example, say we are incorporating day of the week as a categorical variable.

*Without* one-hot encoding, you might assign an integer to each day of the week, like so:
    
    Monday = 0,
    Tuesday = 1,
    Wednesday = 2,
    Thursday = 3,
    Friday = 4,
    Saturday = 5,
    Sunday = 6

In a practical sense, many algorithms will treat this **categorical** feature as an **ordinal** one.  Ordinal features are generally useful when encoding binned or comparative data, such as:

- a 5-star rating 
- 10th, 20th, 30th, 40th, .. percentile
- below 60% versus above 60% viewability

Encoding days of the week in this manner becomes problematic when using distance-based algorithms, since they assume that the order of the categories encodes useful information about how similar those categories are (e.g. difference between Sunday and Monday is 6-0=6). Instead, one-hot encoding represents the days of the week like so:

$$ Tuesday = \left[
    \begin{array}{c}
      0\\
      1\\
      0\\
      0\\
      0\\
      0\\
      0\\
    \end{array}
\right]  $$

This way, all categories are orthogonal and equidistant in the feature space.  One-hot encoding is strongly recommended for SVM, kNN, logistic regression, and linear models.  It is also useful for decision trees and other algorithms, especially when limiting the depth of the trees to avoid overfitting.



In [1]:
import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.feature_extraction import DictVectorizer
import sys

In [11]:
df = pd.read_csv('../data/sample_data.csv', sep=',', header=0, index_col=0)

# define the feature columns so we can separate those from the target label
feature_cols = [x for x in df.columns if x != 'target_variable']
df[feature_cols].head()

Unnamed: 0,site_id,strategy_id,list_type,line_id,adv_id,adv_vertical,name,goal,price,limit,...,stdev_imps_site,win_rate_site,win_rate_strat,cvr_strat,cvr,line_cvr,hist_zscore,overlap,win_rate_site_table,win_rate_strat_table
0,82932,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,1.716631,0.423778,0.111431,0.0,0.001197,0.0,2.708366,0.001066,0.450094,0.249479
1,90474,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,1.304388,0.16301,0.111431,0.0,0.001239,0.0,1.188635,0.000703,0.15805,0.249479
2,92345,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,2.310355,0.318358,0.111431,0.0,0.000729,0.0,1.503285,0.000873,0.360591,0.249479
3,92415,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,1.110002,0.133199,0.111431,0.0,0.005894,0.0,35.153628,0.004614,0.113717,0.249479
4,92425,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,0.0,0.37931,0.111431,0.0,0.0,0.0,-0.091378,0.000344,0.019308,0.249479


### How to do one-hot encoding with pandas

You can use sklearn's [OneHotEncoder](http://scikit-learn.org/0.15/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) module, but this requires you to change string-based categorical values into integers first.

For that reason, I like to use the more flexible [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) module.

#### First, initialize a DictVectorizer


In [12]:
vec = DictVectorizer(sparse=False)

#### Then convert your dataframe to a dictionary, first taking the Transpose so the dict is a collection of rows

In [14]:
raw_features = df[feature_cols].T.to_dict().values()
features = vec.fit_transform(raw_features)

#### Finally, convert back to a dataframe, using the get_feature_names attribute to get the new column names

In [16]:
df_encoded = pd.DataFrame(features, columns = vec.get_feature_names())
df_encoded.head()

Unnamed: 0,adv_id,adv_vertical=Automotive,adv_vertical=Finance,adv_vertical=Home & Garden,adv_vertical=Philanthropy,adv_vertical=Travel,adv_vertical=Utilities,avg_bid,avg_imps_site,conversions,...,name=Travis,overlap,price,site_id,stdev_imps_site,strategy_id,win_rate_site,win_rate_site_table,win_rate_strat,win_rate_strat_table
0,206.0,0.0,0.0,0.0,0.0,1.0,0.0,2.75,0.0,0.0,...,0.0,0.001066,3.95,82932.0,1.716631,313729.0,0.423778,0.450094,0.111431,0.249479
1,206.0,0.0,0.0,0.0,0.0,1.0,0.0,2.75,0.0,0.0,...,0.0,0.000703,3.95,90474.0,1.304388,313729.0,0.16301,0.15805,0.111431,0.249479
2,206.0,0.0,0.0,0.0,0.0,1.0,0.0,2.75,0.0,0.0,...,0.0,0.000873,3.95,92345.0,2.310355,313729.0,0.318358,0.360591,0.111431,0.249479
3,206.0,0.0,0.0,0.0,0.0,1.0,0.0,2.75,0.0,0.0,...,0.0,0.004614,3.95,92415.0,1.110002,313729.0,0.133199,0.113717,0.111431,0.249479
4,206.0,0.0,0.0,0.0,0.0,1.0,0.0,2.75,0.0,0.0,...,0.0,0.000344,3.95,92425.0,0.0,313729.0,0.37931,0.019308,0.111431,0.249479


#### Now you're (almost!) ready to create a model

We have created n features in the expanded feature space for a categorical variable with cardinality of n. In order to avoid collinearity in linear models, do one of the following:

- Dummy-code the variables so that there are n-1 features, with an all-0 vector indicating one of the categories

- Set fit_intercept=False when fitting the model

For more information, [see this Stack Overflow answer](http://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn)