In this tutorial, you will learn what a **categorical variable** is, along with the most common approach for handling this type of data.


# Introduction

**Categorical data** is data that takes only a limited number of values.  

For example, if you people responded to a survey about which what brand of car they owned, the result would be categorical (because the answers would be things like _Honda_,  _Toyota_, _Ford_, _None_, etc.). Responses fall into
a fixed set of categories.

You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first.  In this tutorial, we'll explore the most popular method for handling categorical variables.

# One-Hot Encoding: The Standard Approach for Categorical Data

**One hot encoding** is the most widespread approach for categorical data.  It creates new (binary) columns, indicating the presence of each possible value in the original data.  To understand this, we'll work through an example.

![One-hot encoding simple example](https://i.imgur.com/mtimFxh.png)

The values in the original data are _Red_, _Yellow_, and _Green_.  The corresponding one-hot encoding contains separate columns for each possible value.  Wherever the original value was _Red_, we put a 1 in the _Red_ column; if the original value was _Yellow_, we put a 1 in the _Yellow_ column, and so on.  

One-hot encoding works very well, unless your categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

# Calculating One-Hot Encodings

Let's see this in code.  We'll work with a dataset containing housing characteristics and skip the basic data set-up code.  The end result is:
- The housing characteristics for the training data are stored in a DataFrame `train_predictors`.  We'll use it to predict home prices in a Series called `target`.  
- The housing characteristics for the test data are stored in a DataFrame `test_predictors`.

In [None]:
import pandas as pd

# read the data
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

# separate predictors from target
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
target = train_data.SalePrice

# drop columns with missing values (simplest approach)
cols_with_missing = [col for col in train_data.columns 
                                 if train_data[col].isnull().any()]                                  
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)

# "cardinality" means the number of unique values in a column.
# select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]

# select numeric columns
numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]

# define train and test predictors with selected columns
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

Pandas assigns a data type (called a dtype) to each column or Series.  Let's see a random sample of dtypes from our prediction data:

In [None]:
train_predictors.dtypes.sample(10)

**Object** indicates a column has text (there are other things it could be theoretically be, but that's unimportant for our purposes). It's most common to one-hot encode these "object" columns, since they can't be plugged directly into most models.  

Pandas offers a convenient function called [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to get one-hot encodings. Call it like this:

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

# print the first five rows of the one-hot encoding
one_hot_encoded_training_predictors.head()

# Comparing Approaches?

Alternatively, we could have dropped the columns with categorical data. To see how the approaches compare, we can calculate the mean absolute error (MAE) of models built with two alternative sets of predictors:
1. One-hot encoded categoricals as well as numeric predictors
2. Numerical predictors, where we drop categoricals.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    # multiply by -1 to get positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, target)

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

One-hot encoding usually helps, but it varies on a case-by-case basis.  In this case, there doesn't appear to be any meaningful benefit from using the one-hot encoded variables.

## Applying to Multiple Files

So far, you've one-hot-encoded your training data.  What about when you have multiple files (e.g., a test dataset, or some other data that you'd like to use to make predictions)?  

Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test datasets get misaligned, your results will be nonsense.  This could happen if a categorical variable had a different number of values in the training data vs the test data.  Thankfully, we can quickly ensure the test data is encoded in the same manner as the training data with the `align` command:

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
                                                                    join='left', 
                                                                    axis=1)

The `align` command makes sure the columns show up in the same order in both datasets: it uses column names to identify which columns line up in each dataset.  
- The argument `join='left'` specifies that we will do the equivalent of SQL's _left join_.  In this case, if there are columns that show up in one dataset and not the other, we will keep exactly the columns from our training data. 
- The argument `join='inner'` would do the equivalent of SQL's _inner join_, keeping only the columns that appear in both datasets.  That's also a sensible choice.

# Conclusion

The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!

# Your Turn

hi