In this tutorial, you will learn what a **categorical variable** is, along with three approaches for handling this type of data.


# Introduction

A **categorical variable** takes only a limited number of values.  

- Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely", "Most days", or "Every day".  In this case, the data is categorical, because responses fall into a fixed set of categories.
- If people responded to a survey about which what brand of car they owned, the responses would fall into categories like "Honda", "Toyota", and "Ford".  In this case, the data is also categorical.

You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first.  In this tutorial, we'll compare three approaches that you can use to prepare your categorical data.

# Three Approaches

### 1) Drop Categorical Variables

The easiest approach to dealing with categorical variables is to simply remove them from the dataset.  This approach will only work well if the columns did not contain useful information.

### 2) Label Encoding

**Label encoding** assigns each unique value to a different integer.

![label encoding simple example](./images/tut2_labelencode.png)

This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).

This assumption makes sense in this example, because there is an indisputable ranking to the categories.  Not all categorical variables have a clear ordering in the values, but we refer to those that do as **ordinal variables**.  For tree-based models (like decision trees and random forests), you can expect label encoding to work well with ordinal variables. 

### 3) One-Hot Encoding

**One-hot encoding** creates new columns indicating the presence (or absence) of each possible value in the original data.  To understand this, we'll work through an example.

![one-hot encoding simple example](./images/tut2_onehot.png)

In the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green".  The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset.  Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value was "Yellow", we put a 1 in the "Yellow" column, and so on.  

In contrast to label encoding, one-hot encoding *does not* assume an ordering of the categories.  Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is neither _more_ nor _less_ than "Yellow").  We refer to categorical variables without an intrinsic ranking as **nominal variables**.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values). 

# Example

We'll work with a dataset containing housing characteristics and skip the basic data set-up code.  The end result is:
- The housing characteristics for the training data are stored in a DataFrame `train_predictors`.  We'll use it to predict home prices in a Series called `target`.  
- The housing characteristics for the test data are stored in a DataFrame `test_predictors`.

In [None]:
#$HIDE$
import pandas as pd

# read the data
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

# separate predictors from target
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
target = train_data.SalePrice

# drop columns with missing values (simplest approach)
cols_with_missing = [col for col in train_data.columns 
                                 if train_data[col].isnull().any()]                                  
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)

# "cardinality" means the number of unique values in a column.
# select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]

# select numeric columns
numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]

# define train and test predictors with selected columns
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

We take a peek at the training predictors with the `head` method below.  Many of the first several columns in the DataFrame are categorical. 

In [None]:
train_predictors.head()

Next, we obtain a list of all of the categorical variables in the training data.

We do this by checking the data type (or **dtype**) of each column.  The `object` dtype indicates a column has text (there are other things it could theoretically be, but that's unimportant for our purposes).  For this dataset, the columns with text indicate categorical variables.

In [None]:
# get list of categorical variables
s = (train_predictors.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

### Define Function to Measure Quality of Each Approach

We define a function `score_dataset_cv` to compare the three different approaches to dealing with categorical variables. This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.  Better approaches will have lower MAE!

In [None]:
#$HIDE$
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def score_dataset_cv(X, y):
    # multiply by -1 to get positive MAE score instead of neg value returned as sklearn convention
    # (you'll learn about cross-validation in a later tutorial)
    return -1 * cross_val_score(RandomForestRegressor(50, random_state=0), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

### Score from Approach 1 (Drop Categorical Variables)

We drop the `object` columns with the [`select_dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) method. 

In [None]:
drop_train_predictors = train_predictors.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset_cv(drop_train_predictors, target))

### Score from Approach 2 (Label Encoding)

Scikit-learn has a [`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) class that can be used to get label encodings.  We loop over the categorical variables and apply the label encoder separately to each column.

In [None]:
from sklearn.preprocessing import LabelEncoder

# make copy to avoid changing original data 
label_train_predictors = train_predictors.copy()

# apply label encoder to each column with categorical data
for col in object_cols:
    label_train_predictors[col] = LabelEncoder().fit_transform(train_predictors[col])

print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset_cv(label_train_predictors, target))

### Score from Approach 3 (One-Hot Encoding)

Pandas offers a convenient function called [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to get one-hot encodings.  Note that we can pass the entire `train_predictors` DataFrame to the function, which automatically detects and encodes the categorical variables.

In [None]:
# one-hot encode categorical data
OH_train_predictors = pd.get_dummies(train_predictors)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset_cv(OH_train_predictors, target))

In this case, since the returned MAE scores are close in value, there doesn't appear to be any meaningful benefit to one approach over the other.

In general, one-hot encoding (**Approach 3**) will typically perform best, and dropping the categorical columns (**Approach 1**) typically performs worst, but it varies on a case-by-case basis. 

# Final Note: Applying to Multiple Files

So far, you've one-hot-encoded your training data.  What about when you have multiple files (e.g., a test dataset, or some other data that you'd like to use to make predictions)?  

Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test datasets get misaligned, your results will be nonsense.  This could happen if a categorical variable had a different number of values in the training data vs the test data.  Thankfully, we can quickly ensure the test data is encoded in the same manner as the training data with the `align` command:

In [None]:
OH_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = OH_train_predictors.align(OH_test_predictors,
                                                    join='left', 
                                                    axis=1)

The `align` command makes sure the columns show up in the same order in both datasets: it uses column names to identify which columns line up in each dataset.  
- The argument `join='left'` ensures that if there are columns that show up in one dataset and not the other, we will keep exactly the columns from our training data. 
- The argument `join='inner'` keeps only the columns that appear in both datasets. That's also a sensible choice.

If you're familiar with [SQL](https://en.wikipedia.org/wiki/SQL), you may have noticed that `join="left"` is the equivalent of SQL's left join, and `join='inner'` is the equivalent of SQL's inner join.  If you are not familiar with SQL, **_after completing this mini-course_**, you can learn more through [this mini-course](https://www.kaggle.com/learn/sql)!

# Conclusion

The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!

# Your Turn

hi