# How to Encode Categorical Data

While some machine learning algorithms can work with categorical directly (e.g., decision trees), other learning models require that input and/or output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are: 

- **Ordinal encoding**, for categorical features that have a natural rank order
- **One Hot encoding**, for categorical features that do not have a natural rank order

In this tutorial, you will discover how to use both encoding schemes for categorical machine learning data.

## Encoding Categorical Data

### Ordinal Encoding

Ordinal Encoding should be used for categorical data that has a natural rank order. For instance:
    
    - Place: first, second, third
    - Stage: A, B, C
    - Win: Bronze, Silver, Gold
    
An integer ordinal encoding is a natural encoding for ordinal variables. This ordinal encoding transform is available in the scikit-learn Python machine learning library via the `OrdinalEncoder` class.

**By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the categories argument as a list with the rank order of all expected labels.**

First the categories are sorted then numbers are applied. **For strings, this means the labels are sorted alphabetically and that first=0, second=1, and third=2.**

In [4]:
# Example
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder 


data = asarray([['first'], ['second'], ['third']])

data

array([['first'],
       ['second'],
       ['third']], dtype='<U6')

In [5]:
# define ordinal encoding
encoder = OrdinalEncoder()

# transform data
result = encoder.fit_transform(data)

In [6]:
result

array([[0.],
       [1.],
       [2.]])

**Note:** This OrdinalEncoder class is intended for input variables that are organized into rows and columns, e.g. a matrix. If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the `LabelEncoder` class can be used. It does the same thing as the OrdinalEncoder, although it expects a one-dimensional input for the single target variable.

## One Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough or even misleading to the model. In this case, a one hot encoding can be applied. This is where **the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable**.

This one hot encoding transform is available in the scikit-learn Python machine learning library via the `OneHotEncoder` class. We can demonstrate the usage of the OneHotEncoder on the color categories: 

- Color: 'red', 'green', 'blue'

First the categories are sorted, in this case alphabetically because they are strings: 

['blue', 'green', 'red']

Then binary variables are created for each category in turn. This means **blue will be represented as [1, 0, 0] with a 1 in for the first binary variable, then green, then finally red**:

In [8]:
data = asarray([['red'], ['green'], ['blue']])

In [9]:
data

array([['red'],
       ['green'],
       ['blue']], dtype='<U5')

In [13]:
from sklearn.preprocessing import OneHotEncoder

# define one hot encoding
encoder = OneHotEncoder(sparse_output=False)

# transform data
onehot = encoder.fit_transform(data)

In [14]:
onehot

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

## Breast Cancer Dataset

We will use the Breast Cancer dataset in this tutorial. This dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

In [16]:
import pandas as pd

breast_data = pd.read_csv('data/breast_cancer.csv')

In [17]:
breast_data

Unnamed: 0,age,menopause,tumour-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,class
0,40-49,premeno,15-19,0-2,yes,3,right,left_up,no,recurrence-events
1,50-59,ge40,15-19,0-2,no,1,right,central,no,no-recurrence-events
2,50-59,ge40,35-39,0-2,no,2,left,left_low,no,recurrence-events
3,40-49,premeno,35-39,0-2,yes,3,right,left_low,yes,no-recurrence-events
4,40-49,premeno,30-34,3-5,yes,2,left,right_up,no,recurrence-events
...,...,...,...,...,...,...,...,...,...,...
281,50-59,ge40,30-34,6-8,yes,2,left,left_low,no,no-recurrence-events
282,50-59,premeno,25-29,3-5,yes,2,left,left_low,yes,no-recurrence-events
283,30-39,premeno,30-34,6-8,yes,2,right,right_up,no,no-recurrence-events
284,50-59,premeno,15-19,0-2,no,2,right,left_low,no,no-recurrence-events


In [21]:
X = breast_data.iloc[:,:-1]

In [22]:
X

Unnamed: 0,age,menopause,tumour-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,40-49,premeno,15-19,0-2,yes,3,right,left_up,no
1,50-59,ge40,15-19,0-2,no,1,right,central,no
2,50-59,ge40,35-39,0-2,no,2,left,left_low,no
3,40-49,premeno,35-39,0-2,yes,3,right,left_low,yes
4,40-49,premeno,30-34,3-5,yes,2,left,right_up,no
...,...,...,...,...,...,...,...,...,...
281,50-59,ge40,30-34,6-8,yes,2,left,left_low,no
282,50-59,premeno,25-29,3-5,yes,2,left,left_low,yes
283,30-39,premeno,30-34,6-8,yes,2,right,right_up,no
284,50-59,premeno,15-19,0-2,no,2,right,left_low,no


In [24]:
y = breast_data.iloc[:,-1]

In [25]:
y


0         recurrence-events
1      no-recurrence-events
2         recurrence-events
3      no-recurrence-events
4         recurrence-events
               ...         
281    no-recurrence-events
282    no-recurrence-events
283    no-recurrence-events
284    no-recurrence-events
285    no-recurrence-events
Name: class, Length: 286, dtype: object

In [26]:
X.shape

(286, 9)

### Encoding features with the OrdinalEncoder


An ordinal encoding involves mapping each unique label to an integer value. This type of encoding is really only appropriate if there is a known relationship between the categories. 

**This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the data. In this case, we will ignore any possible existing ordinal relationship and assume all variables are ordinal.**

We can use the OrdinalEncoder from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known. **As an exercise, you can update the update the example below to try specifying the order for those variables that have a natural ordering.**

In [30]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

# Load the data
dataset = pd.read_csv('data/breast_cancer.csv')

# separate into input and output columns
X = breast_data.iloc[:,:-1]
y = breast_data.iloc[:,-1]

# Ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X_trans = ordinal_encoder.fit_transform(X)

label_encoder = LabelEncoder()
y_trans = label_encoder.fit_transform(y)

In [31]:
X_trans

array([[2., 2., 2., ..., 1., 2., 0.],
       [3., 0., 2., ..., 1., 0., 0.],
       [3., 0., 6., ..., 0., 1., 0.],
       ...,
       [1., 2., 5., ..., 1., 4., 0.],
       [3., 2., 2., ..., 1., 1., 0.],
       [3., 0., 7., ..., 0., 4., 0.]])

In [32]:
X_trans = pd.DataFrame(X_trans, columns=dataset.columns[:-1])

In [33]:
X_trans

Unnamed: 0,age,menopause,tumour-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,2.0,2.0,2.0,0.0,1.0,2.0,1.0,2.0,0.0
1,3.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0
2,3.0,0.0,6.0,0.0,0.0,1.0,0.0,1.0,0.0
3,2.0,2.0,6.0,0.0,1.0,2.0,1.0,1.0,1.0
4,2.0,2.0,5.0,4.0,1.0,1.0,0.0,4.0,0.0
...,...,...,...,...,...,...,...,...,...
281,3.0,0.0,5.0,5.0,1.0,1.0,0.0,1.0,0.0
282,3.0,2.0,4.0,4.0,1.0,1.0,0.0,1.0,1.0
283,1.0,2.0,5.0,5.0,1.0,1.0,1.0,4.0,0.0
284,3.0,2.0,2.0,0.0,0.0,1.0,1.0,1.0,0.0


In [41]:
y_trans = pd.DataFrame(y_trans, columns=[dataset.columns[-1]])

In [42]:
y_trans

Unnamed: 0,class
0,1
1,0
2,1
3,0
4,1
...,...
281,0
282,0
283,0
284,0


As expected, in this case, we can see that the number of variables is unchanged, but all values are now ordinal encoded integers.

### Encoding features with the OneHotEncoder

A one hot encoding is appropriate for categorical data where no relationship exists between categories. The scikit-learn library provides the OneHotEncoder class to automatically one hot encode one or more variables. By default the OneHotEncoder will output data with a sparse representation, which is efficient given that most values are 0 in the encoded representation. We will disable this feature by setting the sparse argument to False so that we can review the effect of the encoding.

In [44]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Load the data
dataset = pd.read_csv('data/breast_cancer.csv')

# separate into input and output columns
X = breast_data.iloc[:,:-1]
y = breast_data.iloc[:,-1]

# Onehot encode input variables
onehot_encoder = OneHotEncoder(sparse_output=False)
X_trans = onehot_encoder.fit_transform(X)

label_encoder = LabelEncoder()
y_trans = label_encoder.fit_transform(y)

In [48]:
X_trans = pd.DataFrame(X_trans)

In [49]:
X_trans

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,33,34,35,36,37,38,39,40,41,42
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
282,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
283,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
284,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [50]:
y_trans = pd.DataFrame(y_trans)

In [51]:
y_trans

Unnamed: 0,0
0,1
1,0
2,1
3,0
4,1
...,...
281,0
282,0
283,0
284,0


We would expect the number of rows to remain the same, but the number of columns to dramatically increase. As expected, in this case, we can see that the number of variables has leaped up from 9 to 43 and all values are now binary values 0 or 1.

## Summary

In this tutorial, we discovered how to use encoding schemes for categorical machine learning data:

- Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
- How to use ordinal encoding for categorical variables that have a natural rank ordering. 
- How to use one hot encoding for categorical variables that do not have a natural rank
ordering.


## 👾 Join our [Discord community](https://tiny.ydata.ai/dcai-community-github) and follow our Code-with-Me sessions to learn more about data science!

### 💻 Would you like to collaborate? Check some of the [topics we have open in GitHub issues](https://github.com/Data-Centric-AI-Community/awesome-python-for-data-science/issues), comment in one and let's get started! You can also drop us a line in the #contributors channel on Discord.