# Categorical Data

Read this link before looking at the code: https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/ 

In this code demonstration, we look at categorical data. 

In general, when applying machine learning models we need data to be represented numerically. 
To transform categorical data to numerical, we can: 

1. Use Ordinal Encoding, i.e. there is a ordering in the data. 
2. One-hot-encoding, i.e. there is no ordering in the data. 
3. Dummy variable encoding, this is like one-hot-encoding but it represents C categories with C-1 binary variables. Dummy variable encoding should be used if you use a Linear Regression model. 

In [1]:
from numpy import asarray

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

## Ordinal encoding

In [2]:
data_ordinal = asarray([['red'], ['green'], ['blue']])
print(data_ordinal)

[['red']
 ['green']
 ['blue']]


In [3]:
ordinal_encoder = OrdinalEncoder()
# Code below can be ran if you want to specify how the categories should be ran. Test the code. 
# ordinal_encoder = OrdinalEncoder(categories = [['red', 'green', 'blue']])  
result_ordinal = ordinal_encoder.fit_transform(data_ordinal)
print(result_ordinal)

[[2.]
 [1.]
 [0.]]


In [4]:
ordinal_encoder.categories_

[array(['blue', 'green', 'red'], dtype='<U5')]

## Categorical data: one-hot encoding

In [5]:
data_categorical = asarray([['red'], ['green'], ['blue']])
print(data_categorical)

[['red']
 ['green']
 ['blue']]


In [6]:
onehot_encoder = OneHotEncoder(sparse=False)
# Code below can be ran if you want to specify how the categories should be ran. Test the code. 
# onehot_encoder = OneHotEncoder(sparse=False, categories = [['red', 'green', 'blue']])  

result_onehot = onehot_encoder.fit_transform(data_categorical)
print(result_onehot)

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [7]:
onehot_encoder.categories_

[array(['blue', 'green', 'red'], dtype='<U5')]

## Categorical data: dummy encoding

In [8]:
data_categorical = asarray([['red'], ['green'], ['blue']])
print(data_categorical)

[['red']
 ['green']
 ['blue']]


In [9]:
dummy_encoder = OneHotEncoder(sparse=False, drop='first')
# Code below can be ran if you want to specify how the categories should be ran. Test the code. 
# dummy_encoder = OneHotEncoder(sparse=False, drop='first', categories = [['red', 'green', 'blue']])  

result_dummy = dummy_encoder.fit_transform(data_categorical)
print(result_dummy)

[[0. 1.]
 [1. 0.]
 [0. 0.]]


In [10]:
dummy_encoder.categories_

[array(['blue', 'green', 'red'], dtype='<U5')]