# Machine Learning Preprocessing: Encoding Categorical Data

In this notebook, we explore how to encode categorical data.  Machine Learning algorithms often require numerical data, and encoding "<em>small, medium, large</em>" is an important step in converting data into a format that algorithms can understand.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

In [1]:
# Import support libraries
import os

# Import analytical libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import machine learning support
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load & Preview Data

In this notebook we will preprocess purchase data.  Our dataset contains information on which clients purchased a product based on a set of features.

In [2]:
# Define data file path
purchases_file_path = os.path.join('Data', 'Data.csv')

# Load data
purchases = pd.read_csv(purchases_file_path)

In [3]:
# Preview data
display(purchases.head())

display(purchases.describe())

display(purchases.isna().sum())

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [4]:
# Define features & labels
X = purchases.drop(columns='Purchased').values
y = purchases['Purchased'].values

## Address Missing Values

In [5]:
# Create imputer object
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

In [6]:
# Fit imputer object to the age & salary features, which are in the columns at index 1 and 2
imputer.fit(X[:, 1:3])

# Apply the imputer transformation based on fitted data
X[:, 1:3] = imputer.transform(X[:, 1:3])

## Encoding Categorical Data

Let's start by preview our data again.

In [7]:
display(purchases.head())

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


One of our features is the Country, with the options France, Spain, and Germany.  Under the hood, many machine learning algorithms will not understand these strings, and will instead require numbers.  One idea therefore may be to simply convert "France" to the number 1, "Spain" to 2, and "Germany" to 3.  In certain models however, this could confuse the algorithm into believing that there is a numerical ranking to the countries, that is, that Germany, being 3, is the "best," and France, being 1, is the "worst," with Spain in the middle.  In reality there is no ranking to these countries; so this type of encoding will bias our machine learning model.

An alternative is to use "one-hot encoding," which creates binary vectors (True or False vectors) for each category, or in other words, turns a column with $n$ categories into $n$ distinct columns.  In our case, our Country column has 3 categories, and one-hot encoding will transform this into 3 distinct columns.  The result will essentially be an "Is France?" column, an "Is Spain?" column, and an "Is Germany?" column; we will then populate one of these columns with 1, or "True," and the others with 0, or "False."  France will receive the vector [1,0,0]; Germany will receive the vector [0,1,0], and Spain will receive the vector [0,0,1]; note that one-hot encoding will encode alphabetically by defauly, and not the order that each category appears in.

Next we will create a transformer object.  The transformer object takes at least two arguments:
<ul>
    <li>The transformer configurations</li>
    <ul>
        <li>The general kind of transformation (e.g. encoding)</li>
        <li>The specific kind of transformation (e.g. <em>one-hot</em> encoding)</li>
        <ul>
            <li>We must specifiy the specific kind of transformation, as there are other types of encoding besides one-hot encoding.  This specific transformation argument may not always be necessary.</li>
        </ul>
        <li>The indicies where on to apply the transforamtion</li>
    </ul>
    <li>The remainders argument</li>
      <ul>
          <li>By default a transformer would drop columns we aren't encoding.  We often want to let the transformer pass over them, doing nothing.
      </ul>
</ul>

In [8]:
# Create ColumnTransformer object
transformer = ColumnTransformer(transformers=[('encoder'
                                    ,OneHotEncoder()
                                    ,[0]) # the Country is in index position 0
                                    ]
                       ,remainder = 'passthrough' # Do nothing to remain columns
                      )

We will now examine what this transformation does to our features.  First, let's preview our features pre-transformation.

In [9]:
# View features
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Now we transform the features and view them again.

In [10]:
# Fit transformer to data
transformer.fit(X)

# Transform data
X = transformer.transform(X)

In [11]:
# Preview features
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


Above we see that each category received a binary vector.  All rows with France now instead of 3 columns, with only the "Is France?" column being marked true, and so on for the other categories.  Note too that even though Spain appears before Germany, Spain received the vector [0,0,1], and Germany received [0,1,0], as Germany alphabetically comes before Spain.

Let's also stop to consider whether we should address missing data, or encode first.  When we filled in missing numerical data earlier, we did so by computing the mean of the column.  We could not do the same with categorical data.  Our "Is France?" column has been populated with either 1 (True) or 0 (False).  Suppose, for example, that 5 out of 10 columns have the value 1.  If we encoded before handling missing data, then we could end up calculating the mean of this column to be 0.5, which is nonsensical for binary vectors (True or False vectors).  When addressing missing categorical data, one method is to simply fill in missing fields with the most common value in that column.

## Label Encoding Categorical Data

Label Encoding is another type of encoding, used more often with columns of only 2 categories.

We used one-hot encoding above, because we did not want to apply a numerical ranking to our categories.  Our label though is "Yes" or "No," and we can afford to simply convert "No" to 0, and "Yes" to 1, without biasing a machine learning model.

In [12]:
# Initialize label encoder object
label_encoder = LabelEncoder()

In General, label encoding is more straighforward than one-hot encoding.  As we are only concerned with a single column, we do not need to adjust the same parameters we did with one-hot encoding.

In [13]:
# Fit label encoder to labels
label_encoder.fit(y)

LabelEncoder()

We will now view our labels before and after applying label encoding.

In [14]:
# View labels
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [15]:
# Transform labels
y = label_encoder.transform(y)

In [16]:
# View labels
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])