# Machine Learning Preprocessing: Encoding Categorical Data

In this notebook, we explore how to encode categorical data.  Machine Learning algorithms often require numerical data, and encoding "<em>small, medium, large</em>" is an important step in converting data into a format that algorithms can understand.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

## Load & Preview Data

In [1]:
# Define file path to data
purchases_file_path  <- file.path('Data','Data.csv')

# Load data
purchases  <- read.csv(purchases_file_path)

In [2]:
cat('Shape:')
dim(purchases)
cat('Figure 1.')

cat('\n\nPreview:')
head(purchases, 5)
cat('Figure 2.')

cat('\n\nStructure:\n')
str(purchases)
cat('Figure 3.')

cat('\n\nSummary:')
summary(purchases)
cat('Figure 4.')

Shape:

Figure 1.

Preview:

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>
1,France,44,72000.0,No
2,Spain,27,48000.0,Yes
3,Germany,30,54000.0,No
4,Spain,38,61000.0,No
5,Germany,40,,Yes


Figure 2.

Structure:
'data.frame':	10 obs. of  4 variables:
 $ Country  : chr  "France" "Spain" "Germany" "Spain" ...
 $ Age      : int  44 27 30 38 40 35 NA 48 50 37
 $ Salary   : int  72000 48000 54000 61000 NA 58000 52000 79000 83000 67000
 $ Purchased: chr  "No" "Yes" "No" "No" ...
Figure 3.

Summary:

   Country               Age            Salary       Purchased        
 Length:10          Min.   :27.00   Min.   :48000   Length:10         
 Class :character   1st Qu.:35.00   1st Qu.:54000   Class :character  
 Mode  :character   Median :38.00   Median :61000   Mode  :character  
                    Mean   :38.78   Mean   :63778                     
                    3rd Qu.:44.00   3rd Qu.:72000                     
                    Max.   :50.00   Max.   :83000                     
                    NA's   :1       NA's   :1                         

Figure 4.

## Address Missing Values

In [3]:
# Replace salary values with averages if they are null
purchases$Salary <- ifelse(is.na(purchases$Salary) # If the salary is null
                          ,ave(purchases$Salary, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean salary
                          ,purchases$Salary) # else return the current record's salary

# Replace age values with averages if they are null
purchases$Age <- ifelse(is.na(purchases$Age) # If the age is null
                          ,ave(purchases$Age, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean age
                          ,purchases$Age) # else return the current record's age                        

## Encode Categorical Data

Let's start by preview our data again.

In [4]:
head(purchases, 5)

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>
1,France,44,72000.0,No
2,Spain,27,48000.0,Yes
3,Germany,30,54000.0,No
4,Spain,38,61000.0,No
5,Germany,40,63777.78,Yes


One of our features is the Country, with the options France, Spain, and Germany.  Under the hood, many machine learning algorithms will not understand these strings, and will instead require numbers.  We therefore want to convert these textual categories to numbers.  Encoding is the process of converting the text "France" to the number 1, the text "Spain" to the number 2, and the text "Germany" to the number 3.  Linear Regression algorithms can understand, and can compute the mean, standard deviation, etc, of the numbers 1, 2, and 3, but they cannot do the same with the text "France," "Germany," and "Spain."

To apply encoding we use the factor() function.  This function takes 3 arguments:
<ul>
    <li><b>x:</b> The data to transform into factors</li>
    <li><b>levels:</b> The name of each category</li>
    <li><b>labels:</b> Which number to give to each category</li>
</ul>

In [5]:
# View categorical variables before encoding
purchases$Country
purchases$Purchased

In [6]:
# Encode country features
purchases$Country  <- factor(purchases$Country
                            ,levels=c('France', 'Spain', 'Germany')
                            ,labels=c(1,2,3))

# Encode purchased label
purchases$Purchased  <- factor(purchases$Purchased
                            ,levels=c('Yes', 'No')
                            ,labels=c(1,0))

In [8]:
# View categorical variables after encoding
purchases$Country
purchases$Purchased

In [11]:
# Preview purchases data
head(purchases)

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,1,44,72000.0,0
2,2,27,48000.0,1
3,3,30,54000.0,0
4,2,38,61000.0,0
5,3,40,63777.78,1
6,1,35,58000.0,1


## Parking Lot Notes

Earlier we discussed what the encoding process is.  The idea of converting "Frace" to 1, "Spain" to 2, and "Germany" to 3 is called "label encoding."  In certain models, this could confuse the algorithm into believing that there is a numerical ranking to the countries, that is, that Germany, being 3, is the "best," and France, being 1, is the "worst," with Spain in the middle.  In reality there is no ranking to these countries; so this type of encoding could bias certain machine learning classifiers.

An alternative is to use "one-hot encoding," which creates binary vectors (essentially TRUE or FALSE vectors) for each category.  In other words, it turns a column with $n$ categories into $n$ distinct columns.  In our case, the Country column has 3 categories, and one-hot encoding would transform this into 3 distinct columns.  The result would essentially be an "Is France?" column, an "Is Spain?" column, and an "Is Germany?" column; we would then populate one of these columns with 1, or "True," and the others with 0, or "False."  France would receive the vector [1,0,0]; Spain would receive the vector [0,1,0], and Germany would receive the vector [0,0,1].

Due to the way R can implement simple linear regression, one-hot encoding is necessary, and we may use label encoding on features with more than two categories.  For reference however, following is how you could perform one-hot encoding in R.

In [3]:
# install.packages("caret", dependencies=TRUE, type="win.binary")
# R.version.string

In [4]:
library(caret)

Loading required package: ggplot2

Loading required package: lattice



In [None]:
dummy <- dummyVars("  ~ . ", data=purchases)

In [None]:
predict(dummy, newdata=purchases)

In [None]:
data.frame(predict(dummy, newdata=purchases))

In [None]:
library(mltools)
library(data.table)

In [None]:
one_hot( as.dataframe(purchases$Country))

In [None]:
one_hot(as.data.frame(purchases), cols=purchases$Country)

In [None]:
purchases$Country <- one_hot(as.data.table(purchases$Country))