# Machine Learning Preprocessing: Encoding Categorical Data

In this notebook, we explore how to encode categorical data.  Machine Learning algorithms often require numerical data, and encoding "<em>small, medium, large</em>" is an important step in converting data into a format that algorithms can understand.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

## Load & Preview Data

In [1]:
# Define file path to data
purchases_file_path  <- file.path('Data','Data.csv')

# Load data
purchases  <- read.csv(purchases_file_path)

In [2]:
cat('Shape:')
dim(purchases)
cat('Figure 1.')

cat('\n\nPreview:')
head(purchases, 5)
cat('Figure 2.')

cat('\n\nStructure:\n')
str(purchases)
cat('Figure 3.')

cat('\n\nSummary:')
summary(purchases)
cat('Figure 4.')

cat('\n\nNulls:')
data.frame(colSums(is.na(purchases)))
cat('Figure 5.')

Shape:

Figure 1.

Preview:

Country,Age,Salary,Purchased
France,44,72000.0,No
Spain,27,48000.0,Yes
Germany,30,54000.0,No
Spain,38,61000.0,No
Germany,40,,Yes


Figure 2.

Structure:
'data.frame':	10 obs. of  4 variables:
 $ Country  : Factor w/ 3 levels "France","Germany",..: 1 3 2 3 2 1 3 1 2 1
 $ Age      : int  44 27 30 38 40 35 NA 48 50 37
 $ Salary   : int  72000 48000 54000 61000 NA 58000 52000 79000 83000 67000
 $ Purchased: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 1 2
Figure 3.

Summary:

    Country       Age            Salary      Purchased
 France :4   Min.   :27.00   Min.   :48000   No :5    
 Germany:3   1st Qu.:35.00   1st Qu.:54000   Yes:5    
 Spain  :3   Median :38.00   Median :61000            
             Mean   :38.78   Mean   :63778            
             3rd Qu.:44.00   3rd Qu.:72000            
             Max.   :50.00   Max.   :83000            
             NA's   :1       NA's   :1                

Figure 4.

Nulls:

Unnamed: 0,colSums.is.na.purchases..
Country,0
Age,1
Salary,1
Purchased,0


Figure 5.

## Address Missing Values

In [3]:
# Replace salary values with averages if they are null
purchases$Salary <- ifelse(is.na(purchases$Salary) # If the salary is null
                          ,ave(purchases$Salary, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean salary
                          ,purchases$Salary) # else return the current record's salary

# Replace age values with averages if they are null
purchases$Age <- ifelse(is.na(purchases$Age) # If the age is null
                          ,ave(purchases$Age, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean age
                          ,purchases$Age) # else return the current record's age                        

## Encoding Categorical Data

Let's start by preview our data again.

In [4]:
head(purchases, 5)

Country,Age,Salary,Purchased
France,44,72000.0,No
Spain,27,48000.0,Yes
Germany,30,54000.0,No
Spain,38,61000.0,No
Germany,40,63777.78,Yes


One of our features is the Country, with the options France, Spain, and Germany.  Under the hood, many machine learning algorithms will not understand these strings, and will instead require numbers.  One idea therefore may be to simply convert "France" to the number 1, "Spain" to 2, and "Germany" to 3.  In certain models however, this could confuse the algorithm into believing that there is a numerical ranking to the countries, that is, that Germany, being 3, is the "best," and France, being 1, is the "worst," with Spain in the middle.  In reality there is no ranking to these countries; so this type of encoding will bias our machine learning model.

An alternative is to use "one-hot encoding," which creates binary vectors (True or False vectors) for each category, or in other words, turns a column with $n$ categories into $n$ distinct columns.  In our case, our Country column has 3 categories, and one-hot encoding will transform this into 3 distinct columns.  The result will essentially be an "Is France?" column, an "Is Spain?" column, and an "Is Germany?" column; we will then populate one of these columns with 1, or "True," and the others with 0, or "False."  France will receive the vector [1,0,0]; Germany will receive the vector [0,1,0], and Spain will receive the vector [0,0,1]; note that one-hot encoding will encode alphabetically by defauly, and not the order that each category appears in.

Next we will use the factor() function.  This function takes 3 arguments:
<ul>
    <li>The data to transform into factors</li>
    <li>Levels, or the name of each category</li>
    <li>Which number to give to each category</li>
</ul>

In [5]:
purchases$Country
purchases$Purchased

In [6]:
library(mltools)
library(data.table)

"package 'data.table' was built under R version 3.6.3"

In [6]:
purchases

Country,Age,Salary,Purchased
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,63777.78,Yes
France,35.0,58000.0,Yes
Spain,38.77778,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


In [None]:
one_hot( as.dataframe(purchases$Country))

In [None]:
purchases$Country <- as.factor(purchases$Country)

In [18]:
one_hot(as.data.frame(purchases), cols=purchases$Country)

ERROR: Error in `[.data.frame`(purchases, Country): object 'Country' not found


In [None]:
purchases$Country <- one_hot(as.data.table(purchases$Country))

In [None]:
purchases$Country

In [13]:
purchases

Country,Age,Salary,Purchased
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,63777.78,Yes
France,35.0,58000.0,Yes
Spain,38.77778,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


In [28]:
install.packages('caret', dependencies = TRUE)

"dependencies 'fastICA', 'randomForest' are not available"also installing the dependencies 'Rcpp', 'glue', 'R.methodsS3', 'R.oo', 'R.utils', 'bitops', 'httpuv', 'xtable', 'fontawesome', 'sourcetools', 'later', 'promises', 'commonmark', 'cachem', 'cli', 'R.cache', 'caTools', 'TH.data', 'profileModel', 'boot', 'minqa', 'nloptr', 'RcppEigen', 'lazyeval', 'plotrix', 'shiny', 'miniUI', 'styler', 'classInt', 'labelled', 'gplots', 'libcoin', 'matrixStats', 'multcomp', 'htmltools', 'ipred', 'brglm', 'gtools', 'lme4', 'qvcalc', 'rex', 'Formula', 'plotmo', 'TeachingDemos', 'combinat', 'questionr', 'xfun', 'ROCR', 'cluster', 'mvtnorm', 'modeltools', 'strucchange', 'coin', 'sandwich', 'bslib', 'tinytex', 'ISwR', 'corpcor', 'ROSE', 'recipes', 'BradleyTerry2', 'covr', 'Cubist', 'earth', 'ellipse', 'gam', 'kernlab', 'klaR', 'knitr', 'mda', 'mlbench', 'MLmetrics', 'pamr', 'party', 'pls', 'RANN', 'rmarkdown', 'spls', 'subselect', 'superpc', 'themis'




  There are binary versions available but the source versions are later:
               binary    source needs_compilation
Rcpp            1.0.6   1.0.8.3              TRUE
glue            1.4.2     1.6.2              TRUE
R.utils        2.10.1    2.11.0             FALSE
httpuv          1.6.1     1.6.5              TRUE
fontawesome     0.2.1     0.2.2             FALSE
later           1.2.0     1.3.0              TRUE
commonmark        1.7     1.8.0              TRUE
cachem          1.0.4     1.0.6              TRUE
cli             2.5.0     3.3.0              TRUE
TH.data        1.0-10     1.1-1             FALSE
nloptr        1.2.2.2     2.0.3              TRUE
RcppEigen   0.3.3.9.1 0.3.3.9.2              TRUE
plotrix         3.8-1     3.8-2             FALSE
shiny           1.6.0     1.7.1             FALSE
styler          1.4.1     1.7.0             FALSE
labelled        2.8.0     2.9.1             FALSE
gplots          3.1.1     3.1.3             FALSE
libcoin         1.0-8     

"restored 'Rcpp'"

package 'glue' successfully unpacked and MD5 sums checked


"restored 'glue'"

package 'R.methodsS3' successfully unpacked and MD5 sums checked
package 'R.oo' successfully unpacked and MD5 sums checked
package 'bitops' successfully unpacked and MD5 sums checked
package 'httpuv' successfully unpacked and MD5 sums checked
package 'xtable' successfully unpacked and MD5 sums checked
package 'sourcetools' successfully unpacked and MD5 sums checked
package 'later' successfully unpacked and MD5 sums checked
package 'promises' successfully unpacked and MD5 sums checked
package 'commonmark' successfully unpacked and MD5 sums checked
package 'cachem' successfully unpacked and MD5 sums checked
package 'cli' successfully unpacked and MD5 sums checked
package 'R.cache' successfully unpacked and MD5 sums checked
package 'caTools' successfully unpacked and MD5 sums checked
package 'profileModel' successfully unpacked and MD5 sums checked
package 'boot' successfully unpacked and MD5 sums checked
package 'minqa' successfully unpacked and MD5 sums checked
package 'nloptr' successf

"restored 'htmltools'"

package 'ipred' successfully unpacked and MD5 sums checked
package 'brglm' successfully unpacked and MD5 sums checked
package 'gtools' successfully unpacked and MD5 sums checked
package 'lme4' successfully unpacked and MD5 sums checked
package 'qvcalc' successfully unpacked and MD5 sums checked
package 'Formula' successfully unpacked and MD5 sums checked
package 'TeachingDemos' successfully unpacked and MD5 sums checked
package 'combinat' successfully unpacked and MD5 sums checked
package 'xfun' successfully unpacked and MD5 sums checked
package 'ROCR' successfully unpacked and MD5 sums checked
package 'cluster' successfully unpacked and MD5 sums checked
package 'mvtnorm' successfully unpacked and MD5 sums checked
package 'modeltools' successfully unpacked and MD5 sums checked
package 'strucchange' successfully unpacked and MD5 sums checked
package 'coin' successfully unpacked and MD5 sums checked
package 'ISwR' successfully unpacked and MD5 sums checked
package 'BradleyTerry2' success

installing the source packages 'R.utils', 'fontawesome', 'TH.data', 'plotrix', 'shiny', 'styler', 'labelled', 'gplots', 'multcomp', 'rex', 'plotmo', 'questionr', 'sandwich', 'bslib', 'tinytex', 'corpcor', 'ROSE', 'recipes', 'ellipse', 'klaR', 'knitr', 'pls', 'rmarkdown', 'themis'

"installation of package 'klaR' had non-zero exit status"

In [7]:
library(caret)

"package 'caret' was built under R version 3.6.3"Loading required package: lattice
"package 'lattice' was built under R version 3.6.3"Loading required package: ggplot2
"replacing previous import 'vctrs::data_frame' by 'tibble::data_frame' when loading 'dplyr'"

ERROR: Error: package or namespace load failed for 'caret' in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 there is no package called 'recipes'


In [27]:
dummyVars(" ~ .", data = purchases)

ERROR: Error in dummyVars(" ~ .", data = purchases): could not find function "dummyVars"


In [None]:
# Encode country features
purchases$Country  <- factor(purchases$Country
                            ,levels=c('France', 'Spain', 'Germany')
                            ,labels=c(1,2,3)
                            )
# Encode purchased label
purchases$Purchased  <- factor(purchases$Purchased
                            ,levels=c('Yes', 'No')
                            ,labels=c(1,0)
                            )

In [None]:
purchases$Country
purchases$Purchased

In [8]:
R.version.string