# Machine Learning Preprocessing: Encoding Categorical Data

In this notebook, we explore how to encode categorical data.  Machine Learning algorithms often require numerical data, and encoding "<em>small, medium, large</em>" is an important step in converting data into a format that algorithms can understand.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

## Load & Preview Data

In [2]:
install.packages('recipes')

also installing the dependency 'ipred'




  There are binary versions available but the source versions are later:
        binary source needs_compilation
ipred   0.9-11 0.9-12              TRUE
recipes 0.1.16  0.2.0             FALSE

  Binaries will be installed
package 'ipred' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\tyler\AppData\Local\Temp\RtmpWmIL5S\downloaded_packages


installing the source package 'recipes'

"installation of package 'recipes' had non-zero exit status"

In [3]:
library(caret)

"package 'caret' was built under R version 3.6.3"

ERROR: Error: package or namespace load failed for 'caret' in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 there is no package called 'recipes'


In [None]:
# Define file path to data
purchases_file_path  <- file.path('Data','Data.csv')

# Load data
purchases  <- read.csv(purchases_file_path)

In [None]:
cat('Shape:')
dim(purchases)
cat('Figure 1.')

cat('\n\nPreview:')
head(purchases, 5)
cat('Figure 2.')

cat('\n\nStructure:\n')
str(purchases)
cat('Figure 3.')

cat('\n\nSummary:')
summary(purchases)
cat('Figure 4.')

cat('\n\nNulls:')
data.frame(colSums(is.na(purchases)))
cat('Figure 5.')

## Address Missing Values

In [None]:
# Replace salary values with averages if they are null
purchases$Salary <- ifelse(is.na(purchases$Salary) # If the salary is null
                          ,ave(purchases$Salary, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean salary
                          ,purchases$Salary) # else return the current record's salary

# Replace age values with averages if they are null
purchases$Age <- ifelse(is.na(purchases$Age) # If the age is null
                          ,ave(purchases$Age, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean age
                          ,purchases$Age) # else return the current record's age                        

## Encoding Categorical Data

Let's start by preview our data again.

In [None]:
head(purchases, 5)

One of our features is the Country, with the options France, Spain, and Germany.  Under the hood, many machine learning algorithms will not understand these strings, and will instead require numbers.  One idea therefore may be to simply convert "France" to the number 1, "Spain" to 2, and "Germany" to 3.  In certain models however, this could confuse the algorithm into believing that there is a numerical ranking to the countries, that is, that Germany, being 3, is the "best," and France, being 1, is the "worst," with Spain in the middle.  In reality there is no ranking to these countries; so this type of encoding will bias our machine learning model.

An alternative is to use "one-hot encoding," which creates binary vectors (True or False vectors) for each category, or in other words, turns a column with $n$ categories into $n$ distinct columns.  In our case, our Country column has 3 categories, and one-hot encoding will transform this into 3 distinct columns.  The result will essentially be an "Is France?" column, an "Is Spain?" column, and an "Is Germany?" column; we will then populate one of these columns with 1, or "True," and the others with 0, or "False."  France will receive the vector [1,0,0]; Germany will receive the vector [0,1,0], and Spain will receive the vector [0,0,1]; note that one-hot encoding will encode alphabetically by defauly, and not the order that each category appears in.

Next we will use the factor() function.  This function takes 3 arguments:
<ul>
    <li>The data to transform into factors</li>
    <li>Levels, or the name of each category</li>
    <li>Which number to give to each category</li>
</ul>

In [None]:
purchases$Country
purchases$Purchased

In [None]:
# Encode country features
purchases$Country  <- factor(purchases$Country
                            ,levels=c('France', 'Spain', 'Germany')
                            ,labels=c(1,2,3)
                            )
# Encode purchased label
purchases$Purchased  <- factor(purchases$Purchased
                            ,levels=c('Yes', 'No')
                            ,labels=c(1,0)
                            )

In [None]:
purchases$Country
purchases$Purchased