# Machine Learning Preprocessing: Train-Test Splitting

In this notebook, we explore train-test splitting.  We almost always split our data into a training set, and a testing set, one to train a model on, and one to test how well it works.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

## Load & Preview Data

In [1]:
# Define file path to data
purchases_file_path  <- file.path('Data','Data.csv')

# Load data
purchases  <- read.csv(purchases_file_path)

In [2]:
cat('Shape:')
dim(purchases)
cat('Figure 1.')

cat('\n\nPreview:')
head(purchases, 5)
cat('Figure 2.')

cat('\n\nStructure:\n')
str(purchases)
cat('Figure 3.')

cat('\n\nSummary:')
summary(purchases)
cat('Figure 4.')

Shape:

Figure 1.

Preview:

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>
1,France,44,72000.0,No
2,Spain,27,48000.0,Yes
3,Germany,30,54000.0,No
4,Spain,38,61000.0,No
5,Germany,40,,Yes


Figure 2.

Structure:
'data.frame':	10 obs. of  4 variables:
 $ Country  : chr  "France" "Spain" "Germany" "Spain" ...
 $ Age      : int  44 27 30 38 40 35 NA 48 50 37
 $ Salary   : int  72000 48000 54000 61000 NA 58000 52000 79000 83000 67000
 $ Purchased: chr  "No" "Yes" "No" "No" ...
Figure 3.

Summary:

   Country               Age            Salary       Purchased        
 Length:10          Min.   :27.00   Min.   :48000   Length:10         
 Class :character   1st Qu.:35.00   1st Qu.:54000   Class :character  
 Mode  :character   Median :38.00   Median :61000   Mode  :character  
                    Mean   :38.78   Mean   :63778                     
                    3rd Qu.:44.00   3rd Qu.:72000                     
                    Max.   :50.00   Max.   :83000                     
                    NA's   :1       NA's   :1                         

Figure 4.

## Address Missing Values

In [3]:
# Replace salary values with averages if they are null
purchases$Salary <- ifelse(is.na(purchases$Salary) # If the salary is null
                          ,ave(purchases$Salary, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean salary
                          ,purchases$Salary) # else return the current record's salary

# Replace age values with averages if they are null
purchases$Age <- ifelse(is.na(purchases$Age) # If the age is null
                          ,ave(purchases$Age, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean age
                          ,purchases$Age) # else return the current record's age                        

## Encode Categorical Data

In [4]:
# Encode country features
purchases$Country  <- factor(purchases$Country
                            ,levels=c('France', 'Spain', 'Germany')
                            ,labels=c(1,2,3))

# Encode purchased label
purchases$Purchased  <- factor(purchases$Purchased
                            ,levels=c('Yes', 'No')
                            ,labels=c(1,0))

## Train-Test Splitting

When building machine learning models, we often split the data into a training set, and a test set.  The model uses the training set to look for patterns which it can apply to new data in the future.  We then run the classifier on the test set—which the classifier has never seen—and compare the classifier's predictions to the actual values.  In this way, we can measure how well the classifier works.

In [5]:
# Load dependencies
library(caTools)

In [6]:
# Set seed to keep splitting results consistent
set.seed(123)

In [7]:
# Define train-test split function
split <- sample.split(Y=purchases$Purchased, SplitRatio = 0.8)

# View splitting
split

Before proceeding, let's examine our splitting.  We defined that split ratio—or the percentage of records to be "train records"—to be 80%.  Therefore, out of our 10 records, 8 will be for training, and 2 will be for testing.  Our splitting assigned 2 FALSE values to our purchase records, in index positions 6 and 9.  Let's view our entire dataframe now.

In [8]:
# View data before splitting
purchases

Country,Age,Salary,Purchased
<fct>,<dbl>,<dbl>,<fct>
1,44.0,72000.0,0
2,27.0,48000.0,1
3,30.0,54000.0,0
2,38.0,61000.0,0
3,40.0,63777.78,1
1,35.0,58000.0,1
2,38.77778,52000.0,0
1,48.0,79000.0,1
3,50.0,83000.0,0
1,37.0,67000.0,1


Viewing the above, we can see that the ages/salaries at indices 6 and 8 are 35/58k, and 50/83k, respectively. When we split our data, these will be our two testing records.

In [9]:
# Split data into training and testing sets
purchases_train <- subset(purchases, split == TRUE)
purchases_test <- subset(purchases, split == FALSE)

In [10]:
# View training records
purchases_train

# View testing records
purchases_test

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,1,44.0,72000.0,0
2,2,27.0,48000.0,1
3,3,30.0,54000.0,0
4,2,38.0,61000.0,0
5,3,40.0,63777.78,1
7,2,38.77778,52000.0,0
8,1,48.0,79000.0,1
10,1,37.0,67000.0,1


Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
6,1,35,58000,1
9,3,50,83000,0
