# Machine Learning Preprocessing: Train-Test Splitting

In this notebook, we discuss feature scaling.  When working with multiple numerical variables, it is sometimes necessary to bring each feature onto the same scale.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

## Load & Preview Data

In [1]:
# Define file path to data
purchases_file_path  <- file.path('Data','Data.csv')

# Load data
purchases  <- read.csv(purchases_file_path)

In [2]:
cat('Shape:')
dim(purchases)
cat('Figure 1.')

cat('\n\nPreview:')
head(purchases, 5)
cat('Figure 2.')

cat('\n\nStructure:\n')
str(purchases)
cat('Figure 3.')

cat('\n\nSummary:')
summary(purchases)
cat('Figure 4.')

Shape:

Figure 1.

Preview:

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>
1,France,44,72000.0,No
2,Spain,27,48000.0,Yes
3,Germany,30,54000.0,No
4,Spain,38,61000.0,No
5,Germany,40,,Yes


Figure 2.

Structure:
'data.frame':	10 obs. of  4 variables:
 $ Country  : chr  "France" "Spain" "Germany" "Spain" ...
 $ Age      : int  44 27 30 38 40 35 NA 48 50 37
 $ Salary   : int  72000 48000 54000 61000 NA 58000 52000 79000 83000 67000
 $ Purchased: chr  "No" "Yes" "No" "No" ...
Figure 3.

Summary:

   Country               Age            Salary       Purchased        
 Length:10          Min.   :27.00   Min.   :48000   Length:10         
 Class :character   1st Qu.:35.00   1st Qu.:54000   Class :character  
 Mode  :character   Median :38.00   Median :61000   Mode  :character  
                    Mean   :38.78   Mean   :63778                     
                    3rd Qu.:44.00   3rd Qu.:72000                     
                    Max.   :50.00   Max.   :83000                     
                    NA's   :1       NA's   :1                         

Figure 4.

## Address Missing Values

In [3]:
# Replace salary values with averages if they are null
purchases$Salary <- ifelse(is.na(purchases$Salary) # If the salary is null
                          ,ave(purchases$Salary, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean salary
                          ,purchases$Salary) # else return the current record's salary

# Replace age values with averages if they are null
purchases$Age <- ifelse(is.na(purchases$Age) # If the age is null
                          ,ave(purchases$Age, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean age
                          ,purchases$Age) # else return the current record's age                        

## Encode Categorical Data

In [4]:
# Encode country features
purchases$Country  <- factor(purchases$Country
                            ,levels=c('France', 'Spain', 'Germany')
                            ,labels=c(1,2,3))

# Encode purchased label
purchases$Purchased  <- factor(purchases$Purchased
                            ,levels=c('Yes', 'No')
                            ,labels=c(1,0))

## Train-Test Splitting

When building machine learning models, we often split the data into a training set, and a test set.  The model uses the training set to look for patterns which it can apply to new data in the future.  We then run the classifier on the test set—which the classifier has never seen—and compare the classifier's predictions to the actual values.  In this way, we can measure how well the classifier works.

In [5]:
# Load dependencies
library(caTools)

In [6]:
# Set seed to keep splitting results consistent
set.seed(123)

In [7]:
# Define train-test split function
split <- sample.split(Y=purchases$Purchased, SplitRatio = 0.8)

# View splitting
split

Before proceeding, let's examine our splitting.  We defined that split ratio—or the percentage of records to be "train records"—to be 80%.  Therefore, out of our 10 records, 8 will be for training, and 2 will be for testing.  Our splitting assigned 2 FALSE values to our purchase records, in index positions 6 and 9.  Let's view our entire dataframe now.

In [8]:
# View data before splitting
purchases

Country,Age,Salary,Purchased
<fct>,<dbl>,<dbl>,<fct>
1,44.0,72000.0,0
2,27.0,48000.0,1
3,30.0,54000.0,0
2,38.0,61000.0,0
3,40.0,63777.78,1
1,35.0,58000.0,1
2,38.77778,52000.0,0
1,48.0,79000.0,1
3,50.0,83000.0,0
1,37.0,67000.0,1


Viewing the above, we can see that the ages/salaries at indices 6 and 8 are 35/58k, and 50/83k, respectively. When we split our data, these will be our two testing records.

In [9]:
# Split data into training and testing sets
X_train <- subset(purchases, split == TRUE)
X_test <- subset(purchases, split == FALSE)

In [10]:
# View training records
X_train

# View testing records
X_test

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,1,44.0,72000.0,0
2,2,27.0,48000.0,1
3,3,30.0,54000.0,0
4,2,38.0,61000.0,0
5,3,40.0,63777.78,1
7,2,38.77778,52000.0,0
8,1,48.0,79000.0,1
10,1,37.0,67000.0,1


Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
6,1,35,58000,1
9,3,50,83000,0


## Feture Scaling

Feature scaling in a basic sense determines some measure (e.g. mean, standard deviation) to use to transform features onto the same scale.  While the ages in our dataset can range from 30 to 50, our salaries can range from 50,000 to 80,000.  A difference of 20 is huge for age, but nothing for salary.  Machine learning models sensitive to Euclidean distance will be led to believe that age makes no difference in predicting an outcome.  We will visualize this soon.

Feature scaling should be performed after splitting the data into training and testing sets.  You only want data being used to train the model to contribute to this scaling.  The test set is meant to mimic future data that you haven't seen yet; therefore, you don't want this data contributing to the measure you scale on (e.g. mean and standard deviation).  Consider the effect if an extremely impactful outlier were to contribute to these measures, then end up in the test set; this is technically a future data point we haven't seen, yet it drastically increases or decreases our scaling measure.  Following therefrom, you may often want to split before even imputing data.  In this notebook, we imputed first, as to not disrupt the flow of learning.

As previously mentioned, feature scaling essentially uses some measure (e.g. mean and standard deviation) to bring all features onto a similar scale.  While the ages in our dataset can range from 30 to 50, our salaries can range from 50,000 to 80,000.  A difference of 20 is huge for age, but nothing for salary.  However, 50 is actually a 66.67% change from 30, whereas 80,000 is only a 60% change from 50,000.  It is however not always necessary to scale a features; consider a multiple linear regression model for example:

$$
y = x_0 + b_1 x_1 + b_2 x_2 + ... b_n x_n
$$

If any variable is significantly higher than the rest, the coefficients may naturally compensate by taking lower values.  We will revisit this in regression.

The two main feature scaling techniques are <em>normalization</em> and <em>standardization</em> (or $z$-score normalization).  The are achieved by applying one of the following formulas to every point in a dataset:

<center><b>Normalization:</b>
$$
x_{norm} = \frac{x - x_{min}}{x_{max}-x_{min}}
$$
</center>
<br>

<center><b>Standardization:</b>
$$
x_{stand} = z = \frac{x - \mu}{\sigma}
$$
Where $x$ is a particular observation, $\mu$ is the mean of the dataset, and $\sigma$ is the standard deviation of the dataset; the result of this formula is known as the $z$-score.</center>


Conceptually, normalization  puts all features on a scale from 0 to 1.  Assuming that not every single datapoint is the same value, a datapoint $x$ will always be greater than or equal to the minimum $x$ value, thus being zero or a positive number, while the maximum $x$ value minus the minimum $x$ value will always be positive, thus being another positive.  Lastly, there is a bigger difference between the maximum $x$ value and the minimum $x$ value, and between any given $x$ value and the minimum, thus making the denominator larger or the same as the numerator.  Therefore, zero or some positive number dividied by the same or a larger positive number will always lie between 0 and 1.  

Standardization (or $z$-score normalization) on the other hand is calcualting the $z$-score of every feature, and roughly puts every datapoint on a range from -3 to 3.  Recall from the Empirical Rule, that 99.7% of data falls within 3 standard deviations of the mean.$^2$  The next question then is when to use which method.  Normalization is typically recommended when you have a normal distribution in most features.  Standardization on the other hand is generally "good all the time."

Before proceeding, it is important to note not to scale encoded values.  Recall that we transformed our country features into numbers representing each country; we do not want to scale these meaningful numbers.  First, the whole point of scaling is to put all features into the same range.  Standardization puts features onto a scale from -3 to 3, while normalization puts features onto a scale from 0 to 1.  Either way, only having three categories, the numbers in our encoded Country column fall between 1 and 3 and are already within either scale.  Furthermore, the exact values of 1, 2, and 3 have meaning, namely, "This is France," "This is Spain," and "This is Germany."  We would not want to scale these down to values such as [0.3, 0.3, 0.4] which has no meaning with respect to answering "yes" or "no" to what a country is.  Therefore, we do not scale encoded values.

Next, we will create a copy of our features, then scale them, to visually examine what happens when we scale data.  We will also calculate the first Z-score ourselves to show what the scale() function does to our data.

In [11]:
# Create copy of training features
X_train_original = X_train

In [12]:
# View pre-scaled features
X_train_original

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,1,44.0,72000.0,0
2,2,27.0,48000.0,1
3,3,30.0,54000.0,0
4,2,38.0,61000.0,0
5,3,40.0,63777.78,1
7,2,38.77778,52000.0,0
8,1,48.0,79000.0,1
10,1,37.0,67000.0,1


##### come back and learn and how to select columns by name instead of index

In [13]:
# Scale features
# X_train[,c(purchases$Age, purchases$Salary)] <- scale(X_train[,c(purchases$Age, purchases$Salary)])
X_train[,2:3] <- scale(X_train[,2:3])
X_test[,2:3] <- scale(X_test[,2:3])

Examing the first age, age mean, and age standard deviation in Figure 1, we can calculate the Z-score ourselves.

$$
\displaystyle Z = \frac{38.77 - 40.1}{6.89} = -0.19
$$
<br><center>Figure 2. Z-Score Of First Age</center>


Now let's scale our data and examine the first scaled age.

In [15]:
X_train

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,1,0.90101716,0.9392746,0
2,2,-1.58847494,-1.337116,1
3,3,-1.14915281,-0.7680183,0
4,2,0.02237289,-0.1040711,0
5,3,0.31525431,0.1594,1
7,2,0.13627122,-0.9577176,0
8,1,1.48678,1.6032218,1
10,1,-0.12406783,0.4650265,1
