# Importing dataset

In [1]:
dataset = read.csv("Data.csv")

In [2]:
dataset

Country,Age,Salary,Purchased
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,,Yes
France,35.0,58000.0,Yes
Spain,,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


# Handling Missing Data

ifelse(condition, returnIfTrue, returnIfFalse)

R is very array/vector/matrix/dataframe oriented. So, more often than not, commands you do on those datatypes will be automatically applied to every item as if you were doing it one at a time.

Here it will run the conditional item by item on the dataframe column as if you had written it in a for loop:

In [3]:
dataset$Age=ifelse(
    is.na(dataset$Age),
    ave(dataset$Age, FUN = function(x) mean(x, na.rm=TRUE)),
    dataset$Age
)

In [4]:
dataset$Salary = ifelse(
    is.na(dataset$Salary),
    ave(dataset$Salary, FUN = function(x) mean(x, na.rm=TRUE)),
    dataset$Salary
)

In [5]:
dataset

Country,Age,Salary,Purchased
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,63777.78,Yes
France,35.0,58000.0,Yes
Spain,38.77778,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


# Enconding Categorical Data

In [6]:
dataset$Country = factor(
    dataset$Country,
    levels = c("France","Spain","Germany"),
    labels = c(1, 2, 3)
)

In [7]:
dataset

Country,Age,Salary,Purchased
1,44.0,72000.0,No
2,27.0,48000.0,Yes
3,30.0,54000.0,No
2,38.0,61000.0,No
3,40.0,63777.78,Yes
1,35.0,58000.0,Yes
2,38.77778,52000.0,No
1,48.0,79000.0,Yes
3,50.0,83000.0,No
1,37.0,67000.0,Yes


In [8]:
dataset$Purchased = factor(
    dataset$Purchased,
    levels = c("No","Yes"),
    labels = c(0, 1)
) 

In [9]:
dataset

Country,Age,Salary,Purchased
1,44.0,72000.0,0
2,27.0,48000.0,1
3,30.0,54000.0,0
2,38.0,61000.0,0
3,40.0,63777.78,1
1,35.0,58000.0,1
2,38.77778,52000.0,0
1,48.0,79000.0,1
3,50.0,83000.0,0
1,37.0,67000.0,1


# Splitting Data into training and testing

In [10]:
library('caTools')

"package 'caTools' was built under R version 3.6.3"

In [11]:
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio=.8)

In [12]:
split

In [13]:
training_set = subset(dataset, split==TRUE)
test_set = subset(dataset, split==FALSE)

In [14]:
training_set

Unnamed: 0,Country,Age,Salary,Purchased
1,1,44.0,72000.0,0
2,2,27.0,48000.0,1
3,3,30.0,54000.0,0
4,2,38.0,61000.0,0
5,3,40.0,63777.78,1
7,2,38.77778,52000.0,0
8,1,48.0,79000.0,1
10,1,37.0,67000.0,1


In [15]:
test_set

Unnamed: 0,Country,Age,Salary,Purchased
6,1,35,58000,1
9,3,50,83000,0


# Feature Scaling

We place all variables on the same scale so one won't "dominate" over some other. e.g.:
## Standardisation
$$x_{stand}=\frac{x-mean(x)}{standard\ deviation (x)}$$
## Normalisation
$$x_{norm}=\frac{x-min(x)}{max(x)-min(x)}$$

In [16]:
# This gives out an error because first and last columns
# aren't numeric. They're factors.

# training_set = scale(training_set)

In [17]:
training_set[, 2:3]

Unnamed: 0,Age,Salary
1,44.0,72000.0
2,27.0,48000.0
3,30.0,54000.0
4,38.0,61000.0
5,40.0,63777.78
7,38.77778,52000.0
8,48.0,79000.0
10,37.0,67000.0


In [18]:
training_set[, 2:3] = scale(training_set[, 2:3])

In [19]:
training_set

Unnamed: 0,Country,Age,Salary,Purchased
1,1,0.90101716,0.9392746,0
2,2,-1.58847494,-1.337116,1
3,3,-1.14915281,-0.7680183,0
4,2,0.02237289,-0.1040711,0
5,3,0.31525431,0.1594,1
7,2,0.13627122,-0.9577176,0
8,1,1.48678,1.6032218,1
10,1,-0.12406783,0.4650265,1


In [20]:
test_set[, 2:3] = scale(test_set[, 2:3])

In [21]:
test_set

Unnamed: 0,Country,Age,Salary,Purchased
6,1,-0.7071068,-0.7071068,1
9,3,0.7071068,0.7071068,0


# Comparing preprocessing prior to feature scaling

In [22]:
dataset = read.csv("Data.csv")

dataset$Age=ifelse(
    is.na(dataset$Age),
    ave(dataset$Age, FUN = function(x) mean(x, na.rm=TRUE)),
    dataset$Age
)
dataset$Salary = ifelse(
    is.na(dataset$Salary),
    ave(dataset$Salary, FUN = function(x) mean(x, na.rm=TRUE)),
    dataset$Salary
)

dataset$Country = factor(
    dataset$Country,
    levels = c("France","Spain","Germany"),
    labels = c(1, 2, 3)
)
dataset$Purchased = factor(
    dataset$Purchased,
    levels = c("No","Yes"),
    labels = c(0, 1)
) 
dataset

Country,Age,Salary,Purchased
1,44.0,72000.0,0
2,27.0,48000.0,1
3,30.0,54000.0,0
2,38.0,61000.0,0
3,40.0,63777.78,1
1,35.0,58000.0,1
2,38.77778,52000.0,0
1,48.0,79000.0,1
3,50.0,83000.0,0
1,37.0,67000.0,1


In [23]:
dataset[, 2:3] = scale(dataset[, 2:3])
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio=.8)
training_set = subset(dataset, split==TRUE)
test_set = subset(dataset, split==FALSE)

In [24]:
training_set

Unnamed: 0,Country,Age,Salary,Purchased
1,1,0.7199314,0.7110128,0
2,2,-1.6236751,-1.3643758,1
3,3,-1.2100975,-0.8455287,0
4,2,-0.1072238,-0.240207,0
5,3,0.1684946,0.0,1
7,2,0.0,-1.0184777,0
8,1,1.2713683,1.3163344,1
10,1,-0.245083,0.2786401,1


In [25]:
test_set

Unnamed: 0,Country,Age,Salary,Purchased
6,1,-0.5208015,-0.4996306,1
9,3,1.5470867,1.6622325,0


Both training and test set ends up with totally different values depending on when you scale them.

I guess this shouldn't matter too much on a large enough dataset (as mean and std should tend to the real populational ones), but it surely makes a lot of difference in this small one.