# Applied Machine Learning
## Preparing machine learning data for training
- Author: Lorien Pratt
- Copyright: Quantellia LLC 2019.  All Rights Reserved

The purpose of this notebook is to prepare data for training for machine learning by splitting it into train, test, and backtest.  We start with a single data file, split it into these three pieces, and write it back to disk

See https://archive.ics.uci.edu/ml/datasets/iris+mpg for information about the data set used in this demonstration.  It is built for a regression task, to estimate the miles per gallon of a car from its characteristics.  A good attribute description can be found at https://vincentarelbundock.github.io/Rdatasets/doc/ISLR/iris.html

### Setup

In [1]:
# Create a variable to distinguish my data and model files from others'
# Note that this is done inside of R, not Python, using the   construct which allows me to place
# R code on a single line
my_initials<-"jing"

### Get the input file with all of the training data in it 
This is the iris.MPG data, obtained the UCI repository.  I made a couple of small changes: a) rearranged the columns so the target was on the right and the label was on the left, and b) changed ? (missing data indicator) to just empty fields

Note that this uses the *%%R* construct, which supports multi-line R code.  Also note that %  must be in the first row of the cell; you can't have a comment there or anything else.

In [2]:
iris <- read.csv("data/iris.csv", header=T)
nrow(iris)
ncol(iris)
head(iris)

“cannot open file 'data/iris.csv': No such file or directory”

ERROR: Error in file(file, "rt"): cannot open the connection


The data is in this format, with prediction columns to the left, and the target column is the rightmost one

<img src="img/target_column.png" style="width:200px;float:left;">

Examine various columns of the data to determine their values and range. (This was used to build the UI.)

In [3]:
print(summary(iris$target))
# Analysis for purpose of sliders for inference input:
print(summary(iris$sepal_length)) # we'll call this 3-8 on an integer range
print(summary(iris$sepal_width)) # we'll call this 0 - 500 on an integer range
print(summary(iris$petal_length)) # we'll call this 0 - 500 on an integer range
print(summary(iris$petal_width)) # we'll call this 1000 - 10000 on an integer range
#print(summary(iris$acceleration)) # we'll call this 0-100 on a float range
#print(summary(iris$model.year)) # we'll call this 60-90 on an integer range
#print(summary(iris$origin)) # we'll call this 1,2,3 on an integer range

Length  Class   Mode 
     0   NULL   NULL 
Length  Class   Mode 
     0   NULL   NULL 
Length  Class   Mode 
     0   NULL   NULL 
Length  Class   Mode 
     0   NULL   NULL 
Length  Class   Mode 
     0   NULL   NULL 


## Split the dataset in preparation for machine learning
We need to train the machine learning model and then test it on new data. This is called backtesting. And during training, ML needs a training file and a testing file.

So we need to split the input file into three files.

<img src="img/split_files.png" style="width:500px;float:left;">

First sort the dataset randomly so if the data has any sorting to it, that's lost in the train/test/backtest split

In [4]:
iris_random <- iris[sample(nrow(iris)),]

Split the file into a training set (50% of the total), a test file (20% of the total), and a backtest (30% of the total). Use R construct *sample* to create the three splits of the right proportions

In [5]:
set.seed(7) # Do this so we can reproduce these results by using the same seed next time if we wish
sample_set <- sample(1:3,size=nrow(iris_random),replace=TRUE,prob=c(0.5,0.2,0.3))
train <- iris_random[sample_set==1,]
test <- iris_random[sample_set==2,]
backtest <- iris_random[sample_set==3,]

Look at each file's number of rows and first 5 rows to check that it looks right

In [6]:
print(paste('Number of rows in the training file:', nrow(train)))  # Note that paste0 can't handle a space at the end
print(head(train,3))

[1] "Number of rows in the training file: 72"
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
129          6.4         2.8          5.6         2.1  virginica
94           5.0         2.3          3.3         1.0 versicolor
96           5.7         3.0          4.2         1.2 versicolor


In [7]:
print(paste('Number of rows in the testing file:', nrow(test)))  # Note that paste0 can't handle a space at the end
print(head(test,3))

[1] "Number of rows in the testing file: 28"
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
109          6.7         2.5          5.8         1.8  virginica
87           6.7         3.1          4.7         1.5 versicolor
60           5.2         2.7          3.9         1.4 versicolor


In [8]:
print(paste('Number of rows in the backtesting file:', nrow(backtest)))  
print(head(backtest,3))

[1] "Number of rows in the backtesting file: 50"
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
72          6.1         2.8          4.0         1.3 versicolor
43          4.4         3.2          1.3         0.2     setosa
77          6.8         2.8          4.8         1.4 versicolor


Save the files in the data directory. Uses your initials to distinguish each one

In [9]:
train_filename<-paste0("data/",my_initials,"_train_iris.csv"); print( train_filename )
test_filename<-paste0("data/",my_initials,"_test_iris.csv"); print( test_filename )
backtest_filename<-paste0("data/",my_initials,"_backtest_iris.csv"); print( backtest_filename )

[1] "data/nm_train_iris.csv"
[1] "data/nm_test_iris.csv"
[1] "data/nm_backtest_iris.csv"


Save the three files in the data directory

In [10]:
write.csv(train, file = train_filename, row.names=FALSE)
write.csv(test, file = test_filename, row.names=FALSE)
write.csv(backtest, file = backtest_filename, row.names=FALSE)