# Applied Machine Learning
## Preparing machine learning data for training
- Author: Lorien Pratt
- Copyright: Quantellia LLC 2019.  All Rights Reserved

The purpose of this notebook is to prepare data for training for machine learning by splitting it into train, test, and backtest.  We start with a single data file, split it into these three pieces, and write it back to disk

See https://archive.ics.uci.edu/ml/datasets/auto+mpg for information about the data set used in this demonstration.  It is built for a regression task, to estimate the miles per gallon of a car from its characteristics.  A good attribute description can be found at https://vincentarelbundock.github.io/Rdatasets/doc/ISLR/Auto.html

### Setup

In [1]:
# Create a variable to distinguish my data and model files from others'
# Note that this is done inside of R, not Python, using the   construct which allows me to place
# R code on a single line
my_initials<-"nm"

### Get the input file with all of the training data in it 
This is the Auto.MPG data, obtained the UCI repository.  I made a couple of small changes: a) rearranged the columns so the target was on the right and the label was on the left, and b) changed ? (missing data indicator) to just empty fields

Note that this uses the *%%R* construct, which supports multi-line R code.  Also note that %  must be in the first row of the cell; you can't have a comment there or anything else.

In [2]:
auto <- read.csv("data/auto-mpg.csv", header=T)
nrow(auto)
ncol(auto)
head(auto)

“cannot open file 'data/auto-mpg.csv': No such file or directory”

ERROR: Error in file(file, "rt"): cannot open the connection


The data is in this format, with prediction columns to the left, and the target column is the rightmost one

<img src="img/target_column.png" style="width:200px;float:left;">

Examine various columns of the data to determine their values and range. (This was used to build the UI.)

In [3]:
print(summary(auto$mpg))
# Analysis for purpose of sliders for inference input:
print(summary(auto$cylinders)) # we'll call this 3-8 on an integer range
print(summary(auto$displacement)) # we'll call this 0 - 500 on an integer range
print(summary(auto$horsepower)) # we'll call this 0 - 500 on an integer range
print(summary(auto$weight)) # we'll call this 1000 - 10000 on an integer range
print(summary(auto$acceleration)) # we'll call this 0-100 on a float range
print(summary(auto$model.year)) # we'll call this 60-90 on an integer range
print(summary(auto$origin)) # we'll call this 1,2,3 on an integer range

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.00   17.50   23.00   23.51   29.00   46.60 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   4.000   4.000   5.455   8.000   8.000 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   68.0   104.2   148.5   193.4   262.0   455.0 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   46.0    75.0    93.5   104.5   126.0   230.0       6 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1613    2224    2804    2970    3608    5140 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00   13.82   15.50   15.57   17.18   24.80 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  70.00   73.00   76.00   76.01   79.00   82.00 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.573   2.000   3.000 


## Split the dataset in preparation for machine learning
We need to train the machine learning model and then test it on new data. This is called backtesting. And during training, ML needs a training file and a testing file.

So we need to split the input file into three files.

<img src="img/split_files.png" style="width:500px;float:left;">

First sort the dataset randomly so if the data has any sorting to it, that's lost in the train/test/backtest split

In [4]:
auto_random <- auto[sample(nrow(auto)),]

Split the file into a training set (50% of the total), a test file (20% of the total), and a backtest (30% of the total). Use R construct *sample* to create the three splits of the right proportions

In [5]:
set.seed(7) # Do this so we can reproduce these results by using the same seed next time if we wish
sample_set <- sample(1:3,size=nrow(auto_random),replace=TRUE,prob=c(0.5,0.2,0.3))
train <- auto_random[sample_set==1,]
test <- auto_random[sample_set==2,]
backtest <- auto_random[sample_set==3,]

Look at each file's number of rows and first 5 rows to check that it looks right

In [6]:
print(paste('Number of rows in the training file:', nrow(train)))  # Note that paste0 can't handle a space at the end
print(head(train,3))

[1] "Number of rows in the training file: 193"
                             car.name cylinders displacement horsepower weight
134 chevrolet chevelle malibu classic         6          250        100   3781
374              ford fairmont futura         4          140         92   2865
290           buick estate wagon (sw)         8          350        155   4360
    acceleration model.year origin  mpg
134         17.0         74      1 16.0
374         16.4         82      1 24.0
290         14.9         79      1 16.9


In [7]:
print(paste('Number of rows in the testing file:', nrow(test)))  # Note that paste0 can't handle a space at the end
print(head(test,3))

[1] "Number of rows in the testing file: 78"
                car.name cylinders displacement horsepower weight acceleration
90  dodge coronet custom         8          318        150   3777         12.5
357       toyota corolla         4          108         75   2350         16.8
82      datsun\t510 (sw)         4           97         92   2288         17.0
    model.year origin  mpg
90          73      1 15.0
357         81      3 32.4
82          72      3 28.0


In [8]:
print(paste('Number of rows in the backtesting file:', nrow(backtest)))  
print(head(backtest,3))

[1] "Number of rows in the backtesting file: 127"
                           car.name cylinders displacement horsepower weight
179                    peugeot\t504         4          120         88   2957
264 buick regal sport coupe (turbo)         6          231        165   3445
142                        audi fox         4           98         83   2219
    acceleration model.year origin  mpg
179         17.0         75      2 23.0
264         13.4         78      1 17.7
142         16.5         74      2 29.0


Save the files in the data directory. Uses your initials to distinguish each one

In [9]:
train_filename<-paste0("data/",my_initials,"_train_auto.csv"); print( train_filename )
test_filename<-paste0("data/",my_initials,"_test_auto.csv"); print( test_filename )
backtest_filename<-paste0("data/",my_initials,"_backtest_auto.csv"); print( backtest_filename )

[1] "data/nm_train_auto.csv"
[1] "data/nm_test_auto.csv"
[1] "data/nm_backtest_auto.csv"


Save the three files in the data directory

In [10]:
write.csv(train, file = train_filename, row.names=FALSE)
write.csv(test, file = test_filename, row.names=FALSE)
write.csv(backtest, file = backtest_filename, row.names=FALSE)