# Applied Machine Learning
## Train your first machine learning model
- Author: Lorien Pratt
- Copyright: Quantellia LLC 2019.  All Rights Reserved

## Setup

In [1]:
my_initials<-"nm" # Set your initials to use for model and data files

##### Install and initialize the H2O library, which we will use to do the grid search
Note that this will generate a lot of warnings. These are expected, and not errors but rather notifications

In [2]:
require(h2o)
h2o.init()

Loading required package: h2o

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc




H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpGzmaTN/h2o_jupyter_started_from_r.out
    /tmp/RtmpGzmaTN/h2o_jupyter_started_from_r.err


Starting H2O JVM and connecting: . Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 908 milliseconds 
    H2O cluster timezone:       Etc/UTC 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.26.0.10 
    H2O cluster version age:    3 days  
    H2O cluster name:           H2O_started_from_R_jupyter_nlj368 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.27 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, C

Install a number of R packages we'll need to display the results

In [8]:
# This will generate what look like warnings, but are really notifications
# uncomment install.packages if you are on a new instance without these packages pre-installed
#install.packages("plyr")
#install.packages("pROC")
#install.packages("SDMTools")
#install.packages("RColorBrewer")
#install.packages("gplots")
require(plyr)
require(pROC)
require(SDMTools)
require(RColorBrewer)
require(gplots)

Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess



Create the file names from your initials, just as we did when building the files in the first place

In [9]:
train_filename<-paste0("data/",my_initials,"_train_auto.csv"); print( train_filename )
test_filename<-paste0("data/",my_initials,"_test_auto.csv"); print( test_filename )
backtest_filename<-paste0("data/",my_initials,"_backtest_auto.csv"); print( backtest_filename )

[1] "data/nm_train_auto.csv"
[1] "data/nm_test_auto.csv"
[1] "data/nm_backtest_auto.csv"


Read in the test and training files you created in the previous step. Convert them to h2o files along the way.

In [10]:
train_hex <- h2o.importFile(train_filename, parse = TRUE, header = TRUE, 
                            sep = "", col.names = NULL, col.types = NULL, na.strings = NULL)
test_hex <- h2o.importFile(test_filename, parse = TRUE, header = TRUE, 
                           sep = "", col.names = NULL, col.types = NULL, na.strings = NULL)
backtest_hex <- h2o.importFile(backtest_filename, parse = TRUE, header = TRUE, 
                           sep = "", col.names = NULL, col.types = NULL, na.strings = NULL)


ERROR: Unexpected HTTP Status code: 404 Not Found (url = http://localhost:54321/3/ImportFiles?path=data%2Fnm_train_auto.csv&pattern=)

water.exceptions.H2ONotFoundArgumentException
 [1] "water.exceptions.H2ONotFoundArgumentException: File data/nm_train_auto.csv does not exist"                          
 [2] "    water.persist.PersistNFS.importFiles(PersistNFS.java:136)"                                                      
 [3] "    water.persist.PersistManager.importFiles(PersistManager.java:374)"                                              
 [4] "    water.api.ImportFilesHandler.importFiles(ImportFilesHandler.java:25)"                                           
 [5] "    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"                                                    
 [6] "    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)"                                  
 [7] "    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess

ERROR: Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : 

ERROR MESSAGE:

File data/nm_train_auto.csv does not exist




Set a number of configuration parameters for model training

In [None]:
config_epochs=10000
config_hidden=c(5)
config_input_dropout_ratio = 0.0
config_l1=1.0E-5
config_l2=0.001

**Tell the model training which of the columns are predictors.**
First, let's looK at the top of the dataset again to remind us of the structure...

In [None]:
head(train_hex)

Set the predictor columns and check that they're the right ones

In [None]:
predictors <- c(2,3,4,5,6,7,8)
names(train_hex)[predictors]

Tell the model training which of the columns is the target column (in this case, the very last column, mpg)

In [None]:
targetcol<-ncol(train_hex)
names(train_hex)[targetcol]

Train the model

In [None]:
model  <- h2o.deeplearning(
    x = predictors,
    y = targetcol,
    variable_importances=TRUE,
    model_id='model_1',
    training_frame=train_hex,
    validation_frame = test_hex,
    quiet_mode=FALSE,
    export_weights_and_biases=TRUE,
    activation="Tanh",              # Linear outputs
    autoencoder=FALSE,
    ignore_const_cols=F,
    train_samples_per_iteration=0,
    stopping_tolerance = 1e-5,
    classification_stop = -1,       # Disable automatic stopping
    adaptive_rate=F,                # Manaully tuned learning rate
    reproducible=T,
    epochs=config_epochs,
    hidden=config_hidden,
    input_dropout_ratio = config_input_dropout_ratio, 
    l1=config_l1,
    l2=config_l2
  )

##### Plot how the error changed for the train and test set during learning. H2O calls the test set "Validation".

In [None]:
plot(model, timestep="epochs", metric="RMSE")

## Model evalution

Generate predictions for every row in the backtest set, and create two vectors: one with the truth and one with the prediction

In [None]:
predictions <- h2o.predict(model, backtest_hex)
#actual_column <- as.logical(as.vector(as.numeric(backtest_hex[ ,ncol(backtest_hex)])))
actual_column <- as.vector(as.numeric(backtest_hex[targetcol]))
predict_column <- as.vector(predictions[ ,'predict'])
str(actual_column)
str(predict_column)

Plot actuals versus predictions to get a visual sense of how well the model did

In [None]:
plot(actual_column,predict_column)

Extract the variable importances from the model and show them in a graph from most important to least important

In [None]:
x<-model@model
vi<-x$variable_importances
par(mar=c(5, 12, 5, 5))
plotSize<-15
cols <- colorRampPalette(brewer.pal(4,"Dark2"))(plotSize)
barplot(rev(vi$percentage),las=2,main="Variable Importances for auto mpg dataset",
                    names.arg=rev(vi$variable),
                    horiz=TRUE,cex.names=0.75,col=cols)

###### 11. Create the ROC graph along with its AUC

In [None]:
# Note this is a regression, not a classification problem, so ROC doesn't make sense
# Saving the code here so you can use this for a classification model later
# 
#rocp1 <- roc(actual_column, predict_column,
#        percent=TRUE,
#        plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
#        print.auc=TRUE, show.thres=TRUE, main="ROC Graph: Actuals vs. Predictions")

##### Save your trained model. We'll use it later for inference. 

In [None]:
model_filename<-paste0("models/",my_initials,"_auto_model")
h2o.saveModel(model, model_filename, force=T)