Licensed under the MIT License.
# Train ML models with Azure Machine Learning R SDK (Preview)

Important: The Azure Machine Learning R SDK is currently in public preview. The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.

In this lab you'll use the Azure Machine Learning R SDK (preview) to create a logistic regression model that predicts the likelihood of a fatality in a car accident. You'll see how the Azure Machine Learning cloud resources work with R to provide a scalable environment for training and deploying a model.

In this lab, you perform the following tasks:

- Load data and prepare for training
- Upload data to a datastore so it is available for remote training
- Use compute resource to train the model remotely
- Train a caret model to predict probability of fatality

## Change notebook Kernel to R

Kernel -> Change Kernel -> R


## Install required packages

The stable release of the Azure ML SDK can be installed from CRAN or the development version can 
be installed from GitHub. You will need \*\*remotes** to install the \*\*azuremlsdk** package.

`R code:`

In [None]:
install.packages('remotes')

In [None]:
#Use `install_cran` functions to install the package.

If you are using R installed from CRAN, which comes with 32-bit and 64-bit binaries, you may need to specify the parameter `INSTALL_opts=c("--no-multiarch")` to only build for the current 64-bit architecture.

In [None]:
remotes::install_cran('azuremlsdk', repos = 'https://cloud.r-project.org/', INSTALL_opts=c("--no-multiarch"))

Install the Azure ML Python SDK

By default, `install_azureml()` creates a conda environment called 'r-reticulate', installs the Python SDK in that environment, and restarts the R session after installation (if running in RStudio).

In [None]:
azuremlsdk::install_azureml()

Test installation: You can confirm your installation worked by loading the library and successfully retrieving a run.

In [None]:
library(azuremlsdk)
get_current_run()

## Set the working directory

In [None]:
setwd("C:\\Azure ML Labs\\Lab-04")

## Load your workspace

Instantiate a workspace object from your existing workspace. The following code will load the workspace details from the config.json file.
<br><br>
When you run the code to instantiate the workspace, pop-up will appear to login to Azure ML and get authenticated.

`R code:`

In [None]:
library(azuremlsdk)
ws <- load_workspace_from_config()

## Create an experiment
An Azure ML experiment tracks a grouping of runs, typically from the same training script. Create an experiment to track the runs for training the caret model on the accidents data.

`R code:`

In [None]:
experiment_name <- "Lab-04-accident-logreg"
exp <- experiment(ws, experiment_name)

## Prepare data for training

`R code:`

Please change the path in setwd() to appropriate value.

In [None]:
setwd("C:\\Azure ML Labs\\Datasets")

nassCDS <- read.csv("nassCDS.csv", 
                     colClasses=c("factor","numeric","factor",
                                  "factor","factor","numeric",
                                  "factor","numeric","numeric",
                                  "numeric","character","character",
                                  "numeric","numeric","character"))
accidents <- na.omit(nassCDS[,c("dead","dvcat","seatbelt","frontal","sex","ageOFocc","yearVeh","airbag","occRole")])
accidents$frontal <- factor(accidents$frontal, labels=c("notfrontal","frontal"))
accidents$occRole <- factor(accidents$occRole)
accidents$dvcat <- ordered(accidents$dvcat, 
                          levels=c("1-9km/h","10-24","25-39","40-54","55+"))

saveRDS(accidents, file="accidents.Rd")

## Upload data to the datastore

`R code:`

Please change the path to appropriate value.

In [None]:
ds <- get_default_datastore(ws)

target_path <- "accidentdata"
upload_files_to_datastore(ds,
                          list("./accidents.Rd"),
                          target_path = target_path,
                          overwrite = TRUE)
                          
setwd("C:\\Azure ML Labs\\Lab-04")

## Train a model

For lab, fit a logistic regression model on your uploaded data using your remote compute cluster. To submit a job, you need to:

- Prepare the training script
- Create an estimator
- Submit the job

## Prepare the training script

Save code from below cell into a new  script file with name <b>accidents_train.r</b> and save it within the Lab-04 directory.

Notice the following details inside the training script that have been done to leverage Azure Machine Learning for training:

- The training script takes an argument -d to find the directory that contains the training data. When you define and submit your job later, you point to the datastore for this argument. Azure ML will mount the storage folder to the remote cluster for the training job.
<br><br>
- The training script logs the final accuracy as a metric to the run record in Azure ML using log_metric_to_run(). The Azure ML SDK provides a set of logging APIs for logging various metrics during training runs. These metrics are recorded and persisted in the experiment run record. The metrics can then be accessed at any time or viewed in the run details page in studio. See the reference for the full set of logging methods log_*().
<br><br>
- The training script saves your model into a directory named outputs. The ./outputs folder receives special treatment by Azure ML. During training, files written to ./outputs are automatically uploaded to your run record by Azure ML and persisted as artifacts. By saving the trained model to ./outputs, you'll be able to access and retrieve your model file even after the run is over and you no longer have access to your remote training environment.
<br>

`R code:`

## Retrieve the existing compute target
Compute cluster is already provisioned to be used in all labs. Update the value of cluster_name in code below.


`R code:`

In [None]:
compute_target <- get_compute(ws, cluster_name = 'CPU-Cluster-XX')

## Create environment

`R code:`

In [None]:
r_env  <- r_environment(name = 'myr_env',cran_packages = list(cran_package("caret"),cran_package("e1071"),cran_package("optparse")))

## Create an estimator
An Azure ML estimator encapsulates the run configuration information needed for executing a training script on the compute target. Azure ML runs are run as containerized jobs on the specified compute target. By default, the Docker image built for your training job will include R, the Azure ML SDK, and a set of commonly used R packages. See the full list of default packages included here.

To create the estimator, define:

- The directory that contains your scripts needed for training (source_directory). All the files in this directory are uploaded to the cluster node(s) for execution. The directory must contain your training script and any additional scripts required.
<br><br>
- The training script that will be executed (entry_script).
<br><br>
- The compute target (compute_target), in this case the AmlCompute cluster you created earlier.
<br><br>
- The parameters required from the training script (script_params). Azure ML will run your training script as a command-line script with Rscript. In this tutorial you specify one argument to the script, the data directory mounting point, which you can access with ds$path(target_path).
<br><br>
- If you are using R packages that are not included by default, use the estimator's cran_packages parameter to add additional CRAN packages
<br>

`R code:`

In [None]:
est <- estimator(source_directory = ".",
                 entry_script = "accidents_train.r",
                 script_params = list("--data_folder" = ds$path(target_path)),
                 compute_target = compute_target,
                 environment =r_env                 
                 )

## Submit the job on the remote cluster

Finally submit the job to run on your cluster. submit_experiment() returns a Run object that you then use to interface with the run. In total, the first run takes about 10 minutes. But for later runs, the same Docker image is reused as long as the script dependencies don't change. In this case, the image is cached and the container startup time is much faster.

`R code:`

In [None]:
run <- submit_experiment(exp, est)

You can view the run's details in RStudio Viewer. Clicking the "Web View" link provided will bring you to Azure Machine Learning studio, where you can monitor the run in the UI.
<br><br>
After you submit the experiment, you can view the preparation log by logging into https://ml.azure.com, click Experiments, Lab-04-accident-logreg, click the latest run with status preparing), Output + logs, azureml-logs -> XX_image_build_log.txt

In [None]:
get_run_details(run)

<b>Note:</b> Model training happens in the background. Wait until the model has finished training before you run more code. 

You and colleagues with access to the workspace can submit multiple experiments in parallel, and Azure ML will take of scheduling the tasks on the compute cluster. You can even configure the cluster to automatically scale up to multiple nodes, and scale back when there are no more compute tasks in the queue. This configuration is a cost-effective way for teams to share compute resources.

## Get the logged metrics
Once your model has finished training, you can access the artifacts of your job that were persisted to the run record, including any metrics logged and the final trained model.

In the training script accidents.R, you logged a metric from your model: the accuracy of the predictions in the training data. You can see metrics in the studio, or extract them to the local session as an R list as follows:

`R code:`

In [None]:
metrics <- get_run_metrics(run)
metrics

If you have run multiple experiments (say, using differing variables, algorithms, or hyperparamers), you can use the metrics from each run to compare and choose the model you'll use in production.

## Get the trained model

You can retrieve the trained model and look at the results in your local R session. The following code will download the contents of the ./outputs directory, which includes the model file.

`R code:`

In [None]:
download_files_from_run(run, prefix="outputs/")
accident_model <- readRDS("outputs/model.rds")
summary(accident_model)

You see some factors that contribute to an increase in the estimated probability of death:

- higher impact speed
- male driver
- older occupant
- passenger
<br>

You see lower probabilities of death with:
- presence of airbags
- presence seatbelts
- frontal collision
- The vehicle year of manufacture does not have a significant effect.

You can use this model to make new predictions:

`R code:`

In [None]:
newdata <- data.frame( # valid values shown below
 dvcat="10-24",        # "1-9km/h" "10-24"   "25-39"   "40-54"   "55+"  
 seatbelt="none",      # "none"   "belted"  
 frontal="frontal",    # "notfrontal" "frontal"
 sex="f",              # "f" "m"
 ageOFocc=16,          # age in years, 16-97
 yearVeh=2002,         # year of vehicle, 1955-2003
 airbag="none",        # "none"   "airbag"   
 occRole="pass"        # "driver" "pass"
 )

## predicted probability of death for these variables, as a percentage
as.numeric(predict(accident_model,newdata, type="response")*100)

### --- End ---

In [7]:
#Increase width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))