Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Remote training on BatchAI

## Introduction
This tutorial shows how to train a simple deep neural network using the CIFAR-10 dataset and Keras on Azure Machine Learning. 

Let's get started. First let's import some Python libraries.

In [None]:
%matplotlib inline
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt

In [None]:
import azureml
from azureml.core import Workspace, Run

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [None]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

## Create an Azure ML experiment
Let's create an experiment and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

In [None]:
from azureml.core import Experiment

script_folder = './keras-cifar10'

exp = Experiment(workspace=ws, name='keras-cifar10')

## Create Batch AI cluster as compute target
[Batch AI](https://docs.microsoft.com/en-us/azure/batch-ai/overview) is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Batch AI cluster in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.

If we could not find the cluster with the given name in the previous cell, then we will create a new cluster here. We will create a Batch AI Cluster of `STANDARD_NC6s_v2` GPU VMs. This process is broken down into 3 steps:
1. create the configuration (this step is local and only takes a second)
2. create the Batch AI cluster (this step will take about **20 seconds**)
3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell

In [None]:
from azureml.core.compute import ComputeTarget, BatchAiCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "batchai-cluster"

try:
    # look for the existing cluster by name
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    if type(compute_target) is BatchAiCompute:
        print('Found existing compute target {}.'.format(cluster_name))
    else:
        print('{} exists but it is not a Batch AI cluster. Please choose a different name.'.format(cluster_name))
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = BatchAiCompute.provisioning_configuration(vm_size="STANDARD_NC6s_v2", # GPU-based VM
                                                                #vm_priority='lowpriority', # optional
                                                                autoscale_enabled=False,
                                                                cluster_min_nodes=4, 
                                                                cluster_max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
    # Use the 'status' property to get a detailed status for the current cluster. 
    print(compute_target.status.serialize())

Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named 'gpucluster' of type BatchAI.

In [None]:
compute_targets = ws.compute_targets
compute_targets()[0].provisioning_state

## Create TensorFlow estimator
Next, we construct an `azureml.train.dnn.TensorFlow` estimator object, use the Batch AI cluster as compute target, and pass the mount-point of the datastore to the training code as a parameter.
The TensorFlow estimator is providing a simple way of launching a TensorFlow training job on a compute target. It will automatically provide a docker image that has TensorFlow installed -- if additional pip or conda packages are required, their names can be passed in via the `pip_packages` and `conda_packages` arguments and they will be included in the resulting docker.

In [None]:
from azureml.train.dnn import TensorFlow

script_params = {
    '--data-folder': ws.get_default_datastore().as_mount(),
    '--batch-size': 50,
    '--learning-rate': 0.0001,
    '--decay': 1e-6,
    '--epochs': 25
}

est = TensorFlow(source_directory=script_folder,
                 script_params=script_params,
                 compute_target=compute_target,
                 entry_script='cifar10_azureml.py',    
                 conda_packages=['keras', 'h5py'],
                 use_gpu=True)

## Submit job to run
Calling the `fit` function on the estimator submits the job to Azure ML for execution. Submitting the job should only take a few seconds.

In [None]:
run = exp.submit(config=est)

### Monitor the Run
As the Run is executed, it will go through the following stages:
1. Preparing: A docker image is created matching the Python environment specified by the TensorFlow estimator and it will be uploaded to the workspace's Azure Container Registry. This step will only happen once for each Python environment -- the container will then be cached for subsequent runs. Creating and uploading the image takes about **5 minutes**. While the job is preparing, logs are streamed to the run history and can be viewed to monitor the progress of the image creation.

2. Scaling: If the compute needs to be scaled up (i.e. the Batch AI cluster requires more nodes to execute the run than currently available), the Batch AI cluster will attempt to scale up in order to make the required amount of nodes available. Scaling typically takes about **5 minutes**.

3. Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted/copied and the `entry_script` is executed. While the job is running, stdout and the `./logs` folder are streamed to the run history and can be viewed to monitor the progress of the run.

4. Post-Processing: The `./outputs` folder of the run is copied over to the run history

There are multiple ways to check the progress of a running job. We can use a Jupyter notebook widget. 

**Note: The widget will automatically update ever 10-15 seconds, always showing you the most up-to-date information about the run**

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

We can also periodically check the status of the run object, and navigate to Azure portal to monitor the run.

In [None]:
run

In [None]:
run.wait_for_completion(show_output = True)

### The Run object
The Run object provides the interface to the run history -- both to the job and to the control plane (this notebook), and both while the job is running and after it has completed. It provides a number of interesting features for instance:
* `run.get_details()`: Provides a rich set of properties of the run
* `run.get_metrics()`: Provides a dictionary with all the metrics that were reported for the Run
* `run.get_file_names()`: List all the files that were uploaded to the run history for this Run. This will include the `outputs` and `logs` folder, azureml-logs and other logs, as well as files that were explicitly uploaded to the run using `run.upload_file()`

Below are some examples -- please run through them and inspect their output. 

In [None]:
run.get_details()

In [None]:
run.get_metrics()

In [None]:
run.get_file_names()

## Plot accuracy over epochs
Since we can retrieve the metrics from the run, we can easily make plots using `matplotlib` in the notebook. Then we can add the plotted image to the run using `run.log_image()`, so all information about the run is kept together.

In [None]:
import os
os.makedirs('./imgs', exist_ok = True)
metrics = run.get_metrics()

plt.figure(figsize = (13,5))
plt.plot(metrics['validation_acc'], 'r-', lw = 4, alpha = .6)
plt.plot(metrics['training_acc'], 'b--', alpha = 0.5)
plt.legend(['Full evaluation set', 'Training set mini-batch'])
plt.xlabel('epochs', fontsize = 14)
plt.ylabel('accuracy', fontsize = 14)
plt.title('Accuracy over Epochs', fontsize = 16)
run.log_image(name = 'acc_over_epochs.png', plot = plt)
plt.show()

## Download the saved model

In the training script, a TensorFlow `saver` object is used to persist the model in a local folder (local to the compute target). The model was saved to the `./outputs` folder on the disk of the Batch AI cluster node where the job is run. Azure ML automatically uploaded anything written in the `./outputs` folder into run history file store. Subsequently, we can use the `Run` object to download the model files the `saver` object saved. They are under the the `outputs/model` folder in the run history file store, and are downloaded into a local folder named `model`. Note the TensorFlow model consists of four files in binary format and they are not human-readable.

In [None]:
# create a model folder in the current directory
os.makedirs('./model', exist_ok = True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name = f, output_file_path = output_file_path)