# Train and hyperparameter tune with RAPIDS

## Prerequisites

- Create an Azure ML Workspace and setup environmnet on local computer following the steps in [Azure README.md](https://gitlab-master.nvidia.com/drobison/aws-sagemaker-gtc-2020/tree/master/azure/README.md )

In [1]:
# verify installation and check Azure ML SDK version
import azureml.core

print('SDK version:', azureml.core.VERSION)

SDK version: 1.3.0


- Install [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10) to download dataset from [Azure Blob storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview) to your local computer (or another storage account).

In this example, we will use 20 million rows (samples) of the [airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html):

In [3]:
!./azcopy cp 'https://airlinedataset.blob.core.windows.net/airline-10years/*' '/Users/nanthini/OneDrive - NVIDIA Corporation/csp/AzureML-RAPIDS/notebooks/data/par/'

## Initialize workspace

Load and initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`

In [4]:
from azureml.core.workspace import Workspace

# if a locally-saved configuration file for the workspace is not available, use the following to load workspace
# ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

datastore = ws.get_default_datastore()
print("Default datastore's name: {}".format(datastore.name))

Workspace name: HPO-Workspace-Nanthini
Azure region: eastus
Subscription id: 73612009-b37b-413f-a3f7-ec02f12498cf
Resource group: RAPIDS-HPO-Nanthini
Default datastore's name: workspaceblobstore


## Upload data

Upload the dataset to the workspace's default datastore:

In [5]:
path_on_datastore = '10_year_data'
datastore.upload(src_dir='/Users/nanthini/OneDrive - NVIDIA Corporation/csp/AzureML-RAPIDS/notebooks/data/par', target_path=path_on_datastore, overwrite=False, show_progress=True)

Uploading an estimated of 122 files
Target already exists. Skipping upload for 10_year_data/part.33.parquet
Target already exists. Skipping upload for 10_year_data/part.23.parquet
Target already exists. Skipping upload for 10_year_data/part.51.parquet
Target already exists. Skipping upload for 10_year_data/part.41.parquet
Target already exists. Skipping upload for 10_year_data/part.77.parquet
Target already exists. Skipping upload for 10_year_data/part.67.parquet
Target already exists. Skipping upload for 10_year_data/_common_metadata
Target already exists. Skipping upload for 10_year_data/part.15.parquet
Target already exists. Skipping upload for 10_year_data/part.4.parquet
Target already exists. Skipping upload for 10_year_data/part.48.parquet
Target already exists. Skipping upload for 10_year_data/part.58.parquet
Target already exists. Skipping upload for 10_year_data/part.105.parquet
Target already exists. Skipping upload for 10_year_data/part.115.parquet
Target already exists. Ski

$AZUREML_DATAREFERENCE_6b8d3ec2a7f344ce89daaa2a73b72713

In [6]:
ds_data = datastore.path(path_on_datastore)
print(ds_data)

$AZUREML_DATAREFERENCE_028c9b659c724e5bbfbc3b71ff840262


## Create AML compute

You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this notebook, we will use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training using a dynamically scalable pool of compute resources.

This notebook will use 10 nodes for hyperparameter optimization, you can modify `max_node` based on available quota in the desired region. Similar to other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. [This article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) includes details on the default limits and how to request more quota.

`vm_size` describes the virtual machine type and size that will be used in the cluster. RAPIDS requires NVIDIA Pascal or newer architecture, you will need to specify compute targets from one of `NC_v2`, `NC_v3`, `ND` or `ND_v2` [GPU virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu); these are VMs that are provisioned with P40 and V100 GPUs. Let's create an `AmlCompute` cluster of `Standard_NC6s_v3` GPU VMs:

In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
gpu_cluster_name = 'gpu-cluster'

if gpu_cluster_name in ws.compute_targets:
    gpu_cluster = ws.compute_targets[gpu_cluster_name]
    if gpu_cluster and type(gpu_cluster) is AmlCompute:
        print('Found compute target. Will use {0} '.format(gpu_cluster_name))
else:
    print('creating new cluster')
    # m_size parameter below could be modified to one of the RAPIDS-supported VM types
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = 'Standard_NC12s_v3', max_nodes = 5, idle_seconds_before_scaledown = 300)

    # create the cluster
    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout 
    # if no min node count is provided it uses the scale settings for the cluster
    gpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
# use get_status() to get a detailed status for the current cluster 
print(gpu_cluster.get_status().serialize())

Found compute target. Will use gpu-cluster 
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2020-05-07T05:58:50.345000+00:00', 'errors': None, 'creationTime': '2020-05-06T20:26:47.977000+00:00', 'modifiedTime': '2020-05-06T20:27:04.784279+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 5, 'nodeIdleTimeBeforeScaleDown': 'PT300S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC12S_V3'}


## Prepare training script

Create a project directory that will contain code from your local machine that you will need access to on the remote resource. This includes the training script and additional files your training script depends on. In this example, the training script is provided: 
<br>
`train_rapids_RF.py` - entry script for RAPIDS Estimator that includes loading dataset into cuDF data frame, training with Random Forest and inference using cuML.

In [8]:
import os

project_folder = './train_rapids'
os.makedirs(project_folder, exist_ok=True)

We will log some metrics by using the `Run` object within the training script:

```python
from azureml.core.run import Run
run = Run.get_context()
```
 
We will also log the parameters and highest accuracy the model achieves:

```python
run.log('Accuracy', np.float(accuracy))
```

These run metrics will become particularly important when we begin hyperparameter tuning our model in the 'Tune model hyperparameters' section.

Copy the training script `train_rapids_RF.py` into your project directory:

In [9]:
import shutil

shutil.copy('../code/train_rapids_RF.py', project_folder)
shutil.copy('../code/rapids_csp_azure.py', project_folder)

'./train_rapids/rapids_csp_azure.py'

## Train model on the remote compute

Now that you have your data and training script prepared, you are ready to train on your remote compute.

### Create experiment

Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace.

In [10]:
from azureml.core import Experiment

experiment_name = 'train_rapids'
experiment = Experiment(ws, name=experiment_name)

### Create environment

The [Environment class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py) allows you to build a Docker image and customize the system that you will use for training. We will build a container image using a RAPIDS container as base image and install necessary packages. This build is necessary only the first time and will take about 15 minutes. The image will be added to your Azure Container Registry and the environment will be cached after the first run, as long as the environment definition remains the same.

In [11]:
from azureml.core import Environment

# create the environment
rapids_env = Environment('rapids_env')

# create the environment inside a Docker container
rapids_env.docker.enabled = True

# specify docker steps as a string. Alternatively, load the string from a file
dockerfile = """
FROM rapidsai/rapidsai-nightly:0.14-cuda10.0-runtime-ubuntu18.04-py3.7
RUN source activate rapids && \
pip install azureml-sdk && \
pip install azureml-widgets
"""
#FROM nvcr.io/nvidia/rapidsai/rapidsai:0.12-cuda10.0-runtime-ubuntu18.04

# set base image to None since the image is defined by dockerfile
rapids_env.docker.base_image = None
rapids_env.docker.base_dockerfile = dockerfile

# use rapids environment in the container
rapids_env.python.user_managed_dependencies = True

In [12]:
# from azureml.core.container_registry import ContainerRegistry

# # this is an image available on Docker Hub
# image_name = 'zronaghi/rapidsai-nightly:0.13-cuda10.0-runtime-ubuntu18.04-py3.7-azuresdk-030920'

# # use rapids environment, don't build a new conda environment
# user_managed_dependencies = True

### Create a RAPIDS Estimator

The [Estimator](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py) class can be used with machine learning frameworks that do not have a pre-configure estimator. 

`script_params` is a dictionary of command-line arguments to pass to the training script.

In [13]:

ds_data.as_mount()

$AZUREML_DATAREFERENCE_028c9b659c724e5bbfbc3b71ff840262

In [14]:
from azureml.train.estimator import Estimator

script_params = {
    '--data_dir': ds_data.as_mount(),
    '--n_bins': 32,
}

estimator = Estimator(source_directory=project_folder,
                      script_params=script_params,
                      compute_target=gpu_cluster, 
                      entry_script='train_rapids_RF.py',
                      environment_definition=rapids_env)
#                       custom_docker_image=image_name,
#                       user_managed=user_managed_dependencies

## Tune model hyperparameters

We can optimize our model's hyperparameters and improve the accuracy using Azure Machine Learning's hyperparameter tuning capabilities.

### Start a hyperparameter sweep

Let's define the hyperparameter space to sweep over. We will tune `n_estimators`, `max_depth` and `max_features` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize `Accuracy`.

In [15]:
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice, loguniform, uniform

param_sampling = RandomParameterSampling( {
    '--n_estimators': choice(range(50, 500)),
    '--max_depth': choice(range(5, 19)),
    '--max_features': uniform(0.2, 1.0)
    }
)   

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling, 
                                         primary_metric_name='Accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=50,
                                         max_concurrent_runs=5)

This will launch the RAPIDS training script with parameters that were specified in the cell above.

In [16]:
# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_run_config)

## Monitor HyperDrive runs

Monitor and view the progress of the machine learning training run with a [Jupyter widget](https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py).The widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [17]:
from azureml.widgets import RunDetails

RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [18]:
hyperdrive_run.wait_for_completion(show_output=True)

RunId: HD_696d3b9d-839f-4409-9d4d-7c024e7be2e2
Web View: https://ml.azure.com/experiments/train_rapids/runs/HD_696d3b9d-839f-4409-9d4d-7c024e7be2e2?wsid=/subscriptions/73612009-b37b-413f-a3f7-ec02f12498cf/resourcegroups/RAPIDS-HPO-Nanthini/workspaces/HPO-Workspace-Nanthini

Streaming azureml-logs/hyperdrive.txt

"<START>[2020-05-07T03:28:52.730279][API][INFO]Experiment created<END>\n""<START>[2020-05-07T03:28:53.390289][GENERATOR][INFO]Trying to sample '5' jobs from the hyperparameter space<END>\n""<START>[2020-05-07T03:28:54.330083][GENERATOR][INFO]Successfully sampled '5' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2020-05-07T03:28:55.5823785Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_696d3b9d-839f-4409-9d4d-7c024e7be2e2
Web View: https://ml.azure.com/experiments/train_rapids/runs/HD_696d3b9d-839f-4409-9d4d-7c024e7be2e2?wsid=/subscriptions/73612009-b37b-

{'runId': 'HD_696d3b9d-839f-4409-9d4d-7c024e7be2e2',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-05-07T03:28:52.505906Z',
 'endTimeUtc': '2020-05-07T05:48:20.549323Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '599d2031-4d05-47da-836b-6d92ffb157ca',
  'score': '0.9670767188072205',
  'best_child_run_id': 'HD_696d3b9d-839f-4409-9d4d-7c024e7be2e2_0',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://hpoworkspacena5334546303.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_696d3b9d-839f-4409-9d4d-7c024e7be2e2/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=jWaiCIC2kscq0gVGTySHnPvo0zINEXYHmJIsSVBIKtI%3D&st=2020-05-07T05%3A38%3A26Z&se=2020-05-07T13%3A48%3A26Z&sp=r'}}

In [19]:
# hyperdrive_run.cancel()

### Find and register best model

In [20]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

['--data_dir', '$AZUREML_DATAREFERENCE_0be6a9d7a78346a29d252b707be571e5', '--n_bins', '32', '--max_depth', '18', '--max_features', '0.631046680467573', '--n_estimators', '117']


List the model files uploaded during the run:

In [21]:
print(best_run.get_file_names())

['azureml-logs/55_azureml-execution-tvmps_01e287d63e76dd6832da52a63724415fa599e07ccb5d0814c8137e08f5e9a4dc_d.txt', 'azureml-logs/65_job_prep-tvmps_01e287d63e76dd6832da52a63724415fa599e07ccb5d0814c8137e08f5e9a4dc_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_01e287d63e76dd6832da52a63724415fa599e07ccb5d0814c8137e08f5e9a4dc_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/209_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log']


Register the folder (and all files in it) as a model named `train-rapids` under the workspace for deployment

In [22]:
# model = best_run.register_model(model_name='train-rapids', model_path='outputs/model-rapids.joblib')

## Delete cluster

In [23]:
# delete the cluster
# gpu_cluster.delete()