Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# NVIDIA RAPIDS in Azure Machine Learning

The [RAPIDS](https://www.developer.nvidia.com/rapids) suite of software libraries from NVIDIA enables the execution of end-to-end data science and analytics pipelines entirely on GPUs. In many machine learning projects, a significant portion of the model training time is spent in setting up the data; this stage of the process is known as Extraction, Transformation and Loading, or ETL. By using the DataFrame API for ETLÂ and GPU-capable ML algorithms in RAPIDS, data preparation and training models can be done in GPU-accelerated end-to-end pipelines without incurring serialization costs between the pipeline stages. This notebook demonstrates how to use NVIDIA RAPIDS to prepare data and train modelÂ in Azure.
 
In this notebook, we will do the following:
 
* Create an Azure Machine Learning Workspace
* Create an AMLCompute target
* Use a script to process our data and train a model
* Obtain the data required to run this sample
* Create an AML run configuration to launch a machine learning job
* Run the script to prepare data for training and train the model
 
Prerequisites:
* An Azure subscription to create a Machine Learning Workspace
* Familiarity with the Azure ML SDK (refer to [notebook samples](https://github.com/Azure/MachineLearningNotebooks))
* A Jupyter notebook environment with Azure Machine Learning SDK installed. Refer to instructions to [setup the environment](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment#local)

### Verify if Azure ML SDK is installed

In [None]:
import azureml.core
print("SDK version:", azureml.core.VERSION)

In [None]:
import os
from azureml.core import Workspace, Experiment
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.data.data_reference import DataReference
from azureml.core.runconfig import RunConfiguration
from azureml.core import ScriptRunConfig
from azureml.widgets import RunDetails

### Create Azure ML Workspace

The following step is optional if you already have a workspace. If you want to use an existing workspace, then
skip this workspace creation step and move on to the next step to load the workspace.
 
<font color='red'>Important</font>: in the code cell below, be sure to set the correct values for the subscription_id, 
resource_group, workspace_name, region before executing this code cell.

In [None]:
subscription_id = os.environ.get("SUBSCRIPTION_ID", "<subscription_id>")
resource_group = os.environ.get("RESOURCE_GROUP", "<resource_group>")
workspace_name = os.environ.get("WORKSPACE_NAME", "<workspace_name>")
workspace_region = os.environ.get("WORKSPACE_REGION", "<region>")

ws = Workspace.create(workspace_name, subscription_id=subscription_id, resource_group=resource_group, location=workspace_region)

# write config to a local directory for future use
ws.write_config()

### Load existing Workspace

In [None]:
ws = Workspace.from_config()
# if a locally-saved configuration file for the workspace is not available, use the following to load workspace
# ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

scripts_folder = "scripts_folder"

if not os.path.isdir(scripts_folder):
    os.mkdir(scripts_folder)

### Create AML Compute Target

Because NVIDIA RAPIDS requires P40 or V100 GPUs, the user needs to specify compute targets from one of [NC_v3](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv3-series), [NC_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv2-series), [ND](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#nd-series) or [ND_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ndv2-series-preview) virtual machine types in Azure; these are the families of virtual machines in Azure that are provisioned with these GPUs.
 
Pick one of the supported VM SKUs based on the number of GPUs you want to use for ETL and training in RAPIDS.
 
The script in this notebook is implemented for single-machine scenarios. An example supporting multiple nodes will be published later.

In [None]:
gpu_cluster_name = "gpucluster"

if gpu_cluster_name in ws.compute_targets:
    gpu_cluster = ws.compute_targets[gpu_cluster_name]
    if gpu_cluster and type(gpu_cluster) is AmlCompute:
        print('found compute target. just use it. ' + gpu_cluster_name)
else:
    print("creating new cluster")
    # vm_size parameter below could be modified to one of the RAPIDS-supported VM types
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "Standard_NC6s_v2", min_nodes=1, max_nodes = 1)

    # create the cluster
    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)
    gpu_cluster.wait_for_completion(show_output=True)

### Script to process data and train model

The _process&#95;data.py_ script used in the step below is a slightly modified implementation of [RAPIDS E2E example](https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb).

In [None]:
# copy process_data.py into the script folder
import shutil
shutil.copy('./process_data.py', os.path.join(scripts_folder, 'process_data.py'))

with open(os.path.join(scripts_folder, './process_data.py'), 'r') as process_data_script:
    print(process_data_script.read())

### Data required to run this sample

This sample uses [Fannie Maeâ€™s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). Refer to the 'Available mortgage datasets' section in [instructions](https://rapidsai.github.io/demos/datasets/mortgage-data) to get sample data.

Once you obtain access to the data, you will need to make this data available in an [Azure Machine Learning Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data), for use in this sample.

<font color='red'>Important</font>: The following step assumes the data is uploaded to the Workspace's default data store under a folder named 'mortgagedata2000_01'. Note that uploading data to the Workspace's default data store is not necessary and the data can be referenced from any datastore, e.g., from Azure Blob or File service, once it is added as a datastore to the workspace. The path_on_datastore parameter needs to be updated, depending on where the data is available.  The directory where the data is available should have the following folder structure, as the process_data.py script expects this directory structure:
* _&lt;data directory>_/acq
* _&lt;data directory>_/perf
* _names.csv_

The 'acq' and 'perf' refer to directories containing data files. The _&lt;data directory>_ is the path specified in _path&#95;on&#95;datastore_ parameter in the step below.

In [None]:
ds = ws.get_default_datastore()

# download and uncompress data in a local directory before uploading to data store
# directory specified in src_dir parameter below should have the acq, perf directories with data and names.csv file
# ds.upload(src_dir='<local directory that has data>', target_path='mortgagedata2000_01', overwrite=True, show_progress=True)

# data already uploaded to the datastore
data_ref = DataReference(data_reference_name='data', datastore=ds, path_on_datastore='mortgagedata2000_01')

### Create AML run configuration to launch a machine learning job

AML allows the option of using existing Docker images with prebuilt conda environments. The following step use an existing image from [Docker Hub](https://hub.docker.com/r/rapidsai/rapidsai/).

In [None]:
run_config = RunConfiguration()
run_config.framework = 'python'
run_config.environment.python.user_managed_dependencies = True
# use conda environment named 'rapids' available in the Docker image
# this conda environment does not include azureml-defaults package that is required for using AML functionality like metrics tracking, model management etc.
run_config.environment.python.interpreter_path = '/conda/envs/rapids/bin/python'
run_config.target = gpu_cluster_name
run_config.environment.docker.enabled = True
run_config.environment.docker.gpu_support = True
# if registry is not mentioned the image is pulled from Docker Hub
run_config.environment.docker.base_image = "rapidsai/rapidsai:cuda9.2_ubuntu16.04_root"
run_config.environment.spark.precache_packages = False
run_config.data_references={'data':data_ref.to_config()}

### Wrapper function to submit Azure Machine Learning experiment

In [None]:
# parameter cpu_predictor indicates if training should be done on CPU. If set to true, GPUs are used *only* for ETL and *not* for training
# parameter num_gpu indicates number of GPUs to use among the GPUs available in the VM for ETL and if cpu_predictor is false, for training as well 
def run_rapids_experiment(cpu_training, gpu_count):
    # any value between 1-4 is allowed here depending the type of VMs available in gpu_cluster
    if gpu_count not in [1, 2, 3, 4]:
        raise Exception('Value specified for the number of GPUs to use {0} is invalid'.format(gpu_count))

    # following data partition mapping is empirical (specific to GPUs used and current data partitioning scheme) and may need to be tweaked
    gpu_count_data_partition_mapping = {1: 2, 2: 4, 3: 5, 4: 7}
    part_count = gpu_count_data_partition_mapping[gpu_count]

    end_year = 2000
    if gpu_count > 2:
        end_year = 2001 # use more data with more GPUs

    src = ScriptRunConfig(source_directory=scripts_folder, 
                          script='process_data.py', 
                          arguments = ['--num_gpu', gpu_count, '--data_dir', str(data_ref),
                                      '--part_count', part_count, '--end_year', end_year,
                                      '--cpu_predictor', cpu_training
                                      ],
                          run_config=run_config
                         )

    exp = Experiment(ws, 'rapidstest')
    run = exp.submit(config=src)
    RunDetails(run).show()

### Submit experiment (ETL & training on GPU)

In [None]:
cpu_predictor = False
# the value for num_gpu should be less than or equal to the number of GPUs available in the VM
num_gpu = 1 
# train using CPU, use GPU for both ETL and training
run_rapids_experiment(cpu_predictor, num_gpu)

### Submit experiment (ETL on GPU, training on CPU)

To observe performance difference between GPU-accelerated RAPIDS based training with CPU-only training, set 'cpu_predictor' predictor to 'True' and rerun the experiment

In [None]:
cpu_predictor = True
# the value for num_gpu should be less than or equal to the number of GPUs available in the VM
num_gpu = 1
# train using CPU, use GPU for ETL
run_rapids_experiment(cpu_predictor, num_gpu)

### Delete cluster

In [None]:
# delete the cluster
# gpu_cluster.delete()