# TABLE OF CONTENTS:
---
* [Notebook Summary](#Notebook-Summary)
* [Setup](#Setup)
    * [Connect to Workspace](#Connect-to-Workspace)
* [Data](#Data)
    * [Retrieve AML Dataset](#Retrieve-AML-Dataset)
* [Compute Target](#Compute-Target)
* [Training Artifacts](#Training-Artifacts)
* [Training Environment](#Training-Environment)
* [Experiment & Run Configuration](#Experiment-&-Run-Configuration)
    * [Option 1: Normal Script Run](#Option-1:-Normal-Script-Run)
    * [Option 2: Hyperdrive Run](#Option-2:-Hyperdrive-Run)
* [Run Monitoring](#Run-Monitoring)
* [Model Registration](#Model-Registration)
    * [Model Download](#Model-Download)
* [Resource Clean Up](#Resource-Clean-Up)
---

# Notebook Summary

In this notebook, a pytorch model will be trained on the stanford dogs dataset leveraging transfer learning by using a pretrained ResNet-18. A normal script run can be used for "plain" training of a single model on the Azure Machine Learning (AML) compute cluster. A hyperdrive run can be used to run parallel training with multiple hyperparameter configurations on multiple nodes of the AML compute cluster and is therefore useful for hyperparameter tuning. The compute cluster offers an autoscaling capability and will only be spinned up during an experiment. In general, model training should happen on the AML compute cluster for cost and performance reasons (e.g. powerful GPU clusters can be provisioned).

# Setup

Append parent directory to sys path to be able to import created modules from src directory.

In [1]:
import sys
sys.path.append(os.path.dirname(os.path.abspath("")))

Automatically reload modules when changes are made.

In [2]:
%load_ext autoreload
%autoreload 2

Import libraries.

In [3]:
# Import libraries
import azureml.core
import shutil
from azureml.core import Dataset, Environment, Experiment, Keyvault, Model, ScriptRunConfig, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.model import InferenceConfig 
from azureml.train.hyperdrive import BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, RandomParameterSampling
from azureml.train.hyperdrive import choice, uniform
from azureml.widgets import RunDetails
from torchvision import datasets

print(f"azureml.core version: {azureml.core.VERSION}")

azureml.core version: 1.20.0


### Connect to Workspace

In order to connect and communicate with the AML workspace, a workspace object needs to be instantiated using the AML SDK.

In [4]:
# Connect to the AML workspace using interactive authentication
ws = Workspace.from_config()

# Data

### Retrieve AML Dataset

Retrieve the dataset from the AML workspace. The dataset has been registered as part of the `01_dataset_setup` notebook.

In [5]:
dataset_name = "stanford-dogs-dataset"
dataset = Dataset.get_by_name(ws, name=dataset_name)

# Compute Target

Retrieve a remote compute target to run experiments on. The below code will first check whether a compute target with name **cluster_name** already exists and if it does, will retrieve it. Otherwise it will create a new compute cluster.

In [6]:
# Choose a name for the cluster
cluster_name = "gpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(#vm_size="STANDARD_D2_V2", # CPU
                                                           vm_size='STANDARD_NC6', # GPU
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown=2400)
    
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current cluster
print(compute_target.get_status().serialize())

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-02-28T09:03:31.177000+00:00', 'errors': None, 'creationTime': '2021-02-28T09:03:28.414676+00:00', 'modifiedTime': '2021-02-28T09:03:44.147067+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT2400S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


# Training Artifacts

A training script has been created in the `../src/training` folder. This script will be executed by the remote compute. The training script uses transfer learning to train a pretrained ResNet18 model on the stanford dogs dataset.

Run the training script locally for 2 epochs for debugging purposes.

In [14]:
!python ../src/training/train.py --data_path ../data --num_epochs=2 --output_dir="../outputs" --learning_rate 0.1 --momentum 0.9

Torch version: 1.6.0
--------------------
LOAD DATA
--------------------
Data has been load successfully.
--------------------
START TRAINING
--------------------
Attempted to log scalar metric lr:
0.1
Attempted to log scalar metric momentum:
0.9
0
Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
1
BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
2
ReLU(inplace=True)
3
MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
4
Sequential(
  (0): Bottleneck(
    (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
    (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn3): BatchNorm

--------------------
Epoch 1/2
--------------------
^C
Traceback (most recent call last):
  File "../src/training/train.py", line 257, in <module>
    main()
  File "../src/training/train.py", line 247, in main
    momentum=args.momentum)
  File "../src/training/train.py", line 214, in fine_tune_model
    dataset_sizes)
  File "../src/training/train.py", line 83, in train_model
    outputs = model(inputs)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/torchvision/models/resnet.py", line 220, in forward
    return self._forward_impl(x)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/torchvision/models/resnet.py", line 208, in _forward_impl
    x = self.layer1(x)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.f

# Training Environment

Load the training environment that has been registered as part of the `00_environment_setup` notebook and use it for remote training on the AML compute cluster.

In [7]:
env_name = "stanford-dogs-train-env"
env = Environment.get(workspace=ws, name=env_name)

# Experiment & Run Configuration

Now that the training artifacts are prepared, a model can be trained on the remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time. 

In [8]:
# Create the experiment
experiment = Experiment(workspace=ws, 
                        name="stanford-dogs-classifier-pytorch")

experiment.tag("model_architecture", "transfer-learning with resnet18")

### Option 1: Normal Script Run

In [9]:
# Set variable to identify run type for logic later in the notebook
run_type = "script_run"

# Create the script run configuration
src_config = ScriptRunConfig(source_directory="../src",
                             script="training/train.py",
                             compute_target=compute_target,
                             arguments=[
                                 "--data_path", dataset.as_named_input("input").as_mount(),
                                 "--num_epochs", 30,
                                 "--output_dir", "outputs",
                                 "--learning_rate", 0.1,
                                 "--momentum", 0.9])

src_config.run_config.environment = env

# Start the Script Run
run = experiment.submit(src_config)

### Option 2: Hyperdrive Run

Hyperparameters can be tuned using AML's hyperdrive capability.

The initial learning rate is tuned. The training script can contain a LR schedule to decay the learning rate every several epochs starting from that initial learning rate.

Random sampling is used to try different configuration sets of hyperparameters to maximize the primary metric, the best validation accuracy (best_val_acc).

An early termination policy is specified to early terminate poorly performing runs. The BanditPolicy is used, which will terminate any run that doesn't fall within the slack factor of the primary evaluation metric. In this template, this policy will be applied every epoch (since the best_val_acc metric is reported every epoch and evaluation_interval=1). The first policy evaluation will be delayed until after the first 10 epochs (delay_evaluation=10). 

In [None]:
# Set variable to identify run type for logic later in the notebook
run_type = "hyperdrive_run"

param_sampling = RandomParameterSampling({
    "learning_rate": uniform(0.0005, 0.005),
    "momentum": uniform(0.9, 0.99)}
)

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=10)

# Create the script run configuration
src_config = ScriptRunConfig(source_directory="../src",
                             script="training/train.py",
                             compute_target=compute_target,
                             arguments=[
                                 "--data_path", dataset.as_named_input("input").as_mount(),
                                 "--num_epochs", 20,
                                 "--output_dir", "outputs",
                                 "--learning_rate", 0.01,
                                 "--momentum", 0.9])

src_config.run_config.environment = env

hyperdrive_config = HyperDriveConfig(run_config=src_config,
                                     hyperparameter_sampling=param_sampling, 
                                     policy=early_termination_policy,
                                     primary_metric_name="best_val_acc",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=4,
                                     max_concurrent_runs=2)

# Start the Hyperdrive Run
run = experiment.submit(hyperdrive_config)

# Run Monitoring

In [10]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [11]:
# Get portal URL
run.get_portal_url()

'https://ml.azure.com/experiments/stanford-dogs-classifier-pytorch/runs/stanford-dogs-classifier-pytorch_1614503127_610c8c13?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/sbirkamlrg/workspaces/sbirkamlws'

In [12]:
run.wait_for_completion(show_output=False)

{'runId': 'stanford-dogs-classifier-pytorch_1614503127_610c8c13',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-28T09:32:10.035634Z',
 'endTimeUtc': '2021-02-28T11:05:10.829328Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '46de1d6e-fe78-4051-8cf7-1ae2767cd2a1',
  'azureml.git.repository_uri': 'https://github.com/sebastianbirk/pytorch-use-cases-azure-ml',
  'mlflow.source.git.repoURL': 'https://github.com/sebastianbirk/pytorch-use-cases-azure-ml',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '79cc13fbfdaa456bde30c928c2945d9a98097cbf',
  'mlflow.source.git.commit': '79cc13fbfdaa456bde30c928c2945d9a98097cbf',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '556e3f48-48d0-4702-9854-2332f739eaab'}, 'consumptionDetails': {'type': 'RunIn

In [13]:
# Retrieve best child run
if run_type == "hyperdrive_run":
    best_child_run = run.get_best_run_by_primary_metric()
elif run_type == "script_run":
    best_child_run = run

In [14]:
# Check run metrics, details and file names
best_child_run_metrics = best_child_run.get_metrics()
best_child_run_details = best_child_run.get_details()
best_child_run_file_names = best_child_run.get_file_names()


print(best_child_run_metrics)
print("==========================")
print(best_child_run_details)
print("==========================")
print(best_child_run_file_names)

{'lr': 0.1, 'momentum': 0.9, 'best_val_acc': [0.0, 0.5404166666666667, 0.5404166666666667, 0.6883333333333334, 0.6883333333333334, 0.7233333333333334, 0.7233333333333334, 0.7233333333333334, 0.7233333333333334, 0.74875, 0.74875, 0.7758333333333334, 0.7758333333333334, 0.7758333333333334, 0.7758333333333334, 0.7908333333333334, 0.7908333333333334, 0.7908333333333334, 0.7908333333333334, 0.7908333333333334, 0.7908333333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.8545833333333334, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375, 0.87375]}

In [15]:
print(f"Best Run is:")
print(f"Validation accuracy: {best_child_run_metrics['best_val_acc'][-1]}")
print(f"Learning rate: {best_child_run_metrics['lr']}")
print(f"Momentum: {best_child_run_metrics['momentum']}")

Best Run is:
Validation accuracy: 0.87375
Learning rate: 0.1
Momentum: 0.9


# Model Registration

Register the model to the AML workspace for subsequent deployment.

In [16]:
model_path = "outputs/model.pt"

model = best_child_run.register_model(model_name="dog-classification-model",
                                      model_path=model_path,
                                      model_framework=Model.Framework.PYTORCH,
                                      description="dog classification model")

print(model.name, model.id, model.version, sep="\t")

dog-classification-model	dog-classification-model:1	1


### Model Download

If required, the model can be downloaded as follows (e.g. for local testing):

In [17]:
download_model=True

if download_model:
    
    # Create directory
    outputs_folder = os.path.join(os.getcwd(), "../outputs")
    os.makedirs(outputs_folder, exist_ok=True)
    print(f"Outputs folder {outputs_folder} has been created.")
    
    # Download model artifact
    best_child_run.download_file(name=model_path, output_file_path="../outputs/model.pt")

Outputs folder /mnt/batch/tasks/shared/LS_root/mounts/clusters/sbirkamlci/code/pytorch-use-cases-azure-ml/stanford_dogs/notebooks/../outputs has been created.


# Resource Clean Up

In [None]:
compute_target.delete()