# TABLE OF CONTENTS:
---
* [Notebook Summary](#Notebook-Summary)
* [Setup](#Setup)
    * [Connect to Workspace](#Connect-to-Workspace)
* [Data](#Data)
    * [Retrieve AML Dataset](#Retrieve-AML-Dataset)
* [Compute Target](#Compute-Target)
* [Training Artifacts](#Training-Artifacts)
* [Training Environment](#Training-Environment)
* [Experiment & Run Configuration](#Experiment-&-Run-Configuration)
    * [Option 1: Normal Script Run](#Option-1:-Normal-Script-Run)
    * [Option 2: Hyperdrive Run](#Option-2:-Hyperdrive-Run)
* [Run Monitoring](#Run-Monitoring)
* [Model Registration](#Model-Registration)
    * [Model Download](#Model-Download)
* [Resource Clean Up](#Resource-Clean-Up)
---

# Notebook Summary

In this notebook, a pytorch model will be trained on the stanford dogs dataset leveraging transfer learning by using a pretrained ResNet-18. A normal script run can be used for "plain" training of a single model on the Azure Machine Learning (AML) compute cluster. A hyperdrive run can be used to run parallel training with multiple hyperparameter configurations on multiple nodes of the AML compute cluster and is therefore useful for hyperparameter tuning. The compute cluster offers an autoscaling capability and will only be spinned up during an experiment. In general, model training should happen on the AML compute cluster for cost and performance reasons (e.g. powerful GPU clusters can be provisioned).

# Setup

Append parent directory to sys path to be able to import created modules from src directory.

In [4]:
import sys
sys.path.append(os.path.dirname(os.path.abspath("")))

Automatically reload modules when changes are made.

In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Import libraries.

In [67]:
# Import libraries
import azureml.core
import shutil
import torch
from azureml.core import Dataset, Environment, Experiment, Keyvault, Model, ScriptRunConfig, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.model import InferenceConfig 
from azureml.train.hyperdrive import BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, RandomParameterSampling
from azureml.train.hyperdrive import choice, uniform
from azureml.widgets import RunDetails
from torchvision import datasets

print(f"azureml.core version: {azureml.core.VERSION}")

azureml.core version: 1.20.0


### Connect to Workspace

In order to connect and communicate with the AML workspace, a workspace object needs to be instantiated using the AML SDK.

In [7]:
# Connect to the AML workspace using interactive authentication
ws = Workspace.from_config()

# Data

### Retrieve AML Dataset

Retrieve the dataset from the AML workspace. The dataset has been registered as part of the `01_dataset_setup` notebook.

In [8]:
dataset_name = "stanford-dogs-dataset"
dataset = Dataset.get_by_name(ws, name=dataset_name)

# Compute Target

Retrieve a remote compute target to run experiments on. The below code will first check whether a compute target with name **cluster_name** already exists and if it does, will retrieve it. Otherwise it will create a new compute cluster.

In [9]:
# Choose a name for the cluster
cluster_name = "gpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(#vm_size="STANDARD_D2_V2", # CPU
                                                           vm_size='STANDARD_NC6', # GPU
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown=2400)
    
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current cluster
print(compute_target.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-03-13T11:21:20.882000+00:00', 'errors': None, 'creationTime': '2021-02-28T09:03:28.414676+00:00', 'modifiedTime': '2021-02-28T09:03:44.147067+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT2400S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


# Training Artifacts

A training script has been created in the `../src/training` folder. This script will be executed by the remote compute. The training script uses transfer learning to train a pretrained ResNet18 model on the stanford dogs dataset.

Run the training script locally for 2 epochs for debugging purposes.

In [43]:
!{sys.executable} ../src/training/train.py --data_path ../data --num_epochs=2 --output_dir="../outputs" --learning_rate 0.1 --momentum 0.9

Torch version: 1.7.0
--------------------
CREATE DATALOADERS
--------------------
Dataloaders have been created successfully.
--------------------
START MODEL TRAINING
--------------------
Hyperparameter batch size: 4
Hyperparameter learning rate: 0.1
Hyperparameter momentum: 0.9
Hyperparameter number of frozen layers: 7
Hyperparameter number of neurons fc layer: 512
Hyperparameter dropout probability fc layer: 0
Hyperparameter lr scheduler step size: 10
Attempted to log scalar metric batch_size:
4
Attempted to log scalar metric lr:
0.1
Attempted to log scalar metric momentum:
0.9
Attempted to log scalar metric num_frozen_layers:
7
Attempted to log scalar metric num_neurons_fc_layer:
512
Attempted to log scalar metric dropout_prob_fc_layer:
0
Attempted to log scalar metric lr_scheduler_step_size:
10
--------------------
Epoch 1/2
--------------------
Iteration 100
Preds: tensor([ 84,  33,  16, 117])
Labels: tensor([84, 71, 44, 84])
Cross Entropy: 4.162421703338623
Iteration 200
Preds: 

Iteration 400
Preds: tensor([87, 72, 30, 32])
Labels: tensor([87, 72, 30, 32])
Cross Entropy: 0.0
Iteration 500
Preds: tensor([ 96, 116,  70,  69])
Labels: tensor([ 96, 116,  70,  69])
Cross Entropy: 0.0
Iteration 600
Preds: tensor([ 60, 115,   8,  24])
Labels: tensor([ 60, 115,   8,  24])
Cross Entropy: 0.0
Epoch loss: 57.47384825970899
Val Loss: 57.4738 Val Acc: 0.6658
Attempted to log scalar metric val_loss:
57.47384825970899
Attempted to log scalar metric val_acc:
0.6658333333333334
Attempted to log scalar metric best_val_acc:
0.6658333333333334
--------------------
Attempted to log scalar metric training_duration:
2861.7707707881927
Training completed in 47m 42s.
Best Val Acc: 0.665833
--------------------
Model saved in ../outputs.


# Training Environment

Load the training environment that has been registered as part of the `00_environment_setup` notebook and use it for remote training on the AML compute cluster.

In [23]:
env_name = "stanford-dogs-train-env"
env = Environment.get(workspace=ws, name=env_name)

# Experiment & Run Configuration

Now that the training artifacts are prepared, a model can be trained on the remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time. 

In [24]:
# Create the experiment
experiment = Experiment(workspace=ws, 
                        name="stanford-dogs-classifier-train")

experiment.tag("model_architecture", "transfer-learning with resnet18")

### Option 1: Normal Script Run

In [77]:
# Set variable to identify run type for logic later in the notebook
run_type = "script_run"

# Create the script run configuration
src_config = ScriptRunConfig(source_directory="../src",
                             script="training/train.py",
                             compute_target=compute_target,
                             arguments=[
                                 "--data_path", dataset.as_named_input("input").as_mount(),
                                 "--num_epochs", 36,
                                 "--output_dir", "outputs",
                                 "--batch_size", 8,
                                 "--learning_rate", 0.1,
                                 "--momentum", 0.9,
                                 "--num_frozen_layers", 7,
                                 "--num_neurons_fc_layer", 1024,
                                 "--dropout_prob_fc_layer", 0.0,
                                 "--lr_scheduler_step_size", 8])

src_config.run_config.environment = env

# Start the Script Run
run = experiment.submit(src_config)

### Option 2: Hyperdrive Run

Hyperparameters can be tuned using AML's hyperdrive capability.

The initial learning rate is tuned. The training script can contain a LR schedule to decay the learning rate every several epochs starting from that initial learning rate.

Random sampling is used to try different configuration sets of hyperparameters to maximize the primary metric, the best validation accuracy (best_val_acc).

An early termination policy is specified to early terminate poorly performing runs. The BanditPolicy is used, which will terminate any run that doesn't fall within the slack factor of the primary evaluation metric. In this template, this policy will be applied every epoch (since the best_val_acc metric is reported every epoch and evaluation_interval=1). The first policy evaluation will be delayed until after the first 5 epochs (delay_evaluation=5). 

In [52]:
# Set variable to identify run type for logic later in the notebook
run_type = "hyperdrive_run"

param_sampling = RandomParameterSampling({
    "batch_size": choice(8, 16),
    "learning_rate": choice(0.1, 0.03, 0.01),
    "momentum": uniform(0.9, 0.99),
    "num_frozen_layers": choice(6, 7)})

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=5)

# Create the script run configuration
src_config = ScriptRunConfig(source_directory="../src",
                             script="training/train.py",
                             compute_target=compute_target,
                             arguments=[
                                 "--data_path", dataset.as_named_input("input").as_mount(),
                                 "--num_epochs", 35,
                                 "--output_dir", "outputs",
                                 "--batch_size", 16,
                                 "--learning_rate", 0.1,
                                 "--momentum", 0.9,
                                 "--num_frozen_layers", 7,
                                 "--num_neurons_fc_layer", 1024,
                                 "--dropout_prob_fc_layer", 0.0,
                                 "--lr_scheduler_step_size", 7])

src_config.run_config.environment = env

hyperdrive_config = HyperDriveConfig(run_config=src_config,
                                     hyperparameter_sampling=param_sampling, 
                                     policy=early_termination_policy,
                                     primary_metric_name="best_val_acc",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=4,
                                     max_concurrent_runs=2)

# Start the Hyperdrive Run
run = experiment.submit(hyperdrive_config)

KeyboardInterrupt: 

# Run Monitoring

In [54]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [55]:
# Get portal URL
run.get_portal_url()

'https://ml.azure.com/experiments/stanford-dogs-classifier-train/runs/stanford-dogs-classifier-train_1615659912_35ad8df1?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/sbirkamlrg/workspaces/sbirkamlws'

In [56]:
run.wait_for_completion(show_output=False)

{'runId': 'stanford-dogs-classifier-train_1615659912_35ad8df1',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-13T18:29:42.438169Z',
 'endTimeUtc': '2021-03-13T20:02:29.41692Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '0688f074-6bda-40f6-ba58-3b5e0d70c0a2',
  'azureml.git.repository_uri': 'https://github.com/sebastianbirk/pytorch-use-cases-azure-ml.git',
  'mlflow.source.git.repoURL': 'https://github.com/sebastianbirk/pytorch-use-cases-azure-ml.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '0826c60250bd5f7117a6e9a0fb487f8addc057ab',
  'mlflow.source.git.commit': '0826c60250bd5f7117a6e9a0fb487f8addc057ab',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': 'eaf8e856-4dce-4c8d-9ec8-d9211cfa03e1'}, 'consumptionDetails': {'type': '

In [57]:
# Retrieve best child run
if run_type == "hyperdrive_run":
    best_child_run = run.get_best_run_by_primary_metric()
elif run_type == "script_run":
    best_child_run = run

In [58]:
# Check run metrics, details and file names
best_child_run_metrics = best_child_run.get_metrics()
best_child_run_details = best_child_run.get_details()
best_child_run_file_names = best_child_run.get_file_names()


print(best_child_run_metrics)
print("==========================")
print(best_child_run_details)
print("==========================")
print(best_child_run_file_names)

{'num_epochs': 30, 'batch_size': 8, 'lr': 0.1, 'momentum': 0.9, 'num_frozen_layers': 7, 'num_neurons_fc_layer': 512, 'dropout_prob_fc_layer': 0, 'lr_scheduler_step_size': 7, 'train_loss': [79.32024400116816, 99.7234441648841, 106.25103312307714, 111.59790226948589, 116.81815944839039, 121.67214546163248, 124.67642483834743, 43.987113985798956, 13.284877053333567, 10.645608805397652, 10.239459109386193, 9.823940230748157, 9.284297396588876, 9.260509682019766, 4.6363822694907535, 1.8916586798374662, 1.303240922502567, 1.2565790720846175, 1.294267868238676, 1.2707453783234328, 1.2908153624428087, 0.9892540154611925, 0.87474680591103, 0.8895387642832066, 0.8883564145669031, 0.8875383848939479, 0.8751734884655646, 0.8811545228222774, 0.8471578999143093, 0.8708809402035937], 'train_acc': [0.2315625, 0.39437500000000003, 0.4426041666666667, 0.4761458333333334, 0.49468750000000006, 0.5039583333333334, 0.5238541666666667, 0.6494791666666667, 0.6095833333333334, 0.6019791666666667, 0.60770833333

In [59]:
print(f"Best Run is:")
print(f"Validation accuracy: {best_child_run_metrics['best_val_acc'][-1]}")
print(f"Learning rate: {best_child_run_metrics['lr']}")
print(f"Momentum: {best_child_run_metrics['momentum']}")

Best Run is:
Validation accuracy: 0.8891666666666668
Learning rate: 0.1
Momentum: 0.9


# Model Registration

Register the model to the AML workspace for subsequent deployment.

In [75]:
model_path = "outputs/dog_clf_model.pt"

model = best_child_run.register_model(model_name="dog-classification-model",
                                      model_path=model_path,
                                      tags={"type": "multiclass classification"},
                                      properties={"val_acc": best_child_run_metrics['best_val_acc'][-1]},
                                      model_framework=Model.Framework.PYTORCH,
                                      model_framework_version=torch.__version__,
                                      description="dog classification model")

print(model.name, model.id, model.version, sep="\t")

dog-classification-model	dog-classification-model:10	10


Add the input dataset to the registered model.

In [76]:
model.add_dataset_references([("input dataset", dataset)])

### Model Download

If required, the model can be downloaded as follows (e.g. for local testing):

In [63]:
download_model=True

if download_model:
    
    # Create directory
    outputs_folder = os.path.join(os.getcwd(), "../outputs")
    os.makedirs(outputs_folder, exist_ok=True)
    print(f"Outputs folder {outputs_folder} has been created.")
    
    # Download model artifact
    best_child_run.download_file(name=model_path, output_file_path="../outputs/dog_clf_model.pt")

Outputs folder /mnt/batch/tasks/shared/LS_root/mounts/clusters/sbirkamlci/code/pytorch-use-cases-azure-ml/stanford_dogs/notebooks/../outputs has been created.


# Resource Clean Up

In [None]:
compute_target.delete()