# TABLE OF CONTENTS:
---
* [Notebook Summary](#Notebook-Summary)
* [Setup](#Setup)
    * [Notebook Parameters](#Notebook-Parameters)
    * [Connect to Workspace](#Connect-to-Workspace)
* [Data](#Data)
    * [Retrieve AML Dataset](#Retrieve-AML-Dataset)
* [Compute Target](#Compute-Target)
* [Training Artifacts](#Training-Artifacts)
* [Training Environment](#Training-Environment)
* [Experiment & Run Configuration](#Experiment-&-Run-Configuration)
    * [Option 1: Normal Script Run](#Option-1:-Normal-Script-Run)
    * [Option 2: Hyperdrive Run](#Option-2:-Hyperdrive-Run)
* [Run Monitoring](#Run-Monitoring)
* [Model Registration](#Model-Registration)
    * [Model Download](#Model-Download)
* [Resource Clean Up](#Resource-Clean-Up)
---

# Notebook Summary

In this notebook, a PyTorch model will be trained on the Stanford Dogs Dataset leveraging transfer learning by using a pretrained ResNext-50. A normal script run can be used for "plain" training of a single model on the Azure Machine Learning (AML) compute cluster. A hyperdrive run can be used to run parallel training with multiple hyperparameter configurations on multiple nodes of the AML compute cluster and is therefore useful for hyperparameter tuning.

The compute cluster offers an autoscaling capability and will only be spinned up during an experiment. In general, model training should be done on the AML compute cluster for cost and performance reasons (powerful GPU clusters can be provisioned on demand and will shut down automatically).

# Setup

Append parent directory to sys path to be able to import created modules from src directory.

In [1]:
import sys
sys.path.append(os.path.dirname(os.path.abspath("")))

Automatically reload modules when changes are made.

In [2]:
%load_ext autoreload
%autoreload 2

Import libraries.

In [3]:
# Import libraries
import azureml.core
import shutil
import torch
from azureml.core import Dataset, Environment, Experiment, Keyvault, Model, ScriptRunConfig, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.model import InferenceConfig 
from azureml.train.hyperdrive import BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, RandomParameterSampling
from azureml.train.hyperdrive import choice, uniform
from azureml.widgets import RunDetails
from torchvision import datasets

print(f"azureml.core version: {azureml.core.VERSION}")

azureml.core version: 1.20.0


### Notebook Parameters

In [4]:
# Specify the dataset name
dataset_name = "stanford-dogs-dataset"

# Specify the name of the remote compute target (AML compute cluster)
cluster_name = "gpu-cluster"

# Specify the training environment name
env_name = "stanford-dogs-train-env"

# Specify the training experiment name
experiment_name = "stanford-dogs-classifier-train"

# Specify the src directory path
src_dir = "../src"

# Specify the model name
model_name = "dog-classification-model"

# Specify the model description
model_description = "PyTorch dog classification model trained on 120 dog breeds from the Stanford Dogs Dataset"

# Specify the remote run model path
remote_model_path = "outputs/dog_clf_model.pt"

# Specify the model tags
model_tags = {"type": "multiclass classification"}

# Specify the directory path for outputs
outputs_dir = "../outputs"

# Specify the local model path
local_model_path = "../outputs/dog_clf_model.pt"

# Specify whether to download the model
download_model = False

Next to the notebook parameters, the hyperparameters for the run configurations further down the notebook will have to be modified.

### Connect to Workspace

In order to connect and communicate with the AML workspace, a workspace object needs to be instantiated using the AML SDK.

In [5]:
# Connect to the AML workspace using interactive authentication
ws = Workspace.from_config()

# Data

### Retrieve AML Dataset

Retrieve the dataset from the AML workspace. The dataset has been registered as part of the `01_dataset_setup` notebook.

In [6]:
dataset = Dataset.get_by_name(ws, name=dataset_name)

# Compute Target

Retrieve a remote compute target to run experiments on. The below code will first check whether a compute target with name **cluster_name** (specified as part of Notebook Parameters section) already exists and if it does, will retrieve it. Otherwise it will create a new compute cluster.

In [7]:
# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(#vm_size="STANDARD_D2_V2", # CPU
                                                           vm_size='STANDARD_NC6', # GPU
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown=2400)
    
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current cluster
print(compute_target.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 4, 'targetNodeCount': 4, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 4, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-03-14T07:57:15.668000+00:00', 'errors': None, 'creationTime': '2021-02-28T09:03:28.414676+00:00', 'modifiedTime': '2021-02-28T09:03:44.147067+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT2400S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


# Training Artifacts

A training script has been created in the `<PROJECT_ROOT>/src/training` folder. This script will be executed by the remote compute. The training script uses transfer learning to train a pretrained ResNext-50 model on the Stanford Dogs Dataset.

Run the training script locally for 2 epochs for debugging purposes.

In [29]:
!{sys.executable} ../src/training/train.py --data_path ../data --num_epochs=2 --output_dir="../outputs" --learning_rate 0.1 --momentum 0.9

Torch version: 1.7.0
--------------------
CREATE DATALOADERS
--------------------
Dataloaders have been created successfully.
--------------------
START MODEL TRAINING
--------------------
Hyperparameter number of epochs: 2
Hyperparameter batch size: 4
Hyperparameter learning rate: 0.1
Hyperparameter momentum: 0.9
Hyperparameter number of frozen layers: 7
Hyperparameter number of neurons fc layer: 512
Hyperparameter dropout probability fc layer: 0
Hyperparameter lr scheduler step size: 10
Attempted to log scalar metric num_epochs:
2
Attempted to log scalar metric batch_size:
4
Attempted to log scalar metric lr:
0.1
Attempted to log scalar metric momentum:
0.9
Attempted to log scalar metric num_frozen_layers:
7
Attempted to log scalar metric num_neurons_fc_layer:
512
Attempted to log scalar metric dropout_prob_fc_layer:
0
Attempted to log scalar metric lr_scheduler_step_size:
10
--------------------
Epoch 1/2
--------------------
Train Loss: 81.5035 Train Acc: 0.2345
Cross Entropy: 1956

# Training Environment

Load the training environment that has been registered as part of the `00_environment_setup` notebook and use it for remote training on the AML compute cluster.

In [9]:
env = Environment.get(workspace=ws, name=env_name)

# Experiment & Run Configuration

Now that the training artifacts are prepared, a model can be trained on the remote compute cluster. You can take advantage of GPUs to cut down the training time. 

In [10]:
# Create the experiment
experiment = Experiment(workspace=ws, 
                        name=experiment_name)

### Option 1: Normal Script Run

In [39]:
# Set variable to identify run type for logic later in the notebook
run_type = "script_run"

# Create the script run configuration
src_config = ScriptRunConfig(source_directory=src_dir,
                             script="training/train.py",
                             compute_target=compute_target,
                             arguments=[
                                 "--data_path", dataset.as_named_input("input").as_mount(),
                                 "--num_epochs", 54,
                                 "--output_dir", "outputs",
                                 "--batch_size", 8,
                                 "--learning_rate", 0.2,
                                 "--momentum", 0.9,
                                 "--num_frozen_layers", 7,
                                 "--num_neurons_fc_layer", 1024,
                                 "--dropout_prob_fc_layer", 0.0,
                                 "--lr_scheduler_step_size", 12])

src_config.run_config.environment = env

# Start the Script Run
run = experiment.submit(src_config)

# Tag the run trigger and model architecture
run.tag("trigger", "jupyter_notebook")
run.tag("model_architecture", "transfer_learning_resnext-50")

### Option 2: Hyperdrive Run

Hyperparameters can be tuned using AML's hyperdrive capability.

The initial learning rate is tuned. The training script can contain a LR schedule to decay the learning rate every several epochs starting from that initial learning rate.

Random sampling is used to try different configuration sets of hyperparameters to maximize the primary metric, the best validation accuracy (best_val_acc).

An early termination policy is specified to early terminate poorly performing runs. The BanditPolicy is used, which will terminate any run that doesn't fall within the slack factor of the primary evaluation metric. In this template, this policy will be applied every epoch (since the best_val_acc metric is reported every epoch and evaluation_interval=1). The first policy evaluation will be delayed until after the first 20 epochs (delay_evaluation=20). 

In [23]:
# Set variable to identify run type for logic later in the notebook
run_type = "hyperdrive_run"

# Define the random parameter sampling
param_sampling = RandomParameterSampling({
    "batch_size": choice(8, 16),
    "learning_rate": uniform(0.1, 0.3),
    "num_frozen_layers": choice(6),
    "num_neurons_fc_layer": choice(512, 768, 1024)})

# Check https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#bandit-policy
# for details on BanditPolicy
early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=20)

# Create the script run configuration
src_config = ScriptRunConfig(source_directory=src_dir,
                             script="training/train.py",
                             compute_target=compute_target,
                             arguments=[
                                 "--data_path", dataset.as_named_input("input").as_mount(),
                                 "--num_epochs", 60,
                                 "--output_dir", "outputs",
                                 "--batch_size", 16,
                                 "--learning_rate", 0.1,
                                 "--momentum", 0.9,
                                 "--num_frozen_layers", 6,
                                 "--num_neurons_fc_layer", 1024,
                                 "--dropout_prob_fc_layer", 0.0,
                                 "--lr_scheduler_step_size", 20])

src_config.run_config.environment = env

hyperdrive_config = HyperDriveConfig(run_config=src_config,
                                     hyperparameter_sampling=param_sampling, 
                                     policy=early_termination_policy,
                                     primary_metric_name="best_val_acc",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=8,
                                     max_concurrent_runs=4)

# Start the Hyperdrive Run
run = experiment.submit(hyperdrive_config)

# Tag the run trigger and model architecture
run.tag("trigger", "jupyter_notebook")
run.tag("model_architecture", "transfer_learning_resnext-50")

# Run Monitoring

In [35]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [36]:
# Get portal URL
run.get_portal_url()

'https://ml.azure.com/experiments/stanford-dogs-classifier-train/runs/stanford-dogs-classifier-train_1615735185_a4856f6c?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/sbirkamlrg/workspaces/sbirkamlws'

In [37]:
run.wait_for_completion(show_output=False)

KeyboardInterrupt: 

In [None]:
# Retrieve best child run
if run_type == "hyperdrive_run":
    best_child_run = run.get_best_run_by_primary_metric()
elif run_type == "script_run":
    best_child_run = run

In [None]:
# Check run metrics, details and file names
best_child_run_metrics = best_child_run.get_metrics()
best_child_run_details = best_child_run.get_details()
best_child_run_file_names = best_child_run.get_file_names()


print(best_child_run_metrics)
print("==========================")
print(best_child_run_details)
print("==========================")
print(best_child_run_file_names)

In [None]:
print(f"Best Run is:")
print(f"Validation accuracy: {best_child_run_metrics['best_val_acc'][-1]}")
print(f"Learning rate: {best_child_run_metrics['lr']}")
print(f"Momentum: {best_child_run_metrics['momentum']}")

# Model Registration

Register the model to the AML workspace for subsequent deployment.

In [None]:
model = best_child_run.register_model(model_name=model_name,
                                      description=model_description,
                                      model_path=remote_model_path,
                                      tags=model_tags,
                                      properties={"val_acc": best_child_run_metrics["best_val_acc"][-1]},
                                      model_framework=Model.Framework.PYTORCH,
                                      model_framework_version=torch.__version__)

print(model.name, model.id, model.version, sep="\t")

Add the input dataset to the registered model.

In [None]:
model.add_dataset_references([("input dataset", dataset)])

### Model Download

If required, the model can be downloaded as follows (e.g. for local testing):

In [21]:
if download_model:
    
    # Create directory
    outputs_folder = os.path.join(os.getcwd(), outputs_dir)
    os.makedirs(outputs_folder, exist_ok=True)
    print(f"Outputs folder {outputs_folder} has been created.")
    
    # Download model artifact
    best_child_run.download_file(name=model_path, output_file_path=local_model_path)

# Resource Clean Up

In [None]:
compute_target.delete()