# How-to train a model on Azure ML

This notebook takes you through the steps of training a model on Azure ML for The Ocean Cleanup. We train the models through Azure ML to provide us with a good registration of all performed tests, so that we can see why and how a model was created.

When the result of a training run is satisfactory, a model can be registered from there, from which point we can deploy it.

There are a few concepts to know about first:

- Workspace: The entire AzureML environment you are working in. The Workspace contains all the other elements.
- Experiment: A collection of Runs (see below). A logical container for training a model with different parameters to determine the best.
- Run: A single train/test run of a model. These are tied to an experiment. If you want to train the same model with different parameters, so you can compare them, these are different runs under the same experiment.
- Environment: The code environment used by your code. This contains things like the required Python packages. Multiple options exist here - from just using your local environment to completely curated environments directly from Azure.
- Dataset: A single dataset as registered in the AzureML workspace.

With that out of the way, lets dive right in. Looking at these components, our first step will be to get the correct Workspace:

In [1]:
from toc_azurewrapper.workspace import get_workspace

subscription_id = "a00eaec6-b320-4e7c-ae61-60a30aec1cfc"
resource_group = "MachineLearning"
workspace_name = "RiverImageAnalysis"
tenant_id = "86f9fea7-9eb0-4325-8b58-7ed0db623956"

workspace = get_workspace(subscription_id, resource_group, workspace_name, tenant_id=tenant_id)

## Create experiment

Now that we have a workspace available, we need to create an experiment. As describe above, an experiment will be the container for multiple runs, in which we can train and compare the model using different parameters.

The experiment needs a name. Use something that is descriptive and clear to anyone seeing this.

In [2]:
from toc_azurewrapper.train import create_experiment
experiment = create_experiment(workspace, "model-yolov5-v-1-0")

## Create or select compute target

We want to train our model on a GPU cluster on AzureML. Lets create one (or load an existing one).

In [3]:
from toc_azurewrapper.compute import get_compute

compute_target = get_compute(workspace, "gpu-cluster", vm_size='STANDARD_NC6', max_nodes=4)

Found existing compute target


## Create the environment

We will now need to create an environment. In this case, we build the enviroment based of the Azure docker Tensorflow image, but with our own pip requirements.

In [4]:
from toc_azurewrapper.environment import get_environment


environment = get_environment(
    workspace,
    "custom-pytorch",
    pip_requirements="./examples/yolov5/yolov5/requirements.txt",
    docker_file="./examples/yolov5/Dockerfile"
)

No environment with that name found, creating new one


## Prepare model wraper

Now it's time to perform our first Run of the experiment. However, before we can do this, we will need a wrapper around our model. This wrapper needs to do a few things:

- Initialize and train the model with:
  - The desired parameters
  - The desired data
- Evaluate the performance of the trained model
- Register the parameters and the performance in the Run object
- Add the generated model artifacts to the Run object

There is skeleton code for this available: `skeleton_files/train.py`. In this file you fill in what parameters you expect, you create and train and evaluate the model using these parameters and the loaded in dataset(s), and you register the results and the created artifacts with the Run.

For this how-to, we will use the example provided in `examples/yolov5/train.py`. This is an implementation of the file mentioned above. It requires `yolov5 weights` as dataset.

### Changes to yolo code

- Added AzureML files to `yolov5/requirements.txt`

## Run the experiment

Now we need to create and run the experiment. First, we fetch the desired datasets, and combine these into train- and test sets. Note that we can provide multiple sets for both training and testing. Also note that each set consists of both a label and an image dataset.

In [5]:
from azureml.core import Dataset

train_images = Dataset.get_by_name(workspace, name="campaign-26-10-2020_images")
train_labels = Dataset.get_by_name(workspace, name="campaign-26-10-2020_labels")
test_images = Dataset.get_by_name(workspace, name="campaign-22-10-2020_images")
test_labels = Dataset.get_by_name(workspace, name="campaign-22-10-2020_labels")
trainsets = [
    (train_labels, train_images),
    (test_labels, test_images)
]
testsets = [
    (test_labels, test_images)
]

We now have everything we need to perform the run. Lets do so!

In [6]:
from toc_azurewrapper.train import perform_run

weights = Dataset.get_by_name(workspace, name="YOLOV5")

run = perform_run(experiment, 'train.py', 'examples/yolov5', environment=environment,
                  trainsets=trainsets, testsets=testsets, compute_target=compute_target,
                  parameters={
                      'weights': weights.as_named_input(f'weights').as_mount(),
                  })
run.wait_for_completion(show_output=True)

RunId: model-yolov5-v-1-0_1604922581_e46bc08d
Web View: https://ml.azure.com/experiments/model-yolov5-v-1-0/runs/model-yolov5-v-1-0_1604922581_e46bc08d?wsid=/subscriptions/29d66431-a7ce-4709-93f7-3bdb01a243b3/resourcegroups/ExperimentationJayke/workspaces/ExperimentationJayke

Streaming azureml-logs/55_azureml-execution-tvmps_f45f29388ac2735b1df89ca1ab46fe083c80506bcb0e20aaf9f5f8e036d6e7a5_d.txt

2020-11-09T11:53:15Z Starting output-watcher...
2020-11-09T11:53:15Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2020-11-09T11:53:16Z Executing 'Copy ACR Details file' on 10.0.0.5
2020-11-09T11:53:16Z Copy ACR Details file succeeded on 10.0.0.5. Output: 
>>>   
>>>   
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_4a19f9495f68b1b64d3ab073b622096f
171857c49d0f: Pulling fs layer
419640447d26: Pulling fs layer
61e52f862619: Pulling fs layer
c118dad7e37a: Pulling fs layer
2e36372995f9: Pulling fs layer
0b8e00a4ba4e: Pulling fs layer
b3026b4f2581: P

DEBUG - b'Overriding model.yaml nc=80 with nc=3\n\n\nAnalyzing anchors... anchors/target = 3.20, Best Possible Recall (BPR) = 1.0000\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all           4           0           0           0           0           0\n                 all


Streaming azureml-logs/75_job_post-tvmps_f45f29388ac2735b1df89ca1ab46fe083c80506bcb0e20aaf9f5f8e036d6e7a5_d.txt

Entering job release. Current time:2020-11-09T11:58:20.779049
Starting job release. Current time:2020-11-09T11:58:21.685783
Logging experiment finalizing status in history service.
Starting the daemon thread to refresh tokens in background for process with pid = 1099
[2020-11-09T11:58:21.686654] job release stage : upload_datastore starting...
[{}] job release stage : start importing azureml.history._tracking in run_history_release.
[2020-11-09T11:58:21.687136] job release stage : execute_job_release starting...
[2020-11-09T11:58:21.689058] job release stage : copy_batchai_cached_logs starting...
[2020-11-09T11:58:21.689284] job release stage : copy_batchai_cached_logs completed...
[2020-11-09T11:58:21.695230] Entering context manager injector.
[2020-11-09T11:58:21.697052] job release stage : upload_datastore completed...
[2020-11-09T11:58:21.895609] job release stage : sen

{'runId': 'model-yolov5-v-1-0_1604922581_e46bc08d',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-11-09T11:53:07.732089Z',
 'endTimeUtc': '2020-11-09T11:58:34.225819Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'ea01c0fa-2e06-45c4-9796-40199beb3890',
  'azureml.git.repository_uri': 'git@github.com:TheOceanCleanup/AIDataPipeLine.git',
  'mlflow.source.git.repoURL': 'git@github.com:TheOceanCleanup/AIDataPipeLine.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': '1609c196a49515964b4c1022801125134f17dbf6',
  'mlflow.source.git.commit': '1609c196a49515964b4c1022801125134f17dbf6',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '49fac13f-e0d8-4622-a4ac-f3634ec5e35b'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'test_images_0', 'm

Finally, if the result is to our liking, we will register the model. This means we can use it for deployment. Provide a name to the model, the path to either a single artifact or to the folder containing all required artifacts, and optionally a description, properties and tags.

In [7]:
run.register_model(
    "yolov5",
    model_path="outputs/weights/",
    description="Yolo V5 implementation on Pytorch",
    properties={
        "location": "here",
        "time": "morning"
    }
)

Model(workspace=Workspace.create(name='ExperimentationJayke', subscription_id='29d66431-a7ce-4709-93f7-3bdb01a243b3', resource_group='ExperimentationJayke'), name=yolov5, id=yolov5:1, version=1, tags={}, properties={'location': 'here', 'time': 'morning'})