# How-to train a model on Azure ML

This notebook takes you through the steps of training a model on Azure ML for The Ocean Cleanup. We train the models through Azure ML to provide us with a good registration of all performed tests, so that we can see why and how a model was created.

When the result of a training run is satisfactory, a model can be registered from there, from which point we can deploy it.

There are a few concepts to know about first:

- Workspace: The entire AzureML environment you are working in. The Workspace contains all the other elements.
- Experiment: A collection of Runs (see below). A logical container for training a model with different parameters to determine the best.
- Run: A single train/test run of a model. These are tied to an experiment. If you want to train the same model with different parameters, so you can compare them, these are different runs under the same experiment.
- Environment: The code environment used by your code. This contains things like the required Python packages. Multiple options exist here - from just using your local environment to completely curated environments directly from Azure.
- Dataset: A single dataset as registered in the AzureML workspace.

With that out of the way, lets dive right in. Looking at these components, our first step will be to get the correct Workspace:

In [1]:
from azurewrapper.workspace import get_workspace

subscription_id = "29d66431-a7ce-4709-93f7-3bdb01a243b3"
resource_group = "ExperimentationJayke"
workspace_name = "ExperimentationJayke"

workspace = get_workspace(subscription_id, resource_group, workspace_name)

## Create experiment

Now that we have a workspace available, we need to create an experiment. As describe above, an experiment will be the container for multiple runs, in which we can train and compare the model using different parameters.

The experiment needs a name. Use something that is descriptive and clear to anyone seeing this.

In [2]:
from azurewrapper.train import create_experiment
experiment = create_experiment(workspace, "model-yolo-v-1-0")

## Create or select compute target

We want to train our model on a GPU cluster on AzureML. Lets create one (or load an existing one).

In [3]:
from azurewrapper.compute import get_compute

compute_target = get_compute(workspace, "gpu-cluster", vm_size='STANDARD_NC6', max_nodes=4)

Found existing compute target


## Create the environment

We will now need to create an environment. In this case, we build the enviroment based of the Azure docker Tensorflow image, but with our own pip requirements.

In [4]:
from azureml.core import Environment
from azurewrapper.environment import get_environment


#environment = get_environment(workspace, 'AzureML-TensorFlow-2.2-GPU')
#environment.save_to_directory(path='deps.yml')
environment = Environment.from_pip_requirements(name="tf-exp", file_path="./examples/yolo/Tensorflow_YOLO/requirements.txt")
environment.docker.enabled = True
environment.docker.base_image = "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04"

## Prepare model wraper

Now it's time to perform our first Run of the experiment. However, before we can do this, we will need a wrapper around our model. This wrapper needs to do a few things:

- Initialize and train the model with:
  - The desired parameters
  - The desired data
- Evaluate the performance of the trained model
- Register the parameters and the performance in the Run object
- Add the generated model artifacts to the Run object

There is skeleton code for this available: `skeleton_files/train.py`. In this file you fill in what parameters you expect, you create and train and evaluate the model using these parameters and the loaded in dataset(s), and you register the results and the created artifacts with the Run.

For this how-to, we will use the example provided in `examples/yolo/train.py`. This is an implementation of the file mentioned above. It accepts 4 parameters: `LR_INIT`, `LR_END`, `WARMUP_EPOCHS` and `EPOCHS`, and requires `yolov4 weights` as dataset.

### Changes to tensorflow YOLO code

I made some changes to the Tensorflow YOLO code as provided by Mats/from GitHub:

- Changed the filepaths in `Tensorflow_YOLO/yolov3/configs.py`, in addition to other changes there as indicated
  by Mats
- I made the imports from subdirectories in `Tensorflow_YOLO/train.py`, `Tensorflow_YOLO/evaluate_mAP.py`, and all   files `Tensorflow_YOLO/yolov3` relative instead of absolute, so we can import from our training wrapper.
- `Tensorflow_YOLO/evaluate_mAP.py` also returns the FPS
- `Tensorflow_YOLO/train.py:main()` returns the evaluated mAP and FPS, and accepts the provided parameters as override to the config file.
- Added AzureML modules to `Tensorflow_YOLO/requirements.txt`

## Run the experiment

Now we need to create and run the experiment. First, we fetch the desired datasets, and combine these into train- and test sets. Note that we can provide multiple sets for both training and testing. Also note that each set consists of both a label and an image dataset.

In [5]:
from azureml.core import Dataset

train_images = Dataset.get_by_name(workspace, name="campaign-26-10-2020_images")
train_labels = Dataset.get_by_name(workspace, name="campaign-26-10-2020_labels")
test_images = Dataset.get_by_name(workspace, name="campaign-22-10-2020_images")
test_labels = Dataset.get_by_name(workspace, name="campaign-22-10-2020_labels")
trainsets = [
    (train_labels, train_images),
    (test_labels, test_images)
]
testsets = [
    (test_labels, test_images)
]

We now have everything we need to perform the run. Lets do so!

In [30]:
from azurewrapper.train import perform_run
from azureml.core.runconfig import TensorflowConfiguration

weights = Dataset.get_by_name(workspace, name="YOLO")

run = perform_run(experiment, 'train.py', 'examples/yolo', environment=environment,
                  trainsets=trainsets, testsets=testsets, compute_target=compute_target,
                  parameters={
                      'weights': weights.as_named_input('weights').as_mount(),
                      'LR_INIT': 1e-4,
                      'LR_END': 1e-6,
                      'WARMUP_EPOCHS': 2,
                      'EPOCHS': 100
                  })
run.wait_for_completion(show_output=True)

Submitting /home/jayke/dev/xomnia/TOC/AIDataPipeLine/ModelTraining/examples/yolo directory for run. The size of the directory >= 25 MB, so it can take a few minutes.


RunId: model-yolo-v-1-0_1604505069_64e89e86
Web View: https://ml.azure.com/experiments/model-yolo-v-1-0/runs/model-yolo-v-1-0_1604505069_64e89e86?wsid=/subscriptions/29d66431-a7ce-4709-93f7-3bdb01a243b3/resourcegroups/ExperimentationJayke/workspaces/ExperimentationJayke

Streaming azureml-logs/55_azureml-execution-tvmps_b371b84fb16d416741da287e0bd8ddde07419db5d4ffc3061efc84b2ec958458_d.txt

2020-11-04T15:51:27Z Starting output-watcher...
2020-11-04T15:51:27Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2020-11-04T15:51:28Z Executing 'Copy ACR Details file' on 10.0.0.4
2020-11-04T15:51:28Z Copy ACR Details file succeeded on 10.0.0.4. Output: 
>>>   
>>>   
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_ea71d6d6fbafdaeb54907902a21c77dc
Digest: sha256:075622f50755675d4ad7d3f01a70dbbcc16d68d1efdf6299dd7898a02a47cd0a
Status: Image is up to date for 4974f70cd2934b4299204f2bf3475cda.azurecr.io/azureml/azureml_ea71d6d6fbafdaeb54907902a21c77dc:la

INFO - Starting training
GPUs [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2020-11-04 15:53:00.190062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: e5a1:00:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-04 15:53:00.190154: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-04 15:53:00.190213: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-11-04 15:53:00.190268: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-11-04 15:53:00.190337: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-11-04 15:53:00.190389: I tensorflow/stream_e


Execution Summary
RunId: model-yolo-v-1-0_1604505069_64e89e86
Web View: https://ml.azure.com/experiments/model-yolo-v-1-0/runs/model-yolo-v-1-0_1604505069_64e89e86?wsid=/subscriptions/29d66431-a7ce-4709-93f7-3bdb01a243b3/resourcegroups/ExperimentationJayke/workspaces/ExperimentationJayke



{'runId': 'model-yolo-v-1-0_1604505069_64e89e86',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-11-04T15:51:30.462372Z',
 'endTimeUtc': '2020-11-04T15:56:15.311258Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '1a23a4b6-7183-4ad8-a24c-aac1b3a2ff32',
  'azureml.git.repository_uri': 'git@github.com:TheOceanCleanup/AIDataPipeLine.git',
  'mlflow.source.git.repoURL': 'git@github.com:TheOceanCleanup/AIDataPipeLine.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': 'ab69c2ccf82457e67532be841beb555b3456466e',
  'mlflow.source.git.commit': 'ab69c2ccf82457e67532be841beb555b3456466e',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '0046c288-9c27-407a-afb5-d0439c374cbe'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'train_labels_0', 'me