# How-to train a model on Azure ML

This notebook takes you through the steps of training a model on Azure ML for The Ocean Cleanup. We train the models through Azure ML to provide us with a good registration of all performed tests, so that we can see why and how a model was created.

When the result of a training run is satisfactory, a model can be registered from there, from which point we can deploy it.

There are a few concepts to know about first:

- Workspace: The entire AzureML environment you are working in. The Workspace contains all the other elements.
- Experiment: A collection of Runs (see below). A logical container for training a model with different parameters to determine the best.
- Run: A single train/test run of a model. These are tied to an experiment. If you want to train the same model with different parameters, so you can compare them, these are different runs under the same experiment.
- Environment: The code environment used by your code. This contains things like the required Python packages. Multiple options exist here - from just using your local environment to completely curated environments directly from Azure.
- Dataset: A single dataset as registered in the AzureML workspace.

With that out of the way, lets dive right in. Looking at these components, our first step will be to get the correct Workspace:

In [1]:
from azurewrapper.workspace import get_workspace

subscription_id = "29d66431-a7ce-4709-93f7-3bdb01a243b3"
resource_group = "ExperimentationJayke"
workspace_name = "ExperimentationJayke"

workspace = get_workspace(subscription_id, resource_group, workspace_name)

## Create experiment

Now that we have a workspace available, we need to create an experiment. As describe above, an experiment will be the container for multiple runs, in which we can train and compare the model using different parameters.

The experiment needs a name. Use something that is descriptive and clear to anyone seeing this.

In [2]:
from azurewrapper.train import create_experiment
experiment = create_experiment(workspace, "model-frcnn-v-1-0")

## Create or select compute target

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Creating a new compute target...
Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Create the environment

We will now need to create an environment. In this case, we use the curated TensorFlow 2.2 environment.

In [4]:
from azureml.core import Environment
from azurewrapper.environment import get_environment
# environment = get_environment(
#     workspace,
#     "AzureML-TensorFlow-2.2-GPU"
# )

In [5]:
environment = Environment.from_conda_specification(
    "custom_tensorflow_object_detection",
    file_path='examples/frcnn/conda_dependencies.yml'
)
environment.docker.enabled = True


# Load from dockerfile
environment.docker.base_image = None
environment.docker.base_dockerfile = "./examples/frcnn/Dockerfile"
environment.python.user_managed_dependencies = True

In [6]:

# tf_env = Environment.from_conda_specification(
#     name='tensorflow-2.2-gpu',
#     file_path='./examples/frcnn/conda_dependencies.yml'
# )

# # Specify a GPU base image
# tf_env.docker.enabled = True
# tf_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
# environment = tf_env

## Prepare model wraper

Now it's time to perform our first Run of the experiment. However, before we can do this, we will need a wrapper around our model. This wrapper needs to do a few things:

- Initialize and train the model with:
  - The desired parameters
  - The desired data
- Evaluate the performance of the trained model
- Register the parameters and the performance in the Run object
- Add the generated model artifacts to the Run object

There is skeleton code for this available: `skeleton_files/train.py`. In this file you fill in what parameters you expect, you create and train and evaluate the model using these parameters and the loaded in dataset(s), and you register the results and the created artifacts with the Run.

For this how-to, we will use the example provided in `examples/frcnn/train.py`. This is an implementation of the file mentioned above. It expects two parameters: `num_train_steps` and `sample_1_of_n_eval_examples`.

## Run the experiment

Now we need to create and run the experiment. First, we fetch the desired datasets, and combine these into train- and test sets. Note that we can provide multiple sets for both training and testing. Also note that each set consists of both a label and an image dataset.

In [7]:
from azureml.core import Dataset

train_images = Dataset.get_by_name(workspace, name="campaign-26-10-2020_images")
train_labels = Dataset.get_by_name(workspace, name="campaign-26-10-2020_labels")
test_images = Dataset.get_by_name(workspace, name="campaign-22-10-2020_images")
test_labels = Dataset.get_by_name(workspace, name="campaign-22-10-2020_labels")
trainsets = [
    (train_labels, train_images),
    (test_labels, test_images)
]
testsets = [
    (test_labels, test_images)
]

We now have everything we need to perform the run locally. Lets do so!

In [9]:
from azurewrapper.train import perform_run
from azureml.core.runconfig import TensorflowConfiguration

checkpoint_files = Dataset.get_by_name(workspace, name="FRCNN")

# perform_run(experiment, script, source_directory, environment=None,
#             compute_target=None, datasets=[], parameters={})
# run = perform_run(experiment, 'train.py', 'examples/example_model', environment=environment,
#                   datasets=datasets, parameters={'param_a': 30, 'param_b': 12.0})
run = perform_run(experiment, 'train.py', 'examples/frcnn', environment=environment,
                  trainsets=trainsets, testsets=testsets, compute_target=compute_target,
                  parameters={
                      'num_train_steps': 10000,
                      'sample_1_of_n_eval_examples': 1,
                      'checkpoint_dataset': checkpoint_files.as_named_input(f'checkpoint').as_mount()
                  })
run.wait_for_completion(show_output=True)

RunId: model-frcnn-v-1-0_1604312174_b7a3065b
Web View: https://ml.azure.com/experiments/model-frcnn-v-1-0/runs/model-frcnn-v-1-0_1604312174_b7a3065b?wsid=/subscriptions/29d66431-a7ce-4709-93f7-3bdb01a243b3/resourcegroups/ExperimentationJayke/workspaces/ExperimentationJayke

Streaming azureml-logs/20_image_build_log.txt

2020/11/02 10:16:20 Downloading source code...
2020/11/02 10:16:21 Finished downloading source code
2020/11/02 10:16:23 Creating Docker network: acb_default_network, driver: 'bridge'
2020/11/02 10:16:24 Successfully set up Docker network: acb_default_network
2020/11/02 10:16:24 Setting up Docker configuration...
2020/11/02 10:16:24 Successfully set up Docker configuration
2020/11/02 10:16:24 Logging in to registry: 4974f70cd2934b4299204f2bf3475cda.azurecr.io
2020/11/02 10:16:26 Successfully logged into 4974f70cd2934b4299204f2bf3475cda.azurecr.io
2020/11/02 10:16:26 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2

  Downloading https://files.pythonhosted.org/packages/26/1f/f5939ce897f3d223596a8f922329b868731b24b913c0b2dfff9d21f877e4/azureml_dataprep-2.4.2-py3-none-any.whl (28.2MB)
Collecting fusepy<4.0.0,>=3.0.1; extra == "fuse"
  Downloading https://files.pythonhosted.org/packages/04/0b/4506cb2e831cea4b0214d3625430e921faaa05a7fb520458c75a2dbd2152/fusepy-3.0.1.tar.gz
Collecting azure-mgmt-keyvault<7.0.0,>=0.40.0
  Downloading https://files.pythonhosted.org/packages/f1/af/1ba15e7176bcf6b1531b453e410ae41a983c09f834d8700dfce739451b53/azure_mgmt_keyvault-2.2.0-py2.py3-none-any.whl (89kB)
Collecting ruamel.yaml>=0.15.35
  Downloading https://files.pythonhosted.org/packages/7e/39/186f14f3836ac5d2a6a042c8de69988770e8b9abb537610edc429e4914aa/ruamel.yaml-0.16.12-py2.py3-none-any.whl (111kB)
Collecting msrestazure>=0.4.33
  Downloading https://files.pythonhosted.org/packages/5e/3a/7adb08fd2f0ee6fdfd03685fac38477b64f184943dcf6ea0cbffb205f22d/msrestazure-0.6.4-py2.py3-none-any.whl (40kB)
Collecting PyJWT
  

Removing intermediate container c66931db7676
 ---> f8036a506128
Step 3/26 : RUN mkdir /install
 ---> Running in 500483aa58cc
Removing intermediate container 500483aa58cc
 ---> ba3628e6f014
Step 4/26 : RUN mkdir /install/TensorFlow
 ---> Running in fdbdebf868cf
Removing intermediate container fdbdebf868cf
 ---> 4dcfc2ac242b
Step 5/26 : WORKDIR /install/TensorFlow
 ---> Running in afe8390f7d27
Removing intermediate container afe8390f7d27
 ---> 25555120e012
Step 6/26 : RUN wget https://github.com/tensorflow/models/archive/master.zip
 ---> Running in aa4e878399e0
[91m--2020-11-02 10:22:26--  https://github.com/tensorflow/models/archive/master.zip
[0m[91mResolving github.com (github.com)... [0m[91m140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... [0m[91mconnected.
[0m[91mHTTP request sent, awaiting response... [0m[91m302 Found
Location: https://codeload.github.com/tensorflow/models/zip/master [following]
--2020-11-02 10:22:26--  https://codeload.github.com/t

  Downloading https://files.pythonhosted.org/packages/6c/72/0f85032391a690a04170c05ec899669e99419182d1a68bbd512503faef60/apache_beam-2.25.0-cp37-cp37m-manylinux2010_x86_64.whl (8.7MB)
Collecting lxml
  Downloading https://files.pythonhosted.org/packages/5e/87/9f887cbf975349e7917e6d6b76e66ec42b5a27defd4d8b041c1c532779ea/lxml-4.6.1-cp37-cp37m-manylinux1_x86_64.whl (5.5MB)
Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/87/a6/8d7d06f6b69236a3c1818157875ceb1259ba0d9df4194f4fe138ffdc0f41/matplotlib-3.3.2-cp37-cp37m-manylinux1_x86_64.whl (11.6MB)
Collecting Cython
  Downloading https://files.pythonhosted.org/packages/6b/36/d6c18632a339dafa54fd128b0dd2c36c6dc4bc86b8e0d366ccd9f22b480a/Cython-0.29.21-cp37-cp37m-manylinux1_x86_64.whl (2.0MB)
Collecting tf-slim
  Downloading https://files.pythonhosted.org/packages/02/97/b0f4a64df018ca018cc035d44f2ef08f91e2e8aa67271f6f19633a015ff7/tf_slim-1.1.0-py2.py3-none-any.whl (352kB)
Collecting pycocotools
  Downloading https://fi

  Downloading https://files.pythonhosted.org/packages/c5/1f/ec86d2a5c48ac6490d4471b297885603cf0e8da89d5ffbf0bce6e57f4d64/importlib_resources-3.3.0-py2.py3-none-any.whl
Collecting promise
  Downloading https://files.pythonhosted.org/packages/cf/9c/fb5d48abfe5d791cd496e4242ebcf87a4bb2e0c3dcd6e0ae68c11426a528/promise-2.3.tar.gz
Collecting tensorflow-metadata
  Downloading https://files.pythonhosted.org/packages/72/c4/1ff6a8afaac19250780a82bc05907586f6c23f45a5983df8921040e2b04c/tensorflow_metadata-0.24.0-py3-none-any.whl (44kB)
Collecting attrs>=18.1.0
  Downloading https://files.pythonhosted.org/packages/14/df/479736ae1ef59842f512548bacefad1abed705e400212acba43f9b0fa556/attrs-20.2.0-py2.py3-none-any.whl (48kB)
Collecting google-crc32c<2.0dev,>=1.0; python_version >= "3.5"
  Downloading https://files.pythonhosted.org/packages/d4/16/22b8eac72954d853a0541651f7634067d6ccdbe26940b0b6e79eb6b182a0/google_crc32c-1.0.0-cp37-cp37m-manylinux2010_x86_64.whl
Collecting googleapis-common-protos<2.0dev,

Removing intermediate container 752509ac6172
 ---> 33678cafc065
Step 18/26 : WORKDIR /
 ---> Running in 4ebd28f0b089
Removing intermediate container 4ebd28f0b089
 ---> 2614ff679fd7
Step 19/26 : USER root
 ---> Running in fce9da0b2485
Removing intermediate container fce9da0b2485
 ---> 905d604b2603
Step 20/26 : RUN mkdir -p $HOME/.cache
 ---> Running in 087dbc689125
Removing intermediate container 087dbc689125
 ---> 270b1ffa80e0
Step 21/26 : WORKDIR /
 ---> Running in c0305a715624
Removing intermediate container c0305a715624
 ---> aa2162b6b1dc
Step 22/26 : COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
 ---> 7f488fcf7a14
Step 23/26 : COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
 ---> bd50a1ef8ada
Step 24/26 : RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit  /azureml-environment-setup/spark_cache.py'; fi
 ---> Running in edae68e82903
Removing intermediate container edae68e829

ee06e59a4314: Pull complete
b0ac690bb2f2: Pull complete
5f11ace0944e: Pull complete
d4e18c29e104: Pull complete
09ba06861fca: Pull complete
83213e0b5131: Pull complete
7b0feb8d64db: Pull complete
7b0b507bb1e9: Pull complete
94e0f0e2d7da: Pull complete
45b4bc5400c1: Pull complete
30f9a1497794: Pull complete
9a7ac860589c: Pull complete
0f38927fa8e0: Pull complete
1d7ed14a884d: Pull complete
c01c1fe8898f: Pull complete
81dc248a70ed: Pull complete
cf550e0b0dc2: Pull complete
cf23e4a43291: Pull complete
71780e028e5f: Pull complete
2c16f1517937: Pull complete
142c5579ef76: Pull complete
feed4966b399: Pull complete
80d2d9c434de: Pull complete
36cd24390d80: Pull complete
6e6ad0ed3b63: Pull complete
Digest: sha256:fdbdb69fdfa131fbb95dd0d701ab5c2627908fc8fd5dc081e26f02d584a1e90a
Status: Downloaded newer image for 4974f70cd2934b4299204f2bf3475cda.azurecr.io/azureml/azureml_ac7d441ea8eec91eff9e12f161d95946:latest
c1697d9da5f2ca2d548d3589866b7cfbcdc65d64d1a3678a27b90e029aadacca
2020/11/02 10:40:40 


Streaming azureml-logs/75_job_post-tvmps_83e953c61b6c30dfa0876ed2be63bdd023297f0007de806b2d24eb05dbbf971c_d.txt

Entering job release. Current time:2020-11-02T10:41:48.596568
Starting job release. Current time:2020-11-02T10:41:49.338818
Logging experiment finalizing status in history service.
[2020-11-02T10:41:49.339436] job release stage : upload_datastore starting...
Starting the daemon thread to refresh tokens in background for process with pid = 924
[{}] job release stage : start importing azureml.history._tracking in run_history_release.
[2020-11-02T10:41:49.341756] job release stage : copy_batchai_cached_logs starting...[2020-11-02T10:41:49.341794] job release stage : execute_job_release starting...
[2020-11-02T10:41:49.341924] job release stage : copy_batchai_cached_logs completed...

[2020-11-02T10:41:49.342377] Entering context manager injector.
[2020-11-02T10:41:49.349479] job release stage : upload_datastore completed...
[2020-11-02T10:41:49.553361] job release stage : send

{'runId': 'model-frcnn-v-1-0_1604312174_b7a3065b',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-11-02T10:37:49.423019Z',
 'endTimeUtc': '2020-11-02T10:42:00.573015Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'f67dc28a-b753-4683-b8da-f70e73c644db',
  'azureml.git.repository_uri': 'git@github.com:TheOceanCleanup/AIDataPipeLine.git',
  'mlflow.source.git.repoURL': 'git@github.com:TheOceanCleanup/AIDataPipeLine.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': '7b1ff496025f18b292f735b470fb28827a3bb878',
  'mlflow.source.git.commit': '7b1ff496025f18b292f735b470fb28827a3bb878',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '95c95241-ca72-45cd-aeb6-106bb9c2fd40'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'checkpoint', 'mecha