# How-to train a model on Azure ML

This notebook takes you through the steps of training a model on Azure ML for The Ocean Cleanup. We train the models through Azure ML to provide us with a good registration of all performed tests, so that we can see why and how a model was created.

When the result of a training run is satisfactory, a model can be registered from there, from which point we can deploy it.

There are a few concepts to know about first:

- Workspace: The entire AzureML environment you are working in. The Workspace contains all the other elements.
- Experiment: A collection of Runs (see below). A logical container for training a model with different parameters to determine the best.
- Run: A single train/test run of a model. These are tied to an experiment. If you want to train the same model with different parameters, so you can compare them, these are different runs under the same experiment.
- Environment: The code environment used by your code. This contains things like the required Python packages. Multiple options exist here - from just using your local environment to completely curated environments directly from Azure.
- Dataset: A single dataset as registered in the AzureML workspace.

With that out of the way, lets dive right in. Looking at these components, our first step will be to get the correct Workspace:

In [1]:
from azurewrapper.workspace import get_workspace

subscription_id = "29d66431-a7ce-4709-93f7-3bdb01a243b3"
resource_group = "ExperimentationJayke"
workspace_name = "ExperimentationJayke"

workspace = get_workspace(subscription_id, resource_group, workspace_name)

## Create experiment

Now that we have a workspace available, we need to create an experiment. As describe above, an experiment will be the container for multiple runs, in which we can train and compare the model using different parameters.

The experiment needs a name. Use something that is descriptive and clear to anyone seeing this.

In [2]:
from azurewrapper.train import create_experiment
experiment = create_experiment(workspace, "model-frcnn-v-1-0")

## Create or select compute target

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Found existing compute target


## Create the environment

We will now need to create an environment. In this case, we use the curated TensorFlow 2.2 environment.

In [4]:
from azureml.core import Environment
from azurewrapper.environment import get_environment
# environment = get_environment(
#     workspace,
#     "AzureML-TensorFlow-2.2-GPU"
# )

In [5]:
environment = Environment.from_conda_specification(
    "custom_tensorflow_object_detection",
    file_path='examples/frcnn/conda_dependencies.yml'
)
environment.docker.enabled = True


# Load from dockerfile
environment.docker.base_image = None
environment.docker.base_dockerfile = "./examples/frcnn/Dockerfile"


In [6]:

# tf_env = Environment.from_conda_specification(
#     name='tensorflow-2.2-gpu',
#     file_path='./examples/frcnn/conda_dependencies.yml'
# )

# # Specify a GPU base image
# tf_env.docker.enabled = True
# tf_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
# environment = tf_env

## Prepare model wraper

Now it's time to perform our first Run of the experiment. However, before we can do this, we will need a wrapper around our model. This wrapper needs to do a few things:

- Initialize and train the model with:
  - The desired parameters
  - The desired data
- Evaluate the performance of the trained model
- Register the parameters and the performance in the Run object
- Add the generated model artifacts to the Run object

There is skeleton code for this available: `skeleton_files/train.py`. In this file you fill in what parameters you expect, you create and train and evaluate the model using these parameters and the loaded in dataset(s), and you register the results and the created artifacts with the Run.

For this how-to, we will use the example provided in `examples/frcnn/train.py`. This is an implementation of the file mentioned above. It expects two parameters: `num_train_steps` and `sample_1_of_n_eval_examples`.

## Run the experiment

Now we need to create and run the experiment. First, we fetch the desired datasets, and combine these into train- and test sets. Note that we can provide multiple sets for both training and testing. Also note that each set consists of both a label and an image dataset.

In [7]:
from azureml.core import Dataset

train_images = Dataset.get_by_name(workspace, name="campaign-26-10-2020_images")
train_labels = Dataset.get_by_name(workspace, name="campaign-26-10-2020_labels")
test_images = Dataset.get_by_name(workspace, name="campaign-22-10-2020_images")
test_labels = Dataset.get_by_name(workspace, name="campaign-22-10-2020_labels")
trainsets = [
    (train_labels, train_images),
    (test_labels, test_images)
]
testsets = [
    (test_labels, test_images)
]

We now have everything we need to perform the run locally. Lets do so!

In [None]:
from azurewrapper.train import perform_run
from azureml.core.runconfig import TensorflowConfiguration

checkpoint_files = Dataset.get_by_name(workspace, name="FRCNN")

# perform_run(experiment, script, source_directory, environment=None,
#             compute_target=None, datasets=[], parameters={})
# run = perform_run(experiment, 'train.py', 'examples/example_model', environment=environment,
#                   datasets=datasets, parameters={'param_a': 30, 'param_b': 12.0})
run = perform_run(experiment, 'train.py', 'examples/frcnn', environment=environment,
                  trainsets=trainsets, testsets=testsets, compute_target=compute_target,
                  parameters={
                      'num_train_steps': 10000,
                      'sample_1_of_n_eval_examples': 1,
                      'checkpoint_dataset': checkpoint_files.as_named_input(f'checkpoint').as_mount()
                  })
run.wait_for_completion(show_output=True)

RunId: model-frcnn-v-1-0_1603991311_a67d76b8
Web View: https://ml.azure.com/experiments/model-frcnn-v-1-0/runs/model-frcnn-v-1-0_1603991311_a67d76b8?wsid=/subscriptions/29d66431-a7ce-4709-93f7-3bdb01a243b3/resourcegroups/ExperimentationJayke/workspaces/ExperimentationJayke

Streaming azureml-logs/20_image_build_log.txt

2020/10/29 17:08:36 Downloading source code...
2020/10/29 17:08:38 Finished downloading source code
2020/10/29 17:08:38 Creating Docker network: acb_default_network, driver: 'bridge'
2020/10/29 17:08:39 Successfully set up Docker network: acb_default_network
2020/10/29 17:08:39 Setting up Docker configuration...
2020/10/29 17:08:39 Successfully set up Docker configuration
2020/10/29 17:08:39 Logging in to registry: 4974f70cd2934b4299204f2bf3475cda.azurecr.io
2020/10/29 17:08:41 Successfully logged into 4974f70cd2934b4299204f2bf3475cda.azurecr.io
2020/10/29 17:08:41 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2

 ---> Running in d837ce996ea6
Removing intermediate container d837ce996ea6
 ---> b8cb85031200
Step 8/31 : RUN mkdir /install/proto
 ---> Running in 033eee736d52
Removing intermediate container 033eee736d52
 ---> 8625f555ff83
Step 9/31 : WORKDIR /install/proto
 ---> Running in e9ff89bbca3d
Removing intermediate container e9ff89bbca3d
 ---> 2a1df1fa4f3c
Step 10/31 : RUN wget https://github.com/protocolbuffers/protobuf/releases/download/v3.13.0/protoc-3.13.0-linux-x86_64.zip
 ---> Running in fcdadff333ce
[91m--2020-10-29 17:10:46--  https://github.com/protocolbuffers/protobuf/releases/download/v3.13.0/protoc-3.13.0-linux-x86_64.zip
Resolving github.com (github.com)... [0m[91m140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... [0m[91mconnected.
[0m[91mHTTP request sent, awaiting response... [0m[91m302 Found
[0m[91mLocation: https://github-production-release-asset-2e65be.s3.amazonaws.com/23357588/a595da00-de51-11ea-9242-968f4fc5b907?X-Amz-Algorithm=AWS4-HMAC-S

  Downloading https://files.pythonhosted.org/packages/bb/2b/f0204cd2a78665248ef03344cae156a5a5f5af72c9146628bcf5fac58c91/kiwisolver-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (1.6MB)
Collecting certifi>=2020.06.20
  Downloading https://files.pythonhosted.org/packages/5e/c4/6c4fe722df5343c33226f0b4e0bb042e4dc13483228b4718baf286f86d87/certifi-2020.6.20-py2.py3-none-any.whl (156kB)
Collecting cycler>=0.10
  Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3
  Downloading https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl (67kB)
Collecting absl-py>=0.2.2
  Downloading https://files.pythonhosted.org/packages/bc/58/0aa6fb779dc69cfc811df3398fcbeaeefbf18561b6e36b185df0782781cc/absl_py-0.11.0-py3-none-any.whl (127kB)
Collecting opencv-python>=4.1.0.25


    Uninstalling certifi-2019.9.11:
      Successfully uninstalled certifi-2019.9.11
  Found existing installation: requests 2.22.0
    Uninstalling requests-2.22.0:
      Successfully uninstalled requests-2.22.0

Successfully installed Cython-0.29.21 absl-py-0.11.0 apache-beam-2.25.0 astunparse-1.6.3 attrs-20.2.0 avro-python3-1.10.0 cachetools-4.1.1 certifi-2020.6.20 contextlib2-0.6.0.post1 crcmod-1.7 cycler-0.10.0 dataclasses-0.6 dill-0.3.1.1 dm-tree-0.1.5 docopt-0.6.2 fastavro-1.0.0.post1 future-0.18.2 gast-0.3.3 gin-config-0.3.0 google-api-core-1.23.0 google-api-python-client-1.12.5 google-auth-1.22.1 google-auth-httplib2-0.0.4 google-auth-oauthlib-0.4.2 google-cloud-bigquery-2.2.0 google-cloud-core-1.4.3 google-crc32c-1.0.0 google-pasta-0.2.0 google-resumable-media-1.1.0 googleapis-common-protos-1.52.0 grpcio-1.33.2 h5py-2.10.0 hdfs-2.5.8 httplib2-0.17.4 importlib-metadata-2.0.0 importlib-resources-3.3.0 kaggle-1.5.9 keras-preprocessing-1.1.2 kiwisolver-1.3.0 lvis-0.5.3 lxml-4.6.1

Removing intermediate container 8b6ff74e5120
 ---> 5098eab160bc
Step 25/31 : ENV PATH /azureml-envs/azureml_82c327390cbd63886e2f791b598dd9e9/bin:$PATH
 ---> Running in 4bee347e95ab
Removing intermediate container 4bee347e95ab
 ---> 145ed2978fbd
Step 26/31 : ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_82c327390cbd63886e2f791b598dd9e9
 ---> Running in ae03bbda2965
Removing intermediate container ae03bbda2965
 ---> cc6a54c543ca
Step 27/31 : ENV LD_LIBRARY_PATH /azureml-envs/azureml_82c327390cbd63886e2f791b598dd9e9/lib:$LD_LIBRARY_PATH
 ---> Running in c5795e33371b
Removing intermediate container c5795e33371b
 ---> b715005275d6
Step 28/31 : COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
 ---> 048b1009e849
Step 29/31 : RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit  /azureml-environment-setup/spark_cache.py'; fi
 ---> Running in c5788dd95ed0
Removing intermediate container c5788dd95ed