# How-to train a model on Azure ML

This notebook takes you through the steps of training a model on Azure ML for The Ocean Cleanup. We train the models through Azure ML to provide us with a good registration of all performed tests, so that we can see why and how a model was created.

When the result of a training run is satisfactory, a model can be registered from there, from which point we can deploy it.

There are a few concepts to know about first:

- Workspace: The entire AzureML environment you are working in. The Workspace contains all the other elements.
- Experiment: A collection of Runs (see below). A logical container for training a model with different parameters to determine the best.
- Run: A single train/test run of a model. These are tied to an experiment. If you want to train the same model with different parameters, so you can compare them, these are different runs under the same experiment.
- Environment: The code environment used by your code. This contains things like the required Python packages. Multiple options exist here - from just using your local environment to completely curated environments directly from Azure.
- Dataset: A single dataset as registered in the AzureML workspace.

With that out of the way, lets dive right in. Looking at these components, our first step will be to get the correct Workspace:

In [1]:
from azurewrapper.workspace import get_workspace

subscription_id = "29d66431-a7ce-4709-93f7-3bdb01a243b3"
resource_group = "ExperimentationJayke"
workspace_name = "ExperimentationJayke"

workspace = get_workspace(subscription_id, resource_group, workspace_name)

## Create experiment

Now that we have a workspace available, we need to create an experiment. As describe above, an experiment will be the container for multiple runs, in which we can train and compare the model using different parameters.

The experiment needs a name. Use something that is descriptive and clear to anyone seeing this.

In [2]:
from azurewrapper.train import create_experiment
experiment = create_experiment(workspace, "model-frcnn-v-1-0")

## Create or select compute target

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Found existing compute target


## Create the environment

We will now need to create an environment. In this case, we use the curated TensorFlow 2.2 environment.

In [4]:
from azureml.core import Environment
from azurewrapper.environment import get_environment
# environment = get_environment(
#     workspace,
#     "AzureML-TensorFlow-2.2-GPU"
# )

In [5]:
environment = Environment("custom_tensorflow_object_detection")
environment.docker.enabled = True


# Alternatively, load the string from a file.
environment.docker.base_image = None
environment.docker.base_dockerfile = "./examples/frcnn/Dockerfile"


In [6]:

# tf_env = Environment.from_conda_specification(
#     name='tensorflow-2.2-gpu',
#     file_path='./examples/frcnn/conda_dependencies.yml'
# )

# # Specify a GPU base image
# tf_env.docker.enabled = True
# tf_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
# environment = tf_env

## Prepare model wraper

Now it's time to perform our first Run of the experiment. However, before we can do this, we will need a wrapper around our model. This wrapper needs to do a few things:

- Initialize and train the model with:
  - The desired parameters
  - The desired data
- Evaluate the performance of the trained model
- Register the parameters and the performance in the Run object
- Add the generated model artifacts to the Run object

There is skeleton code for this available: `skeleton_files/train.py`. In this file you fill in what parameters you expect, you create and train and evaluate the model using these parameters and the loaded in dataset(s), and you register the results and the created artifacts with the Run.

For this how-to, we will use the example provided in `examples/frcnn/train.py`. This is an implementation of the file mentioned above. It expects two parameters: `num_train_steps` and `sample_1_of_n_eval_examples`.

## Run the experiment

Now we need to create and run the experiment. First, we fetch the desired datasets, and combine these into train- and test sets. Note that we can provide multiple sets for both training and testing. Also note that each set consists of both a label and an image dataset.

In [7]:
from azureml.core import Dataset

train_images = Dataset.get_by_name(workspace, name="campaign-26-10-2020_images")
train_labels = Dataset.get_by_name(workspace, name="campaign-26-10-2020_labels")
test_images = Dataset.get_by_name(workspace, name="campaign-22-10-2020_images")
test_labels = Dataset.get_by_name(workspace, name="campaign-22-10-2020_labels")
trainsets = [
    (train_labels, train_images),
    (test_labels, test_images)
]
testsets = [
    (test_labels, test_images)
]

We now have everything we need to perform the run locally. Lets do so!

In [None]:
from azurewrapper.train import perform_run
from azureml.core.runconfig import TensorflowConfiguration

checkpoint_files = Dataset.get_by_name(workspace, name="FRCNN")

# perform_run(experiment, script, source_directory, environment=None,
#             compute_target=None, datasets=[], parameters={})
# run = perform_run(experiment, 'train.py', 'examples/example_model', environment=environment,
#                   datasets=datasets, parameters={'param_a': 30, 'param_b': 12.0})
run = perform_run(experiment, 'train.py', 'examples/frcnn', environment=environment,
                  trainsets=trainsets, testsets=testsets, compute_target=compute_target,
                  parameters={
                      'num_train_steps': 10000,
                      'sample_1_of_n_eval_examples': 1,
                      'checkpoint_dataset': checkpoint_files.as_named_input(f'checkpoint').as_mount()
                  })
run.wait_for_completion(show_output=True)

RunId: model-frcnn-v-1-0_1603988066_042bef5e
Web View: https://ml.azure.com/experiments/model-frcnn-v-1-0/runs/model-frcnn-v-1-0_1603988066_042bef5e?wsid=/subscriptions/29d66431-a7ce-4709-93f7-3bdb01a243b3/resourcegroups/ExperimentationJayke/workspaces/ExperimentationJayke

Streaming azureml-logs/20_image_build_log.txt

2020/10/29 16:14:34 Downloading source code...
2020/10/29 16:14:35 Finished downloading source code
2020/10/29 16:14:36 Creating Docker network: acb_default_network, driver: 'bridge'
2020/10/29 16:14:36 Successfully set up Docker network: acb_default_network
2020/10/29 16:14:36 Setting up Docker configuration...
2020/10/29 16:14:37 Successfully set up Docker configuration
2020/10/29 16:14:37 Logging in to registry: 4974f70cd2934b4299204f2bf3475cda.azurecr.io
2020/10/29 16:14:39 Successfully logged into 4974f70cd2934b4299204f2bf3475cda.azurecr.io
2020/10/29 16:14:39 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2

Connecting to github.com (github.com)|140.82.118.4|:443... [0m[91mconnected.
[0m[91mHTTP request sent, awaiting response... [0m[91m302 Found
[0m[91mLocation: https://github-production-release-asset-2e65be.s3.amazonaws.com/23357588/a595da00-de51-11ea-9242-968f4fc5b907?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201029%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201029T161653Z&X-Amz-Expires=300&X-Amz-Signature=1a481faf1a3dbc1363672b34069c4c8f141de269a5d9e93ca6b6e011b013b49f&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=23357588&response-content-disposition=attachment%3B%20filename%3Dprotoc-3.13.0-linux-x86_64.zip&response-content-type=application%2Foctet-stream [following]
--2020-10-29 16:16:53--  https://github-production-release-asset-2e65be.s3.amazonaws.com/23357588/a595da00-de51-11ea-9242-968f4fc5b907?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201029%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201029T1616

  Downloading https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz (269kB)
Collecting tensorflow-datasets
  Downloading https://files.pythonhosted.org/packages/e8/3c/9563b456dedef4a5d9652e50544bc4930b09f24c3d6dee533f89c0f3189b/tensorflow_datasets-4.0.1-py3-none-any.whl (3.5MB)
Collecting dataclasses
  Downloading https://files.pythonhosted.org/packages/26/2f/1095cdc2868052dd1e64520f7c0d5c8c550ad297e944e641dbf1ffbb9a5d/dataclasses-0.6-py3-none-any.whl
Collecting docopt
  Downloading https://files.pythonhosted.org/packages/a2/55/8f8cab2afd404cf578136ef2cc5dfb50baa1761b68c9da1fb1e4eed343c9/docopt-0.6.2.tar.gz
Collecting pbr>=0.11
  Downloading https://files.pythonhosted.org/packages/fb/48/69046506f6ac61c1eaa9a0d42d22d54673b69e176d30ca98e3f61513e980/pbr-5.5.1-py2.py3-none-any.whl (106kB)
Collecting pyasn1-modules>=0.0.5
  Downloading https://files.pythonhosted.org/packages/95/de/214830a981892a3e286c3794f41ae67a4495df

Removing intermediate container 0639ea0825c2
 ---> 6358f45069be
Step 17/32 : RUN python object_detection/builders/model_builder_tf2_test.py
 ---> Running in 7ce60440a50a
[91m2020-10-29 16:19:56.408775: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0m[91mRunning tests under Python 3.7.4: /opt/miniconda/bin/python
[0m[91m[ RUN      ] ModelBuilderTF2Test.test_create_center_net_model
[0m[91m2020-10-29 16:19:59.496824: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-10-29 16:19:59.496901: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-10-29 16:19:59.496954: I tensorflow/stream_executor/cuda/cu

[0m[91mI1029 16:20:20.828056 140031828502336 efficientnet_model.py:148] round_filter input=112 output=136
I1029 16:20:20.828417 140031828502336 efficientnet_model.py:148] round_filter input=192 output=232
[0m[91mI1029 16:20:21.842417 140031828502336 efficientnet_model.py:148] round_filter input=192 output=232
[0m[91mI1029 16:20:21.842665 140031828502336 efficientnet_model.py:148] round_filter input=320 output=384
[0m[91mI1029 16:20:22.227562 140031828502336 efficientnet_model.py:148] round_filter input=1280 output=1536
[0m[91mI1029 16:20:22.329020 140031828502336 efficientnet_model.py:462] Building model efficientnet with params ModelConfig(width_coefficient=1.2, depth_coefficient=1.4, resolution=300, dropout_rate=0.3, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repe

 ---> Running in 70dfe752f1f1
Removing intermediate container 70dfe752f1f1
 ---> 5992b0e38300
Step 24/32 : COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
 ---> 89ce827fee81
Step 25/32 : RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
 ---> Running in d39d7aafe7ad
Collecting package metadata (repodata.json): ...working... 
done
Solving environment: ...working... done

Downloading and Extracting Packages

tk-8.6.10            | 3.2 MB    |            |   0% 
tk-8.6.10            | 3.2 MB    |            |   0% 
tk-8.6.10            | 3.2 MB    | #########4 |  94% 
tk-8.6.10            | 3.

239f2986ab6a: Pushed

db91aca116f0: Pushed
492031137608: Pushed

2518a9a5c5fd: Pushed
44e2d0d1aa02: Pushed
15aff59a9f72: Pushed
9327904150fd: Pushed




4ca8d9f9eed1: Pushed
4ec277bdf61d: Pushed
7b9520c9f3d7: Pushed
7d85c7820671: Pushed
7a694df0ad6c: Pushed
3fd9df553184: Pushed
7bbcc92dda8d: Pushed
805802706667: Pushed
7aa13e53a32f: Pushed
e3aa3988e482: Pushed
