# How-to train a model on Azure ML

This notebook takes you through the steps of training a model on Azure ML for The Ocean Cleanup. We train the models through Azure ML to provide us with a good registration of all performed tests, so that we can see why and how a model was created.

When the result of a training run is satisfactory, a model can be registered from there, from which point we can deploy it.

There are a few concepts to know about first:

- Workspace: The entire AzureML environment you are working in. The Workspace contains all the other elements.
- Experiment: A collection of Runs (see below). A logical container for training a model with different parameters to determine the best.
- Run: A single train/test run of a model. These are tied to an experiment. If you want to train the same model with different parameters, so you can compare them, these are different runs under the same experiment.
- Environment: The code environment used by your code. This contains things like the required Python packages. Multiple options exist here - from just using your local environment to completely curated environments directly from Azure.
- Dataset: A single dataset as registered in the AzureML workspace.

With that out of the way, lets dive right in. Looking at these components, our first step will be to get the correct Workspace:

In [2]:
from toc_azurewrapper.workspace import get_workspace

subscription_id = "a00eaec6-b320-4e7c-ae61-60a30aec1cfc"
resource_group = "MachineLearning"
workspace_name = "tutorial_azureml_mats"
tenant_id = "86f9fea7-9eb0-4325-8b58-7ed0db623956"

workspace = get_workspace(subscription_id, resource_group, workspace_name, tenant_id=tenant_id)

## Create experiment

Now that we have a workspace available, we need to create an experiment. As describe above, an experiment will be the container for multiple runs, in which we can train and compare the model using different parameters.

The experiment needs a name. Use something that is descriptive and clear to anyone seeing this.

In [3]:
from toc_azurewrapper.train import create_experiment
experiment = create_experiment(workspace, "model-tf2od-jayke")

## Create or select compute target

We want to train our model on a GPU cluster on AzureML. Lets create one (or load an existing one).

In [4]:
from toc_azurewrapper.compute import get_compute

compute_target = get_compute(workspace, "gpu-cluster-ai", vm_size='STANDARD_NC6', max_nodes=1)
# compute_target = get_compute(workspace, "gpu-cluster-v100", vm_size='STANDARD_NC6S_V3', max_nodes=1)
# compute_target = get_compute(workspace, "gpu-cluster", vm_size='STANDARD_NC6', max_nodes=4)

Found existing compute target


## Create the environment

We will now need to create an environment. In this case, we build our own from a Docker file.

In [5]:
from toc_azurewrapper.environment import get_environment


environment = get_environment(
     workspace,
    "tensorflow-objectdetection",
     docker_file="./examples/tensorflow/Dockerfile"
)

No environment with that name found, creating new one


## Prepare model wraper

Now it's time to perform our first Run of the experiment. However, before we can do this, we will need a wrapper around our model. This wrapper needs to do a few things:

- Initialize and train the model with:
  - The desired parameters
  - The desired data
- Evaluate the performance of the trained model
- Register the parameters and the performance in the Run object
- Add the generated model artifacts to the Run object

There is skeleton code for this available: `skeleton_files/train.py`. In this file you fill in what parameters you expect, you create and train and evaluate the model using these parameters and the loaded in dataset(s), and you register the results and the created artifacts with the Run.

For this how-to, we will use the example provided in `examples/frcnn/train.py`. This is an implementation of the file mentioned above. It expects two parameters: `num_train_steps` and `sample_1_of_n_eval_examples`.

## Run the experiment

Now we need to create and run the experiment. First, we fetch the desired datasets, and combine these into train- and test sets. Note that we can provide multiple sets for both training and testing. Also note that each set consists of both a label and an image dataset.

In [6]:
# Load datasets

from azureml.core import Dataset

train_images = Dataset.get_by_name(workspace, name="Grogol_train_images")
train_labels = Dataset.get_by_name(workspace, name="Grogol_train_labels")
test_images = Dataset.get_by_name(workspace, name="Grogol_test_images")
test_labels = Dataset.get_by_name(workspace, name="Grogol_test_labels")

trainsets = [
    (train_labels, train_images),
]
testsets = [
    (test_labels, test_images)
]

checkpoint_files = Dataset.get_by_name(workspace, name="tf2od_checkpoints")

We now have everything we need to perform the run locally. Lets do so!

In [6]:
#checkpoint_files = Dataset.get_by_name(workspace, name="EfficientDetD0")

In [13]:
import os
# Impossible models to train because batch_size of 1 doesn't fit into memory 
impossible_to_train_models = [
    'efficientdet_d5', 
    'efficientdet_d6',
    'efficientdet_d7',
    'faster_rcnn_inception_resnet_v2_1024x1024',
    'faster_rcnn_resnet101_v1_800x1333',
    'faster_rcnn_resnet152_v1_800x1333',
    'faster_rcnn_resnet152_v1_1024x1024'    
]


model_names = sorted([f[:-7] for f in os.listdir('examples/tensorflow/tf2od/configs') if f.endswith('.config') 
                        and f[:-7] not in impossible_to_train_models])
print(len(model_names))
model_names

31


['centernet_hg104_1024x1024',
 'centernet_hg104_512x512',
 'centernet_resnet101_v1_fpn_512x512',
 'centernet_resnet50_v1_fpn_512x512',
 'centernet_resnet50_v2_512x512',
 'efficientdet_d0',
 'efficientdet_d0_0',
 'efficientdet_d0_1',
 'efficientdet_d0_2',
 'efficientdet_d0_3',
 'efficientdet_d0_4',
 'efficientdet_d0_5',
 'efficientdet_d0_6',
 'efficientdet_d1',
 'efficientdet_d2',
 'efficientdet_d3',
 'efficientdet_d4',
 'faster_rcnn_inception_resnet_v2_640x640',
 'faster_rcnn_resnet101_v1_1024x1024',
 'faster_rcnn_resnet101_v1_640x640',
 'faster_rcnn_resnet152_v1_640x640',
 'faster_rcnn_resnet50_v1_1024x1024',
 'faster_rcnn_resnet50_v1_640x640',
 'faster_rcnn_resnet50_v1_640x640 copy',
 'faster_rcnn_resnet50_v1_800x1333',
 'ssd_resnet101_v1_fpn_1024x1024',
 'ssd_resnet101_v1_fpn_640x640',
 'ssd_resnet152_v1_fpn_1024x1024',
 'ssd_resnet152_v1_fpn_640x640',
 'ssd_resnet50_v1_fpn_1024x1024',
 'ssd_resnet50_v1_fpn_640x640']

In [14]:
from toc_azurewrapper.train import perform_run
import time

# Note, 1 step is 1 batch. With batch size of 1 and trainset of size 473, 1 epoch is 473 steps.

#model_name = 'efficientdet_d5'
model_name = 'faster_rcnn_resnet50_v1_640x640'
assert (model_name + '.config') in os.listdir('examples/tensorflow/tf2od/configs/')

# for config_num in range(0,1):
#     cuda_error_out_of_memory = True
#     while cuda_error_out_of_memory:

        
#         print(run._run_number)
#         tag = f"Initial tests {config_num}"
#         print(tag)
#         run.tag(tag)
#         display(run)
#         run.wait_for_completion()
        
#         time.sleep(10)
#         x = run.get_all_logs()
#         time.sleep(10)
#         with open([xx for xx in x if xx.endswith('70_driver_log.txt')][0], 'r') as f:
#             s = f.read()
#         f.close()
#         cuda_error_out_of_memory = ('CUDA_ERROR_OUT_OF_MEMORY' in s) or ('Resource exhausted: OOM when allocating tensor with shape' in s)


In [None]:
from toc_azurewrapper.train import perform_run
from azureml.core.runconfig import TensorflowConfiguration
import time

run = perform_run(experiment, 'train.py', 'examples/tensorflow', environment=environment,
                    trainsets=trainsets, testsets=testsets, compute_target=compute_target,
                    parameters={
                        'model_name': model_name,
                        'sample_1_of_n_eval_examples': 1,
                        'num_train_steps': 25000,
                        'checkpoint_files': checkpoint_files.as_named_input(f'checkpoint_files').as_mount(),
                              #'config_num': '_' + str(config_num)
                    },
                    #distributed_job_config=TensorflowConfiguration(worker_count=2, parameter_server_count=1)
                 )
run.wait_for_completion(show_output=True)

RunId: model-tf2od-jayke_1610629406_cefb7331
Web View: https://ml.azure.com/experiments/model-tf2od-jayke/runs/model-tf2od-jayke_1610629406_cefb7331?wsid=/subscriptions/a00eaec6-b320-4e7c-ae61-60a30aec1cfc/resourcegroups/MachineLearning/workspaces/tutorial_azureml_mats

Streaming azureml-logs/55_azureml-execution-tvmps_df64ad678115a5be4045276b2300ac5f8e1c56e7bcfe028be595a6f554b5bd34_d.txt

2021-01-14T13:03:44Z Starting output-watcher...
2021-01-14T13:03:44Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2021-01-14T13:03:45Z Executing 'Copy ACR Details file' on 10.0.0.5
2021-01-14T13:03:45Z Copy ACR Details file succeeded on 10.0.0.5. Output: 
>>>   
>>>   
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_9d4fa30783fc98f2c7c7f19c6a312f30
Digest: sha256:96a223a2d683aab4b4f91719ba3f705a79883c430ca39e73845fd2ba36704f14
Status: Image is up to date for viennaglobal.azurecr.io/azureml/azureml_9d4fa30783fc98f2c7c7f19c6a312f30:latest
viennaglobal.azu


Streaming azureml-logs/70_driver_log.txt

2021/01/14 13:04:22 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info
2021/01/14 13:04:22 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
[2021-01-14T13:04:24.632187] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['train.py', '--train_sets', '8bc66dc5-50c1-4ca5-8917-d8f06fcb1e30', 'DatasetConsumptionConfig:train_images_0', '--test_sets', '955d4e3b-50ec-4fcf-9e6e-e46477ed330f', 'DatasetConsumptionConfig:test_images_0', '--model_name', 'faster_rcnn_resnet50_v1_640x640', '--sample_1_of_n_eval_examples', '1', '--num_train_steps', '1000', '--checkpoint_files', 'DatasetConsumptionConfig:checkpoint_files'])
Script type = None
Starting the daemon thre

Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W0114 13:05:11.112099 140111140951872 deprecation.py:323] From /opt/miniconda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
Instructions for updating:
Use `tf.cast` instead.
W0114 13:05:16.576279 140111140951872 deprecation.py:323] From /opt/miniconda/lib/python3.7/site-packages/object_detection/inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
W0114 13:05:21.117694 140105557718784 deprecation.py:323] From /opt/min

Instructions for updating:
Use fn_output_signature instead
W0114 13:05:57.888493 140105566111488 deprecation.py:506] From /opt/miniconda/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
INFO:tensorflow:Step 100 per-step time 0.355s loss=0.931
I0114 13:06:47.962735 140111140951872 model_lib_v2.py:652] Step 100 per-step time 0.355s loss=0.931
INFO:tensorflow:Step 200 per-step time 0.366s loss=0.612
I0114 13:07:24.597121 140111140951872 model_lib_v2.py:652] Step 200 per-step time 0.366s loss=0.612
INFO:tensorflow:Step 300 per-step time 0.361s loss=1.494
I0114 13:08:01.481145 140111140951872 model_lib_v2.py:652] Step 300 per-step time 0.361s loss=1.494
INFO:tensorflow:Step 400 per-step time 0.375s loss=0.125
I0114 13:08:38.548666 140111140951872 model_lib_v2.py:652] Step 400 per-step time 0.3

2021-01-14 13:13:05.377261: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-01-14 13:13:05.532728: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
INFO:tensorflow:Finished eval step 0
I0114 13:13:08.311958 140473687459648 model_lib_v2.py:805] Finished eval step 0
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
options available in V2.
- tf.py_function takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
- tf.numpy_function maintains the semantics of the deprecated tf.py_func
(it is not differentiable, and manipulates numpy arrays). It drops the
stateful argum

Finally, if the result is to our liking, we will register the model. This means we can use it for deployment. Provide a name to the model, the path to either a single artifact or to the folder containing all required artifacts, and optionally a description, properties and tags.

In [None]:
run.register_model(
    "frcnn",
    model_path="outputs/",
    description="FRCNN implementation on Tensorflow + Object detection API",
    properties={
        "location": "there",
        "time": "noon"
    }
)

In [None]:
python model_main_tf2.py --model_dir=models/my_EfficientDetD0/ --pipeline_config_path=models/my_EfficientDetD0/pipeline.config 