# Distributed PyTorch Image Classification

**Learning Objectives** - By the end of this tutorial you should be able to use Azure Machine Learning (AzureML) to:
- quickly implement basic commands for data preparation
- assemble a pipeline with custom data preparation (python) scripts
- test and run a multi-node multi-gpu pytorch job
- use mlflow to analyze your metrics

**Requirements** - In order to benefit from this tutorial, you need:
- to have provisioned an AzureML workspace
- to have permissions to provision a minimal cpu and gpu cluster

**Motivations** - Let's consider the following scenario: we want to explore training different image classifiers on distinct kinds of problems, based on a large public dataset that is available at a given url. This ML pipeline will be future-looking, in particular we want:
- **genericity**: to be fairly independent from the data we're ingesting (so that we could switch to internal proprietary data in the future),
- **configurability**: to run different versions of that training with simple configuration changes,
- **scalability**: to iterate on the pipeline on small sample, then smoothly transition to running at scale.

### Connect to AzureML

Before we dive in the code, we'll need to create an instance of MLClient to connect to Azure ML. Please provide the references to your workspace below.

In [None]:
# handle to the workspace
from azure.ml import MLClient

# authentication package
from azure.identity import InteractiveBrowserCredential

# get a handle to the workspace
ml_client = MLClient(
    InteractiveBrowserCredential(),
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

### Provision the required resources for this notebook

We'll need 2 clusters for this notebook, a CPU cluster and a GPU cluster. First, let's create a minimal cpu cluster.

In [None]:
from azure.ml.entities import AmlCompute

# Let's create the Azure ML compute object with the intended parameters
cpu_cluster = AmlCompute(
    # Name assigned to the compute cluster
    name="cpu-cluster",
    # Azure ML Compute is the on-demand VM service
    type="amlcompute",
    # VM Family
    size="STANDARD_DS3_V2",
    # Minimum running nodes when there is no job running
    min_instances=0,
    # Nodes in cluster
    max_instances=4,
    # How many seconds will the node running after the job termination
    idle_time_before_scale_down=180,
    # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
    tier="Dedicated",
)

# Now, we pass the object to MLClient's create_or_update method
cpu_cluster = ml_client.begin_create_or_update(cpu_cluster)

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

For GPUs, we're creating the cluster below with the smallest VM family.

In [None]:
from azure.ml.entities import AmlCompute

gpu_cluster = AmlCompute(
    name="gpu-cluster",
    type="amlcompute",
    size="STANDARD_NC6",  # 1 x NVIDIA Tesla K80
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=180,
    tier="Dedicated",
)

gpu_cluster = ml_client.begin_create_or_update(gpu_cluster)

print(
    f"AMLCompute with name {gpu_cluster.name} is created, the compute size is {gpu_cluster.size}"
)

# 1. Implement a reusable data preparation pipeline

To develop our data preparation pipeline, there are a couple constraints that we're setting for ourselves:
- we want to minimize the effort to ingest public data as it is used only as a learning opportunity,
- we do not want to manipulate large data locally (ex: download/upload that data could take multiple hours),

In this section, we'll achieve just that, by implementing the following:
- a data ingestion and processing pipeline with simple shell commands (wget, unzip) using minimal boilerplate code,

## 1.1. Unzip archives with a simple command (no code)

To train our classifier, we'll consume the [Common Objects in COntext (COCO) dataset](https://cocodataset.org/). If we were to use this locally, the sequence would be very basic: download 3 zip files, unzip each of them in a distinct folder for train/val/test, use python to extract annotations into a format we can use. We'll do just that, but in the cloud, without too much pain.

The Azure ML SDK provides `entities` to implement any step of a workflow. In the example below, we create a `CommandComponent` with just a shell command. We parameterize this command by using a string template syntax provided by the SDK:

> ```
> curl -o local_archive.zip ${{inputs.url}} && unzip local_archive.zip -d ${{outputs.extracted_data}}
> ```

Creating the component just consists in declaring the names of the inputs, outputs, and specifying an environment. For this simple job we'll use a curated environment from AzureML. After that, we'll be able to reuse that component multiple times in our pipeline design.

In [None]:
from azure.ml.entities import CommandComponent, JobInput, JobOutput

download_unzip_component = CommandComponent(
    name="download_and_unzip",  # optional: this will show in the UI
    # this component has no code, just a simple unzip command
    command="curl -o local_archive.zip ${{inputs.url}} && unzip local_archive.zip -d ${{outputs.extracted_data}}",
    # I/O specifications, each using a specific key and type
    inputs={
        # 'url' is the key of this input string
        "url": {"type": "string"}
    },
    outputs={
        # 'extracted_data' will be the key to link this output to other steps in the pipeline
        "extracted_data": {"type": "path"}
    },
    # we're using a curated environment
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
)

The component class we just created can now be loaded as a function in python. This function will be used to write a reusable step in a pipeline.

In [None]:
from azure.ml import dsl

# we'll package this unzip command as a component to use within a pipeline
download_unzip_component_func = dsl.load_component(component=download_unzip_component)

?download_unzip_component_func

## 1.2. Add a python script

Next step in our pipeline is to implement a simple script to extract the annotations and format them for us. We've written that script in this repository, and it can be loaded as a component from its yaml specification.

In [None]:
from azure.ml import dsl

parse_annotations_func = dsl.load_component(
    yaml_file="./src/coco_extract_annotations/spec.yaml"
)

?parse_annotations_func

## 1.3. Write a reusable pipeline

We use the decorator `@dsl.pipeline` to construct an AzureML pipeline assembling the two components above.

In [None]:
from azure.ml import dsl

# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    # this will be the default compute on which each step below will run
    compute="cpu-cluster",
    # a readable description
    description="e2e images preparation",
)
def coco_preparation_pipeline(
    train_archive_url,
    valid_archive_url,
    test_archive_url,
    annotations_archive_url,
    category_id,
    category_name,
):
    # 1st instance using the command component above on the training data
    train_unzip_step = download_unzip_component_func(url=train_archive_url)

    # 2nd instance for validation data
    valid_unzip_step = download_unzip_component_func(url=valid_archive_url)

    # 3rd instance for testing data
    test_unzip_step = download_unzip_component_func(url=test_archive_url)

    # 4th instance for the annotations
    annotations_unzip_step = download_unzip_component_func(url=annotations_archive_url)

    # add the annotations processing after the unzip command
    parse_annotations_step = parse_annotations_func(
        # here we consume the output of the unzip step
        annotations_dir=annotations_unzip_step.outputs.extracted_data,
        # parameters for this step are given as pipeline parameters
        # to allow for genericity (no hardcoded value)
        category_id=category_id,
        category_name=category_name,
    )

    # outputs of this pipeline are coded as a dictionary
    # keys can be used to assemble and link this pipeline with other pipelines
    return {
        "train_images": train_unzip_step.outputs.extracted_data,
        "valid_images": valid_unzip_step.outputs.extracted_data,
        "test_images": test_unzip_step.outputs.extracted_data,
        "train_annotations": parse_annotations_step.outputs.train_annotations,
        "valid_annotations": parse_annotations_step.outputs.valid_annotations,
    }

The pipeline we just created, decorated by `@dsl.pipeline` can also be called from python, as a sub-pipeline within another pipeline, creating more complex workflows (we'll see in next section).

In [None]:
?coco_preparation_pipeline

## 1.4.. Run an instance of this pipeline in AzureML

When calling the pipeline function decorated with `@dsl.pipeline`, we will create an instance of this pipeline with the given arguments. In this scenario, we just provide the urls to the zip files we want to process, and the category of the objects we plan to train on.

In [None]:
pipeline_instance = coco_preparation_pipeline(
    train_archive_url="http://images.cocodataset.org/zips/train2017.zip",
    valid_archive_url="http://images.cocodataset.org/zips/val2017.zip",
    test_archive_url="http://images.cocodataset.org/zips/test2017.zip",
    annotations_archive_url="http://images.cocodataset.org/annotations/annotations_trainval2017.zip",
    category_id=1,
    category_name="contains_person",
)

That instance can be submitted to AzureML and run there. Use the `MLClient` to create the job:

In [None]:
# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    # Project's name
    experiment_name="e2e_image_sample",
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
print("The url to see your live job running is returned by the sdk:")
print(returned_job.services["Studio"].endpoint)

Considering the size of the dataset, this job will take a couple hours to complete. The validation and annotations dataset are smaller, and should take a couple minutes only to unzip. So while we wait for the training dataset (110k+ images) to finalize, you can already go into AzureML and register the outputs of the pipeline as datasets (see below).

<span style="color:red">IMPORTANT</span> - To move forward with the next section, we'll need you to:
- register the output of "train_unzip_step" as dataset "coco_train2017"
- register the output of "valid_unzip_step" as dataset "coco_val2017"
- (when available) register the 1st output of "Extract Coco Annotations" as dataset "coco_train2017_annotations"
- register the 2nd output of "Extract Coco Annotations" as dataset "coco_val2017_annotations"

![](media/image-prep-pipeline.png)

# 2. Training a distributed gpu job

Implementing a distributed pytorch training is complex. Of course in this tutorial we've written one for you, but the point is: it takes time, it takes several iterations, each requiring you to try your code locally, then in the cloud, then try it at scale, until satisfied and then run a full blown production model training. This trial/error process can be made easier if we can create reusable code we can iterate on quickly, and that can be configured to run from small to large scale.

So, to develop our training pipeline, we set a couple constraints for ourselves:
- we want to minimize the effort to iterate on the pipeline code when porting it in the cloud,
- we want to use the same code for small scale and large scale testing (so that we don't do not want to manipulate large data locally (ex: download/upload that data could take multiple hours),

## 2.1. Implement training and test on a sample

We've implemented a distributed pytorch training script that we can load as a component using its yaml specification, like we did for other components before.

Writing a pipeline to run it will be greatly simplified by the Azure ML SDK `@dsl.pipeline` decorator. For this, we've decided to parameterize this pipeline with relevant arguments (see below):

In [None]:
from azure.ml import dsl

# load the component from its yaml specifications
training_func = dsl.load_component(yaml_file="./src/pytorch_dl_train/spec.yaml")

# the dsl decorator tells the sdk that we are defining an Azure ML pipeline
@dsl.pipeline(
    description="e2e images classification",  # TODO: document
)
def coco_model_training(
    train_images,  # dataset containing training images
    valid_images,  # dataset containing validation images
    train_annotations,  # annotations in CSV (see coco_extract_annotations/)
    valid_annotations,  # annotations in CSV (see coco_extract_annotations/)
    model_name,  # a name to register the model after training
    epochs,  # the number of epochs
    enable_profiling,  # bonus: we've implemented pytorch profiling in our script
):
    # the training step is calling our training component with the right arguments
    training_step = training_func(
        # inputs
        train_images=train_images,
        valid_images=valid_images,
        train_annotations=train_annotations,
        valid_annotations=valid_annotations,
        # params (some hardcoded, some given by pipeline parameters)
        num_epochs=epochs,
        register_model_as=model_name,
        num_workers=-1,  # use all cpus (see train.py)
        enable_profiling=enable_profiling,  # turns on profiler (see train.py)
    )
    # we set the name of the compute target for this training job
    training_step.compute = "gpu-cluster"

    # use process_count_per_instance to parallelize on multiple gpus
    training_step.distribution.process_count_per_instance = (
        1  # set to number of gpus on instance
    )

    # use instance_count to increase the number of nodes (machines)
    training_step.resources.instance_count = 1

    # outputs of this pipeline are coded as a dictionary
    # keys can be used to assemble and link this pipeline with other pipelines
    return {"model": training_step.outputs.trained_model}


# TODO: document
?coco_model_training

We can now test this code by running it on a smaller dataset in Azure ML. Here, we will use the validation set for training. Of course, the model will not be valid. But training will be short (8 mins on STANDARD_NC6 for 1 epoch) to allow us to iterate if needed.

In [None]:
pipeline_instance = coco_model_training(
    # inputs: using validation set for training makes model invalid
    train_images=ml_client.datasets.get("coco_val2017", version=1),
    valid_images=ml_client.datasets.get("coco_val2017", version=1),
    train_annotations=ml_client.datasets.get("coco_val2017_annotations", version=1),
    valid_annotations=ml_client.datasets.get("coco_val2017_annotations", version=1),
    # training parameters surfaced in the pipeline definition
    epochs=1,  # 1 epoch only for testing, model isn't valid anyway
    model_name="coco_model_person_dev",
    enable_profiling=False,  # turns on profiler (see train.py)
)

Once we create that pipeline instance, we submit it through `MLClient`.

In [None]:
import webbrowser

# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    # Project's name
    experiment_name="e2e_image_sample",
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
print("The url to see your live job running is returned by the sdk:")
print(returned_job.services["Studio"].endpoint)
# open the browser with this url
webbrowser.open(returned_job.services["Studio"].endpoint)

# print the pipeline run id
print(
    f"The pipeline details can be access programmatically using identifier: {returned_job.name}"
)
# saving it for later in this notebook
small_scale_pipeline_id = returned_job.name

You can iterate on this design as much as you'd like, updating the local code of the component and re-submit the pipeline. You would have to go back to the beginning of this section (from the `dsl.load_component()` call) in order to refresh the pipeline definition with a new version of the code.

## 2.2. Reuse the same pipeline on full scale data

Once the pipeline is satisfying, running it full scale is actually just adjusting the inputs and parameters. We're creating a new instance below where we use training images as training set (as we should). If running on a minimal STANDARD_NC6, it should take approx < 2h30 to complexe 10 epochs.

In [None]:
pipeline_instance = coco_model_training(
    train_images=ml_client.datasets.get("coco_train2017", version=1),
    valid_images=ml_client.datasets.get("coco_val2017", version=1),
    train_annotations=ml_client.datasets.get("coco_train2017_annotations", version=1),
    valid_annotations=ml_client.datasets.get("coco_val2017_annotations", version=1),
    epochs=10,
    model_name="coco_model_person_full",
    enable_profiling=False,  # turns off profiler (see train.py)
)

In [None]:
import webbrowser

# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    # Project's name
    experiment_name="e2e_image_sample",
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
print("The url to see your live job running is returned by the sdk:")
print(returned_job.services["Studio"].endpoint)
# open the browser with this url
webbrowser.open(returned_job.services["Studio"].endpoint)

# print the pipeline run id
print(
    f"The pipeline details can be access programmatically using identifier: {returned_job.name}"
)
# saving it for later in this notebook
large_scale_pipeline_id = returned_job.name

# 3. Analyze experiments using MLFlow

Azure ML natively integrates with MLFlow so that if your code already supports MLFlow logging, you will not have to modify it to report your metrics within Azure ML. The component above is using MLFlow internally to report relevant metrics, logs and artifacts. Look for `mlflow` calls within the script `train.py`.

To access this data in the Azure ML Studio, click on the component in the pipeline to open the Details panel, then choose the **Metrics** panel.

You can also access those metrics programmatically using mlflow. We'll demo a couple examples below.

## 3.1. Connect to Azure ML using MLFlow client

In [None]:
import mlflow
from mlflow.tracking import MlflowClient
import matplotlib.pyplot as plt

mlflow.set_tracking_uri(ml_client.workspaces.get().mlflow_tracking_uri)

# search for the training step within the pipeline
mlflow.set_experiment("e2e_image_sample")

## 3.2. Analyze the metrics of a specific jog

Using MLFlow, you can retrive all the metrics produces by a given run. You can then leverage any usual tool to draw the analysis that is relevant for you. In the example below, we're plotting accuracy per epoch.

![plot training and validation accuracy over epochs](./media/pytorch_train_mlflow_plot.png)

In [None]:
# here we're using the small scale training on validation data
training_pipeline_id = small_scale_pipeline_id
# feel free to adapt with a pipeline id of yours
# training_pipeline_id = "..."

# use this to get the id of the training step within the pipeline
training_step_run_id = mlflow.search_runs(
    filter_string=f"tags.mlflow.parentRunId = '{training_pipeline_id}'"
)["run_id"][0]

# alternatively, you can directly use a known training step id
# training_step_run_id = "..."

print(f"Withiin pipeline run id: {training_pipeline_id}")
print(f"Training step has run id: {training_step_run_id}")

# open a client to get metric history
client = MlflowClient()

# create a plot
plt.rcdefaults()
fig, ax = plt.subplots()
ax.set_xlabel("epoch")

for metric in ["epoch_train_acc", "epoch_valid_acc"]:
    # get all values taken by the metric
    metric_history = client.get_metric_history(training_step_run_id, metric)

    epochs = [metric_entry.step for metric_entry in metric_history]
    metric_array = [metric_entry.value for metric_entry in metric_history]
    ax.plot(epochs, metric_array, label=metric)

plt.legend()

## 3.2. Retrieve artifacts for local analysis (ex: tensorboard)

MLFlow also allows you to record artifacts during training. The script `train.py` leverages the [PyTorch profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) to produce logs for analyzing GPU performance. It uses mlflow to record those logs as artifacts.

In the following, we'll download those locally to inspect with other tools such as tensorboard.

In [None]:
# here we're using the small scale training on validation data
training_pipeline_id = small_scale_pipeline_id
# feel free to adapt with a pipeline id of yours
# training_pipeline_id = "..."

# use this to get the id of the training step within the pipeline
training_step_run_id = mlflow.search_runs(
    filter_string=f"tags.mlflow.parentRunId = '{training_pipeline_id}'"
)["run_id"][0]

# alternatively, you can directly use a known training step id
# training_step_run_id = "..."

print(f"Withiin pipeline run id: {training_pipeline_id}")
print(f"Training step has run id: {training_step_run_id}")

# open a client to get run artifacts
client = MlflowClient()

# create local directory to store artefacts
os.makedirs("./logs/", exist_ok=True)

for artifact in client.list_artifacts(training_step_run_id, path="profiler/markdown/"):
    print(f"Downloading artifact {artifact.path}")
    client.download_artifacts(
        training_step_run_id, path=artifact.path, dst_path="./logs"
    )
else:
    print(
        f"No artefacts were found for profiler/markdown/ in run id {training_step_run_id}"
    )

for artifact in client.list_artifacts(
    training_step_run_id, path="profiler/tensorboard_logs/"
):
    print(f"Downloading artifact {artifact.path}")
    client.download_artifacts(
        training_step_run_id, path=artifact.path, dst_path="./logs"
    )
else:
    print(
        f"No artefacts were found for profiler/markdown/ in run id {training_step_run_id}"
    )

We can now run tensorboard locally with the downloaded artifacts to run some analysis of GPU performance (see example snapshot below).

```
tensorboard --logdir="./logs/profiler/tensorboard_logs/"
```

![tensorboard logs generated by pytorch profiler](./media/pytorch_train_tensorboard_logs.png)

## 3.3. Analyze metrics accross multiple jobs

You can also use mlflow to search all your runs, filter by some specific properties and get the results as a pandas dataframe. Once you get that dataframe, you can implement any analysis on top of it.

Below, we're extracting all runs and show the effect of profiling on the epoch training time.

![mlflow runs in a pandas dataframe](./media/pytorch_train_mlflow_runs.png)

In [None]:
runs = mlflow.search_runs(
    # we're using mlflow syntax to restrict to a specific parameter
    filter_string=f"params.model_arch = 'resnet18'"
)

# we're keeping only some relevant columns
columns = [
    "run_id",
    "status",
    "end_time",
    "metrics.epoch_train_time",
    "metrics.epoch_train_acc",
    "metrics.epoch_valid_acc",
    "params.enable_profiling",
]

# showing the raw results in notebook
runs[columns].dropna()

![](media/mlflow_plot.png)