TODO: does this work locally AND remotely?
TODO: "This is a N part series of tutorials"


# PyTorch image classification from prep to deployment (part I)

**Learning Objectives** - By the end of this tutorial you should be able to use Azure Machine Learning (AzureML) to:
- ingest a large dataset from a simple url
- quickly implement basic commands for data preparation
- assemble a pipeline with custom data preparation (python) scripts

**Requirements** - In order to benefit from this tutorial, you need:
- to have provisioned an AzureML workspace
- to have permissions to create simple SKUs in your resource group
- a python environment

**Motivations** - TODO

# End-to-end scenario

TODO

# 1. Set up the pipeline resources

In [1]:
# handle to the workspace
from azure.ml import MLClient

# Authentication package
from azure.identity import InteractiveBrowserCredential

In [2]:
# get a handle to the workspace
ml_client = MLClient(
    InteractiveBrowserCredential(), 
    # subscription_id = '<SUBSCRIPTION_ID>', 
    # resource_group = '<RESOURCE_GROUP>', 
    # workspace = '<AML_WORKSPACE_NAME>'
)

# 2. Implement a reusable data preparation pipeline

motivation...

## 2.1. Unzip archives with a simple command (no code component)

In our use case, we want to shortcut having to download locally and upload the data into AzureML. So first, we'll have to unzip the archives provided from the COCO web urls. Unzipping is not the most complex operation. Running it in a reusable manner in a cloud pipeline job can require a lot of boilerplate code. Instead of writing a python script for that purpose, we just want to run the command itself.

Here, we will embed the unzip call in a [CommandComponent](TODO) that specifies inputs, outputs and environment of that command.

In [None]:
from azure.ml import dsl
from azure.ml.entities import CommandComponent, JobInput, JobOutput

download_unzip_component = CommandComponent(
    name="download_and_unzip", # optional: this will show in the UI

    # this component has no code, just a simple unzip command
    # TODO: command = "wgetls -lr ${{inputs.archive_path}}; unzip ${{inputs.archive_path}} -d ${{outputs.extracted_data}}",
    command = "curl -o local_archive.zip ${{inputs.url}} && unzip local_archive.zip -d ${{outputs.extracted_data}}",

    # inputs and outputs need to match with the command
    inputs = {
        'url': { 'type': 'string' }
    },
    outputs = {
        'extracted_data': { 'type': 'path' }
    },

    # we're using a curated environment
    environment = 'AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9',
)

## 2.2. Write a reusable pipeline in python

The component we just created can now be loaded as a [component](TODO): a reusable step in a pipeline. 

In [None]:
# we'll package this unzip command as a component to use within a pipeline
download_unzip_component_func = dsl.load_component(component=download_unzip_component)

This step can be used as a python function with arguments and parameters. The `inputs` and `ouputs` of the command component can be provided as python variables. We use the decorator `@dsl.pipeline` to construct an AzureML pipeline assembling components.

In [None]:
from azure.ml import dsl

# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    compute="cpu-d14-v2", #"cpu-cluster", # TODO: document
    description="e2e images preparation", # TODO: document
)
def coco_preparation_pipeline(annotations_archive_url, train_archive_url, valid_archive_url, category_id, category_name):
    # TODO: document
    annotations_unzip_step = download_unzip_component_func(
        url=annotations_archive_url
    )

    # TODO: document
    train_unzip_step = download_unzip_component_func(
        url=train_archive_url
    )

    # TODO: document
    valid_unzip_step = download_unzip_component_func(
        url=valid_archive_url
    )

    # TODO: document
    return {
        "train_images": train_unzip_step.outputs.extracted_data,
        "valid_images": valid_unzip_step.outputs.extracted_data,
        "trainval_annotations": annotations_unzip_step.outputs.extracted_data
    }

The pipeline we just created, decorated by `@dsl.pipeline` can also be called from python, as a sub-pipeline within another pipeline, creating more complex workflows (we'll see in next section).

In [None]:
# TODO: document
help(coco_preparation_pipeline)

## 2.3. Run an instance of this pipeline in AzureML

When calling the pipeline function decorated with `@dsl.pipeline`, we will create an instance of this pipeline with the given arguments.

In [None]:
from azure.ml.entities import Dataset
from azure.ml.entities import JobInput, JobOutput

pipeline_instance = coco_preparation_pipeline(
    annotations_archive_url="http://images.cocodataset.org/annotations/annotations_trainval2017.zip",
    train_archive_url="http://images.cocodataset.org/zips/train2017.zip",
    valid_archive_url="http://images.cocodataset.org/zips/val2017.zip",
    category_id=1,
    category_name="contains_person"
)

That instance can be submitted to AzureML and run as an experiment there. Use the [`MLClient`](TODO) to create this experiment.

In [None]:
# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    
    # Project's name
    experiment_name="e2e_image_sample",
    
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
print("The url to see your live job running is returned by the sdk:")
print(returned_job.services["Studio"].endpoint)

![](media/image-prep-pipeline.png)

# 3. Training a distributed gpu job

To run a pytorch training on multiple gpus, you have multiple options.

In the following, we'll use the `pytorch` distribution setting to run a job using [`DistributedDataParallel`](TODO): multiple instances of the script will be running on each node (one per gpu on that node).

In [3]:
from azure.ml import dsl

parse_annotations_func = dsl.load_component(yaml_file="./components/coco_extract_annotations/spec.yaml")
training_func = dsl.load_component(yaml_file="./components/pytorch_dl_train/spec.yaml")

In [4]:
from azure.ml import dsl

# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    description="e2e images classification", # TODO: document
)
def coco_model_training_devel(validation_images, trainval_annotations, category_id, category_name, epochs):
    # TODO: document
    parse_annotations_step = parse_annotations_func(
        annotations_dir=trainval_annotations,
        category_id=category_id,
        category_name=category_name
    )
    parse_annotations_step.compute="cpu-cluster"

    training_step = training_func(
        train_images=validation_images, # use validation as training
        valid_images=validation_images,
        train_annotations=parse_annotations_step.outputs.valid_annotations, # use validation as training
        valid_annotations=parse_annotations_step.outputs.valid_annotations,
        num_epochs=epochs,
        register_model_as="coco_model"
    )
    training_step.compute="gpu-cluster"

    # use process_count_per_instance to parallelize on multiple gpus
    training_step.distribution.process_count_per_instance  = 4 # set to number of gpus on instance

    # use instance_count to increase the number of nodes (machines)
    training_step.resources.instance_count  = 2

    # TODO: document
    return {
        "model": training_step.outputs.trained_model
    }

# TODO: document
help(coco_model_training_devel)

Help on function coco_model_training_devel in module __main__:

coco_model_training_devel(validation_images, trainval_annotations, category_id, category_name, epochs)



In [5]:
from azure.ml.entities import Dataset
from azure.ml.entities import JobInput, JobOutput

pipeline_instance = coco_model_training_devel(
    validation_images=ml_client.datasets.get("coco_val2017", version=1),
    trainval_annotations=ml_client.datasets.get("coco_trainval2017_annotations", version=1),
    epochs=1,
    category_id=1,
    category_name="person"
)

In [6]:
# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    
    # Project's name
    experiment_name="e2e_image_sample",
    
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
print("The url to see your live job running is returned by the sdk:")
print(returned_job.services["Studio"].endpoint)

UserErrorException: Required input simulated_latency_in_ms for component training_step not provided.