TODO: does this work locally AND remotely?
TODO: "This is a N part series of tutorials"


# PyTorch image classification from prep to deployment (part I data preparation)

**Learning Objectives** - By the end of this tutorial you should be able to use Azure Machine Learning (AzureML) to:
- ingest a large dataset from a simple url
- quickly implement basic commands for data preparation
- assemble a pipeline with custom data preparation (python) scripts

**Requirements** - In order to benefit from this tutorial, you need:
- to have provisioned an AzureML workspace
- to have permissions to create simple SKUs in your resource group
- a python environment

**Motivations** - TODO

# 1. Introduction to the end-to-end scenario

TODO

# 2. Set up the pipeline resources

## 2.1. Connect to your AzureML workspace

In [None]:
# handle to the workspace
from azure.ml import MLClient

# Authentication package
from azure.identity import InteractiveBrowserCredential

In [None]:
# get a handle to the workspace
ml_client = MLClient(
    InteractiveBrowserCredential(), 
    subscription_id = '<SUBSCRIPTION_ID>', 
    resource_group = '<RESOURCE_GROUP>', 
    workspace = '<AML_WORKSPACE_NAME>'
)

# 3. Implement a reusable data preparation pipeline

motivation...

## 3.1. Unzip archives with a simple command (no code component)

In our use case, we want to shortcut having to download locally and upload the data into AzureML. So first, we'll have to unzip the archives provided from the COCO web urls. Unzipping is not the most complex operation. Running it in a reusable manner in a cloud pipeline job can require a lot of boilerplate code. Instead of writing a python script for that purpose, we just want to run the command itself.

Here, we will embed the unzip call in a [CommandComponent](TODO) that specifies inputs, outputs and environment of that command.

In [None]:
from azure.ml import dsl
from azure.ml.entities import CommandComponent, JobInput, JobOutput

download_unzip_component = CommandComponent(
    name="download_and_unzip", # optional: this will show in the UI

    # this component has no code, just a simple unzip command
    # TODO: command = "wgetls -lr ${{inputs.archive_path}}; unzip ${{inputs.archive_path}} -d ${{outputs.extracted_data}}",
    command = "curl -o local_archive.zip ${{inputs.url}} && unzip local_archive.zip -d ${{outputs.extracted_data}}",

    # inputs and outputs need to match with the command
    inputs = {
        'url': { 'type': 'string' }
    },
    outputs = {
        'extracted_data': { 'type': 'path' }
    },

    # we're using a curated environment
    environment = 'AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9',
)

# we'll package this unzip command as a component to use within a pipeline
download_unzip_component_func = dsl.load_component(component=download_unzip_component)

## 3.2. Custom python script for extracting annotations

In [None]:
parse_annotations_func = dsl.load_component(yaml_file="./components/coco_extract_annotations/spec.yaml")

In [None]:
help(parse_annotations_func)

## 3.3. Assemble as a pipeline in python

In [None]:
from azure.ml import dsl

# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    compute="cpu-d14-v2", #"cpu-cluster", # TODO: document
    description="e2e images preparation", # TODO: document
)
def coco_preparation_pipeline(annotations_archive_url, train_archive_url, valid_archive_url, category_id, category_name):
    # TODO: document
    annotations_unzip_step = download_unzip_component_func(
        url=annotations_archive_url
    )

    # TODO: document
    train_unzip_step = download_unzip_component_func(
        url=train_archive_url
    )

    # TODO: document
    valid_unzip_step = download_unzip_component_func(
        url=valid_archive_url
    )

    # TODO: document
    parse_annotations_step = parse_annotations_func(
        annotations_dir=annotations_unzip_step.outputs.extracted_data,
        category_id=category_id,
        category_name=category_name
    )

    # TODO: document
    return {
        "train_images": train_unzip_step.outputs.extracted_data,
        "valid_images": valid_unzip_step.outputs.extracted_data,
        "train_annotations": parse_annotations_step.outputs.train_annotations,
        "valid_annotations": parse_annotations_step.outputs.valid_annotations,
    }

# TODO: document
help(coco_preparation_pipeline)

In [None]:
from azure.ml.entities import Dataset
from azure.ml.entities import JobInput, JobOutput

pipeline_instance = coco_preparation_pipeline(
    annotations_archive_url="http://images.cocodataset.org/annotations/annotations_trainval2017.zip",
    train_archive_url="http://images.cocodataset.org/zips/train2017.zip",
    valid_archive_url="http://images.cocodataset.org/zips/val2017.zip",
    category_id=1,
    category_name="contains_person"
)

In [None]:
# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    
    # Project's name
    experiment_name="e2e_image_sample",
    
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
returned_job.services["Studio"].endpoint

# 2. Training a distributed gpu job

In [None]:
from azure.ml import dsl

training_func = dsl.load_component(yaml_file="./components/pytorch_dl_train/spec.yaml")

In [None]:
help(training_func)

In [None]:
from azure.ml import dsl

# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    compute="gpu-cluster", #"cpu-cluster", # TODO: document
    description="e2e images classification", # TODO: document
)
def coco_model_training(training_images, validation_images, training_annotations, validation_annotations, epochs=1, nodes=1):
    # TODO: document
    training_step = training_func(
        train_images=training_images,
        valid_images=validation_images,
        train_annotations=training_annotations,
        valid_annotations=validation_annotations,
        epochs=epochs
    )
    training_step.distribution.instance_count  = nodes

    # TODO: document
    return {
        "model": training_step.outputs.trained_model
    }

# TODO: document
help(coco_model_training)

In [None]:
from azure.ml.entities import Dataset
from azure.ml.entities import JobInput, JobOutput

pipeline_instance = coco_model_training(
    training_images=ml_client.datasets.get("coco_val2017", version=1),
    validation_images=ml_client.datasets.get("coco_val2017", version=1),
    training_annotations=ml_client.datasets.get("coco_val2017_annotations", version=2),
    validation_annotations=ml_client.datasets.get("coco_val2017_annotations", version=2),
    epochs=10,
    nodes=1
)

In [None]:
# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    
    # Project's name
    experiment_name="e2e_image_sample",
    
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
returned_job.services["Studio"].endpoint