TODO: does this work locally AND remotely?
TODO: "This is a N part series of tutorials"


# PyTorch image classification from prep to deployment (part I data preparation)

**Learning Objectives** - By the end of this tutorial you should be able to use Azure Machine Learning (AzureML) to:
- ingest a large dataset from a simple url
- quickly implement basic commands for data preparation
- assemble a pipeline with custom data preparation (python) scripts

**Requirements** - In order to benefit from this tutorial, you need:
- to have provisioned an AzureML workspace
- to have permissions to create simple SKUs in your resource group
- a python environment

**Motivations** - TODO

# 1. Introduction to the end-to-end scenario

TODO

# 2. Set up the pipeline resources

## 2.1. Connect to your AzureML workspace

In [1]:
# handle to the workspace
from azure.ml import MLClient

# Authentication package
from azure.identity import InteractiveBrowserCredential

In [2]:
# get a handle to the workspace
ml_client = MLClient(
    InteractiveBrowserCredential(), 
    subscription_id = '48bbc269-ce89-4f6f-9a12-c6f91fcb772d',
    resource_group_name = 'aml1p-rg',
    workspace_name = 'aml1p-ml-eus2'
    #subscription_id = '<SUBSCRIPTION_ID>', 
    #resource_group = '<RESOURCE_GROUP>', 
    #workspace = '<AML_WORKSPACE_NAME>'
)

## 2.2. Create a compute resource

Each step of an AML pipelines can use a different compute resource for running the specific job of that step. It can be single or multi-node machines with Linux or Windows OS, or a specific compute fabric like spark.

In this section, we provision a Linux compute cluster for our tasks in this tutorial. Let's start by listing the available VM sizes available for use. Check the docs for the [full list on VM sizes and prices](https://azure.microsoft.com/en-ca/pricing/details/machine-learning/).

In [None]:
from azure.ml.entities import AmlCompute, Compute
import pandas as pd

# Let's have a peak at the most important properties
VM_dict = {
    vm.name: {
        "family": vm.family,
        "hdd_size": vm.os_vhd_size_mb,
        "memory": vm.memory_gb,
        "cpus": vm.v_cp_us,
        "gpus": vm.gpus,
    }
    for vm in ml_client.compute.list_sizes()
}
VM_df = pd.DataFrame.from_dict(VM_dict, orient='index')

# Let's take a look at one VM Family, you can change the code and explore more
VM_df[VM_df['family']=='standardDSv2Family']

In [None]:
for vm in ml_client.compute.list_sizes():
    print(vm)
    for value in vm.estimated_vm_prices.values:
        print(value)
    break

For this tutorial we only need a basic cluster, let's pick `Standard_DS2_v2` and create am AML Compute

In [None]:
# Let's create the AML compute object with the intended parameters
cluster_basic = AmlCompute(
    # Name assigned to the compute cluster
    name="cpu-cluster",
    
    # AML Compte is AML's on-demand VM service
    type="amlcompute",
   
    # VM Family
    size="Standard_DS2_v2",
    
    # Minimum running nodes when there is no job running
    min_instances=0,
    
    # nodes in cluster
    max_instances=2,
    
    # How many seconds will the node running after the job termination
    idle_time_before_scale_down=120,
    
    # dedicated or LowPriority. The latter is cheaper but there is a chance of job termination 
    tier='dedicated'
)

# Now, we pass the object to clinet's create_or_update method
ml_client.begin_create_or_update(cluster_basic)

# 3. Implement a reusable data preparation pipeline

motivation...

## 3.1. Unzip archives with a simple command (no code component)

In our use case, we want to shortcut having to download locally and upload the data into AzureML. So first, we'll have to unzip the archives provided from the COCO web urls. Unzipping is not the most complex operation. Running it in a reusable manner in a cloud pipeline job can require a lot of boilerplate code. Instead of writing a python script for that purpose, we just want to run the command itself.

Here, we will embed the unzip call in a [CommandComponent](TODO) that specifies inputs, outputs and environment of that command.

In [3]:
from azure.ml import dsl
from azure.ml.entities import CommandComponent, JobInput, JobOutput

unzip_component = CommandComponent(
    name="Unzip", # optional: this will show in the UI
    
    # this component has no code, just a simple unzip command
    # TODO: command = "ls -lr ${{inputs.archive_path}}; unzip ${{inputs.archive_path}} -d ${{outputs.extracted_data}}",
    command = "ls -lr ${{inputs.archive_path}}; unzip ${{inputs.archive_path}}/*.zip -d ${{outputs.extracted_data}}",

    # inputs and outputs need to match with the command
    inputs = {
        'archive_path': { 'type': 'path' }
    },
    outputs = {
        'extracted_data': { 'type': 'path' }
    },
    
    # we're using a curated environment
    environment = 'AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9',
)

# we'll package this unzip command as a component to use within a pipeline
unzip_component_func = dsl.load_component(component=unzip_component)

## 3.2. Custom python script for extracting annotations

In [4]:
parse_annotations_func = dsl.load_component(yaml_file="./components/coco_extract_annotations/spec.yaml")

In [None]:
help(parse_annotations_func)

## 3.3. Assemble as a pipeline in python

In [None]:
from azure.ml import dsl

# we'll package this unzip command as a component to use within a pipeline
unzip_component_func = dsl.load_component(component=unzip_component)
parse_annotations_func = dsl.load_component(yaml_file="./components/coco_extract_annotations/spec.yaml")

# the dsl decorator tells the sdk that we are defining an AML pipeline
@dsl.pipeline(
    compute="cpu-d14-v2", #"cpu-cluster",
    description="e2e images preparation",
)
def coco_preparation_pipeline(annotations_archive, train_archive, valid_archive, category_id, category_name, use_val_as_train=False):
    annotations_unzip_step = unzip_component_func(
        archive_path=annotations_archive
    )

    if use_val_as_train:
        train_unzip_step = unzip_component_func(
            archive_path=valid_archive
        )
    else:
        train_unzip_step = unzip_component_func(
            archive_path=train_archive
        )
    custom_outputs['train_images'] = train_unzip_step.outputs.extracted_data

    valid_unzip_step = unzip_component_func(
        archive_path=valid_archive
    )
    custom_outputs['valid_images'] = valid_unzip_step.outputs.extracted_data

    parse_annotations_step = parse_annotations_func(
        annotations_dir=annotations_unzip_step.outputs.extracted_data,
        category_id=category_id,
        category_name=category_name
    )
    if use_val_as_train:
        custom_outputs["train_annotations"] = parse_annotations_step.outputs.train_annotations
        custom_outputs["valid_annotations"] = parse_annotations_step.outputs.valid_annotations
    else:
        custom_outputs["train_annotations"] = parse_annotations_step.outputs.valid_annotations
        custom_outputs["valid_annotations"] = parse_annotations_step.outputs.valid_annotations

    return custom_outputs

help(coco_preparation_pipeline)

In [None]:
from azure.ml.entities import Dataset
from azure.ml.entities import JobInput, JobOutput

pipeline_instance = coco_preparation_pipeline(
    # TODO: annotations_archive=JobInput(file="http://images.cocodataset.org/annotations/annotations_trainval2017.zip"),
    annotations_archive=ml_client.datasets.get("coco_annotatons_trainval2017_archive", version="1"),
    # TODO: train_archive=JobInput(file="http://images.cocodataset.org/zips/train2017.zip"),
    train_archive=ml_client.datasets.get("coco_train2017_archive", version="1"),
    # TODO: valid_archive=JobInput(file="http://images.cocodataset.org/zips/val2017.zip"),
    valid_archive=ml_client.datasets.get("coco_val2017_archive", version="1"),
    category_id=1,
    category_name="contains_person"
)

In [None]:
# submit the pipeline job
returned_job = ml_client.jobs.create_or_update(
    pipeline_instance,
    
    # Project's name
    experiment_name="e2e_image_preparation",
    
    # If there is no dependency, pipeline run will continue even after the failure of one component
    continue_run_on_step_failure=True,
)

# get a URL for the status of the job
returned_job.services["Studio"].endpoint