# AutoML Image Classification Multilabel in pipeline

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Create a pipeline with Image Classification Multilabel AutoML task.

**Motivations** - This notebook explains how to use Image Classification Multilabel AutoML task inside pipeline.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, Input, command, Output
from azure.ai.ml.automl import (
    image_classification_multilabel,
    SearchSpace,
    ClassificationMultilabelPrimaryMetrics,
)
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.sweep import BanditPolicy, Choice, Uniform

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)
print(ml_client)

## 1.3 Download Data

Load the 'fridge items' dataset from a JSON file and MLTable definition.

In order to generate models for computer vision, you will need to bring in labeled image data as input for model training in the form of an Azure Machine Learning MLTable. 

In this notebook, we use a toy dataset called Fridge Objects, which consists of 134 images of 4 classes of beverage container {can, carton, milk bottle, water bottle} photos taken on different backgrounds.

All images in this notebook are hosted in [this repository](https://github.com/microsoft/computervision-recipes) and are made available under the [MIT license](https://github.com/microsoft/computervision-recipes/blob/master/LICENSE).

**NOTE:** In this PRIVATE PREVIEW we're defining the MLTable in a separate folder and .YAML file.
In later versions, you'll be able to do it all in Python APIs.

In [None]:
import os
import urllib
from zipfile import ZipFile

# download data
download_url = "https://cvbp-secondary.z19.web.core.windows.net/datasets/image_classification/multilabelFridgeObjects.zip"
data_file = "./data/multilabelFridgeObjects.zip"
urllib.request.urlretrieve(download_url, filename=data_file)

# extract files
with ZipFile(data_file, "r") as zip:
    print("extracting files...")
    zip.extractall(path="./data")
    print("done")
# delete zip file
os.remove(data_file)

This is a sample image from this dataset:

In [None]:
from IPython.display import Image

sample_image = "./data/multilabelFridgeObjects/images/56.jpg"
Image(filename=sample_image)

### Upload the images to Datastore through an AML Data asset (URI Folder)

In order to use the data for training in Azure ML, we upload it to our default Azure Blob Storage of our  Azure ML Workspace.

Reference to URI FOLDER data asset example for further details: https://github.com/Azure/azureml-examples/blob/samuel100/data-samples/sdk/assets/data/data.ipynb

In [None]:
# Uploading image files by creating a 'data asset URI FOLDER':

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_data = Data(
    path="./data/multilabelFridgeObjects",
    type=AssetTypes.URI_FOLDER,
    description="Fridge-items images multilabel",
    name="fridge-items-images-multilabel",
)

uri_folder_data_asset = ml_client.data.create_or_update(my_data)

print(uri_folder_data_asset)
print("")
print("Path to folder in Blob Storage:")
print(uri_folder_data_asset.path)

### Convert the downloaded data to JSON metadata

In this example, the fridge object dataset is stored in a directory. There are four different folders inside:

/water_bottle
/milk_bottle
/carton
/can

This is the most common data format for multiclass image classification. Each folder title corresponds to the image label for the images contained inside.

In order to use this data to create an AzureML MLTable, we first need to convert it to the required JSONL format. 

The following script is creating two .jsonl files (one for training and one for validation) in the parent folder of the dataset. The train / validation ratio corresponds to 20% of the data going into the validation file.

In [None]:
import json
import os

src_images = "./data/multilabelFridgeObjects/"

# We'll copy each JSONL file within its related MLTable folder
training_mltable_path = "./data/training-mltable-folder/"
validation_mltable_path = "./data/validation-mltable-folder/"

train_validation_ratio = 5

# Path to the training and validation files
train_annotations_file = os.path.join(training_mltable_path, "train_annotations.jsonl")
validation_annotations_file = os.path.join(
    validation_mltable_path, "validation_annotations.jsonl"
)

# Baseline of json line dictionary
json_line_sample = {
    "image_url": uri_folder_data_asset.path,
    "label": [],
}

# Path to the labels file.
labelFile = os.path.join(src_images, "labels.csv")

# Read each annotation and convert it to jsonl line
with open(train_annotations_file, "w") as train_f:
    with open(validation_annotations_file, "w") as validation_f:
        with open(labelFile, "r") as labels:
            for i, line in enumerate(labels):
                # Skipping the title line and any empty lines.
                if i == 0 or len(line.strip()) == 0:
                    continue
                line_split = line.strip().split(",")
                if len(line_split) != 2:
                    print("Skipping the invalid line: {}".format(line))
                    continue
                json_line = dict(json_line_sample)
                json_line["image_url"] += f"images/{line_split[0]}"
                json_line["label"] = line_split[1].strip().split(" ")

                if i % train_validation_ratio == 0:
                    # validation annotation
                    validation_f.write(json.dumps(json_line) + "\n")
                else:
                    # train annotation
                    train_f.write(json.dumps(json_line) + "\n")

# 2. Basic pipeline job with Image Classification Multilabel task

## 2.1 Build pipeline

In [None]:
# note that the used docker image doesn't suit for all size of gpu compute. Please use the following command to create gpu compute if experiment failed
# !az ml compute create -n gpu-cluster --type amlcompute --min-instances 0 --max-instances 4 --size Standard_NC12

In [None]:
# Define pipeline
@pipeline(
    description="AutoML Image Clasiification Multilabel Pipeline",
)
def automl_image_classification_multilabel(
    image_classification_multilabel_train_data,
    image_classification_multilabel_validation_data,
):
    # define the automl image_classification_multilabel task with automl function
    image_classification_multilabel_node = image_classification_multilabel(
        training_data=image_classification_multilabel_train_data,
        validation_data=image_classification_multilabel_validation_data,
        target_column_name="label",
        primary_metric=ClassificationMultilabelPrimaryMetrics.IOU,
        # currently need to specify outputs "mlflow_model" explictly to reference it in following nodes
        outputs={"best_model": Output(type="mlflow_model")},
    )
    image_classification_multilabel_node.set_limits(
        max_trials=10, max_concurrent_trials=2, timeout_minutes=180
    )

    image_classification_multilabel_node.extend_search_space(
        [
            SearchSpace(
                model_name=Choice(["vitb16r224"]),
                learning_rate=Uniform(0.005, 0.05),
                number_of_epochs=Choice([15, 30]),
                gradient_accumulation_step=Choice([1, 2]),
            ),
            SearchSpace(
                model_name=Choice(["seresnext"]),
                learning_rate=Uniform(0.005, 0.05),
                # model-specific, valid_resize_size should be larger or equal than valid_crop_size
                validation_resize_size=Choice([288, 320, 352]),
                validation_crop_size=Choice([224, 256]),  # model-specific
                training_crop_size=Choice([224, 256]),  # model-specific
            ),
        ]
    )

    image_classification_multilabel_node.set_sweep(
        sampling_algorithm="Random",
        early_termination=BanditPolicy(
            evaluation_interval=2, slack_factor=0.2, delay_evaluation=6
        ),
    )

    # define command function for registering the model
    command_func = command(
        inputs=dict(
            model_input_path=Input(type="mlflow_model"),
            model_base_name="image_classification_multilabel_example_model",
        ),
        outputs=dict(model_id_path=Output(type="uri_folder")),
        code="./register.py",
        command="python register.py "
        + "--model_input_path ${{inputs.model_input_path}} "
        + "--model_base_name ${{inputs.model_base_name}} "
        + "--model_id_path ${{outputs.model_id_path}}",
        environment="AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu:4",
    )
    register_model = command_func(
        model_input_path=image_classification_multilabel_node.outputs.best_model
    )

    # define command function for the deployment
    deploy_command_func = command(
        inputs=dict(
            model_id_path = Input(type="uri_folder"),
            endpoint_name = "imgmllabel-endpoint",
            deployment_name= "imgmllabel-deployment"
        ),
        code="./deploy.py",
        command="python deploy.py " +
                "--model_id_path ${{inputs.model_id_path}} " +
                "--endpoint_name ${{inputs.endpoint_name}} " +
                "--deployment_name ${{inputs.deployment_name}}",

        environment="AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu:4"
    )
    deploy_model = deploy_command_func(model_id_path = register_model.outputs.model_id_path)

data_folder = "./data"
pipeline = automl_image_classification_multilabel(
    image_classification_multilabel_train_data=Input(
        path=f"{data_folder}/training-mltable-folder/", type="mltable"
    ),
    image_classification_multilabel_validation_data=Input(
        path=f"{data_folder}/validation-mltable-folder/", type="mltable"
    ),
)

# set pipeline level compute(MSI should have access to this compute)
pipeline.settings.default_compute = "b-dedicated-Standard-NC12"

# 2.2 Submit pipeline job

In [None]:
# submit the pipeline job
pipeline_job = ml_client.jobs.create_or_update(
    pipeline, experiment_name="pipeline_samples"
)
pipeline_job

In [None]:
# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

# Next Steps
You can see further examples of running a pipeline job [here](../)