## Multi-label Multimodal Classification using pipeline component

This sample shows how to use `multimodal_classification_pipeline` component from the `azureml` system registry to fine tune a model for multi-label multimodal classification task using Chest X-Ray Dataset. We then deploy the fine tuned model to an online endpoint for real time inference.

### Training data
We will use the [ChXray](https://automlresources-prod.azureedge.net/datasets/ChXray.zip) dataset.  <br />
Original source of dataset: https://nihcc.app.box.com/v/ChestXray-NIHCC/file/220660789610 <br />
[arXiv:1705.02315](https://arxiv.org/abs/1705.02315v5) [cs.CV]

### Model
We will use the `mmeft` model in this notebook.

### Outline
1. Install dependencies
2. Setup pre-requisites such as compute
3. Pick a model to fine tune
4. Prepare dataset for finetuning the model
5. Submit the fine tuning job using transformers specific image-classification component
6. Review training and evaluation metrics
7. Register the fine tuned model
8. Deploy the fine tuned model for real time inference
9. Test deployed end point
9. Clean up resources

### 1. Install dependencies
Before starting off, if you are running the notebook on Azure Machine Learning Studio or running first time locally, you will need the following packages

In [None]:
! pip install azure-ai-ml==1.12.0
! pip install azure-identity==1.13.0
! pip install scikit-learn==1.3.2

#### 2.1 Connect to Azure Machine Learning workspace

Before we dive in the code, you'll need to connect to your workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

We are using `DefaultAzureCredential` to get access to workspace. `DefaultAzureCredential` should be capable of handling most scenarios. If you want to learn more about other available credentials, go to [set up authentication doc](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk), [azure-identity reference doc](https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

Replace `AML_WORKSPACE_NAME`, `RESOURCE_GROUP` and `SUBSCRIPTION_ID` with their respective values in the below cell.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# can rename to any valid name
experiment_name = "AzureML-Train-Finetune-Multimodal-MultiLabel-Samples"

credential = DefaultAzureCredential()
workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
    subscription_id = workspace_ml_client.subscription_id
    resource_group = workspace_ml_client.resource_group_name
    workspace_name = workspace_ml_client.workspace_name
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "SUBSCRIPTION_ID"
    resource_group = "RESOURCE_GROUP"
    workspace_name = "AML_WORKSPACE_NAME"

workspace_ml_client = MLClient(
    credential, subscription_id, resource_group, workspace_name
)
registry_ml_client = MLClient(
    credential,
    subscription_id,
    resource_group,
    registry_name="azureml",
)

#### 2.2 Create compute

In order to finetune a model on Azure Machine Learning studio, you will need to create a compute resource first. **Creating a compute may take 3-4 minutes.** 

For additional references, see [Azure Machine Learning in a Day](https://github.com/Azure/azureml-examples/blob/main/tutorials/azureml-in-a-day/azureml-in-a-day.ipynb). 

##### Create CPU compute for model selection and data preprocess component

In [None]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

model_import_cluster_name = "sample-cpu-cluster"
data_preprocess_cluster_name = model_import_cluster_name
try:
    _ = workspace_ml_client.compute.get(model_import_cluster_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=model_import_cluster_name,
        type="amlcompute",
        size="Standard_D12_v2",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=4,
    )
    workspace_ml_client.begin_create_or_update(compute_config).result()

##### Create GPU compute for finetune component

The list of GPU machines can be found [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu).

In [None]:
finetune_cluster_name = "sample-finetune-cluster-gpu"

try:
    _ = workspace_ml_client.compute.get(finetune_cluster_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=finetune_cluster_name,
        type="amlcompute",
        size="Standard_NC6s_v3",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=4,
    )
    workspace_ml_client.begin_create_or_update(compute_config).result()

### 3. Pick a foundation model to fine tune

We will use the `mmeft` model in this notebook. It is available in `azureml` system registry.


In [None]:
aml_registry_model_name = "mmeft"
use_model_name = aml_registry_model_name
foundation_models = registry_ml_client.models.list(aml_registry_model_name)
foundation_model = max(foundation_models, key=lambda x: x.version)
print(
    f"\n\nUsing model name: {foundation_model.name}, version: {foundation_model.version}, id: {foundation_model.id} for fine tuning"
)

### 4. Prepare the dataset for fine-tuning the model

We will use the [ChXrays](https://automlresources-prod.azureedge.net/datasets/ChXray.zip) dataset. It has a `.csv` file with features and label. Along with it, images are stored separately in `images` folder. Column name that stores label is `Finding_Labels`. 

#### 4.1 Download the Data
We first download and unzip the data locally. By default, the data would be downloaded in `./data` folder in current directory. 
If you prefer to download the data at a different location, update it in `dataset_parent_dir = ...` in the following cell.

In [None]:
import os
import urllib
from zipfile import ZipFile

# Change to a different location if you prefer
dataset_parent_dir = "./data"

# Create data folder if it doesnt exist.
os.makedirs(dataset_parent_dir, exist_ok=True)

# Download data
download_url = "https://automlresources-prod.azureedge.net/datasets/ChXray.zip"

# Extract current dataset name from dataset url
dataset_name = os.path.split(download_url)[-1].split(".")[0]

# Get the data zip file path
data_file = os.path.join(dataset_parent_dir, f"{dataset_name}.zip")

# Download the dataset
urllib.request.urlretrieve(download_url, filename=data_file)

# Extract files
with ZipFile(data_file, "r") as zip:
    print("extracting files...")
    zip.extractall(path=dataset_parent_dir)
    print("done")
# Delete zip file
os.remove(data_file)

In [None]:
# Initialize dataset specific fields

dataset_dir = os.path.join(dataset_parent_dir, dataset_name)
input_csv_file_path = os.path.join(dataset_dir, "chxray_multilabel_dataset.csv")
output_csv_file_path = os.path.join(dataset_dir, "mmeft_chxray_multilabel_dataset.csv")
# Directory in which we have our images
images_dir = os.path.join(dataset_dir, "images")

image_column_name = "Image_Index"
label_column_name = "Finding_Labels"

# columns to be ignored while training
columns_to_drop = "Patient_Id,OriginalImage_Width,OriginalImage_Height,OriginalImagePixelSpacing_x,OriginalImagePixelSpacing_y"

In [None]:
import pandas as pd

# Read a sample row from dataset
df = pd.read_csv(input_csv_file_path)
print(f"rows = {df.shape[0]}, columns = {df.shape[1]} \n")
print("Sample row\n")
print(df.head(1))

#### 4.2 Upload the images to Datastore through an AML Data asset (URI Folder)

In order to use the data for training in Azure ML, we upload it to our default Azure Blob Storage of our  Azure ML Workspace.

In [None]:
# Uploading image files by creating a 'data asset URI FOLDER':

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_data = Data(
    path=images_dir,
    type=AssetTypes.URI_FOLDER,
    description="Chest X-ray images",
    name="chxray-multimodal-multilabel-classif",
)

uri_folder_data_asset = workspace_ml_client.data.create_or_update(my_data)

print(uri_folder_data_asset)
print("Path to folder in Blob Storage:")
print(uri_folder_data_asset.path)

#### 4.3. Update image url in dataset

[csv_processor.py](../utils/csv_processor.py) script updates the path to images in .csv files, from local path to path in AML datastore, where we uploaded the images in Section 2.2

In [None]:
!python ../utils/csv_processor.py \
    --img_col_name {image_column_name} \
    --image_url_prefix {uri_folder_data_asset.path} \
    --input_file_name {input_csv_file_path} \
    --output_file_name {output_csv_file_path}

#### 4.4 Split the downloaded data into Train/Validation dataset

For documentation on preparing the datasets beyond this notebook, refer to the [documentation on how to prepare datasets](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-prepare-datasets-for-automl-images).

In order to use this data to create an AzureML MLTable, we first need either `.csv` or `.jsonl` format. The following script is creating two `.jsonl` files (one for training and one for validation) in the corresponding MLTable folder. In this example, 20% of the data is kept for validation.

Here we also replace local image path with path to same image in datastore. 

In [None]:
import os
from sklearn.model_selection import train_test_split

# We will copy each JSONL file within its related MLTable folder
training_mltable_path = os.path.join(dataset_parent_dir, "training-mltable-folder")
validation_mltable_path = os.path.join(dataset_parent_dir, "validation-mltable-folder")

# Create the folders if they don't exist
os.makedirs(training_mltable_path, exist_ok=True)
os.makedirs(validation_mltable_path, exist_ok=True)

# Path to the training and validation files
train_annotations_file = os.path.join(training_mltable_path, "train_annotations.jsonl")
validation_annotations_file = os.path.join(validation_mltable_path, "validation_annotations.jsonl")

train_validation_ratio = 0.2

# Read a sample row from dataset
df = pd.read_csv(output_csv_file_path)
train_df, val_df = train_test_split(
    df,
    test_size=train_validation_ratio,
    random_state=0,
    stratify=df[[label_column_name]],
)

# Save the DataFrame to a JSON Lines file
train_df.to_json(train_annotations_file, orient="records", lines=True)
val_df.to_json(validation_annotations_file, orient="records", lines=True)

#### 4.4 Create MLTable data input

Create MLTable data input using the jsonl files created above.

For documentation on creating your own MLTable assets for jobs beyond this notebook, please refer to below resources
- [MLTable YAML Schema](https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable) - covers how to write MLTable YAML, which is required for each MLTable asset.
- [Create MLTable data asset](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK#create-a-mltable-data-asset) - covers how to create MLTable data asset. 

In [None]:
def create_ml_table_file(filename) -> str:
    """Create ML Table definition

    :param filename: filename
    :type filename: str
    :return: ML Table definition
    :rtype: str
    """

    return (
        "paths:\n"
        "  - file: ./{0}\n"
        "transformations:\n"
        "  - read_json_lines:\n"
        "        encoding: utf8\n"
        "        invalid_lines: error\n"
        "        include_path_column: false\n"
        "  - convert_column_types:\n"
        "      - columns: image_url\n"
        "        column_type: stream_info"
    ).format(filename)


def save_ml_table_file(output_path: str, mltable_file_contents: str) -> None:
    """Save ML Table definition to file

    :param output_path: output path
    :type output_path: str
    :param mltable_file_contents: ML Table definition
    :type mltable_file_contents: str
    """
    with open(os.path.join(output_path, "MLTable"), "w") as f:
        f.write(mltable_file_contents)


# Create and save train mltable
train_mltable_file_contents = create_ml_table_file(
    os.path.basename(train_annotations_file)
)
save_ml_table_file(training_mltable_path, train_mltable_file_contents)

# Create and save validation mltable
validation_mltable_file_contents = create_ml_table_file(
    os.path.basename(validation_annotations_file)
)
save_ml_table_file(validation_mltable_path, validation_mltable_file_contents)

### 5. Submit the fine tuning job using `multimodal_classification_pipeline` component
 
Create the job that uses the `multimodal_classification_pipeline` component for multi-class multimodal-classification task. Learn more in 5.2 about all the parameters supported for fine tuning.

#### 5.1 Create component

In [None]:
FINETUNE_PIPELINE_COMPONENT_NAME = "multimodal_classification_pipeline"
pipeline_component_transformers_func = registry_ml_client.components.get(
    name=FINETUNE_PIPELINE_COMPONENT_NAME, label="latest"
)

#### 5.2 Create arguments to be passed to `multimodal_classification_pipeline` component

The `multimodal_classification_pipeline` component consists of model selection and finetuning components.

In [None]:
deepspeed_config_path = "./deepspeed_configs/zero1.json"
if not os.path.exists(deepspeed_config_path):
    print("DeepSpeed config file not found")
    deepspeed_config_path = None

pipeline_component_args = {
    ## Model selector component args
    "data_modalities": "text-image-tabular",
    ## Data preprocessing args
    "problem_type": "multimodal-classification-multilabel",
    "label_column": label_column_name,
    "image_column": image_column_name,
    "drop_columns": columns_to_drop,
    # We try top auto detect the data type of values in column.
    # But still if you want to explicitly specify data type (categorical, numerical or textual), then you can do so by providing comma separated column names in below fields.
    # "numerical_columns_overrides":
    # "categorical_columns_overrides":
    # "text_columns_overrides":
    ## Finetune_args
    "deepspeed_config": deepspeed_config_path,
    "number_of_epochs": 15,
    "max_steps": -1,
    "training_batch_size": 8,
    "validation_batch_size": 8,
    "auto_find_batch_size": "false",
    "optimizer": "adamw_hf",
    "learning_rate": 2e-05,
    "warmup_steps": 0,
    "adam_beta1": 0.9,
    "adam_beta2": 0.999,
    "adam_epsilon": 1e-8,
    "gradient_accumulation_steps": 64,
    "learning_rate_scheduler": "linear",
    "precision": 32,
    "random_seed": 42,
    "evaluation_strategy": "epoch",
    "evaluation_steps_interval": 0.0,
    "evaluation_steps": 500,
    "logging_strategy": "epoch",
    "logging_steps": 500,
    "primary_metric": "loss",
    "resume_from_checkpoint": "false",
    "save_total_limit": -1,
    "apply_early_stopping": "false",
    "early_stopping_patience": 1,
    "early_stopping_threshold": 0.0,
    "apply_deepspeed": "false",
    "apply_ort": "false",
    "save_as_mlflow_model": "true",
}

#### 5.3 Utility function to create pipeline using `multimodal_classification_pipeline` component

In [None]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import PipelineComponent
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes


process_count_per_instance = 1
instance_count = 1


@pipeline()
def create_pipeline():
    """Create pipeline."""

    pipeline_component: PipelineComponent = pipeline_component_transformers_func(
        compute_model_import=model_import_cluster_name,
        compute_preprocess=data_preprocess_cluster_name,
        compute_finetune=finetune_cluster_name,
        training_data=Input(type=AssetTypes.MLTABLE, path=training_mltable_path),
        validation_data=Input(type=AssetTypes.MLTABLE, path=validation_mltable_path),
        mlflow_model_path=Input(type=AssetTypes.MLFLOW_MODEL, path=foundation_model.id),
        instance_count=instance_count,
        process_count_per_instance=process_count_per_instance,
        **pipeline_component_args,
    )
    return {
        # Map the output of the fine tuning job to the output of pipeline job so that we can easily register the fine tuned model. Registering the model is required to deploy the model to an online or batch endpoint.
        "mlflow_model_folder": pipeline_component.outputs.mlflow_model_folder,
    }

#### 5.4 Run the fine tuning job using `multimodal_classification_pipeline` component

In [None]:
pipeline_object = create_pipeline()

pipeline_object.display_name = "mmeft_multimodal_multilabel_pipeline_component_run"
# Don't use cached results from previous jobs
pipeline_object.settings.force_rerun = True

print("Submitting pipeline")

pipeline_run = workspace_ml_client.jobs.create_or_update(
    pipeline_object, experiment_name=experiment_name
)

print(f"Pipeline created. URL: {pipeline_run.studio_url}")

In [None]:
workspace_ml_client.jobs.stream(pipeline_run.name)

### 6. Get metrics from finetune component

The model training happens as part of the finetune component. Please follow below steps to extract validation metrics from the run.

##### 6.1 Initialize MLFlow Client

The models and artifacts that are produced by AutoML can be accessed via the MLFlow interface.
Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

IMPORTANT - You need to have installed the latest MLFlow packages with:

    pip install azureml-mlflow
    pip install mlflow

In [None]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = workspace_ml_client.workspaces.get(
    name=workspace_ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLFLOW TRACKING URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
print(f"\nCurrent tracking uri: {mlflow.get_tracking_uri()}")

In [None]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

#### 6.2 Get the training and evaluation run

In [None]:
# Concat 'tags.mlflow.rootRunId=' and pipeline_job.name in single quotes as filter variable
filter = "tags.mlflow.rootRunId='" + pipeline_run.name + "'"
runs = mlflow.search_runs(
    experiment_names=[experiment_name], filter_string=filter, output_format="list"
)

# Get the training and evaluation runs.
for run in runs:
    # Check if run.data.metrics.epoch exists
    if "epoch" in run.data.metrics:
        training_run = run

#### 6.3 Get training metrics

Access the results (such as Models, Artifacts, Metrics) of a previously completed run.

In [None]:
import pandas as pd

pd.DataFrame(training_run.data.metrics, index=[0]).T

### 7. Register the fine tuned model with the workspace

We will register the model from the output of the fine tuning job. This will track lineage between the fine tuned model and the fine tuning job. The fine tuning job, further, tracks lineage to the foundation model, data and training code.

In [None]:
import time

# Generating a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time()))

In [None]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# Check if the `mlflow_model_folder` output is available
print(
    f"Pipeline job outputs: {workspace_ml_client.jobs.get(pipeline_run.name).outputs}"
)

# Fetch the model from pipeline job output - not working, hence fetching from fine tune child job
model_path_from_job = f"azureml://jobs/{pipeline_run.name}/outputs/mlflow_model_folder"
print(f"Path to register model: {model_path_from_job}")

finetuned_model_name = (
    f"{use_model_name.replace('/', '-')}-chxray-multilabel-classification"
)
finetuned_model_description = f"{use_model_name.replace('/', '-')} fine tuned model for chxray multilabel classification"
prepare_to_register_model = Model(
    path=model_path_from_job,
    type=AssetTypes.MLFLOW_MODEL,
    name=finetuned_model_name,
    version=timestamp,  # use timestamp as version to avoid version conflict
    description=finetuned_model_description,
)
print(f"Prepare to register model: \n{prepare_to_register_model}")

# Register the model from pipeline job output
registered_model = workspace_ml_client.models.create_or_update(
    prepare_to_register_model
)
print(f"Registered model: {registered_model}")

### 8. Deploy the fine tuned model to an online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [None]:
import datetime
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment

# Endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name
online_endpoint_name = "mm-ml-chxray-" + datetime.datetime.now().strftime("%m%d%H%M")
online_endpoint_description = f"Online endpoint for {registered_model.name}, finetuned for predicting disease from Chest Xray multilabel classification dataset."
# Create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description=online_endpoint_description,
    auth_mode="key",
    tags={"foo": "bar"},
)
workspace_ml_client.begin_create_or_update(endpoint).result()

In [None]:
from azure.ai.ml.entities import OnlineRequestSettings, ProbeSettings

# deployment_name should be mandatorily in lowercase
deployment_name = "mm-ml-chxray-mlflow-deploy"
print(registered_model.id)
print(online_endpoint_name)
print(deployment_name)

# Create a deployment
demo_deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type="Standard_DS3_V2",  # Use GPU instance type like STANDARD_NC6s_v3 for faster explanations
    instance_count=1,
    request_settings=OnlineRequestSettings(
        max_concurrent_requests_per_instance=1,
        request_timeout_ms=90000,
        max_queue_wait_ms=500,
    ),
    liveness_probe=ProbeSettings(
        failure_threshold=49,
        success_threshold=1,
        timeout=299,
        period=180,
        initial_delay=180,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=10,
        success_threshold=1,
        timeout=10,
        period=10,
        initial_delay=2000,
    ),
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {deployment_name: 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

### 9. Test the endpoint with sample data

We will fetch some sample data from the test dataset and submit to online endpoint for inference. We will then show the display the scored labels alongside the ground truth labels

In [None]:
demo_deployment = workspace_ml_client.online_deployments.get(
    name=deployment_name,
    endpoint_name=online_endpoint_name,
)

# Get the details for online endpoint
endpoint = workspace_ml_client.online_endpoints.get(name=online_endpoint_name)

# Existing traffic details
print(endpoint.traffic)

# Get the scoring URI
print(endpoint.scoring_uri)
print(demo_deployment)

In [None]:
import base64
import json


def image_to_str(img_path) -> str:
    with open(os.path.join(dataset_dir, img_path), "rb") as f:
        encoded_image = base64.encodebytes(f.read()).decode("utf-8")
        return encoded_image


df_sample = pd.read_csv(input_csv_file_path, nrows=2)

# We can pass image either as azureml url on data asset or as a base64 encoded string.
# Here, we will be passing base64 encoded string.
df_sample[image_column_name] = df_sample.apply(
    lambda x: image_to_str(x[image_column_name]), axis=1
)

request_json = {
    "input_data": {
        "columns": df_sample.columns.values.tolist(),
        "data": df_sample.values.tolist(),
    }
}

# Create request json
request_file_name = "sample_request_data.json"
with open(request_file_name, "w") as request_file:
    json.dump(request_json, request_file)

In [None]:
resp = workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name=demo_deployment.name,
    request_file=request_file_name,
)

In [None]:
resp

### 10. Clean up resources - delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint.

In [None]:
workspace_ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()