Microsoft AI Rangers Demo
# Large Scale Image Classification
__X-Ray Classification using CheXpert__

<img src="images/dp-chexpert.png" width=800 />

### Goal
The goal of this notebook is demonstrate feasibility of large scale multi-label image classification from radiographs using the [Automated Machine Learning](https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) feature of Azure Machine Learning. For this, we are using [CheXpert:](https://stanfordmlgroup.github.io/competitions/chexpert/) a large chest X-Ray dataset. Given that the original size of CheXpert is ~400GB, we will use the small version (images resized) which is ~11GB and to explore model performance, we will use a small dataset (~250 images).

### Steps

1. Upload data (small dataset) to the Datastore through an AML Data asset (URI Folder)
2. Create an MLTable from labeled training data in JSONL format
3. Set AutoML Run
4. Configure parameters and run

At the end, we show results using CheXpert standard image size (no resizing).

This notebook was developed and tested using an Azure ML STANDARD_D13_V2 CPU [compute instance](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-manage-compute-instance?tabs=python) and an Azure ML [Python SDK v2](https://learn.microsoft.com/en-us/azure/machine-learning/concept-v2).

As with previous examples, having Azure account is a prerequisite. Once you have it, please set up an Azure ML Workspace and either create a new notebook or import this one there. You can follow official documentation for more details: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-run-jupyter-notebooks 

# 1. Upload data (small dataset) to the Datastore through an AML Data asset (URI Folder)

Since AutoML relies on AzureML experiment management infrastructure, we need to stage our data in the cloud first. 

## Download and extract CheXpert
The dataset is available for download [here](https://stanfordmlgroup.github.io/competitions/chexpert/). After filling out the form you will receive an automated email from Stanford which will have a link to both full and downsampled dataset in it. We have downloaded the dataset from above url and unzipped the images files into the following structure:

<img src="images/dp-chexpert-file-structure.png" width=200 />

Run the following cell (by replacing LINK_TO_FILE with the link you receive in the email) if you want to update and unzip automatically.

> `NOTICE`: 
> Chexpert dataset contains images encoded as grayscale JPEGs. Since you are here, you probably know that real-world medical images come in DICOM format which is capable of much higher intensity range than what the 8bit grayscale that JPEG can provide is capable of. At the moment of writing, however, AutoML only supports the common 2D RGB image formats such as BMP, JPEG or PNG. You would have to convert your DICOMs into one of those, applying appropriate window width/window level transforms for best results. Unfortunately, if you are looking at working with 3D or 4D data such as CT, MR, fMRI, etc, you would either need to cast your task so that it can be inferred from independent 2D slices or use more advanced specialized frameworks such as Microsoft Research's [InnerEye Deep Learning SDK](https://github.com/microsoft/InnerEye-DeepLearning).

In [None]:
# remove files from /tmp to avoid potential overlap from previous runs
!rm /tmp/CheXpert-v1.0-small.zip
!rm -r /tmp/CheXpert-v1.0-small

# Download the small version of the dataset
!wget "LINK_TO_FILE" -P /tmp
!unzip -q /tmp/CheXpert-v1.0-small.zip -d /tmp

Extract a small sample dataset (~250 images)

In [None]:
import pandas as pd
import os, shutil

def chexpert_sampler(base_dir,chexpert_original_dir,chexpert_new_dir,n_train, n_valid):

    pathologies = ['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis', 'Pleural Effusion']
    df_train = pd.read_csv( base_dir + chexpert_original_dir + '/train.csv')
    df_valid = pd.read_csv( base_dir + chexpert_original_dir + '/valid.csv')

    df_train = df_train[df_train['Frontal/Lateral'] == 'Frontal']  
    df_valid = df_valid[df_valid['Frontal/Lateral'] == 'Frontal']

    print(df_train.columns[7],df_train.columns[10],df_train.columns[11],df_train.columns[13],df_train.columns[15])
    df_train_sample = pd.concat([df_train[df_train.Cardiomegaly==1].sample(n=n_train), df_train[df_train.Edema==1].sample(n=n_train), df_train[df_train.Consolidation==1].sample(n=n_train), \
        df_train[df_train.Atelectasis==1].sample(n=n_train), df_train[df_train['Pleural Effusion']==1].sample(n=n_train)  ] )
    df_train_sample.reset_index(drop=True,inplace=True)        

    df_valid_sample = pd.concat([df_valid[df_valid.Cardiomegaly==1].sample(n=n_valid), df_valid[df_valid.Edema==1].sample(n=n_valid), df_valid[df_valid.Consolidation==1].sample(n=n_valid), \
        df_valid[df_valid.Atelectasis==1].sample(n=n_valid), df_valid[df_valid['Pleural Effusion']==1].sample(n=n_valid)  ] )
    df_valid_sample.reset_index(drop=True,inplace=True)

    for index, row in df_train_sample.iterrows():    
        path_source, __   = os.path.split( base_dir + df_train_sample.loc[index, 'Path'] )
        df_train_sample.loc[index, 'Path'] = row.Path.replace(chexpert_original_dir,chexpert_new_dir)   
        path_target, file = os.path.split( base_dir + df_train_sample.loc[index, 'Path'] )
        print( path_target )
        if not os.path.isdir(path_target):
            os.makedirs(path_target)
        shutil.copyfile( path_source + os.sep + file, path_target + os.sep + file)

    for index, row in df_valid_sample.iterrows():
        path_source, __   = os.path.split( base_dir + df_valid_sample.loc[index, 'Path'] )
        df_valid_sample.loc[index, 'Path'] = row.Path.replace(chexpert_original_dir,chexpert_new_dir)   
        path_target, file = os.path.split( base_dir + df_valid_sample.loc[index, 'Path'] )
        print( path_target )
        if not os.path.isdir(path_target):
            os.makedirs(path_target)
        shutil.copyfile( path_source + os.sep + file, path_target + os.sep + file)

    df_train_sample.to_csv(base_dir + os.sep + chexpert_new_dir + "/train.csv", index = False)
    df_valid_sample.to_csv(base_dir + os.sep + chexpert_new_dir + "/valid.csv", index = False)


base_dir                = "/tmp/"
chexpert_original_dir   = "CheXpert-v1.0-small"
chexpert_new_dir        = "chexpert_small_demo"
chexpert_sampler(base_dir,chexpert_original_dir,chexpert_new_dir,n_train=100,n_valid=10)

## Upload data

You will be using Blob storage as the datastore in this example. Every Azure ML workspace will have a default datastore which the code below uses. You can pick a different datastore for your real-world scenario.

[Check this notebook for AML data asset example.](https://github.com/Azure/azureml-examples/blob/a7f2b1894769736a2bbfcfb5c4e1d00269f8c6a6/sdk/python/assets/data/data.ipynb)

In [2]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient, Input

from azure.ai.ml.automl import SearchSpace, ClassificationMultilabelPrimaryMetrics
from azure.ai.ml.sweep import (
    Choice,
    Uniform,
    BanditPolicy,
)

from azure.ai.ml import automl
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes, InputOutputModes

In [3]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = ""
    resource_group = ""
    workspace = ""
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)


Found the config file in: /config.json


In [4]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import Input

chexpert_local_path = base_dir + chexpert_new_dir 

my_data = Data(
    path=chexpert_local_path,
    type=AssetTypes.URI_FOLDER,
    description="CheXpert dataset small version for demo",
    name="chexpert_small_demo_asset",
)

uri_folder_data_asset = ml_client.data.create_or_update(my_data)

print(uri_folder_data_asset)
print("")
print("Path to folder in Blob Storage")
print(uri_folder_data_asset.path)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'chexpert_small_demo_asset', 'description': 'CheXpert dataset small version for demo', 'tags': {}, 'properties': {}, 'id': '/subscriptions/MY_SUBSCRIPTION/resourceGroups/MY_RESOURCEGROUP/providers/Microsoft.MachineLearningServices/workspaces/MY_WORKSPACE/data/chexpert_small_demo_asset/versions/10', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/demo6/code/Users/medical-imaging/notebooks', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f5bbc2dcb50>, 'serialize': <msrest.serialization.Serializer object at 0x7f5bbc2dc550>, 'version': '10', 'latest_version': None, 'path': 'azureml://subscriptions/MY_SUBSCRIPTION/resourcegroups/MY_RESOURCEGROUP/workspaces/MY_WORKSPACE/datastores/workspaceblobstore/paths/LocalUpload/bf6fb7b1e88207436d2a346c6bcf7de

## 2. Create an MLTable from labeled training data in JSONL format

AutoML expects the mapping between data points and labels to be in a [JSONL format](https://jsonlines.org/). The following code will take in the training and validation CSVs provided by Chexpert and turn them into JSONL.

In [5]:
import pandas as pd
import json, os

df_train = pd.read_csv( chexpert_local_path + '/train.csv')
df_train = df_train[df_train['Frontal/Lateral'] == 'Frontal']  

df_valid = pd.read_csv(chexpert_local_path + '/valid.csv')
df_valid = df_valid[df_valid['Frontal/Lateral'] == 'Frontal']  

print(df_train.shape)
df_train.head(4)

(500, 19)


Unnamed: 0,Path,Sex,Age,Frontal/Lateral,AP/PA,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Opacity,Lung Lesion,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
0,chexpert_small_demo/train/patient39463/study2/...,Female,66,Frontal,AP,,,1.0,,,1.0,1.0,-1.0,-1.0,0.0,1.0,,,1.0
1,chexpert_small_demo/train/patient17454/study8/...,Male,73,Frontal,AP,,,1.0,,,1.0,,,,,,,,1.0
2,chexpert_small_demo/train/patient25099/study2/...,Male,64,Frontal,AP,,,1.0,,,,1.0,,1.0,0.0,1.0,,,1.0
3,chexpert_small_demo/train/patient28255/study1/...,Female,49,Frontal,PA,,,1.0,1.0,,,0.0,,1.0,0.0,0.0,,,1.0


In [6]:
pathologies = ['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis', 'Pleural Effusion']
# sample json line dictionary
json_line_sample = {
    "image_url": uri_folder_data_asset.path.replace(chexpert_new_dir+'/',""),
    "label": ""
    }

MLTable =  \
'''paths:
  - file: ./MY_FILE
transformations:
  - read_json_lines:
        encoding: utf8
        invalid_lines: error
        include_path_column: false
  - convert_column_types:
      - columns: image_url
        column_type: stream_info
'''

chexpert_train_jsonl_file_name      = "chexpert_train.jsonl"
chexpert_validation_jsonl_file_name = "chexpert_validation.jsonl"

labels_chexpert_local_path          = chexpert_local_path + '/label_files/'

chexpert_train_jsonl_dir_name       = chexpert_train_jsonl_file_name.replace(".jsonl","_mltable_folder")
chexpert_validation_jsonl_dir_name  = chexpert_validation_jsonl_file_name.replace(".jsonl","_mltable_folder")

training_mltable_path               = labels_chexpert_local_path + os.sep + chexpert_train_jsonl_dir_name
validation_mltable_path             = labels_chexpert_local_path + os.sep + chexpert_validation_jsonl_dir_name


for path in [labels_chexpert_local_path, training_mltable_path, validation_mltable_path]:
    os.mkdir(path) if not os.path.exists(path) else None

# Path to the training and validation files
train_annotations_file      = labels_chexpert_local_path + os.sep + chexpert_train_jsonl_dir_name       + os.sep + chexpert_train_jsonl_file_name
validation_annotations_file = labels_chexpert_local_path + os.sep + chexpert_validation_jsonl_dir_name  + os.sep + chexpert_validation_jsonl_file_name

with open(train_annotations_file, "w") as train_f:
    for idx, row in df_train.iterrows():    
        pathology_list = [pathology for pathology in pathologies if row[pathology] == 1]
        if len(pathology_list) == 0:
            pathology_list.append("X_other")
        json_line = dict(json_line_sample)
        json_line["image_url"] += row.Path
        json_line["label"] = pathology_list  
        train_f.write(json.dumps(json_line) + "\n")

with open(train_annotations_file.replace(chexpert_train_jsonl_file_name,"MLTable"), "w") as mltable_f:
    mltable_f.write( MLTable.replace("MY_FILE",chexpert_train_jsonl_file_name) + "\n")

with open(validation_annotations_file, "w") as validation_f:
    for idx, row in df_valid.iterrows():
        pathology_list = [pathology for pathology in pathologies if row[pathology] == 1]
        if len(pathology_list) == 0:
            pathology_list.append("X_other")
        json_line = dict(json_line_sample)
        json_line["image_url"] += row.Path
        json_line["label"] = pathology_list  
        validation_f.write(json.dumps(json_line) + "\n")

with open(validation_annotations_file.replace(chexpert_validation_jsonl_file_name,"MLTable"), "w") as mltable_f:
    mltable_f.write( MLTable.replace("MY_FILE",chexpert_validation_jsonl_file_name) + "\n")



In [7]:
# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(type=AssetTypes.MLTABLE, path=training_mltable_path)

# Validation MLTable defined locally, with local data to be uploaded
my_validation_data_input = Input(type=AssetTypes.MLTABLE, path=validation_mltable_path)

# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store
# my_training_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/vision-classification/train")
# my_validation_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/vision-classification/valid")

print(training_mltable_path)

/home/azureuser/cloudfiles/code/Users/demo/chexpert_small_demo/label_files//chexpert_train_mltable_folder


## Compute target setup
You need to provide a [Compute Target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) that will be used for your AutoML model training. AutoML models for image tasks require GPU SKUs and support NC and ND families. We recommend using the NCsv3-series (with v100 GPUs) for faster training. Using a compute target with a multi-GPU VM SKU will leverage the multiple GPUs to speed up training. Additionally, setting up a compute target with multiple nodes will allow for faster model training by leveraging parallelism, when tuning hyperparameters for your model. See more on the compute targets in the official documentation: https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target

In [8]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_name = "gpu-cluster"

try:
    _ = ml_client.compute.get(compute_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=compute_name,
        type="amlcompute",
        size="Standard_NC6",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=4,
    )
    ml_client.begin_create_or_update(compute_config).result()

Found existing compute target.


In [9]:
# general job parameters
exp_name = "chexpert-demo"

## 4. Configure parameters and run

The key feature of AutoML is that it can sweep across a set of parameters selecting the combination that works best for your task. 

The configuration below uses Vision Transformer (ViT) and a variant of a ResNext (SE ResNeXt)[ (more info here)](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models?tabs=cli#supported-model-algorithms) with image resize set to 256 and image center crop set to 224. To evaluate performance, we use an [early termination policy](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-auto-train-image-models-v1#early-termination-policies). In larger datasets and depending on the compute budget you can specify an early termination policy such as [Median stopping](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#median-stopping-policy).


You can modify this configuration by adding more models and parameters to try out. See reference here: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models

In [10]:
# Create the AutoML job with the related factory-function.

image_classification_multilabel_job = automl.image_classification_multilabel(
    compute=compute_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    target_column_name="label",
    primary_metric=ClassificationMultilabelPrimaryMetrics.IOU,
    tags={"my_custom_tag": "My custom value"},
)

image_classification_multilabel_job.set_limits(
    timeout_minutes=60,
    max_trials=10,
    max_concurrent_trials=2,
)

image_classification_multilabel_job.extend_search_space(
    [
        SearchSpace(
            model_name=Choice(["vitb16r224"]),
            learning_rate=Uniform(0.005, 0.05),
            number_of_epochs=Choice([15, 30]),
            gradient_accumulation_step=Choice([1, 2]),
        ),
        SearchSpace(
            model_name=Choice(["seresnext"]),
            learning_rate=Uniform(0.005, 0.05),
            # model-specific, valid_resize_size should be larger or equal than valid_crop_size
            validation_resize_size=Choice([288, 320, 352]),
            validation_crop_size=Choice([224, 256]),  # model-specific
            training_crop_size=Choice([224, 256]),  # model-specific
        ),
    ]
)

image_classification_multilabel_job.set_sweep(
    sampling_algorithm="Random",
    early_termination=BanditPolicy(
        evaluation_interval=2, slack_factor=0.2, delay_evaluation=6
    ),
)

In [11]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    image_classification_multilabel_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

[32mUploading chexpert_validation_mltable_folder (0.02 MBs):   0%|          | 0/16947 [00:00<?, ?it/s][32mUploading chexpert_validation_mltable_folder (0.02 MBs): 100%|██████████| 16947/16947 [00:00<00:00, 585283.04it/s]
[39m



Created job: ImageClassificationMultilabelJob({'log_verbosity': <LogVerbosity.INFO: 'Info'>, 'target_column_name': 'label', 'validation_data_size': None, 'task_type': <TaskType.IMAGE_CLASSIFICATION_MULTILABEL: 'ImageClassificationMultilabel'>, 'training_data': {'type': 'mltable', 'path': 'azureml://datastores/workspaceblobstore/paths/LocalUpload/bf6fb7b1e88207436d2a346c6bcf7ded/chexpert_train_mltable_folder'}, 'validation_data': {'type': 'mltable', 'path': 'azureml://datastores/workspaceblobstore/paths/LocalUpload/efe13e0cf86a8b49b6fc8b98f6585b04/chexpert_validation_mltable_folder'}, 'test_data': None, 'environment_id': None, 'environment_variables': None, 'outputs': {}, 'type': 'automl', 'status': 'NotStarted', 'log_files': None, 'name': 'gentle_carpet_grpm1rfbw3', 'description': None, 'tags': {'my_custom_tag': 'My custom value'}, 'properties': {}, 'id': '/subscriptions/MY_SUBSCRIPTION/resourceGroups/MY_RESOURCEGROUP/providers/Microsoft.MachineLearningServices/workspaces/MY_WORKSPACE/

In [12]:
ml_client.jobs.stream(returned_job.name)

RunId: gentle_carpet_grpm1rfbw3
Web View: https://ml.azure.com/runs/gentle_carpet_grpm1rfbw3?wsid=/subscriptions/MY_SUBSCRIPTION/resourcegroups/MY_RESOURCEGROUP/workspaces/MY_WORKSPACE

Execution Summary
RunId: gentle_carpet_grpm1rfbw3
Web View: https://ml.azure.com/runs/gentle_carpet_grpm1rfbw3?wsid=/subscriptions/MY_SUBSCRIPTION/resourcegroups/MY_RESOURCEGROUP/workspaces/MY_WORKSPACE



In [13]:
hd_job = ml_client.jobs.get(returned_job.name + "_HD")
hd_job

Experiment,Name,Type,Status,Details Page
chexpert-demo,gentle_carpet_grpm1rfbw3_HD,sweep,Completed,Link to Azure Machine Learning studio


# Results using 'small CheXpert' dataset

These are some of the charts from the experiments we have run when preparing this tutorial. You should get something similar.

<img src="images/dp-chexpert_small-runs.png" width=1200 />


# Bonus

### Results using standard (non-resized) CheXpert dataset

We show results of training CheXpert using the standard image size. The two graphs below show overall performance using a 'serexnet' model. In this case, we performed an exhaustive evaluation using by disabling the early terminal policy (configuration parameters are below). 

You can reproduce results by just downloading/unziping standard CheXpert dataset using the steps above and using the configuration settings below. 

<img src="images/dp-chexpert-runs.png" width=800 >

<img src="images/dp-chexpert-runs_parallel_coordinates_chart.png" width=800 >

### Effect of [hyperparameters](https://learn.microsoft.com/en-us/azure/machine-learning/reference-automl-images-hyperparameters#image-classification-multi-class-and-multi-label-specific-hyperparameters)

We can easily visualize the effects of hyperparameters. The two graphs below show effect of learning rate and weighted loss with respect to the AUC macro (metric). Based on the graphs below, we osbserve that performance (AUC Macro) increases for:

1. Larger learning rates (greater than 0.001) 
2. Weighted loss with sqrt. (class_weights), which corresponds to value 1 (value 2 corresponds to weighted loss with class_weights).

<img src="images/dp-chexpert-scatter-AUC_lr.png" width=800 >

<img src="images/dp-chexpert-scatter-AUC_wl.png" width=800 >



# Conclusion

This tutorial has demonstrated the feasibility of low-code solution that is AutoML to achieve state-of-the-art performance on a radiological image classification task. Good luck with your experiments!