Microsoft AI Rangers Demo
# Large Scale Image Classification
__X-Ray Classification using CheXpert__

<img src="images/dp-chexpert.png" width=800 />

### Goal
The goal of this notebook is demonstrate feasibility of large scale multi-label image classification from radiographs using the [Automated Machine Learning](https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) feature of Azure Machine Learning. For this, we are using [CheXpert:](https://stanfordmlgroup.github.io/competitions/chexpert/) a large chest X-Ray dataset. Given that the original size of CheXpert is ~400GB, we will use the small version (images resized) which is ~11GB.


### Steps
1. Upload data (small dataset) to the cloud
2. Convert the data to JSONL
3. Set AutoML Run
4. Configure parameters and run

At the end, we show results using CheXpert standard image size (no resizing).

This notebook was developed and tested using an Azure ML STANDARD_D13_V2 CPU [compute instance](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-manage-compute-instance?tabs=python) and an Azure ML [Python SDK v1](https://learn.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py&preserve-view=true).

As with previous examples, having Azure account is a prerequisite. Once you have it, please set up an Azure ML Workspace and either create a new notebook or import this one there. You can follow official documentation for more details: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-run-jupyter-notebooks

# 1. Upload data to the cloud

Since AutoML relies on AzureML experiment management infrastructure, we need to stage our data in the cloud first. 

## Download and extract CheXpert
The dataset is available for download [here](https://stanfordmlgroup.github.io/competitions/chexpert/). After filling out the form you will receive an automated email from Stanford which will have a link to both full and downsampled dataset in it. We have downloaded the dataset from above url and unzipped the images files into the following structure:

<img src="images/dp-chexpert-file-structure.png" width=200 />

Run the following cell (by replacing LINK_TO_FILE with the link you receive in the email) if you want to update and unzip automatically.

> `NOTICE`: 
> Chexpert dataset contains images encoded as grayscale JPEGs. Since you are here, you probably know that real-world medical images come in DICOM format which is capable of much higher intensity range than what the 8bit grayscale that JPEG can provide is capable of. At the moment of writing, however, AutoML only supports the common 2D RGB image formats such as BMP, JPEG or PNG. You would have to convert your DICOMs into one of those, applying appropriate window width/window level transforms for best results. Unfortunately, if you are looking at working with 3D or 4D data such as CT, MR, fMRI, etc, you would either need to cast your task so that it can be inferred from independent 2D slices or use more advanced specialized frameworks such as Microsoft Research's [InnerEye Deep Learning SDK](https://github.com/microsoft/InnerEye-DeepLearning).

In [None]:
# remove files from /tmp to avoid potential overlap from previous runs
!rm /tmp/CheXpert-v1.0-small.zip
!rm -r /tmp/CheXpert-v1.0-small

# Download the small version of the dataset
!wget "LINK_TO_FILE" -P /tmp
!unzip -q /tmp/CheXpert-v1.0-small.zip -d /tmp

## Upload to blob storage

You will be using Blob storage as the datastore in this example. Every Azure ML workspace will have a default datastore which the code below uses. You can pick a different datastore for your real-world scenario.

In [29]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
ds = ws.get_default_datastore()

chexpert_local_path = '/tmp/CheXpert-v1.0-small/'
blob_chexpert_target_name = "chexpert_v1_small/"

# Upload image files to chexpert folder in AML datastore
ds.upload(src_dir=chexpert_local_path, target_path=blob_chexpert_target_name, overwrite=True, show_progress=False)


## 2. Convert the data to JSONL

AutoML expects the mapping between data points and labels to be in a [JSONL format](https://jsonlines.org/). The following code will take in the training and validation CSVs provided by Chexpert and turn them into JSONL.

In [32]:
import pandas as pd
import json, os

df_train = pd.read_csv( chexpert_local_path + 'train.csv')
df_train = df_train[df_train['Frontal/Lateral'] == 'Frontal']  

df_valid = pd.read_csv(chexpert_local_path + 'valid.csv')
df_valid = df_valid[df_valid['Frontal/Lateral'] == 'Frontal']  

print(df_train.shape)
df_train.head(4)

(191027, 19)


Unnamed: 0,Path,Sex,Age,Frontal/Lateral,AP/PA,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Opacity,Lung Lesion,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
0,CheXpert-v1.0-small/train/patient00001/study1/...,Female,68,Frontal,AP,1.0,,,,,,,,,0.0,,,,1.0
1,CheXpert-v1.0-small/train/patient00002/study2/...,Female,87,Frontal,AP,,,-1.0,1.0,,-1.0,-1.0,,-1.0,,-1.0,,1.0,
2,CheXpert-v1.0-small/train/patient00002/study1/...,Female,83,Frontal,AP,,,,1.0,,,-1.0,,,,,,1.0,
4,CheXpert-v1.0-small/train/patient00003/study1/...,Male,41,Frontal,AP,,,,,,1.0,,,,0.0,,,,


In [33]:
pathologies = ['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis', 'Pleural Effusion']
# sample json line dictionary
json_line_sample = {
    "image_url": "AmlDatastore://",
    "label": ""
    }

chexpert_dataset_name = "CheXpert-v1.0-small"
# To process standard CheXpert, just remove the '-small' characters:
# chexpert_dataset_name = "CheXpert-v1.0"

label_dir = 'label_files/'
labels_chexpert_local_path = chexpert_local_path + label_dir

if not os.path.exists(labels_chexpert_local_path):
    os.mkdir(labels_chexpert_local_path)

chexpert_train_jsonl_file_name = "chexpert_train.jsonl"
chexpert_validation_jsonl_file_name = "chexpert_validation.jsonl"

# Path to the training and validation files
train_annotations_file = labels_chexpert_local_path + chexpert_train_jsonl_file_name
validation_annotations_file = labels_chexpert_local_path + chexpert_validation_jsonl_file_name

with open(train_annotations_file, "w") as train_f:
    for idx, row in df_train.iterrows():
        pathology_list = [pathology for pathology in pathologies if row[pathology] == 1]
        if len(pathology_list) == 0:
            pathology_list.append("X_other")
        json_line = dict(json_line_sample)
        json_line["image_url"] += "workspaceblobstore/" + row.Path.replace(chexpert_dataset_name,blob_chexpert_target_name) 
        json_line["label"] = pathology_list  
        train_f.write(json.dumps(json_line) + "\n")

with open(validation_annotations_file, "w") as validation_f:
    for idx, row in df_valid.iterrows():
        pathology_list = [pathology for pathology in pathologies if row[pathology] == 1]
        if len(pathology_list) == 0:
            pathology_list.append("X_other")
        json_line = dict(json_line_sample)
        json_line["image_url"] += "workspaceblobstore/" +  row.Path.replace(chexpert_dataset_name,blob_chexpert_target_name) 
        json_line["label"] = pathology_list  
        validation_f.write(json.dumps(json_line) + "\n")


In [35]:
ds = ws.get_default_datastore()

labels_chexpert_blob_path = blob_chexpert_target_name + label_dir

# Upload json files to the label folder in AML datastore
ds.upload(src_dir=labels_chexpert_local_path, target_path=labels_chexpert_blob_path, overwrite=True, show_progress=True)

Uploading an estimated of 2 files
Uploading /tmp/CheXpert-v1.0-small/label_files/chexpert_validation.jsonl
Uploaded /tmp/CheXpert-v1.0-small/label_files/chexpert_validation.jsonl, 1 files out of an estimated total of 2
Uploading /tmp/CheXpert-v1.0-small/label_files/chexpert_train.jsonl
Uploaded /tmp/CheXpert-v1.0-small/label_files/chexpert_train.jsonl, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_fe2c2f936a684cb6b4cd658e0e6acaf3

The code snippet below takes the JSONL files that were created just now and creates AzureML Dataset entities out of them, which is the mapping between the actual files and their labels.

In [37]:
from azureml.contrib.dataset.labeled_dataset import _LabeledDatasetFactory, LabeledDatasetTask
from azureml.core import Dataset

# Path to the training and validation files
train_dataset_name = "chexpert_train"
valid_dataset_name = "chexpert_valid"

# create training dataset
training_dataset = _LabeledDatasetFactory.from_json_lines(
    task=LabeledDatasetTask.IMAGE_MULTI_LABEL_CLASSIFICATION, path=ds.path(labels_chexpert_blob_path + chexpert_train_jsonl_file_name))
training_dataset = training_dataset.register(workspace=ws, name=train_dataset_name)

# create validation dataset
validation_dataset = _LabeledDatasetFactory.from_json_lines(
    task=LabeledDatasetTask.IMAGE_MULTI_LABEL_CLASSIFICATION, path=ds.path(labels_chexpert_blob_path + chexpert_validation_jsonl_file_name))
validation_dataset = validation_dataset.register(workspace=ws, name=valid_dataset_name)

print("Training dataset name: " + training_dataset.name)
print("Validation dataset name: " + validation_dataset.name)

Training dataset name: chexpert_train
Validation dataset name: chexpert_valid


In [40]:
validation_dataset.to_pandas_dataframe().head(10)

Unnamed: 0,image_url,label
0,StreamInfo(AmlDatastore://workspaceblobstore/c...,[Cardiomegaly]
1,StreamInfo(AmlDatastore://workspaceblobstore/c...,[X_other]
2,StreamInfo(AmlDatastore://workspaceblobstore/c...,[Edema]
3,StreamInfo(AmlDatastore://workspaceblobstore/c...,[X_other]
4,StreamInfo(AmlDatastore://workspaceblobstore/c...,"[Atelectasis, Pleural Effusion]"
5,StreamInfo(AmlDatastore://workspaceblobstore/c...,"[Cardiomegaly, Atelectasis]"
6,StreamInfo(AmlDatastore://workspaceblobstore/c...,[X_other]
7,StreamInfo(AmlDatastore://workspaceblobstore/c...,[X_other]
8,StreamInfo(AmlDatastore://workspaceblobstore/c...,"[Cardiomegaly, Consolidation, Atelectasis, Ple..."
9,StreamInfo(AmlDatastore://workspaceblobstore/c...,[Cardiomegaly]


## Compute target setup
You need to provide a [Compute Target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) that will be used for your AutoML model training. AutoML models for image tasks require GPU SKUs and support NC and ND families. We recommend using the NCsv3-series (with v100 GPUs) for faster training. Using a compute target with a multi-GPU VM SKU will leverage the multiple GPUs to speed up training. Additionally, setting up a compute target with multiple nodes will allow for faster model training by leveraging parallelism, when tuning hyperparameters for your model. See more on the compute targets in the official documentation: https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target

The code sample below creates a [low priority compute target](https://azure.microsoft.com/en-us/blog/low-priority-scale-sets/) which means that the Azure ML backend will look for underutilized resources across the datacenters and try to run your job on one of them. Being low priority resource means that your job may be pre-empted by some other job, i.e. the backend may have to restart it on another resource. However the upside is that a low priority resource comes at a fraction of a cost of a full system. In any case, AutoML will make sure that your powerful GPU machine will spend only as much time as needed working so that you don't have to pay for the compute you are not using.

In [29]:
from azureml.core.compute import AmlCompute, ComputeTarget

cluster_name = "gpu-clu-nv24v3"

try:
    compute_target = ws.compute_targets[cluster_name]
    print('Found existing compute target.')
except KeyError:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6', 
                                                           vm_priority='lowpriority', # or 'dedicated
                                                           idle_seconds_before_scaledown=1800,
                                                           min_nodes=0, 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Creating a new compute target...
InProgress..
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [23]:
from azureml.core import Experiment

experiment_name = "automl-chexpert-classification-multilabel"
experiment = Experiment(ws, name=experiment_name)

The key feature of AutoML is that it can sweep across a set of parameters selecting the combination that works best for your task. 

The configuration below uses Vision Transformer (ViT) model [vitb16r224](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models?tabs=cli#supported-model-algorithms) with image resize set to 256 and image center crop set to 224. To visualize performance for all epochs, we don't use an [early termination policy](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-auto-train-image-models-v1#early-termination-policies). In larger datasets and depending on the compute budget you can specify an early termination policy such as [Median stopping](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#median-stopping-policy).

You can modify this configuration by adding more models and parameters to try out. See reference here: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models

In [31]:
from azureml.automl.core.shared.constants import ImageTask
from azureml.train.automl import AutoMLImageConfig
from azureml.train.hyperdrive import BanditPolicy, RandomParameterSampling,GridParameterSampling
from azureml.train.hyperdrive import choice, uniform

parameter_space = {
    "learning_rate": choice(0.0001, .0003, .0005),
    "early_stopping": choice(0),
    "weighted_loss": choice(1,2),
    "number_of_epochs": 10,
    "model": choice(
        {
            # model-specific, valid_resize_size should be larger or equal than valid_crop_size
            "model_name": choice("vitb16r224"),            
            "valid_resize_size": choice(256),
            "valid_crop_size": choice(224),  # model-specific
            "train_crop_size": choice(224),  # model-specific
        }
    ),
}

tuning_settings = {
    "iterations": 6,
    "max_concurrent_iterations": 4,
    "hyperparameter_sampling": GridParameterSampling(parameter_space),
    "early_termination_policy":None,
}

automl_image_config = AutoMLImageConfig(
    task=ImageTask.IMAGE_CLASSIFICATION_MULTILABEL,
    compute_target=compute_target,
    training_data=training_dataset,
    validation_data=validation_dataset,
    **tuning_settings,
    
)

In [None]:
# Kick off the experiment and put those GPUs to work!
automl_image_run = experiment.submit(automl_image_config)

Visualize the different configurations that were tried using the HyperDrive UI.

In [24]:
from azureml.core import Run
hyperdrive_run = Run(experiment=experiment, run_id=automl_image_run.id + '_HD')
hyperdrive_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-chexpert-classification-multilabel,AutoML_b7ae46a4-9972-468c-9aeb-18709b09e7f7_HD,hyperdrive,Completed,Link to Azure Machine Learning studio,Link to Documentation


Download the best model

In [27]:
best_child_run = run.get_best_child()
model_name = best_child_run.properties['model_name']
model = best_child_run.register_model(model_name = model_name, model_path='outputs/model.pt')

# Results using 'small CheXpert' dataset

These are some of the charts from the experiments we have run when preparing this tutorial. You should get something similar.

<img src="images/dp-chexpert_small-runs.png" width=800 />

<img src="images/dp-chexpert_small-runs_parallel_coordinates_chart.png" width=800 />

# Bonus

### Results using standard (non-resized) CheXpert dataset

We show results of training CheXpert using the standard image size. The two graphs below show overall performance using a 'serexnet' model. In this case, we performed an exhaustive evaluation using by disabling the early terminal policy (configuration parameters are below). 

You can reproduce results by just downloading/unziping standard CheXpert dataset using the steps above and using the configuration settings below. 

<img src="images/dp-chexpert-runs.png" width=800 >

<img src="images/dp-chexpert-runs_parallel_coordinates_chart.png" width=800 >

### Effect of [hyperparameters](https://learn.microsoft.com/en-us/azure/machine-learning/reference-automl-images-hyperparameters#image-classification-multi-class-and-multi-label-specific-hyperparameters)

We can easily visualize the effects of hyperparameters. The two graphs below show effect of learning rate and weighted loss with respect to the AUC macro (metric). Based on the graphs below, we osbserve that performance (AUC Macro) increases for:

1. Larger learning rates (greater than 0.001) 
2. Weighted loss with sqrt. (class_weights), which corresponds to value 1 (value 2 corresponds to weighted loss with class_weights).

<img src="images/dp-chexpert-scatter-AUC_lr.png" width=800 >

<img src="images/dp-chexpert-scatter-AUC_wl.png" width=800 >



## The parameter settings for the runs above:

In [None]:
parameter_space = {
    "learning_rate": uniform(0.0005, 0.002), 
    "early_stopping": choice(0),
    "weighted_loss": choice(1,2),
    "number_of_epochs": 15,
    "model": choice(

        {
            # model-specific, valid_resize_size should be larger or equal than valid_crop_size
            "model_name": choice("seresnext"),            
            "valid_resize_size": choice(352),
            "valid_crop_size": choice(256),  # model-specific
            "train_crop_size": choice(256),  # model-specific
            'training_batch_size': choice(48), 
            'validation_batch_size': choice(48),
        }
    ),
}

tuning_settings = {
    "iterations": 20,
    "max_concurrent_iterations": 8,
    "hyperparameter_sampling": RandomParameterSampling(parameter_space),
    "early_termination_policy":BanditPolicy(slack_factor = 0.4,
                                         evaluation_interval = 1,
                                         delay_evaluation = 5),
}

automl_image_config = AutoMLImageConfig(
    task=ImageTask.IMAGE_CLASSIFICATION_MULTILABEL,
    compute_target=compute_target,
    training_data=training_dataset,
    validation_data=validation_dataset,
    **tuning_settings,
    
)


# Conclusion

This tutorial has demonstrated the feasibility of low-code solution that is AutoML to achieve state-of-the-art performance on a radiological image classification task. Good luck with your experiments!