AI Ranger Team Demo Development
# Deep Learning for Medical Image Analysis leveraging the AML Platform
### Pneunomia detection using Remote Experiment Runs and Azure ML Pipeines

<img src="images/medicalimage.jpg" width=1000 />

**In this notebook**  
In this notebook, three different approaches are demonstrated to train a Pneumonia detection model using the unique capabilities of Azure ML. We will start by training a baseline model, which is trained on a remote cluster with GPU machines. The second part will show an approach for training hyperparameters. The third and last part of the notebook, will transform the Remote Experiment Runs in a Azure ML Pipeline, that can be used to train a model repeatably when new data is available. The Pipeline will also include a step for deployment.

# Pneumonia Detection Use Case
A relatively small public dataset of medical images for detecting viral or bacterial pneumonia has been chosen to keep the scenario straightforward and reproducible with limited computing resources. The dataset contains 5,218 x-ray images with two classes of diagnostic outcomes: 3,876 cases with (viral or bacterial) pneumonia and 1,342 cases without findings ("Normal").
The dataset is split into training, validation and test sets. Since some images represent radiographs from the same patient, it has been ensured that there is no overlap of patients between the training, validation and test sets.

<img src="images/pneumonia.png" width=1000 />
You can find the dataset under this location: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia. 


# Neural Network architecture

The following neural network architecture is used:

<img src="images/cnnframe.png" width=1200 />

Though a detailed discussion of the architecture and functionality of Convnets is outside the scope of this demo, the following summary provides a brief overview ofthe design:
- The x-ray images are resized to a 224 x 224 pixel resolution before being fed into the Convnet. Other medical imaging use cases will most likely require higher resolutions. However, for the selected dataset, high accuracy results can be achieved with this small image size.
- During  the  data  flow  through  the Convnet, relevant  properties  for  the  classification  task (features) are extracted in a hierarchical way. The lower layers of the network detect low-level features like edges or surfaces. More complex features (for detecting pneumonia in this case) are extracted at higher layers. The three convolutional layers perform the detection of features at different abstraction levels in the network, where the images are scanned by a small moving window (kernel).
- To reduce computational effort while focusing on the most dominant features, the image size is reduced further as the data flows through the three max pooling layers.
- Two dropout layers are included to reduce the risk of overfitting to the training data.
- The final layer consists of two neurons for representing the classes "pneumonia" and "normal".

# Setup


## Installs and imports

In [None]:
# Download Kaggle pip package and split-folders
%pip install kaggle --upgrade split-folders

In [2]:
import matplotlib.pyplot as plt

import json
from azureml.core import Workspace, Dataset, Experiment

workspace = Workspace.from_config()

## Retrieve and upload data

When you use this notebook for the first time, the pneumonia dataset should be uploaded to the default AzureML datastore and registered as a managed file dataset.

The commands below can be used to download the dataset using the Kaggle API (https://github.com/Kaggle/kaggle-api). Use the instructions to generate your own API key and fill them in on the code cell.

In [None]:
# Export Kaggle configuration variables
%env KAGGLE_USERNAME=[Kaggle user name]
%env KAGGLE_KEY=[API token]

In [8]:
# remove folders and zipfile from previous runs of the cell
!rm /tmp/chest-xray-pneumonia.zip
!rm -r /tmp/chest_xray
!rm -r /tmp/chest_xray_tvt

# Download the Pneumonia dataset. ISSUE: Requires Python 3.6 AzureML kernel which is not available in newer Compute Instances
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia -p /tmp

!unzip -q /tmp/chest-xray-pneumonia.zip -d /tmp

Downloading chest-xray-pneumonia.zip to /tmp
100%|██████████████████████████████████████▉| 2.29G/2.29G [00:15<00:00, 181MB/s]
100%|███████████████████████████████████████| 2.29G/2.29G [00:15<00:00, 159MB/s]


In [9]:
import splitfolders

download_root = '/tmp/chest_xray/train' 
train_val_test_root = '/tmp/chest_xray_tvt/'

train_val_test_split = (0.8, 0.1, 0.1)
random_seed = 33

splitfolders.ratio(download_root, train_val_test_root, random_seed, ratio=train_val_test_split)

Copying files: 5216 files [00:01, 2796.29 files/s]


In [10]:
# check dataset splits
for split in os.listdir(train_val_test_root):
    for label in ['NORMAL', 'PNEUMONIA']:
        files = os.listdir(os.path.join(train_val_test_root, split, label))
        print(f'{split}-{label}: ', len(files))

test-NORMAL:  135
test-PNEUMONIA:  388
train-NORMAL:  1072
train-PNEUMONIA:  3100
val-NORMAL:  134
val-PNEUMONIA:  387


In [11]:
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.datapath import DataPath

# Upload data to AzureML Datastore
ds = workspace.get_default_datastore()
ds = Dataset.File.upload_directory(src_dir=train_val_test_root,
            target=DataPath(ds, 'chest-xray'),
            show_progress=False, overwrite=False)

# Register file dataset with AzureML
ds = ds.register(workspace=workspace, name="pneumonia", description="Pneumonia train / val / test folders with 2 classes", create_new_version=True)

print(f'Dataset {ds.name} registered.')

Dataset pneumonia registered.


# I. Run baseline experiment

## Create/retrieve Compute Cluster

In [12]:
from azureml.core.compute import AmlCompute, ComputeTarget

cluster_name = "gpu-cluster"

try:
    compute_target = workspace.compute_targets[cluster_name]
    print('Found existing compute target.')
except KeyError:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6', 
                                                           idle_seconds_before_scaledown=1800,
                                                           min_nodes=0, 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Found existing compute target.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Define ScriptRunConfig object

Define the Environment that you will use to run your experiment, retrieve the dataset by name and define the ScriptRunConfig object.

In [13]:
from azureml.core import ScriptRunConfig, Environment
from azureml.core.compute import ComputeTarget

experiment = Experiment(workspace, 'pneumonia')

pytorch_env = Environment.from_conda_specification(name = 'pytorch-1.6-gpu', file_path = './training/conda_dependencies.yml')

dataset = Dataset.get_by_name(workspace, name='pneumonia', version='latest')

src = ScriptRunConfig(source_directory='./training',
                      script='train.py',
                      arguments=['--epochs', 15, '--data-folder', dataset.as_mount()],
                      compute_target= ComputeTarget(workspace, 'gpu-cluster'),
                      environment=pytorch_env)

## Submit baseline experiment

In [14]:
from azureml.widgets import RunDetails

script_run = experiment.submit(src)
RunDetails(script_run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [16]:
_ = script_run.wait_for_completion()

# II. Hyperparameter tuning using Random Parameter Sampling
Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that results in the best performance. The process is typically computationally expensive and manual.

Azure Machine Learning lets you automate hyperparameter tuning and run experiments in parallel to efficiently optimize hyperparameters.

Random sampling supports discrete and continuous hyperparameters. It supports early termination of low-performance runs. Some users do an initial search with random sampling and then refine the search space to improve results. In random sampling, hyperparameter values are randomly selected from the defined search space.

Selected hyperparameters affect various stages of the experiment:

- Data: Training and validation loader: batch size
- CNN Architecture: Dropout
- Choice of optimizer
- Training loop: learning rate

In [17]:
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, uniform, choice, PrimaryMetricGoal

param_sampling = RandomParameterSampling( {
        'learning_rate': choice(0.00007, 0.0007, 0.07),
        'batch_size': choice(16, 32, 64, 128), 
        'conv_dropout' : uniform(0.0, 0.5), 
        'optimizer': choice('SGD', 'Adam', 'RMSprop')
    }
)

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=5)

hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=param_sampling, 
                                     policy=early_termination_policy,
                                     primary_metric_name='best_val_acc',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=8,
                                     max_concurrent_runs=4)

## Submit hyperdrive run

In [18]:
from azureml.widgets import RunDetails

# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_config)

RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [19]:
_ = hyperdrive_run.wait_for_completion()

#  III. Define AML pipeline with HyperDriveStep

The third part of this, is a showcase on how to use the Azure ML Pipeline capability to create a pipeline from the same training script that we have been using. In the cell below, the first step of the pipeline is created, by defining the pipeline data that will be the output of the first step.
The Hyperdrive config that we have defined in the previous step will be re-used.

In [21]:
from azureml.pipeline.steps import HyperDriveStep, HyperDriveStepRun, PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData, TrainingOutput

metrics_output_name = 'metrics_output'
metrics_data = PipelineData(name='metrics_data',
                            datastore=workspace.get_default_datastore(),
                            pipeline_output_name=metrics_output_name,
                            training_output=TrainingOutput("Metrics"))

model_output_name = 'model_output'
saved_model = PipelineData(name='saved_model',
                            datastore=workspace.get_default_datastore(),
                            pipeline_output_name=model_output_name,
                            training_output=TrainingOutput("Model",
                                                           model_file="outputs/model/pneumonia.pt"))

hd_step_name='hyperdrive_step'
hd_step = HyperDriveStep(
    name=hd_step_name,
    hyperdrive_config=hyperdrive_config,
    inputs=[dataset.as_mount()],
    outputs=[metrics_data, saved_model])

## Find and register best model

We add a step in our pipeline to find and register the best model, that is the output of the Hyperdrivestep.

In [22]:
%%writefile training/register_model.py

import argparse
import json
import os
from azureml.core import Workspace, Experiment, Model
from azureml.core import Run
from shutil import copy2

parser = argparse.ArgumentParser()
parser.add_argument('--saved-model', type=str, dest='saved_model', help='path to saved model file')
args = parser.parse_args()

model_output_dir = './model/'

os.makedirs(model_output_dir, exist_ok=True)
copy2(args.saved_model, model_output_dir)

ws = Run.get_context().experiment.workspace

model = Model.register(workspace=ws, model_name='pneunomia-pytorch', model_path=model_output_dir)

Overwriting training/register_model.py


In [23]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

conda_dep = CondaDependencies()
conda_dep.add_pip_package("azureml-sdk")

rcfg = RunConfiguration(conda_dependencies=conda_dep)

register_model_step = PythonScriptStep(source_directory='./training',
                                       script_name='register_model.py',
                                       name="register_model_step01",
                                       inputs=[saved_model],
                                       compute_target=ComputeTarget(workspace, 'gpu-cluster'),
                                       arguments=["--saved-model", saved_model],
                                       allow_reuse=True,
                                       runconfig=rcfg)

register_model_step.run_after(hd_step)

## Submit pipeline including model registration

In [24]:
pipeline = Pipeline(workspace=workspace, steps=[hd_step, register_model_step])
pipeline_run = experiment.submit(pipeline)

Created step hyperdrive_step [21656851][abc7a277-3c29-4933-9969-0b080a506579], (This step will run and generate new outputs)
Created step register_model_step01 [87357745][9d2e8a1e-8a54-414c-8c01-cb88f9eb61c4], (This step is eligible to reuse a previous run's output)
Submitted PipelineRun 56a40e12-ea58-45f7-8ca9-0615722f28c8
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/56a40e12-ea58-45f7-8ca9-0615722f28c8?wsid=/subscriptions/4eeedd72-d937-4243-86d1-c3982a84d924/resourcegroups/livecell/workspaces/livecell&tid=72f988bf-86f1-41af-91ab-2d7cd011db47


In [25]:
_ = pipeline_run.wait_for_completion()

PipelineRunId: 56a40e12-ea58-45f7-8ca9-0615722f28c8
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/56a40e12-ea58-45f7-8ca9-0615722f28c8?wsid=/subscriptions/4eeedd72-d937-4243-86d1-c3982a84d924/resourcegroups/livecell/workspaces/livecell&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
PipelineRun Status: Running


StepRunId: c3ca0f48-0f97-43cd-9c1e-441eb4f1e08d
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/c3ca0f48-0f97-43cd-9c1e-441eb4f1e08d?wsid=/subscriptions/4eeedd72-d937-4243-86d1-c3982a84d924/resourcegroups/livecell/workspaces/livecell&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
StepRun( hyperdrive_step ) Status: Running

StepRun(hyperdrive_step) Execution Summary
StepRun( hyperdrive_step ) Status: Finished
{'runId': 'c3ca0f48-0f97-43cd-9c1e-441eb4f1e08d', 'status': 'Completed', 'startTimeUtc': '2022-04-14T14:16:31.925382Z', 'endTimeUtc': '2022-04-14T15:12:50.410201Z', 'services': {}, 'properties': {'ContentSnapshotId': '2ba0b739-1103-42ef-b04c-a29dcb7f2

## Download training metrics

In [26]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/c3ca0f48-0f97-43cd-9c1e-441eb4f1e08d/metrics_data
Downloaded azureml/c3ca0f48-0f97-43cd-9c1e-441eb4f1e08d/metrics_data, 1 files out of an estimated total of 1


## Visualize training metrics

In [27]:
import pandas as pd
import json
with open(metrics_output._path_on_datastore) as f:  
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_0,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_1,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_2,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_4,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_3,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_5,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_6,HD_9ffa8646-d059-4eda-b136-6c06a8c9e5ff_7
best_val_acc,"[0.73828125, 0.73828125, 0.73828125, 0.7402343...","[0.9375, 0.958984375, 0.958984375, 0.978515625...","[0.73828125, 0.73828125, 0.73828125, 0.7382812...","[0.951171875, 0.97265625, 0.9765625, 0.9765625...","[0.73828125, 0.73828125, 0.73828125, 0.7382812...","[0.96484375, 0.96484375, 0.966796875, 0.972656...","[0.73828125, 0.73828125, 0.73828125, 0.7382812...","[0.96484375, 0.96484375, 0.966796875, 0.974609..."
validation accuracy,"[0.73828125, 0.73828125, 0.73828125, 0.7402343...","[0.9375, 0.958984375, 0.943359375, 0.978515625...","[0.73828125, 0.73828125, 0.73828125, 0.7382812...","[0.951171875, 0.97265625, 0.9765625, 0.9570312...","[0.73828125, 0.73828125, 0.73828125, 0.7382812...","[0.96484375, 0.96484375, 0.966796875, 0.972656...","[0.73828125, 0.73828125, 0.73828125, 0.7382812...","[0.96484375, 0.958984375, 0.966796875, 0.97460..."
validation loss,"[0.571142793388147, 0.5503021174337493, 0.5244...","[0.15307974952653822, 0.14249444785822835, 0.1...","[0.6616888650319398, 0.5675458853166987, 0.578...","[0.1399190309180408, 0.09493237424949309, 0.10...","[0.6032876986688478, 0.5876256097072374, 0.581...","[0.09238113511546789, 0.09778779058676078, 0.0...","[0.7011677376825841, 0.576986137141193, 0.5906...","[0.09617587429204967, 0.1186367193819694, 0.09..."
Train imgs,[4172.0],[4172.0],[4172.0],[4172.0],[4172.0],[4172.0],[4172.0],[4172.0]
training accuracy,"[0.7403846153846154, 0.7430288461538461, 0.743...","[0.8322115384615385, 0.9536057692307692, 0.962...","[0.623291015625, 0.66015625, 0.65625, 0.674804...","[0.9129807692307692, 0.9588942307692307, 0.967...","[0.7425480769230769, 0.7430288461538461, 0.743...","[0.9139423076923077, 0.9625, 0.971634615384615...","[0.6427884615384616, 0.7052884615384616, 0.684...","[0.8831730769230769, 0.9572115384615385, 0.967..."
training loss,"[0.5877237754265848, 0.5524920790901806, 0.522...","[6.387939022520016, 0.1332745910090888, 0.1033...","[2122011624.351042, 381.15741803616373, 283.84...","[0.26221456522280967, 0.10199810359501998, 0.0...","[0.6261790975971167, 0.5855979583407408, 0.576...","[1.2016393507822762, 0.09463848423972647, 0.08...","[105454933.47803673, 34.846322106492146, 17.12...","[1.960142974117601, 0.11514388852728637, 0.093..."


## Publish the training pipeline

By publishing the training pipeline, an pipeline endpoint is created, that we can use to trigger the pipeline from external services.

In [28]:
published_pipeline1 = pipeline_run.publish_pipeline(
     name="Training_pneumonia",
     description="Pipeline to train a classification model to detect pneumonia.",
     version="1.0")

print(published_pipeline1.id)

5d6a4e0a-647e-427e-aa51-56466cd4a2a0


## Create a schedule based on file change
One advantage of defining and publishing your training script as an Azure ML Pipeline, is that a schedule can be created to trigger retraining of your model based on file changes in the source dataset.

In [2]:
from azureml.pipeline.core.schedule import Schedule

schedule = Schedule.list(workspace) 
print(schedule)

# sch = Schedule.list(workspace)[0] #  disable the first pipeline
    
# Schedule.disable(sch)

[Pipeline(Name: MyReactiveSchedule,
Id: f95649d0-1a91-410a-a647-da823538be60,
Status: Active,
Pipeline Id: f2ad64dc-ef8a-4f3d-b199-35c14c8d8b97,
Pipeline Endpoint Id: None,
Datastore: workspaceblobstore,
Path on Datastore: chest-xray/train/PNEUMONIA)]


In [18]:
from azureml.pipeline.core import PipelineEndpoint

PipelineEndpoint.list(workspace)

# PipelineEndpoint.get(workspace=workspace, name="Training_pneumonia")

[]

In [29]:
from azureml.pipeline.core.schedule import Schedule
from azureml.pipeline.core import PipelineEndpoint

datastore = workspace.get_default_datastore()

reactive_schedule = Schedule.create(workspace, name="MyReactiveSchedule", 
                                    description="Based on input file change.",
                                    pipeline_id=published_pipeline1.id,
                                    experiment_name='experiment_name',
                                    datastore=datastore, 
                                    path_on_datastore="chest-xray/train/PNEUMONIA")

We are uploading a couple of sample files to the monitored path in the Datastore to trigger pipeline execution. You can check the status under Pipelines > Pipeline endpoints > Training_pneumonia > Runs.

In [30]:
from azureml.data.datapath import DataPath

samples = './chest-xray-retrain'

ds = workspace.get_default_datastore()
ds = Dataset.File.upload_directory(src_dir = samples,
            target=DataPath(ds, 'chest-xray/train'),
            show_progress=True, overwrite=False)

Validating arguments.
Arguments validated.
Uploading file to chest-xray/train
Uploading an estimated of 10 files
Uploading ./chest-xray-retrain/NORMAL/add_IM-0077-0001.jpeg
Uploaded ./chest-xray-retrain/NORMAL/add_IM-0077-0001.jpeg, 1 files out of an estimated total of 10
Uploading ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0146-0001.jpeg
Uploaded ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0146-0001.jpeg, 2 files out of an estimated total of 10
Uploading ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0198-0001.jpeg
Uploaded ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0198-0001.jpeg, 3 files out of an estimated total of 10
Uploading ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0241-0001.jpeg
Uploaded ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0241-0001.jpeg, 4 files out of an estimated total of 10
Uploading ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0276-0001.jpeg
Uploaded ./chest-xray-retrain/NORMAL/add_NORMAL2-IM-0276-0001.jpeg, 5 files out of an estimated total of 10
Uploading ./chest-xray-