Copyright 2018 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


# Training/Inference on Breast Density Classification Model on Cloud ML Engine


The goal of this tutorial is to train, deploy and run inference on a breast density classification model. Breast density is thought to be a factor for an increase in the risk for breast cancer. This will emphasize using the [Cloud Healthcare API](https://cloud.google.com/healthcare/) in order to store, retreive and transcode medical images (in DICOM format) in a managed and scalable way. This tutorial will focus on using [Cloud Machine Learning Engine](https://cloud.google.com/ml-engine/) to scalably train and serve the model.

**Note: This is the Cloud ML Engine version of the AutoML Codelab found [here](./breast_density_auto_ml).**

## Requirements
- A Google Cloud project.
- Project has [Cloud Healthcare API](https://cloud.google.com/healthcare/docs/quickstart) enabled.
- Project has [Cloud Machine Learning API ](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction) enabled.
- Project has [Cloud Dataflow API ](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python) enabled.
- Project has [Cloud Build API](https://cloud.google.com/cloud-build/docs/quickstart-docker) enabled.
- Project has [Kubernetes engine API](https://console.developers.google.com/apis/api/container.googleapis.com/overview?project=) enabled.
- Project has [Cloud Resource Manager API](https://console.cloud.google.com/cloud-resource-manager) enabled.


## Input Dataset

The dataset that will be used for training is the [TCIA CBIS-DDSM](https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM) dataset. This dataset contains ~2500 mammography images in DICOM format. Each image is given a [BI-RADS breast density ](https://breast-cancer.ca/densitbi-rads/) score from 1 to 4. In this tutorial, we will build a binary classifier that distinguishes between breast density "2" (*scattered density*) and "3" (*heterogeneously dense*). These are the two most common and variably assigned scores. In the literature, this is said to be [particularly difficult for radiologists to consistently distinguish](https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1002/mp.12683).

In [None]:
project_id = "MY_PROJECT" # @param
location = "us-central1"
dataset_id = "MY_DATASET" # @param
dicom_store_id = "MY_DICOM_STORE" # @param

# Input data used by Cloud ML must be in a bucket with the following format.
cloud_bucket_name = "gs://" + project_id + "-vcm"

In [None]:
%%bash -s {project_id} {location} {cloud_bucket_name}
# Create bucket.
gsutil -q mb -c regional -l $2 $3

# Allow Cloud Healthcare API to write to bucket.
PROJECT_NUMBER=`gcloud projects describe $1 | grep projectNumber | sed 's/[^0-9]//g'`
SERVICE_ACCOUNT="service-${PROJECT_NUMBER}@gcp-sa-healthcare.iam.gserviceaccount.com"
COMPUTE_ENGINE_SERVICE_ACCOUNT="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"

gsutil -q iam ch serviceAccount:${SERVICE_ACCOUNT}:objectCreator $3
gsutil -q iam ch serviceAccount:${COMPUTE_ENGINE_SERVICE_ACCOUNT}:objectCreator $3
gcloud projects add-iam-policy-binding $1 --member=serviceAccount:${SERVICE_ACCOUNT} --role=roles/pubsub.publisher
gcloud projects add-iam-policy-binding $1 --member=serviceAccount:${COMPUTE_ENGINE_SERVICE_ACCOUNT} --role roles/pubsub.admin

In [None]:
import json
import os
import google.auth
from google.auth.transport.requests import AuthorizedSession

credentials, project = google.auth.default()
authed_session = AuthorizedSession(credentials)
# Path to Cloud Healthcare API.
HEALTHCARE_API_URL = 'https://healthcare.googleapis.com/v1beta1'

# Create Cloud Healthcare API dataset.
path = os.path.join(HEALTHCARE_API_URL, 'projects', project_id, 'locations', location, 'datasets?dataset_id=' + dataset_id)
headers = {'Content-Type': 'application/json'}
resp = authed_session.post(path, headers=headers)

assert resp.status_code == 200, 'error creating Dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))

# Create Cloud Healthcare API DICOM store.
path = os.path.join(HEALTHCARE_API_URL, 'projects', project_id, 'locations', location, 'datasets', dataset_id, 'dicomStores?dicom_store_id=' + dicom_store_id)
resp = authed_session.post(path, headers=headers)
assert resp.status_code == 200, 'error creating DICOM store, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))

Next, we are going to transfer the DICOM instances to the Cloud Healthcare API.

Note: We are transfering >100GB of data so this will take some time to complete

In [None]:
# Store DICOM instances in Cloud Healthcare API.
path = "https://healthcare.googleapis.com/v1beta1/projects/{}/locations/{}/datasets/{}/dicomStores/{}:import".format(
      project_id, location, dataset_id, dicom_store_id)
headers = {'Content-Type': 'application/json'}
body = { 
      'gcsSource': {
        'uri': 'gs://gcs-public-data--healthcare-tcia-cbis-ddsm/dicom/**'
      }
}
resp = authed_session.post(path, headers=headers, json=body)
assert resp.status_code == 200, 'error creating Dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))
response = json.loads(resp.text)
operation_name = response['name']

In [None]:
import time

def wait_for_operation_completion(path, timeout, sleep_time=30): 
  success = False
  while time.time() < timeout:
    print('Waiting for operation completion...')
    resp = authed_session.get(path)
    assert resp.status_code == 200, 'error polling for Operation results, code: {0}, response: {1}'.format(resp.status_code, resp.text)
    response = json.loads(resp.text)
    if 'done' in response:
      if response['done'] == True and 'error' not in response:
        success = True;
      break
    time.sleep(sleep_time)

  print('Full response:\n{0}'.format(resp.text))      
  assert success, "operation did not complete successfully in time limit"
  print('Success!')
  return response

In [None]:
path = os.path.join(HEALTHCARE_API_URL, operation_name)
timeout = time.time() + 40*60 # Wait up to 40 minutes.
_ = wait_for_operation_completion(path, timeout)

### Explore the Cloud Healthcare DICOM dataset (optional)

This is an optional section to explore the Cloud Healthcare DICOM dataset. In the following code, we simply just list the studies that we have loaded into the Cloud Healthcare API. You can modify the *num_of_studies_to_print* parameter to print as many studies as desired.

In [None]:
num_of_studies_to_print = 2 # @param


path = os.path.join(HEALTHCARE_API_URL, 'projects', project_id, 'locations', location, 'datasets', dataset_id, 'dicomStores', dicom_store_id, 'dicomWeb', 'studies')
resp = authed_session.get(path)
assert resp.status_code == 200, 'error querying Dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
response = json.loads(resp.text)

print(json.dumps(response[:num_of_studies_to_print], indent=2))

## Convert DICOM to JPEG

The ML model that we will build requires that the dataset be in JPEG. We will leverage the Cloud Healthcare API to transcode DICOM to JPEG.

First we will create a [Google Cloud Storage](https://cloud.google.com/storage/) bucket to hold the output JPEG files. Next, we will use the ExportDicomData API to transform the DICOMs to JPEGs.

In [None]:
jpeg_bucket = cloud_bucket_name + "/images/"

Next we will convert the DICOMs to JPEGs using the [ExportDicomData](https://cloud.google.com/sdk/gcloud/reference/beta/healthcare/dicom-stores/export/gcs). 

In [None]:
%%bash -s {jpeg_bucket} {project_id} {location} {dataset_id} {dicom_store_id}
gcloud beta healthcare --project $2  dicom-stores export gcs $5 --location=$3 --dataset=$4 --mime-type="image/jpeg; transfer-syntax=1.2.840.10008.1.2.4.50" --gcs-uri-prefix=$1

We will use the Operation name returned from the previous command to poll the status of ExportDicomData. We will poll for operation completeness, which should take a few minutes. When the operation is complete, the operation's *done* field will be set to true.

Meanwhile, you should be able to observe the JPEG images being added to your Google Cloud Storage bucket.

## Training

We will use [Transfer Learning](https://en.wikipedia.org/wiki/Transfer_learning) to retrain a generically trained trained model to perform breast density classification. Specifically, we will use an [Inception V3](https://github.com/tensorflow/models/tree/master/research/inception) checkpoint as the starting point.

The neural network we will use can roughly be split into two parts: "feature extraction" and "classification". In transfer learning, we take advantage of a pre-trained (checkpoint) model to do the "feature extraction", and add a few layers to perform the "classification" relevant to the specific problem. In this case, we are adding aa [dense](https://www.tensorflow.org/api_docs/python/tf/layers/dense) layer with two neurons to do the classification and a [softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax) layer to normalize the classification score.  The mammography images will be classified as either "2" (scattered density) or "3" (heterogeneously dense). See below for diagram of the training process:

![Inception V3](images/cloud_ml_training_pipeline.png)


The "feature extraction" and the "classification" part will be done in the following steps, respectively.

### Preprocess Raw Images using Cloud Dataflow

In this step, we will resize images to 300x300 (required for Inception V3) and will run each image through the checkpoint Inception V3 model to calculate the *bottleneck values*. This is the feature vector for the output of the feature extraction part of the model (the part that is already pre-trained). Since this process is resource intensive, we will utilize [Cloud Dataflow](https://cloud.google.com/dataflow/) in order to do this scalably. We extract the features and calculate the bottleneck values here for performance reasons - so that we don't have to recalculate them during training.

The output of this process will be a collection of [TFRecords](https://www.tensorflow.org/guide/datasets) storing the bottleneck value for each image in the input dataset. This TFRecord format is commonly used to store Tensors in binary format for storage.

Finally, in this step, we will also split the input dataset into *training*, *validation* or *testing*. The percentage of each can be modified using the parameters below.

In [None]:
# GCS Bucket to store output TFRecords.
bottleneck_bucket = cloud_bucket_name + "/bottleneck/" # @param

# Percentage of dataset to allocate for validation and testing.
validation_percentage = 10 # @param
testing_percentage = 10 # @param

# Number of Dataflow workers. This can be increased to improve throughput.
dataflow_num_workers = 5 # @param

# Staging bucket for training.
staging_bucket = cloud_bucket_name # @param

The following command will kick off a Cloud Dataflow pipeline that runs preprocessing. The script that has the relevant code is [preprocess.py](./scripts/trainer/preprocess.py). ***You can check out how the pipeline is progressing [here](https://console.cloud.google.com/dataflow)***.

When the operation is done, we will begin training the classification layers.

In [None]:
%%bash -s {project_id} {jpeg_bucket} {bottleneck_bucket} {validation_percentage} {testing_percentage} {dataflow_num_workers} {staging_bucket}

# Install Python library dependencies.
sudo pip3 install  tensorflow==1.15.0 google-apitools apache_beam[gcp]==2.18.0 --ignore-installed
# Start job in Cloud Dataflow and wait for completion.
python3 -m scripts.preprocess.preprocess \
    --project $1 \
    --input_path $2 \
    --output_path "$3/record" \
    --num_workers $6 \
    --temp_location "$7/temp" \
    --staging_location "$7/staging" \
    --validation_percentage $4 \
    --testing_percentage $5

### Train the Classification Layers of Model using Cloud ML Engine

In this step, we will train the classification layers of the model. This consists of just a [dense](https://www.tensorflow.org/api_docs/python/tf/layers/dense) and [softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax) layer. We will use the bottleneck values calculated at the previous step as the input to these layers. We will use Cloud ML Engine to train the model. The output of stage will be a trained model exported to GCS, which can be used for inference.


There are various training parameters below that can be tuned. 

In [None]:
training_steps = 1000 # @param
learning_rate = 0.01 # @param

# Location of exported model.
exported_model_bucket = cloud_bucket_name + "/models" # @param


# Inference requires the exported model to be versioned (by default we choose version 1).
exported_model_versioned_uri = exported_model_bucket + "/1"

We'll invoke Cloud ML Engine with the above parameters. We use a GPU for training to speed up operations. The script that does the training is [model.py](./scripts/trainer/model.py)

In [None]:
%%bash -s {location} {bottleneck_bucket} {staging_bucket} {training_steps} {learning_rate} {exported_model_versioned_uri}

# Start training on CMLE.
gcloud ai-platform jobs submit training breast_density \
    --python-version 3.5 \
    --runtime-version 1.14 \
    --scale-tier BASIC_GPU \
    --module-name "scripts.trainer.model" \
    --package-path scripts \
    --staging-bucket $3 \
    --region $1 \
    -- \
    --bottleneck_dir "$2/record" \
    --training_steps $4 \
    --learning_rate $5 \
    --export_model_path $6

You can monitor the status of the training job by running the following command. The job can take a few minutes to start-up.

In [None]:
!gcloud ai-platform jobs describe breast_density

When the job has started, you can observe the logs for the training job by executing the below command (it will poll for new logs every 30 seconds).

As training progresses, the logs will output the accuracy on the training set, validation set, as well as the [cross entropy](http://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html). You'll generally see that the accuracy goes up, while the cross entropy goes down as the number of training iterations increases.


Finally, when the training is complete, the accuracy of the model on the held-out test set will be output to console. The job can take a few minutes to shut-down.

In [None]:
!gcloud ai-platform jobs stream-logs breast_density --polling-interval=30

### Export Trained Model for Inference in Cloud ML Engine

Cloud ML Engine can also be used to serve the model for inference.  The inference model is composed of the pre-trained Inception V3 checkpoint, along with the classification layers we trained above for breast density. First we set the inference model name and version.

In [None]:
model_name = "breast_density" # @param
version = "v1" # @param

# The full name of the model.
full_model_name = "projects/" + project_id + "/models/" + model_name + "/versions/" + version

In [None]:
!gcloud ai-platform models create $model_name --regions $location
!gcloud ai-platform versions create $version --model $model_name --origin $exported_model_versioned_uri --runtime-version 1.14 --python-version 3.5

## Inference

To allow medical imaging ML models to be easily integrated into clinical workflows, an *inference module* can be used. A standalone modality, a PACS system or a DICOM router can push DICOM instances into Cloud Healthcare [DICOM stores](https://cloud.google.com/healthcare/docs/introduction), allowing ML models to be triggered for inference. This inference results can then be structured into various DICOM formats (e.g. DICOM [structured reports](http://dicom.nema.org/MEDICAL/Dicom/2014b/output/chtml/part20/sect_A.3.html)) and stored in the Cloud Healthcare API, which can then be retrieved by the customer.

The inference module is built as a [Docker](https://www.docker.com/) container and deployed using [Kubernetes](https://kubernetes.io/), allowing you to easily scale your deployment. The dataflow for inference can look as follows (see corresponding diagram below):

1. Client application uses [STOW-RS](ftp://dicom.nema.org/medical/Dicom/2013/output/chtml/part18/sect_6.6.html) to push a new DICOM instance to the Cloud Healthcare DICOMWeb API.

2. The insertion of the DICOM instance triggers a [Cloud Pubsub](https://cloud.google.com/pubsub/) message to be published. The *inference module* will pull incoming Pubsub messages and will recieve a message for the previously inserted DICOM instance. 

3. The *inference module* will retrieve the instance in JPEG format from the Cloud Healthcare API using [WADO-RS](ftp://dicom.nema.org/medical/Dicom/2013/output/chtml/part18/sect_6.5.html).

4. The *inference module* will send the JPEG bytes to the model hosted on Cloud ML Engine.

5. Cloud ML Engine will return the prediction back to the  *inference module*.

6. The *inference module* will package the prediction into a DICOM instance. This can potentially be a DICOM structured report, [presentation state](ftp://dicom.nema.org/MEDICAL/dicom/2014b/output/chtml/part03/sect_A.33.html), or even burnt text on the image. In this codelab, we will focus on just DICOM structured reports. The structured report is then stored back in the Cloud Healthcare API using STOW-RS.

7. The client application can query for (or retrieve) the structured report by using [QIDO-RS](http://dicom.nema.org/dicom/2013/output/chtml/part18/sect_6.7.html) or WADO-RS. Pubsub can also be used by the client application to poll for the newly created DICOM structured report instance.

![Inference data flow](images/cloud_ml_inference_pipeline.png)


To begin, we will create a new DICOM store that will store our inference source (DICOM mammography instance) and results (DICOM structured report). In order to enable Pubsub notifications to be triggered on inserted instances, we will give the DICOM store a Pubsub channel to publish on.

In [None]:
# Pubsub config.
pubsub_topic_id = "MY_PUBSUB_TOPIC_ID" # @param
pubsub_subscription_id = "MY_PUBSUB_SUBSRIPTION_ID" # @param

# DICOM Store for store DICOM used for inference.
inference_dicom_store_id = "MY_INFERENCE_DICOM_STORE" # @param

pubsub_subscription_name = "projects/" + project_id + "/subscriptions/" + pubsub_subscription_id
inference_dicom_store_name = "projects/" + project_id + "/locations/" + location + "/datasets/" + dataset_id + "/dicomStores/" + inference_dicom_store_id

In [None]:
%%bash -s {pubsub_topic_id} {pubsub_subscription_id} {project_id} {location} {dataset_id} {inference_dicom_store_id}

# Create Pubsub channel.
gcloud beta pubsub topics create $1
gcloud beta pubsub subscriptions create $2 --topic $1

# Create a Cloud Healthcare DICOM store that published on given Pubsub topic.
TOKEN=`gcloud beta auth application-default print-access-token`
NOTIFICATION_CONFIG="{notification_config: {pubsub_topic: \"projects/$3/topics/$1\"}}"
curl -s -X POST -H "Content-Type: application/json" -d "${NOTIFICATION_CONFIG}" https://healthcare.googleapis.com/v1beta1/projects/$3/locations/$4/datasets/$5/dicomStores?access_token=${TOKEN}\&dicom_store_id=$6

# Enable Cloud Healthcare API to publish on given Pubsub topic.
PROJECT_NUMBER=`gcloud projects describe $3 | grep projectNumber | sed 's/[^0-9]//g'`
SERVICE_ACCOUNT="service-${PROJECT_NUMBER}@gcp-sa-healthcare.iam.gserviceaccount.com"
gcloud beta pubsub topics add-iam-policy-binding $1 --member="serviceAccount:${SERVICE_ACCOUNT}" --role="roles/pubsub.publisher"

Next, we will building the *inference module* using [Cloud Build API](https://cloud.google.com/cloud-build/docs/api/reference/rest/). This will create a Docker container that will be stored in [Google Container Registry](https://cloud.google.com/container-registry/). The inference module code is found in *[inference.py](./scripts/inference/inference.py)*. The build script used to build the Docker container for this module is *[cloudbuild.yaml](./scripts/inference/cloudbuild.yaml)*. Progress of build may be found on [cloud build dashboard](https://console.cloud.google.com/cloud-build/builds?project=).

In [None]:
%%bash -s {project_id}
PROJECT_ID=$1

gcloud builds submit --config scripts/inference/cloudbuild.yaml --timeout 1h scripts/inference

Next, we will deploy the *inference module* to Kubernetes.

Then we create a Kubernetes Cluster and a Deployment for the *inference module*.

In [None]:
%%bash -s {project_id} {location} {pubsub_subscription_name} {full_model_name} {inference_dicom_store_name}
gcloud container clusters create inference-module --region=$2 --scopes https://www.googleapis.com/auth/cloud-platform --num-nodes=1

PROJECT_ID=$1
SUBSCRIPTION_PATH=$3
MODEL_PATH=$4
INFERENCE_DICOM_STORE_NAME=$5

cat <<EOF | kubectl create -f -
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: inference-module
  namespace: default
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: inference-module
    spec:
      containers:
        - name: inference-module
          image: gcr.io/${PROJECT_ID}/inference-module:latest
          command:
            - "/opt/inference_module/bin/inference_module"
            - "--subscription_path=${SUBSCRIPTION_PATH}"
            - "--model_path=${MODEL_PATH}"
            - "--dicom_store_path=${INFERENCE_DICOM_STORE_NAME}"
            - "--prediction_service=CMLE"
EOF

Next, we will store a mammography DICOM instance from the TCIA dataset to the DICOM store. This is the image that we will request inference for. Pushing this instance to the DICOM store will result in a Pubsub message, which will trigger the *inference module*.

In [None]:
# DICOM Study/Series UID of input mammography image that we'll push for inference.
input_mammo_study_uid = "1.3.6.1.4.1.9590.100.1.2.85935434310203356712688695661986996009"
input_mammo_series_uid = "1.3.6.1.4.1.9590.100.1.2.374115997511889073021386151921807063992"
input_mammo_instance_uid = "1.3.6.1.4.1.9590.100.1.2.289923739312470966435676008311959891294"

In [None]:
from google.cloud import storage


client = storage.Client()
bucket = client.bucket('gcs-public-data--healthcare-tcia-cbis-ddsm', user_project=project_id)
blob = bucket.blob("dicom/{}/{}/{}.dcm".format(input_mammo_study_uid,input_mammo_series_uid,input_mammo_instance_uid))
blob.download_to_filename("example.dcm")
with open("example.dcm", 'rb') as dcm:
  dcm_content = dcm.read()
path = os.path.join(HEALTHCARE_API_URL, inference_dicom_store_name, 'dicomWeb', 'studies')
headers = {'Content-Type': 'application/dicom'}
authed_session.post(path, headers=headers, data=dcm_content)

You should be able to observe the *inference module*'s logs by running the following command. In the logs, you should observe that the inference module successfully recieved the the Pubsub message and ran inference on the DICOM instance. The logs should also include the inference results. It can take a few minutes to start-up the Kubernetes deployment, so you many have to run this a few times.

In [None]:
!kubectl logs -l app=inference-module

You can also query the Cloud Healthcare DICOMWeb API (using QIDO-RS) to see that the DICOM structured report has been inserted for the study. The structured report contents can be found under tag **"0040A730"**. 

You can optionally also use WADO-RS to recieve the instance (e.g. for viewing).

In [None]:
%%bash -s {project_id} {location} {dataset_id} {inference_dicom_store_id} {input_mammo_study_uid}

TOKEN=`gcloud beta auth application-default print-access-token`

# QIDO-RS should return two results in JSON response. One for the original DICOM
# instance, and one for the Strucured Report containing the inference results.
curl -s https://healthcare.googleapis.com/v1beta1/projects/$1/locations/$2/datasets/$3/dicomStores/$4/dicomWeb/studies/$5/instances?includefield=all\&access_token=${TOKEN} | python -m json.tool