<center><h1>Supporting RSNA Screening Mammography Breast Cancer Detection with Pytorch Image Classification on Amazon SageMaker</h1></center>

![Find breast cancers in screening mammograms](https://storage.googleapis.com/kaggle-competitions/kaggle/39272/logos/header.png?t=2022-11-28-17-29-35)
    
Data Source: https://www.kaggle.com/competitions/rsna-breast-cancer-detection/data?select=train.csv

In [None]:
# !pip install kaggle

In [None]:
!pip install sagemaker ipywidgets pydicom --upgrade --quiet

In [None]:
pip install -U pylibjpeg pylibjpeg-openjpeg pylibjpeg-libjpeg

In [None]:
pip install split-folders

In [None]:
# import kaggle
# !kaggle competitions download -c rsna-breast-cancer-detection

In [None]:
# !unzip chest-xray-pneumonia.zip

In [None]:
%%time
import boto3
import re
import os, sys, glob
import sagemaker
from sagemaker import get_execution_role
from sagemaker import image_uris, model_uris, script_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner


role = get_execution_role()
sess = sagemaker.Session()

bucket = sess.default_bucket()
prefix = 'rsna-breast-cancer-detection'

In [None]:
aws_role = get_execution_role()
aws_region = boto3.Session().region_name

In [None]:
import matplotlib.pyplot as plt
import pydicom
from pydicom.data import get_testdata_files

In [None]:
import pandas as pd
import numpy as np
from PIL import Image

In [None]:
import splitfolders
import json
import io

--- 
## Quick Data inspection


---

In [None]:
ds = pydicom.dcmread('train_images/10006/1864590858.dcm')
plt.imshow(ds.pixel_array, cmap=plt.cm.bone)

---

### Data and Metadata Exploration

#### Metadata file column desriptions

- `site_id` - ID code for the source hospital.
- `patient_id` - ID code for the patient.
- `image_id` - ID code for the image.
- `laterality` - Whether the image is of the left or right breast.
- `view` - The orientation of the image. The default for a screening exam is to capture two views per breast.
- `age` - The patient's age in years.
- `implant` - Whether or not the patient had breast implants. Site 1 only provides breast implant information at the patient level, not at the breast level.
- `density` - A rating for how dense the breast tissue is, with A being the least dense and D being the most dense. Extremely dense tissue can make diagnosis more difficult. Only provided for train.
- `machine_id` - An ID code for the imaging device.
- `cancer` - Whether or not the breast was positive for malignant cancer. The target value. Only provided for train.
- `biopsy` - Whether or not a follow-up biopsy was performed on the breast. Only provided for train.
- `invasive` - If the breast is positive for cancer, whether or not the cancer proved to be invasive. Only provided for train.
- `BIRADS` - 0 if the breast required follow-up, 1 if the breast was rated as negative for cancer, and 2 if the breast was rated as normal. Only provided for train.
- `difficult_negative_case` - True if the case was unusually difficult. Only provided for train.

In [None]:
train_metadata_df = pd.read_csv('train.csv')

In [None]:
train_metadata_df.shape

In [None]:
train_metadata_df.head()

In [None]:
train_metadata_df.describe()

In [None]:
train_metadata_df.density.value_counts()

---
## Data Preparation 
---

### Handling Dataset Imbalance

On exploring the data with the `train.csv`, we can easily see that only 2% of the data has positive cancer labels. Hence, if we train our model on the data, we may likely end up with a model predicting a lot of false negatives. Therefore to mitigate this, we prepare a balanced data set across the following features:
1. density (optional)
2. biopsy
3. invasive
4. BIRADS

Reason: The problem is originally a binary classification problem, i.e., 0 or 1. We want to use leverage the ML model capability to detect the need for biopsy, the invasive nature of the cancer and follow-up is needed

In [None]:
train_metadata_df.cancer.value_counts()

In [None]:
train_metadata_df.BIRADS.fillna(0)

In [None]:
balancing_columns = ['density', 'invasive', 'biopsy', 'BIRADS', 'cancer']

In [None]:
# train_metadata_df
df_grouped_by = train_metadata_df.groupby(balancing_columns)
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

In [None]:
df_balanced = df_balanced.droplevel(balancing_columns)
# df_balanced

In [None]:
df_balanced.describe()

In [None]:
# !wget https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py

### Convert dcm files to png files

### Data Manipulation

The dataset is split into 2 classes Pneumonia and Normal. However, the Pneumonia directory contains images for both Bacteria and Virus Pneumonia. We will create 3 class problem by splitting the Pneumonia directory into Bacteria and Virus, thus helping us in ur medical diagnosis

In [None]:
df_balanced['file_path'] = 'train_images/' + df_balanced['patient_id'].astype(str) + '/' + df_balanced['image_id'].astype(str) + '.dcm'

In [None]:
df_balanced.head()

In [None]:
df_balanced.density.value_counts()

In [None]:
!mkdir -p model_images/train model_images/val

In [None]:
for column in balancing_columns:
    !mkdir model_images/train/$column

In [None]:
balancing_columns

In [None]:
def convert_dcm_to_png_with_path(dcm_file_path):
    """
    Function to convert a DCM file to a PNG image and generate the corresponding output file path
    based on the metadata in the DCM file.
    """
    # Load the DCM file using pydicom
    dcm_data = pydicom.dcmread(dcm_file_path)

    # Get the pixel data from the DCM file as a numpy array
    pixel_data = dcm_data.pixel_array

    # Rescale the pixel data to 0-255 and convert it to uint8 data type
    pixel_data = ((pixel_data - np.min(pixel_data)) / np.ptp(pixel_data) * 255.0).astype(np.uint8)

    # Resize the image to 1024x1024 using PIL
    image = Image.fromarray(pixel_data).resize((1024, 1024))
    return image

In [None]:
def generate_png_path(s):
    """
    Function to store data in converted images into the class path
    """
    png_paths = []
    if s['biopsy'] == 1:
        png_paths.append('model_images/train/biopsy/{}.png'.format(s['image_id']))
    if s['cancer'] == 1:
        png_paths.append('model_images/train/cancer/{}.png'.format(s['image_id']))
    if s['invasive'] == 1:
        png_paths.append('model_images/train/invasive/{}.png'.format(s['image_id'])) 
    if s['density'] == 'A':
        png_paths.append('model_images/train/density_A/{}.png'.format(s['image_id']))
    if s['density'] == 'B':
        png_paths.append('model_images/train/density_B/{}.png'.format(s['image_id']))
    if s['density'] == 'C':
        png_paths.append('model_images/train/density_C/{}.png'.format(s['image_id']))
    if s['density'] == 'D':
        png_paths.append('model_images/train/density_D/{}.png'.format(s['image_id']))
    
    # convert dcm file to png
    image = convert_dcm_to_png_with_path(s['file_path'])
    
    # save to new png paths
    for png_path in png_paths:
        os.makedirs(os.path.dirname(png_path), exist_ok=True)
        image.save(png_path)
    return png_paths

In [None]:
# generate_png_path()
df_balanced.apply(generate_png_path, axis=1)

In [None]:
# convert_dcm_to_png_with_path('train_images/52566/202476234.dcm')

### Split images into train and upload images to S3

In [None]:
splitfolders.ratio("model_images/train/", output="model_images/upload_to_s3/",
    seed=1337, ratio=(.8, .2), group_prefix=None, move=True)

### Data Augmentation (Coming Soon)

After rebalancing our dataset, we reduced the sample size from 54706 to 288. A good ML model may be extremely difficult to come by with this size. Hence, we employ image data augmentation techniques to increase the size of our training dataset.

### Upload the image files to train and validation channels

`restart from here`

In [None]:
prefix

### Fine-tune Tensorflow pre-trained model on our custom breast cancer dataset

Once we have the data available in the correct format for training, the next step is to actually train the model using the data. Before training the model, we need to setup the training parameters. The next section will explain the parameters in detail.




#### Retrieve JumpStart Training artifacts
---
Here, for the selected model, we retrieve the training docker container, the training algorithm source, the pre-trained base model, and a python dictionary of the training hyper-parameters that the algorithm accepts with their default values. Note that the model_version="*" fetches the lates model. Also, we do need to specify the training_instance_type to fetch train_image_uri.

---


In [None]:
model_id, model_version, = (
    "pytorch-ic-mobilenet-v2",
    "*",
)

In [None]:
import IPython
from ipywidgets import Dropdown

# download JumpStart model_manifest file.
boto3.client("s3").download_file(
    f"jumpstart-cache-prod-{aws_region}", "models_manifest.json", "models_manifest.json"
)
with open("models_manifest.json", "rb") as json_file:
    model_list = json.load(json_file)

# filter-out all the Image Classification models from the manifest list.
ic_models_all_versions, ic_models = [
    model["model_id"] for model in model_list if "-ic-" in model["model_id"]
], []
[ic_models.append(model) for model in ic_models_all_versions if model not in ic_models]

# display the model-ids in a dropdown, for user to select a model.
dropdown = Dropdown(
    options=ic_models,
    value=model_id,
    description="JumpStart Image Classification Models:",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(IPython.display.Markdown("## Select a JumpStart pre-trained model from the dropdown below"))
display(dropdown)

In [None]:
from sagemaker import image_uris, model_uris, script_uris, hyperparameters

model_id, model_version = dropdown.value, "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)
# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)
# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

In [None]:
train_model_uri

#### Set Training parameters
Now that we are done with all the setup that is needed, we are ready to fine-tune our Image Classification model. To begin, let us create a sageMaker.estimator.Estimator object. This estimator will launch the training job.

There are two kinds of parameters that need to be set for training.

The first one are the parameters for the training job. These include: 
- (i) Training data path. This is S3 folder in which the input data is stored, 
- (ii) Output path: This the s3 folder in which the training output is stored. 
- (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri.

The second set of parameters are algorithm specific training hyper-parameters.

- Training instance count: This is the number of instances on which to run the training. When the number of instances is greater than one, then the image classification algorithm will run in distributed settings.
- Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training
- Output path: This the s3 folder in which the training output is stored

In [None]:
# Four channels: train, validation, train_lst, and validation_lst
s3_train = f's3://{bucket}/{prefix}/train/'
s3_validation = f's3://{bucket}/{prefix}/val/'

In [None]:
s3_output_location = f"s3://{bucket}/{prefix}/output"

In [None]:
!aws s3 cp model_images/upload_to_s3/train/ $s3_train --recursive --quiet
!aws s3 cp model_images/upload_to_s3/val/ $s3_validation --recursive --quiet

---

For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values

---

In [None]:
from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "20"
print(hyperparameters)

### Train with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html))

---
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.

---

In [None]:
from sagemaker.tuner import ContinuousParameter

# Use AMT for tuning and selecting the best model
use_amt = False

# Define objective metric per framework, based on which the best model will be selected.
metric_definitions_per_model = {
    "tensorflow": {
        "metrics": [
            {"Name": "val_accuracy", "Regex": "val_accuracy: ([0-9\\.]+)"},
            {"Name": "val_top_5_accuracy", "Regex": "val_top_5_accuracy: ([0-9\\.]+)"}            
        ],
        "type": "Maximize",
    },
    "pytorch": {
        "metrics": [
            {"Name": "val_accuracy", "Regex": "val Acc: ([0-9\\.]+)"},
            {"Name": "val_top_5_accuracy", "Regex": "val_top_5_accuracy: ([0-9\\.]+)"}    
        ],
        "type": "Maximize",
    },
}

# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
    "adam-learning-rate": ContinuousParameter(0.0001, 0.1, scaling_type="Logarithmic")
}

# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 6
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2

### Start Training
---
We start by creating the estimator object with all the required assets and then launch the training job (with spot instances).

---

In [None]:
train_use_spot_instances = False
train_max_run = 1300
train_max_wait = 2400 if train_use_spot_instances else None

In [None]:
from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner

training_job_name = name_from_base(f"bc-detection-{model_id}-transfer-learning")



In [None]:
training_job_name

In [None]:
# Create SageMaker Estimator instance
ic_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    max_wait=640000,
    hyperparameters=hyperparameters,
    use_spot_instances=True,
    output_path=s3_output_location,
    base_job_name=training_job_name,
)


In [None]:
if use_amt:
    metric_definitions = next(
        value for key, value in metric_definitions_per_model.items() if model_id.startswith(key)
    )

    hp_tuner = HyperparameterTuner(
        ic_estimator,
        metric_definitions["metrics"][0]["Name"],
        hyperparameter_ranges,
        metric_definitions["metrics"],
        max_jobs=max_jobs,
        max_parallel_jobs=max_parallel_jobs,
        objective_type=metric_definitions["type"],
        base_tuning_job_name=training_job_name,
    )

    # Launch a SageMaker Tuning job to search for the best hyperparameters
    hp_tuner.fit({"training": s3_train, "validation": s3_validation})
else:
    # Launch a SageMaker Training job by passing s3 path of the training data
    ic_estimator.fit({"training": s3_train, "validation": s3_validation}, logs=True)

### Model Assessment

The model accuracy on the validation dataset is ~25% in predicting a specific class. On the other hand, the top_5_accuracy on the validation dataset is ~93%. Therefore, we can assume that our model is performs very well in guessing top five classes, which are likely indicators of breast cancer.

### Deploy & run Inference on the fine-tuned model
---
A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class label of an image. We follow the same steps as in 3. Run inference on the pre-trained model. We start by retrieving the jumpstart artifacts for deploying an endpoint. However, instead of base_predictor, we deploy the ic_estimator that we fine-tuned.

---

In [None]:
inference_instance_type = "ml.m5.xlarge"

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)

endpoint_name = name_from_base(f"jumpstart-example-FT-{model_id}-")

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = (hp_tuner if use_amt else ic_estimator).deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    entry_point="inference.py",
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    endpoint_name=endpoint_name,
)

---
Next, we query the fine-tuned model, parse the response and display the predictions.

For this, we will make use of images excluded from the balanced dataset and select 10 images at random.

---

In [None]:
# train_metadata_df[~df_balanced]

raw_df = train_metadata_df[~train_metadata_df.image_id.isin(df_balanced.image_id)]


In [None]:
import openai
openai.api_key = "YOUR_API_KEY"

model_engine = "text-davinci-002"

In [None]:
def show_predictions(s):
    """
    Function runs the prediction on an image
    using the sagemaker inference endpoint
    """
    dcm_file_path = ('train_images/' + s['patient_id'].astype(str) + '/' + s['image_id'].astype(str) + '.dcm').values[0]
    png_img_object = convert_dcm_to_png_with_path(dcm_file_path)
    buf = io.BytesIO()
    png_img_object.save(buf, format='JPEG')
    byte_im = buf.getvalue()
    query_response = finetuned_predictor.predict(
        byte_im, {"ContentType": "application/x-image", "Accept": "application/json;verbose"}
    )
    model_predictions = json.loads(query_response)
    predicted_label = model_predictions["predicted_label"]
    return dict(zip(model_predictions['labels'], model_predictions['probabilities']))
    
    

In [None]:
def generate_explanation(prediction_dict):
    input_text = "Explain the results of my cancer prediction model: " + str(prediction_dict)
    response = openai.Completion.create(
        engine=model_engine,
        prompt=input_text,
        temperature=0.5,
        max_tokens=1024,
        n=1,
        stop=None,
        timeout=30,
    )
    explanation = response.choices[0].text.strip()
    return explanation

In [None]:
prediction_dict = show_predictions(raw_df.dropna().iloc[19:20])

In [None]:
explanation = generate_explanation(prediction_dict)

print(explanation)