<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Google-Health/imaging-research/blob/master/cxr-foundation/MIMIC_Embeddings_Demo.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/Google-Health/imaging-research/blob/master/cxr-foundation/MIMIC_Embeddings_Demo.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# MIMIC CXR Embeddings Demo

## Overview

This notebook demonstrates how to train a simple neural network for a supervised classification task, using a set of Chest X-ray image embeddings.

The datasets leveraged in this notebook are both derived from the [MIMIC-CXR Dataset](https://physionet.org/content/mimic-cxr/2.0.0/), which contains over 300,000 DICOMs and radiology reports:
1. [The MIMIC-CXR JPG Dataset](https://physionet.org/content/mimic-cxr-jpg/2.0.0/) - contains JPG files derived from the DICOM images and structured labels derived from the free-text reports.
2. [The MIMIC-CXR Image Embeddings Dataset](https://physionet.org/content/image-embeddings-mimic-cxr/1.0/) - which was generated from MIMIC-CXR using the Google Health [CXR Foundation tool](https://github.com/Google-Health/imaging-research/blob/master/cxr-foundation/README.md).

## Prerequisites

1. **Data access** - the MIMIC datasets are access-controlled. Follow the instructions on the [files](https://physionet.org/content/image-embeddings-mimic-cxr/1.0/#files) section to get access to the data. Overall, you must:
   - Be a credentialled PhysioNet user
   - Complete the appropriate institutional research training and get it verified by PhysioNet
   - Ensure the email you use to access Google Cloud is [selected](https://physionet.org/settings/cloud/) in your PhysioNet profile.
   - Sign the data use agreement for each dataset
   - Request access to the dataset's GCS bucket
2. **Billing** - this notebook downloads data directly from PhysioNet's GCS buckets, which are set to [requester pays](https://cloud.google.com/storage/docs/requester-pays). Therefore you must have a Google Cloud project with an associated billing account. (The download cost in this notebook should be < $1)

Note: PhysioNet hosts its data on its on-prem servers, which can be downloaded free of charge. Some of its databases are copied onto GCS buckets, which have much faster download speeds.

# Install Packages

In [None]:
# Run this cell if running notebook from Colab
!git clone https://github.com/Google-Health/imaging-research.git
!mv imaging-research/cxr-foundation/cxr_foundation .

In [None]:
!pip install google-cloud-storage==2.8.0 \
    pandas==1.5.3 \
    tensorflow==2.12.0 \
    tf-models-official==2.12.0

**IMPORTANT**: If you are using Google Colab, you must restart the runtime after installing new packages.

# Authenticate to Access Data

In [None]:
from google.colab import auth

# Authenticate user for access. There will be a popup asking you to sign in with your user and approve access.
auth.authenticate_user()

# Download and Process Metadata

In [None]:
import os

from google.cloud import storage
import pandas as pd

from cxr_foundation.gcs import download_blob
from cxr_foundation.mimic import parse_embedding_file_pattern


DATA_DIR = "data"
EMBEDDINGS_DATA_DIR = os.path.abspath(os.path.join(DATA_DIR, "mimic-embeddings-files"))

storage_client = storage.Client()

# Make a directory to download the data
if not os.path.exists(DATA_DIR):
  os.mkdir(DATA_DIR)

if not os.path.exists(EMBEDDINGS_DATA_DIR):
  os.mkdir(EMBEDDINGS_DATA_DIR)

## Embeddings Metadata

Data source:
- https://physionet.org/content/image-embeddings-mimic-cxr/1.0/
- https://console.cloud.google.com/storage/browser/image-embeddings-mimic-cxr-1.0.physionet.org

Download the checksums file which contains a list of the embeddings files. Extract the data components from the file names.

In [None]:
embeddings_bucket = storage_client.bucket(
    'image-embeddings-mimic-cxr-1.0.physionet.org')    

# Download the checksums file which contains a records list
download_blob(embeddings_bucket, "SHA256SUMS.txt", "data/SHA256SUMS.txt")

In [None]:
df_embeddings = pd.read_csv("data/SHA256SUMS.txt", delimiter=" ", header=None, skiprows=[0])  # Skip the license file entry
display(df_embeddings.head())

In [None]:
SOURCE_COL_NAME = "embeddings_file"  # Remote bucket embedding file location
DL_COL_NAME = "local_embeddings_file"  # Download file to this location

# Create additional columns from file path components
df_embeddings = df_embeddings[[1]]
df_embeddings.rename(columns={1: "embeddings_file"}, inplace=True)
df_embeddings[["subject_id","study_id", "dicom_id"]] = df_embeddings.apply(
    lambda x: parse_embedding_file_pattern(x[SOURCE_COL_NAME]), axis=1, result_type="expand")
df_embeddings[DL_COL_NAME] = df_embeddings[SOURCE_COL_NAME].apply(lambda x: os.path.join(EMBEDDINGS_DATA_DIR, os.path.basename(x)))  # For download

display(df_embeddings)

## CXR Metadata

Data source:
- https://physionet.org/content/mimic-cxr-jpg/2.0.0/
- https://console.cloud.google.com/storage/browser/mimic-cxr-jpg-2.0.0.physionet.org

Download and visualize three metadata files:
1. `mimic-cxr-2.0.0-metadata.csv`: Meta-data derived from the original DICOM files
2. `mimic-cxr-2.0.0-split.csv`: A reference dataset split for studies using MIMIC-CXR-JPG
3. `mimic-cxr-2.0.0-chexpert.csv`:  Lists all studies with labels generated by the CheXpert labeler.

The first two files were used to generate the embeddings database. Embeddings files were only generated for the frontal view CXRs, so there are fewer embeddings files than there are original DICOMs/JPGs.


In [None]:
cxr_jpg_bucket = storage_client.bucket(
    'mimic-cxr-jpg-2.0.0.physionet.org')

CXR_JPG_METADATA_FILES = (
    "mimic-cxr-2.0.0-metadata.csv.gz",
    "mimic-cxr-2.0.0-split.csv.gz",
    "mimic-cxr-2.0.0-chexpert.csv.gz")

for fname in CXR_JPG_METADATA_FILES:
  download_blob(cxr_jpg_bucket, fname, f"{DATA_DIR}/{fname}")

In [None]:
CXR_JPG_METADATA_FILES = (
    "mimic-cxr-2.0.0-metadata.csv.gz",
    "mimic-cxr-2.0.0-split.csv.gz",
    "mimic-cxr-2.0.0-chexpert.csv.gz")

df_metadata = pd.read_csv(f"data/{CXR_JPG_METADATA_FILES[0]}", compression="gzip")
df_split = pd.read_csv(f"data/{CXR_JPG_METADATA_FILES[1]}", compression="gzip")
df_labels_chexpert = pd.read_csv(f"data/{CXR_JPG_METADATA_FILES[2]}", compression="gzip")

display(df_metadata.head())
display(df_split.head())
display(df_labels_chexpert.head())

## Create the full labels file

Join embeddings list with Chexpert metadata files

In [None]:
# Each study contains one or more DICOMs
# Chexpert labels df does not contain DICOM ID. Must join on (subject_id + study_id)
df_labels_all = df_split.merge(df_labels_chexpert, on=['subject_id', 'study_id'])
df_labels_all = df_labels_all.merge(df_metadata, on=['dicom_id'])
df_labels_all = df_embeddings.merge(df_labels_all, on=['dicom_id'], how='left')

display(df_labels_all)

## Make Labels files for Individual Diagnoses

In [None]:
# Dict of data frames for individual diagnoses
diagnoses_dataframes = {}

# Choose some of the Chexpert generated diagnoses
for diagnosis in ('Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 'Fracture'):
  # Remove missing/unsure labels
  df_diagnosis = df_labels_all[df_labels_all[diagnosis].isin((0, 1))]
  # Only extract required columns for the ML model
  df_diagnosis = df_diagnosis[[diagnosis, SOURCE_COL_NAME, DL_COL_NAME, 'split']]
  
  diagnoses_dataframes[diagnosis] = df_diagnosis
  df_diagnosis.to_csv(f'data/{diagnosis}.csv', index=False)
  print(f"Created {diagnosis}.csv with {len(df_diagnosis)} rows")
  display(df_diagnosis.nunique())
  
  # Show label and split value distributions
  display(df_diagnosis[diagnosis].value_counts())
  display(df_diagnosis['split'].value_counts())
  print("\n")

# Download Embeddings Files for Model Training

There are many labels for Cardiomegaly. We will train our model using the embeddings with this label.

In [None]:
DIAGNOSIS = 'Cardiomegaly'
LABELS_CSV = f"data/{DIAGNOSIS}.csv"
MAX_TRAINING_SAMPLES = 500
MAX_VALIDATION_SAMPLES = 200

df_diagnosis = pd.read_csv(LABELS_CSV)

df_train = df_diagnosis[df_diagnosis["split"] == "train"][:MAX_TRAINING_SAMPLES]
df_validate = df_diagnosis[df_diagnosis["split"] == "validate"][:MAX_VALIDATION_SAMPLES]
     

display(df_train)
display(df_validate)

In [None]:
# Takes ~2m
for i, row in df_train.iterrows():
    download_blob(embeddings_bucket, row[SOURCE_COL_NAME], row[DL_COL_NAME], print_name="dest")

for i, row in df_validate.iterrows():
    download_blob(embeddings_bucket, row[SOURCE_COL_NAME], row[DL_COL_NAME], print_name="dest")

In [None]:
# Inspect some embeddings files. A single file is only 5.6kb
from cxr_foundation import embeddings_data

filename = df_train[DL_COL_NAME][0]

# Read the tf.train.Example object from the first tfrecord file
example = embeddings_data.read_record_example(filename)
print(example)

# If you don't care about the structure of the .tfrecord file, and/or if
# you don't use Tensorflow, just use the following function to read
# the values directly into a numpy array.
values = embeddings_data.read_record_values(filename)
print(values)

# Create and Train Model


In [None]:
import tensorflow as tf
import tensorflow_models as tfm


def create_model(heads,
                 embeddings_size=1376,
                 learning_rate=0.1,
                 end_lr_factor=1.0,
                 dropout=0.0,
                 decay_steps=1000,
                 loss_weights=None,
                 hidden_layer_sizes=[512, 256],
                 weight_decay=0.0,
                 seed=None):
  """
  Creates linear probe or multilayer perceptron using LARS + cosine decay.

  """
  inputs = tf.keras.Input(shape=(embeddings_size,))
  hidden = inputs
  # If no hidden_layer_sizes are provided, model will be a linear probe.
  for size in hidden_layer_sizes:
    hidden = tf.keras.layers.Dense(
        size,
        activation='relu',
        kernel_initializer=tf.keras.initializers.HeUniform(seed=seed),
        kernel_regularizer=tf.keras.regularizers.l2(l2=weight_decay),
        bias_regularizer=tf.keras.regularizers.l2(l2=weight_decay))(
            hidden)
    hidden = tf.keras.layers.BatchNormalization()(hidden)
    hidden = tf.keras.layers.Dropout(dropout, seed=seed)(hidden)
  output = tf.keras.layers.Dense(
      units=len(heads),
      activation='sigmoid',
      kernel_initializer=tf.keras.initializers.HeUniform(seed=seed))(
          hidden)

  outputs = {}
  for i, head in enumerate(heads):
    outputs[head] = tf.keras.layers.Lambda(
        lambda x: x[..., i:i + 1], name=head.lower())(
            output)

  model = tf.keras.Model(inputs, outputs)

  learning_rate_fn = tf.keras.experimental.CosineDecay(
      tf.cast(learning_rate, tf.float32),
      tf.cast(decay_steps, tf.float32),
      alpha=tf.cast(end_lr_factor, tf.float32))
      
  model.compile(
      optimizer=tfm.optimization.lars_optimizer.LARS(
          learning_rate=learning_rate_fn),
      loss=dict([(head, 'binary_crossentropy') for head in heads]),
      loss_weights=loss_weights or dict([(head, 1.) for head in heads]),
      weighted_metrics=['AUC'])
  return model

In [None]:
# Create training and validation Datasets
training_data = embeddings_data.get_dataset(filenames=df_train[DL_COL_NAME].values,
                        labels=df_train[DIAGNOSIS].values)


validation_data = embeddings_data.get_dataset(filenames=df_validate[DL_COL_NAME].values,
                        labels=df_validate[DIAGNOSIS].values)

# Create and train the model
model = create_model([DIAGNOSIS])

model.fit(
    x=training_data.batch(512).prefetch(tf.data.AUTOTUNE).cache(),
    validation_data=validation_data,
    epochs=300,
)

In [None]:
model.summary()

In [None]:
# Optional: serialize model for later use
# model.save("embeddings_model", include_optimizer=False)