<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Google-Health/imaging-research/blob/mimic-demo/cxr-foundation/MIMIC_Embeddings_Demo.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Google-Health/imaging-research/blob/mimic-demo/cxr-foundation/MIMIC_Embeddings_Demo.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# MIMIC CXR Embeddings Demo

## Overview

This notebook demonstrates how to train a simple neural network for a supervised classification task, using a set of Chest X-ray image embeddings.

The datasets leveraged in this notebook are both derived from the [MIMIC-CXR Dataset](https://physionet.org/content/mimic-cxr/2.0.0/), which contains over 300,000 DICOMs and radiology reports:
1. [The MIMIC-CXR JPG Dataset](https://physionet.org/content/mimic-cxr-jpg/2.0.0/) - contains JPG files derived from the DICOM images and structured labels derived from the free-text reports.
2. [The MIMIC-CXR Image Embeddings Dataset](https://physionet.org/content/image-embeddings-mimic-cxr/1.0/) - which was generated from MIMIC-CXR using the Google Health [CXR Foundation tool](https://github.com/Google-Health/imaging-research/blob/master/cxr-foundation/README.md).

## Prerequisites

1. **Data access** - the MIMIC datasets are access-controlled. Follow the instructions on the [files](https://physionet.org/content/image-embeddings-mimic-cxr/1.0/#files) section to get access to the data. Overall, you must:
   - Be a credentialled PhysioNet user
   - Complete the appropriate institutional research training and get it verified by PhysioNet
   - Ensure the email you use to access Google Cloud is [selected](https://physionet.org/settings/cloud/) in your PhysioNet profile.
   - Sign the data use agreement for each dataset
   - Request access to the dataset's GCS bucket
2. **Billing** - this notebook downloads data directly from PhysioNet's GCS buckets, which are set to [requester pays](https://cloud.google.com/storage/docs/requester-pays). Therefore you must have a Google Cloud project with an associated billing account. (The download cost in this notebook should be < $1)

Note: PhysioNet hosts its data on its on-prem servers, which can be downloaded free of charge. Some of its databases are copied onto GCS buckets, which have much faster download speeds.

# Install Packages

In [None]:
# Run this cell if running notebook from Colab
!git clone -b mimic-demo https://github.com/Google-Health/imaging-research.git
!mv imaging-research/cxr-foundation/cxr_foundation .

In [None]:
!pip install google-cloud-storage==1.42.3 \
    pandas==1.3.5 \
    tensorflow==2.10.0 \
    tf-models-official==2.10.0

**IMPORTANT**: If you are using Google Colab, you must restart the runtime after installing new packages.

# Authenticate to Access Data

In [None]:
from google.colab import auth

# Authenticate user for access. There will be a popup asking you to sign in with your user and approve access.
auth.authenticate_user()

# Required: set a project ID for the requester pays GCS downloads
PROJECT_ID = '[your Cloud Platform project ID]'

# Download and Process Metadata

In [None]:
import os

from google.cloud import storage
import pandas as pd

def download_blob(bucket, source_blob_name: str, destination_file_name: str):
    """
    Downloads a blob from the bucket.
    
    https://cloud.google.com/storage/docs/downloading-objects
    """
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)
    print(f"Downloaded {source_blob_name}")


storage_client = storage.Client(project=PROJECT_ID)

# Make a directory to download the data
if not os.path.exists('data'):
  os.mkdir('data')

## Embeddings Metadata

Data source:
- https://physionet.org/content/image-embeddings-mimic-cxr/1.0/
- https://console.cloud.google.com/storage/browser/image-embeddings-mimic-cxr-1.0.physionet.org

Download the checksums file which contains a list of the embeddings files. Extract the data components from the file names.

In [None]:
embeddings_bucket = storage_client.bucket(
    'image-embeddings-mimic-cxr-1.0.physionet.org', user_project=PROJECT_ID)    

# Download the checksums file which contains a records list
download_blob(embeddings_bucket, "SHA256SUMS.txt", "data/SHA256SUMS.txt")

In [None]:
df_embeddings = pd.read_csv("data/SHA256SUMS.txt", delimiter=" ", header=None, skiprows=[0])  # Skip the license file entry
display(df_embeddings.head())

In [None]:
import re

# Example: 'files/p19/p19692222/s59566639/965b6053-a2c70d67-c0467ca6-02372346-fb7c6224.tfrecord'
FILE_PATTERN = re.compile(r"files/(?:\w+)/p(?P<subject_id>\w+)/s(?P<study_id>\w+)/(?P<dicom_id>[\w-]+)\.tfrecord")

def parse_file_pattern(file_path: str):
    """
    Extracts the subject_id, study_id, and dicom_id
    from the full file path string.
    """
    match = FILE_PATTERN.fullmatch(file_path)
    if not match:
        raise Exception(f"Failed to match file path: {file_path}")
    return (int(match[1]), int(match[2]), match[3])

# Create additional columns from file path components
df_embeddings = df_embeddings[[1]]
df_embeddings.rename(columns={1: "embedding_file"}, inplace=True)
df_embeddings[["subject_id","study_id", "dicom_id"]] = df_embeddings.apply(
    lambda x: parse_file_pattern(x["embedding_file"]), axis=1, result_type="expand")

display(df_embeddings)

## CXR Metadata

Data source:
- https://physionet.org/content/mimic-cxr-jpg/2.0.0/
- https://console.cloud.google.com/storage/browser/mimic-cxr-jpg-2.0.0.physionet.org

Download and visualize three metadata files:
1. `mimic-cxr-2.0.0-metadata.csv`: Meta-data derived from the original DICOM files
2. `mimic-cxr-2.0.0-split.csv`: A reference dataset split for studies using MIMIC-CXR-JPG
3. `mimic-cxr-2.0.0-chexpert.csv`:  Lists all studies with labels generated by the CheXpert labeler.

The first two files were used to generate the embeddings database. Embeddings files were only generated for the frontal view CXRs, so there are fewer embeddings files than there are original DICOMs/JPGs.


In [None]:
cxr_jpg_bucket = storage_client.bucket(
    'mimic-cxr-jpg-2.0.0.physionet.org', user_project=PROJECT_ID)

CXR_JPG_METADATA_FILES = (
    "mimic-cxr-2.0.0-metadata.csv.gz",
    "mimic-cxr-2.0.0-split.csv.gz",
    "mimic-cxr-2.0.0-chexpert.csv.gz")

for fname in CXR_JPG_METADATA_FILES:
  download_blob(cxr_jpg_bucket, fname, f"data/{fname}")

In [None]:
df_metadata = pd.read_csv(f"data/{CXR_JPG_METADATA_FILES[0]}", compression="gzip")
df_split = pd.read_csv(f"data/{CXR_JPG_METADATA_FILES[1]}", compression="gzip")
df_labels_chexpert = pd.read_csv(f"data/{CXR_JPG_METADATA_FILES[2]}", compression="gzip")

display(df_metadata.head())
display(df_split.head())
display(df_labels_chexpert.head())

## Create the full labels file

Join embeddings list with Chexpert metadata files

In [None]:
# Each study contains one or more DICOMs
# Chexpert labels df does not contain DICOM ID. Must join on (subject_id + study_id)
df_labels = df_split.merge(df_labels_chexpert, on=['subject_id', 'study_id'])
df_labels = df_labels.merge(df_metadata, on=['dicom_id'])
df_labels = df_embeddings.merge(df_labels, on=['dicom_id'], how='left')

display(df_labels)

## Make Labels files for Individual Diagnoses

In [None]:
# Choose some of the Chexpert generated diagnoses
for diagnosis in ('Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 'Fracture'):
  # Remove missing/unsure labels
  df = df_labels[df_labels[diagnosis].isin((0, 1))]
  # Only need diagnosis, image_id, and train/test/val split for ML model
  df = df[[diagnosis, 'dicom_id', 'embedding_file', 'split']]
  # Workaround for: https://github.com/Google-Health/imaging-research/issues/7
  # You don't need to do this if not using train_lib.py
  df['image_id'] = df['embedding_file'].apply(lambda x: f"gs://superrad/inputs/mimic-cxr/{x.replace('tfrecord', 'dcm')}")
  df.to_csv(f'data/{diagnosis}.csv', index=False)
  print(f"Created {diagnosis}.csv with {len(df)} rows")
  display(df.nunique())
  # Show label and split value distributions
  display(df[diagnosis].value_counts())
  display(df['split'].value_counts())
  print("\n")

# Download Embeddings Files for Model Training

There are many labels for Cardiomegaly. We will train our model using the embeddings with this label.

In [None]:
DIAGNOSIS = 'Cardiomegaly'
LABELS_CSV = f"data/{DIAGNOSIS}.csv"
MAX_TRAINING_SAMPLES = 500
MAX_VALIDATION_SAMPLES = 200

# Download the embeddings files here
EMBEDDINGS_DIR = 'data/mimic-embeddings-files'

if not os.path.exists(EMBEDDINGS_DIR):
  os.mkdir(EMBEDDINGS_DIR)

df = pd.read_csv(LABELS_CSV)
df.head()

In [None]:
# Download training files
for i, row in df[df["split"] == "train"][:MAX_TRAINING_SAMPLES].iterrows():
    download_blob(embeddings_bucket, row["embedding_file"], f"{EMBEDDINGS_DIR}/{row['dicom_id']}.tfrecord")
    
# Download validation files
for i, row in df[df["split"] == "validate"][:MAX_VALIDATION_SAMPLES].iterrows():
    download_blob(embeddings_bucket, row["embedding_file"], f"{EMBEDDINGS_DIR}/{row['dicom_id']}.tfrecord")

In [None]:
# Inspect an embeddings file. A single file is only 5.6kb
import tensorflow as tf
import glob

raw_dataset = tf.data.TFRecordDataset(glob.glob(f"{EMBEDDINGS_DIR}/*.tfrecord"))
for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

# Train Model


In [None]:
BATCH_SIZE = 512
NUM_EPOCHS = 20

In [None]:
!python -m cxr_foundation.train \
  --train_split_name train \
  --tune_split_name validate \
  --labels_csv {LABELS_CSV} \
  --head_name {DIAGNOSIS} \
  --data_dir {EMBEDDINGS_DIR} \
  --batch_size {BATCH_SIZE} \
  --num_epochs {NUM_EPOCHS}