<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.sandbox.google.com/github/Google-Health/imaging-research/blob/master/path-foundation/linear-classifier-demo.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/Google-Health/imaging-research/tree/master/path-foundation"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>


~~~
Copyright 2024 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
~~~
# Train a Digital Pathology Linear Classifier From Images Stored on DICOM
This notebook is a demonstration of generating and using embeddings from the Path Foundation API to train a linear classifier. This API enables users to compute embeddings for histopathology images. The contents include how to build an API request to generate embeddings from stored patches and train a linear model using the embeddings. Note: This notebook is for API demonstration purposes only. As with all machine-learning use-cases it is critical to consider training and evaluation datasets that reflect the expected distribution of the intended use case.

**Additional details**: For this demo, whole slide images (WSIs) available from the dataset below were split into train and evaluation sets. A subset of patches were sampled randomly from across all available slides and embeddings were generated via the Path Foundation model.

**Dataset**: This notebook uses the [CAMELYON16](https://camelyon16.grand-challenge.org/) dataset, which contains WSIs from lymph node specimens with and without metastatic breast cancer. Any work that uses this dataset should consider additional details along with usage and citation requirements listed on [their website](https://camelyon17.grand-challenge.org/Data/).

**Dataset citation**: Babak Ehteshami Bejnordi; Mitko Veta; Paul Johannes van Diest; Bram van Ginneken; Nico Karssemeijer; Geert Litjens; Jeroen A. W. M. van der Laak; and the CAMELYON16 Consortium. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA. 2017;318(22):2199–2210. DOI: 10.1001/jama.2017.14585
# Prerequisites
You must have access to the Pathology Foundation Tool. See the project's [README](https://github.com/Google-Health/imaging-research/blob/master/path-foundation/README.md) for details.




## Imports and constants


In [None]:
# @title Pip install EZ-WSI DICOMweb
%%capture
!pip install --upgrade ez_wsi_dicomweb>=6.0.8

In [None]:
# @title Authenticate Colab User.
from google.colab import auth
# There will be a popup asking you to sign in with your user account and approve
# access.
auth.authenticate_user()

In [None]:
from collections import defaultdict
import concurrent.futures
from dataclasses import dataclass
import functools
import json
import random
from typing import Iterator, List, Mapping, Sequence, Tuple
import warnings
from ez_wsi_dicomweb import credential_factory
from ez_wsi_dicomweb import dicom_slide
from ez_wsi_dicomweb import dicom_web_interface
from ez_wsi_dicomweb import patch_embedding
from ez_wsi_dicomweb import patch_embedding_endpoints
from ez_wsi_dicomweb import patch_embedding_types
from ez_wsi_dicomweb import pixel_spacing
from ez_wsi_dicomweb.ml_toolkit import dicom_path
import google.cloud.storage
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.metrics
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import tensorflow as tf

In [None]:
# Constants
PROJECT_ID = 'hai-cd3-foundations'
BUCKET_NAME = 'hai-cd3-foundations-pathology-vault-entry'
DATASET_PROJECT_ID = 'hai-cd3-foundations'
DATASET_LOCATION = 'us-west1'
DATASET_ID = 'pathology'
STORE_ID = 'camelyon'
PATCHES_DIR_NAME = 'patches/'
EMBEDDINGs_DIR_NAME = 'embeddings/'
CANCER_FILE = 'all_cancer_patches.json'
BENIGN_FILE = 'all_non_cancer_patches.json'
TRAINING_CANCER_PATCH_COUNT = 250  # @param {type: 'integer'}
TRAINING_BENIGN_PATCH_COUNT = 250  # @param {type: 'integer'}
EVAL_CANCER_PATCH_COUNT = 50  # @param {type: 'integer'}
EVAL_BENIGN_PATCH_COUNT = 50  # @param {type: 'integer'}
PATCH_SIZE = 224
TARGET_PIXEL_SPACING = pixel_spacing.PixelSpacing.FromMagnificationString('20X')
EVAL_RESERVED_SLIDES = (
    EVAL_CANCER_PATCH_COUNT + 15
)  # slides reserved for the eval set. Add some buffer in case patch count is much higher than the reserved slide count.

 ## Additional setup

In [None]:
# Helper function to render patches, this function is used to display example patches


# Use patch location and DICOM information from a returned embedding to retrieve and display the correct patch
def render_patch_from_embedding(
    patch: dicom_slide.DicomPatch, plot_name: str = ''
) -> None:
  patch_bytes = patch.image_bytes()
  plt.figure(figsize=(2, 2))
  plt.imshow(patch_bytes)
  plt.title(plot_name)
  plt.axis('off')
  plt.show()

## Download & Organize Patches Into Train and Eval Lists

In [None]:
# @title Downloads pre-defined patch coordinates to sample
client = google.cloud.storage.Client()
bucket = google.cloud.storage.Bucket(client, name=BUCKET_NAME)


@dataclass
class Patch:
  """Patch metadata stored on GCS."""
  slide_id: str
  study_instance_uid: str
  series_instance_uid: str
  x_origin: int
  y_origin: int


def download_and_convert_patches(blob_path: str) -> List[Patch]:
  """Downloads a blob and converts JSON to dataclass"""
  json_str = client.bucket(BUCKET_NAME).get_blob(blob_path).download_as_string()
  return [Patch(**pd) for pd in json.loads(json_str)['patches']]


# Downloads patch coordiantes
cancer_patch_coordiantes = download_and_convert_patches(
    PATCHES_DIR_NAME + CANCER_FILE
)
benign_patch_coordiantes = download_and_convert_patches(
    PATCHES_DIR_NAME + BENIGN_FILE
)

In [None]:
# @title Split into Training and Eval lists
# Split by slide for eval and separate patches into training and eval lists
# according to patch labels.


# Bucket patches by slide_id
def build_patches_by_slide_id(
    patch_collection: Sequence[Patch],
) -> Mapping[str, List[Patch]]:
  patches_by_slide = defaultdict(list)  # Create a defaultdict of lists
  for patch in patch_collection:
    patches_by_slide[patch.slide_id].append(patch)  # Directly append
  return patches_by_slide


def select_random_slide_ids(
    patches_by_slide: Mapping[str, Sequence[Patch]], num_slides: int
) -> List[str]:
  slide_ids = list(patches_by_slide)  # Get all slide IDs
  random.shuffle(slide_ids)  # Shuffle for randomness
  return slide_ids[:num_slides]  # Select the first num_slides elements


def filter_patches(
    slide_id: str, selected_slide_ids: List[str], include_selected: bool
) -> bool:
  return (
      slide_id in selected_slide_ids
      if include_selected
      else slide_id not in selected_slide_ids
  )


def get_patches_from_slide_ids(
    patches_by_slide: Mapping[str, List[Patch]],
    selected_slide_ids: List[str],
    include_selected: bool = True,
) -> List[Patch]:
  patches = []
  for slide_id in patches_by_slide:
    if filter_patches(slide_id, selected_slide_ids, include_selected):
      patches.extend([patch for patch in patches_by_slide[slide_id]])
  return patches


cancer_slide_patches = build_patches_by_slide_id(cancer_patch_coordiantes)
bengin_slide_patches = build_patches_by_slide_id(benign_patch_coordiantes)

eval_reserved_slides = select_random_slide_ids(
    cancer_slide_patches, EVAL_RESERVED_SLIDES
)

training_cancer_patches = get_patches_from_slide_ids(
    cancer_slide_patches, eval_reserved_slides, include_selected=False
)
training_benign_patches = get_patches_from_slide_ids(
    bengin_slide_patches, eval_reserved_slides, include_selected=False
)

eval_cancer_patches = get_patches_from_slide_ids(
    cancer_slide_patches, eval_reserved_slides, include_selected=True
)
eval_bengin_patches = get_patches_from_slide_ids(
    bengin_slide_patches, eval_reserved_slides, include_selected=True
)

print(f'Total training benign patches: {len(training_benign_patches)}')
print(f'Total training cancer patches: {len(training_cancer_patches)}')
print(f'Total eval benign patches: {len(eval_bengin_patches)}')
print(f'Total eval cancer patches: {len(eval_cancer_patches)}')

In [None]:
# @title Initial Helper Functions and Setup

dwi = dicom_web_interface.DicomWebInterface(
    credential_factory.DefaultCredentialFactory()
)


def _group_patches_by_series(patches: List[Patch]) -> Iterator[List[Patch]]:
  patches_by_series = defaultdict(list)
  for patch in patches:
    patches_by_series[patch.series_instance_uid].append(patch)
  return patches_by_series.values()


def generate_embeddings_payload(
    patch_count: int, input_patches: List[Patch]
) -> Iterator[dicom_slide.DicomPatch]:
  selected_patches = random.sample(input_patches, patch_count)
  # Group patches by series for efficient processing
  for series_patches in _group_patches_by_series(selected_patches):
    first_patch = series_patches[0]
    path = dicom_path.FromString(
        f'https://healthcare.googleapis.com/v1beta1/projects/{DATASET_PROJECT_ID}/locations/{DATASET_LOCATION}/datasets/{DATASET_ID}/dicomStores/{STORE_ID}/dicomWeb/studies/{first_patch.study_instance_uid}/series/{first_patch.series_instance_uid}'
    )
    slide = dicom_slide.DicomSlide(dwi=dwi, path=path)
    level = slide.get_level_by_pixel_spacing(TARGET_PIXEL_SPACING)
    for patch in series_patches:
      yield slide.get_patch(
          level,
          patch.x_origin,
          patch.y_origin,
          width=PATCH_SIZE,
          height=PATCH_SIZE,
      )

##Using the API on Google DICOM store images

In [None]:
# @title Define Cloud Endpoint used to Generate Embeddings.

endpoint = patch_embedding_endpoints.V2PatchEmbeddingEndpoint()

In [None]:
# @title Generate Embeddings for the patches in the Training and Eval sets
# Note: May take approximately 5 Minutes

list_of_patch_iterators = [
    generate_embeddings_payload(
        patch_count=EVAL_CANCER_PATCH_COUNT, input_patches=eval_cancer_patches
    ),
    generate_embeddings_payload(
        patch_count=EVAL_BENIGN_PATCH_COUNT,
        input_patches=eval_bengin_patches,
    ),
    generate_embeddings_payload(
        patch_count=TRAINING_CANCER_PATCH_COUNT,
        input_patches=training_cancer_patches,
    ),
    generate_embeddings_payload(
        patch_count=TRAINING_BENIGN_PATCH_COUNT,
        input_patches=training_benign_patches,
    ),
]


def _get_patch_embeddings(
    patches: Iterator[dicom_slide.DicomPatch],
) -> List[patch_embedding_types.EmbeddingResult]:
  return list(patch_embedding.generate_patch_embeddings(endpoint, patches))


with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
  results = list(executor.map(_get_patch_embeddings, list_of_patch_iterators))
eval_cancer_embeddings = results[0]
eval_begnin_embeddings = results[1]
training_cancer_embeddings = results[2]
training_begnin_embeddings = results[3]

## Train and Evaluate Linear Probe

In [None]:
# @title Organize embeddings for ML training


def get_embeddings(
    embedding_results: Sequence[patch_embedding_types.EmbeddingResult],
) -> np.ndarray:
  """Returns numpy array of embeddings returned in embedding results list."""
  return np.array([e.embedding for e in embedding_results])


def concatenate_series_ids(
    embedding_results: List[patch_embedding_types.EmbeddingResult],
) -> np.ndarray:
  """Concatenates instance UIDs into a NumPy array."""
  # Assume there is one instance uid per series.
  return np.asarray([e.patch.source.path.series_uid for e in embedding_results])


def concatenate_training_data_and_build_training_labels(
    cancer: Sequence[patch_embedding_types.EmbeddingResult],
    benign: Sequence[patch_embedding_types.EmbeddingResult],
) -> Tuple[np.ndarray, np.ndarray]:
  """Concatenate cancer and benign examples into and generate label data."""
  data = np.concatenate([get_embeddings(cancer), get_embeddings(benign)])
  labels = np.concatenate((np.ones(len(cancer)), np.zeros(len(benign))))
  return data, labels


# Embeddings and training lables
training_embeddings, training_labels = (
    concatenate_training_data_and_build_training_labels(
        training_cancer_embeddings, training_begnin_embeddings
    )
)
training_ids = np.concatenate([
    concatenate_series_ids(training_cancer_embeddings),
    concatenate_series_ids(training_begnin_embeddings),
])

# Generate evaluation embeddings and labels
eval_embeddings, eval_labels = (
    concatenate_training_data_and_build_training_labels(
        eval_cancer_embeddings, eval_begnin_embeddings
    )
)


In [None]:
# Train a linear classifier using the embeddings


with warnings.catch_warnings():
  warnings.simplefilter('ignore')
  clf_pipeline = sklearn.pipeline.Pipeline([
      ('scaler', sklearn.preprocessing.StandardScaler()),
      (
          'logreg',
          sklearn.model_selection.GridSearchCV(
              sklearn.linear_model.LogisticRegression(
                  random_state=0,
                  multi_class='ovr',
                  verbose=False,
              ),
              cv=sklearn.model_selection.StratifiedGroupKFold(n_splits=5).split(
                  training_embeddings, y=training_labels, groups=training_ids
              ),
              param_grid={'C': np.logspace(start=-4, stop=4, num=10, base=10)},
              scoring='roc_auc_ovr',
              refit=True,
          ),
      ),
  ]).fit(training_embeddings, training_labels)

  test_predictions = clf_pipeline.predict_proba(eval_embeddings)[:, 1]

In [None]:
# Evaluate the linear classifiers performance using the eval patches

sklearn.metrics.roc_auc_score(eval_labels, test_predictions)

In [None]:
# @title Plot the ROC Curve

display = sklearn.metrics.RocCurveDisplay.from_predictions(
    eval_labels, test_predictions, name="Tumor Classifier"
)
display.ax_.set_title("ROC of Tumor Classifier")

In [None]:
# @title Find Youden's index for threshold selection

thresholds = np.linspace(0, 1, 100)
sensitivities = []
specificities = []
for threshold in thresholds:
  predictions = test_predictions > threshold
  sensitivities.append(sklearn.metrics.recall_score(eval_labels, predictions))
  specificities.append(
      sklearn.metrics.recall_score(eval_labels == 0, predictions == 0)
  )
index = np.argmax(np.array(sensitivities) + np.array(specificities))
best_threshold = thresholds[index]
sens = sensitivities[index]
spec = specificities[index]
print(
    f"Best threshold: {round(best_threshold,2)}. Sensitivity is"
    f" {round(sens*100,2)}% and Specificity is {round(spec*100,2)}% "
)

In [None]:
# @title Show the results in a table
eval_embeddings_obj = eval_cancer_embeddings + eval_begnin_embeddings

df = pd.DataFrame(
    {'ground_truth': eval_labels, 'model_score': test_predictions}
)
df['tumor_prediction'] = df['model_score'] > best_threshold
df['embeddings'] = [e.embedding for e in eval_embeddings_obj]

df

In [None]:
# @title Visualize True Positives
def display_results(
    tumor_prediction: bool, ground_truth: int, title: str
) -> None:
  df_tp = (
      df[
          (df['tumor_prediction'] == tumor_prediction)
          & (df['ground_truth'] == ground_truth)
      ]
      .sort_values('model_score', ascending=False)
      .head(5)
  )
  for index, row in df_tp.iterrows():
    print(index)
    print(f'model score is {row.model_score}')
    render_patch_from_embedding(eval_embeddings_obj[index].patch, title)


display_results(True, 1, 'True Positive')

In [None]:
# @title Visualize True Negatives
display_results(False, 0, 'True Negative')

In [None]:
# @title Visualize False Positives
display_results(True, 0, 'False Positive')

In [None]:
# @title Visualize False Negatives
display_results(False, 1, 'False Negative')