# Demonstration of histomics_stream

Click to open in [[GitHub](https://github.com/DigitalSlideArchive/histomics_stream/tree/master/example/tensorflow_stream.ipynb)] [[Google Colab](https://colab.research.google.com/github/DigitalSlideArchive/histomics_stream/blob/master/example/tensorflow_stream.ipynb)]

The `histomics_stream` Python package sits at the start of any machine learning workflow that is built on the TensorFlow machine learning library.  The package is responsible for efficient access to the input image data that will be used to fit a new machine learning model or will be used to predict regions of interest in novel inputs using an already learned model.

## Installation

If you are running this notebook on Google Colab or another system where `histomics_stream` and its dependencies are not yet installed then they can be installed with the following commands.

In [None]:
!apt update
!apt install -y python3-openslide openslide-tools

!pip install histomics_stream 'large_image[openslide,ometiff,openjpeg,bioformats]' pooch --find-links https://girder.github.io/large_image_wheels

import sys
!pip install /tf/notebooks/histomics_detect
sys.path.append("/tf/notebooks/histomics_detect/")

print(
    "\nNOTE!: On Google Colab you may need to choose 'Runtime->Restart runtime' for these updates to take effect."
)

## Fetching and creating the test data
This notebook has demonstrations that use the files `TCGA-AN-A0G0-01Z-00-DX1.svs` (365 MB) and `TCGA-AN-A0G0-01Z-00-DX1.mask.png` (4 kB),  The pooch commands will fetch them if they are not already available.

In [None]:
import os
import pooch

# download whole slide image
wsi_path = pooch.retrieve(
    fname="TCGA-AN-A0G0-01Z-00-DX1.svs",
    url="https://northwestern.box.com/shared/static/qelyzb45bigg6sqyumtj8kt2vwxztpzm",
    known_hash="d046f952759ff6987374786768fc588740eef1e54e4e295a684f3bd356c8528f",
    path=str(pooch.os_cache("pooch")) + os.sep + "wsi",
)
print(f"Have {wsi_path}")

# download binary mask image
mask_path = pooch.retrieve(
    fname="TCGA-AN-A0G0-01Z-00-DX1.mask.png",
    url="https://northwestern.box.com/shared/static/2q13q2r83avqjz9glrpt3s3nop6uhi2i",
    known_hash="bb657ead9fd3b8284db6ecc1ca8a1efa57a0e9fd73d2ea63ce6053fbd3d65171",
    path=str(pooch.os_cache("pooch")) + os.sep + "wsi",
)
print(f"Have {mask_path}")

## Creating a study for use with histomics_stream

We describe the input and desired parameters using standard Python lists and dictionaries.  Here we give a high-level configuration; selection of tiles is done subsequently. 

In [None]:
import histomics_stream as hs
import tensorflow as tf

# Create a study and insert study-wide information
my_study0 = {"version": "version-1"}
my_study0["number_pixel_rows_for_tile"] = 256
my_study0["number_pixel_columns_for_tile"] = 256
my_slides = my_study0["slides"] = {}

# Add a slide to the study, including slide-wide information with it.
my_slide0 = my_slides["Slide_0"] = {}
my_slide0["filename"] = wsi_path
my_slide0["slide_name"] = "TCGA-AN-A0G0-01Z-00-DX1"
my_slide0["slide_group"] = "Group 3"
my_slide0["number_pixel_rows_for_chunk"] = 2048
my_slide0["number_pixel_columns_for_chunk"] = 2048

# For each slide, find the appropriate resolution given the target_magnification and
# magnification_tolerance.  In this example, we use the same parameters for each slide,
# but this is not required generally.
find_resolution_for_slide = hs.configure.FindResolutionForSlide(
    my_study0, target_magnification=20, magnification_source="native"
)
for slide in my_study0["slides"].values():
    find_resolution_for_slide(slide)
print(f"my_study0 = {my_study0}")

## Tile selection

We are going to demonstrate several approaches to choosing tiles.  Each approach will start with its own copy of the `my_study0` that we have built so far.

In [None]:
import copy

In [None]:
# Demonstrate TilesByGridAndMask without a mask
my_study_tiles_by_grid = copy.deepcopy(my_study0)
tiles_by_grid = hs.configure.TilesByGridAndMask(
    my_study_tiles_by_grid,
    number_pixel_overlap_rows_for_tile=32,
    number_pixel_overlap_columns_for_tile=32,
    randomly_select=5,
)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_tiles_by_grid["slides"].values():
    tiles_by_grid(slide)
print(f"my_study_tiles_by_grid = {my_study_tiles_by_grid}")

In [None]:
# Demonstrate TilesByGridAndMask with a mask
my_study_tiles_by_grid_and_mask = copy.deepcopy(my_study0)
tiles_by_grid_and_mask = hs.configure.TilesByGridAndMask(
    my_study_tiles_by_grid_and_mask,
    number_pixel_overlap_rows_for_tile=0,
    number_pixel_overlap_columns_for_tile=0,
    mask_filename=mask_path,
    randomly_select=10,
)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_tiles_by_grid_and_mask["slides"].values():
    tiles_by_grid_and_mask(slide)
print(f"my_study_tiles_by_grid_and_mask = {my_study_tiles_by_grid_and_mask}")

In [None]:
# Demonstrate TilesByList
my_study_tiles_by_list = copy.deepcopy(my_study0)
tiles_by_list = hs.configure.TilesByList(
    my_study_tiles_by_list,
    randomly_select=5,
    tiles_dictionary=my_study_tiles_by_grid["slides"]["Slide_0"]["tiles"],
)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_tiles_by_list["slides"].values():
    tiles_by_list(slide)
print(f"my_study_tiles_by_list = {my_study_tiles_by_list}")

In [None]:
# Demonstrate TilesRandomly
my_study_tiles_randomly = copy.deepcopy(my_study0)
tiles_randomly = hs.configure.TilesRandomly(my_study_tiles_randomly, randomly_select=10)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_tiles_randomly["slides"].values():
    tiles_randomly(slide)
print(f"my_study_tiles_randomly = {my_study_tiles_randomly}")

## Creating a TensorFlow Dataset

We request tiles indicated by the mask and create a tensorflow Dataset that has the image data for these tiles as well as associated parameters for each tile, such as its location.

In [None]:
# Demonstrate TilesByGridAndMask with a mask
my_study_of_tiles = copy.deepcopy(my_study0)
tiles_by_grid_and_mask = hs.configure.TilesByGridAndMask(
    my_study_of_tiles,
    number_pixel_overlap_rows_for_tile=0,
    number_pixel_overlap_columns_for_tile=0,
    mask_filename=mask_path,
    mask_threshold=0.5,
    randomly_select=100,
)
for slide in my_study_of_tiles["slides"].values():
    tiles_by_grid_and_mask(slide)
print("Finished selecting tiles.")

create_tensorflow_dataset = hs.tensorflow.CreateTensorFlowDataset()
tiles = create_tensorflow_dataset(my_study_of_tiles)
print("Finished with CreateTensorFlowDataset")
print(f"... with tile shape = {tiles.take(1).get_single_element()[0][0].shape}")

## Fetch a model for prediction

We fetch a model (840 MB compressed, 1.3 GB decompressed) that we will use to make predictions.

Because each element of our Dataset is a tuple `(rgb_image_data, dictionary_of_annotation)`, a typical model that accepts only the former as its input needs to be wrapped.

Note that this model assumes that the tiles/images are not batched, with the understanding that if there is enough memory to do batching then one should instead choose a larger tile size. 

In [None]:
# download trained model.
model_path = pooch.retrieve(
    fname="tcga_brca_model",
    url="https://northwestern.box.com/shared/static/4g6idrqlpvgxnsktz8pym5386njyvyb6",
    known_hash="b5b5444cc8874d17811a89261abeafd9b9603e7891a8b2a98d8f13e2846a6689",
    path=str(pooch.os_cache("pooch")) + os.sep + "model",
    processor=pooch.Unzip(),
)
model_path = os.path.split(model_path[0])[0]
print(f"Have {model_path}.")

# restore keras model
from histomics_detect.models import FasterRCNN

model = tf.keras.models.load_model(
    model_path, custom_objects={"FasterRCNN": FasterRCNN}
)

# Each element of the `tiles` tensorflow Dataset is a (rgb_image_data, dictionary_of_annotation) pair.
# Wrap the unwrapped_model so that it knows to use the image.
class WrappedModel(tf.keras.Model):
    def __init__(self, model, *args, **kwargs):
        super(WrappedModel, self).__init__(*args, **kwargs)
        self.model = model

    def call(self, element):
        return (self.model(element[0]), element[1])


unwrapped_model = model
model = WrappedModel(unwrapped_model)
print("Model built and wrapped.")

## Make predictions

In [None]:
import time

print("Starting predictions")
start_time = time.time()
# This model assumes that the tiles are not batched.  Do not use, e.g., tiles.batch(32).
predictions = model.predict(tiles)
end_time = time.time()
number_of_inputs = len([0 for tile in tiles])
number_of_predictions = predictions[0].shape[0]
print(
    f"Made {number_of_predictions} predictions for {number_of_inputs} tiles in {end_time - start_time} s."
)
print(f"Average of {(end_time - start_time) / number_of_inputs} s per tile.")