# Demonstration of histomics_stream

Click to open in [[GitHub](https://github.com/DigitalSlideArchive/HistomicsStream/tree/master/example/tensorflow_stream.ipynb)] [[Google Colab](https://colab.research.google.com/github/DigitalSlideArchive/HistomicsStream/blob/master/example/tensorflow_stream.ipynb)]

The `histomics_stream` Python package sits at the start of any machine learning workflow that is built on the TensorFlow machine learning library.  The package is responsible for efficient access to the input image data that will be used to fit a new machine learning model or will be used to predict regions of interest in novel inputs using an already learned model.

## Installation

If you are running this notebook on Google Colab or another system where `histomics_stream` and its dependencies are not yet installed then they can be installed with the following commands.  Note that image readers in addition to openslide are also supported by using, e.g., `large_image[bioformats,ometiff,openjpeg,openslide,tiff]` on the below pip install command line.

In [None]:
# Get histomics_stream and its dependencies
!apt update
!apt install -y python3-openslide openslide-tools
!pip install 'large_image[openslide,tiff]' --find-links https://girder.github.io/large_image_wheels
!pip install histomics_stream

# Get other packages used in this notebook
# N.B. itkwidgets works with jupyter<=3.0.0
!apt install libcudnn8 libcudnn8-dev
!pip install histomics_detect pooch itkwidgets
!jupyter labextension install @jupyter-widgets/jupyterlab-manager jupyter-matplotlib jupyterlab-datawidgets itkwidgets

print(
    "\nNOTE!: On Google Colab you may need to choose 'Runtime->Restart runtime' for these updates to take effect."
)

## Fetching and creating the test data
This notebook has demonstrations that use the files `TCGA-AN-A0G0-01Z-00-DX1.svs` (365 MB) and `TCGA-AN-A0G0-01Z-00-DX1.mask.png` (4 kB),  The pooch commands will fetch them if they are not already available.

In [None]:
import os
import pooch

# download whole slide image
wsi_path = pooch.retrieve(
    fname="TCGA-AN-A0G0-01Z-00-DX1.svs",
    url="https://drive.google.com/uc?export=download&id=19agE_0cWY582szhOVxp9h3kozRfB4CvV&confirm=t&uuid=6f2d51e7-9366-4e98-abc7-4f77427dd02c&at=ALgDtswlqJJw1KU7P3Z1tZNcE01I:1679111148632",
    known_hash="d046f952759ff6987374786768fc588740eef1e54e4e295a684f3bd356c8528f",
    path=str(pooch.os_cache("pooch")) + os.sep + "wsi",
)
print(f"Have {wsi_path}")

# download binary mask image
mask_path = pooch.retrieve(
    fname="TCGA-AN-A0G0-01Z-00-DX1.mask.png",
    url="https://drive.google.com/uc?export=download&id=17GOOHbL8Bo3933rdIui82akr7stbRfta",
    known_hash="bb657ead9fd3b8284db6ecc1ca8a1efa57a0e9fd73d2ea63ce6053fbd3d65171",
    path=str(pooch.os_cache("pooch")) + os.sep + "wsi",
)
print(f"Have {mask_path}")

## Creating a study for use with histomics_stream

We describe the input and desired parameters using standard Python lists and dictionaries.  Here we give a high-level configuration; selection of tiles is done subsequently.

N.B.: __*all*__ values that are number of pixels are based upon the `target_magnification` that is supplied to `FindResolutionForSlide`.  This includes pixel sizes of a slide, chunk, or tile and it includes the pixel coordinates for a chunk or tile.  It applies whether the numbers are supplied to histomics_stream or returned by histomics_stream.  However, if the `magnification_source` is not `exact` the `returned_magnification` may not equal the `target_magnification`; to get the number of pixels that is relevant for the `returned_magnification`, typically these numbers of pixels are multiplied by the ratio `returned_magnification / target_magnification`.  In particular, the *pixel size of the returned tiles* will be the requested size times this ratio.

In [None]:
import histomics_stream as hs
import histomics_stream.tensorflow
import tensorflow as tf

In [None]:
# Create a study and insert study-wide information.
# Add a slide to the study, including slide-wide information with it.
my_study0 = dict(
    version="version-1",
    tile_height=256,
    tile_width=256,
    overlap_height=0,
    overlap_width=0,
    slides=dict(
        Slide_0=dict(
            filename=wsi_path,
            slide_name=os.path.splitext(os.path.split(wsi_path)[1])[0],
            slide_group="Group 3",
            chunk_height=2048,
            chunk_width=2048,
        )
    ),
)

# For each slide, find the appropriate resolution given the target_magnification and
# magnification_tolerance.  In this example, we use the same parameters for each slide,
# but this is not required generally.
find_slide_resolution = hs.configure.FindResolutionForSlide(
    my_study0, target_magnification=20, magnification_source="native"
)
for slide in my_study0["slides"].values():
    find_slide_resolution(slide)
print(f"my_study0 = {my_study0}")

## Tile selection

We are going to demonstrate several approaches to choosing tiles.  Each approach will start with its own copy of the `my_study0` that we have built so far.

In [None]:
import copy

In [None]:
# Demonstrate TilesByGridAndMask without a mask
my_study_by_grid = copy.deepcopy(my_study0)
tiles_by_grid = hs.configure.TilesByGridAndMask(
    my_study_by_grid, overlap_height=32, overlap_width=32, randomly_select=5
)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_by_grid["slides"].values():
    tiles_by_grid(slide)
# Take a look at what we have made
print(f"==== The entire dictionary is now ==== \nmy_study_by_grid = {my_study_by_grid}")
just_tiles = tiles_by_grid.get_tiles(my_study_by_grid)
print(f"==== A quick look at just the tiles is now ====\njust_tiles = {just_tiles}")

In [None]:
# Demonstrate TilesByGridAndMask with a mask
my_study_by_grid_and_mask = copy.deepcopy(my_study0)
tiles_by_grid_and_mask = hs.configure.TilesByGridAndMask(
    my_study_by_grid_and_mask, mask_filename=mask_path, randomly_select=10
)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_by_grid_and_mask["slides"].values():
    tiles_by_grid_and_mask(slide)
# Take a look at what we have made
print(
    f"==== The entire dictionary is now ==== \nmy_study_by_grid_and_mask = {my_study_by_grid_and_mask}"
)
just_tiles = tiles_by_grid_and_mask.get_tiles(my_study_by_grid_and_mask)
print(f"==== A quick look at just the tiles is now ====\njust_tiles = {just_tiles}")

In [None]:
# Demonstrate TilesByList
my_study_by_list = copy.deepcopy(my_study0)
tiles_by_list = hs.configure.TilesByList(
    my_study_by_list,
    randomly_select=5,
    tiles_dictionary=my_study_by_grid["slides"]["Slide_0"]["tiles"],
)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_by_list["slides"].values():
    tiles_by_list(slide)
# Take a look at what we have made
print(f"==== The entire dictionary is now ==== \nmy_study_by_list = {my_study_by_list}")
just_tiles = tiles_by_list.get_tiles(my_study_by_list)
print(f"==== A quick look at just the tiles is now ====\njust_tiles = {just_tiles}")

In [None]:
# Demonstrate TilesRandomly
my_study_randomly = copy.deepcopy(my_study0)
tiles_randomly = hs.configure.TilesRandomly(my_study_randomly, randomly_select=10)
# We could apply this to a subset of the slides, but we will apply it to all slides in
# this example.
for slide in my_study_randomly["slides"].values():
    tiles_randomly(slide)
# Take a look at what we have made
print(
    f"==== The entire dictionary is now ==== \nmy_study_randomly = {my_study_randomly}"
)
just_tiles = tiles_randomly.get_tiles(my_study_randomly)
print(f"==== A quick look at just the tiles is now ====\njust_tiles = {just_tiles}")

## Creating a TensorFlow Dataset

We request tiles indicated by the mask and create a tensorflow Dataset that has the image data for these tiles as well as associated parameters for each tile, such as its location.

In [None]:
# Demonstrate TilesByGridAndMask with a mask
my_study = copy.deepcopy(my_study0)
tiles_by_grid_and_mask = hs.configure.TilesByGridAndMask(
    my_study, mask_filename=mask_path, mask_threshold=0.5, randomly_select=100
)
for slide in my_study["slides"].values():
    tiles_by_grid_and_mask(slide)
print("Finished selecting tiles.")

create_tensorflow_dataset = hs.tensorflow.CreateTensorFlowDataset()
tiles = create_tensorflow_dataset(my_study)
print("Finished with CreateTensorFlowDataset")
print(f"... with tile shape = {tiles.take(1).get_single_element()[0][0].shape}")

## Fetch a model for prediction

We fetch a model (840 MB compressed, 1.3 GB decompressed) that we will use to make predictions.

Because each element of our Dataset is a tuple `(rgb_image_data, dictionary_of_annotation)`, a typical model that accepts only the former as its input needs to be wrapped.

Note that this model assumes that the tiles/images are not batched, with the understanding that if there is enough memory to do batching then one should instead choose a larger tile size. 

In [None]:
# download trained model.
model_path = pooch.retrieve(
    fname="tcga_brca_model",
    url="https://drive.google.com/uc?export=download&id=1KxB6iAn9j2Wp7oyFlV4T1Kli-mR8-35G&confirm=t&uuid=c5df8dfd-ed48-4cef-81a0-19df97677fe5&at=ALgDtswWzs0BEdkVNgFrp83p9NDO:1679111246793",
    known_hash="b5b5444cc8874d17811a89261abeafd9b9603e7891a8b2a98d8f13e2846a6689",
    path=str(pooch.os_cache("pooch")) + os.sep + "model",
    processor=pooch.Unzip(),
)
model_path = os.path.split(model_path[0])[0]
print(f"Have {model_path}.")

# restore keras model
from histomics_detect.models import FasterRCNN

model = tf.keras.models.load_model(
    model_path, custom_objects={"FasterRCNN": FasterRCNN}
)


# Each element of the `tiles` tensorflow Dataset is a (rgb_image_data, dictionary_of_annotation) pair.
# Wrap the unwrapped_model so that it knows to use the image.
class WrappedModel(tf.keras.Model):
    def __init__(self, model, *args, **kwargs):
        super(WrappedModel, self).__init__(*args, **kwargs)
        self.model = model

    def call(self, element):
        return (self.model(element[0]), element[1])


unwrapped_model = model
model = WrappedModel(unwrapped_model)
print("Model built and wrapped.")

## Make predictions

In [None]:
import time

print("Starting predictions")
start_time = time.time()
# This model assumes that the tiles are not batched.  Do not use, e.g., tiles.batch(32).
predictions = model.predict(tiles)
end_time = time.time()
num_inputs = len([0 for tile in tiles])
num_predictions = predictions[0].shape[0]
print(
    f"Made {num_predictions} predictions for {num_inputs} tiles in {end_time - start_time} s."
)
print(f"Average of {(end_time - start_time) / num_inputs} s per tile.")

## Look at internals

In [None]:
my_element = tiles.take(1).get_single_element()
my_pair = my_element[0]
my_target = my_element[1]
my_weight = my_element[2]
my_image = my_pair[0]
my_annotation = my_pair[1]

print(f"   type(my_element) = {type(my_element)}")
print(f"    len(my_element) = {len(my_element)}")
print(f"      type(my_pair) = {type(my_pair)}")
print(f"       len(my_pair) = {len(my_pair)}")
print(f"    type(my_target) = {type(my_target)}")
print(f"    type(my_weight) = {type(my_weight)}")
print(f"     type(my_image) = {type(my_image)}")
print(f"     my_image.shape = {my_image.shape}")
print(f"type(my_annotation) = {type(my_annotation)}")

## Display a tile

In [None]:
import itk, itkwidgets

itkwidgets.view(itk.image_from_array(my_image.numpy(), is_vector=True))