# Advanced H&E Workflow

In the basic H&E workflow, each whole-slide image was loaded into memory at full-resolution, and then preprocessing was applied to the entire image at once. However, in some cases, this may not be a good option. For example, you may have images that are simply too big to load in their entirety at full-resolution. Also, some steps may be applied at lower-resolution levels, thus increasing efficiency. 

In this notebook, we will show how to use `PathML` to develop a custom preprocessing pipeline that takes these factors into account. We will still be performing the same task as in the basic example notebook, i.e. detecting regions of tissue and extracting tiles. However, in this pipeline we will perform tissue detection using a low-resolution image of the slide, and extract tiles by processing the original image in chunks.

For each slide:

1. Detect regions of tissue, using low-resolution image of whole slide
2. load full-resolution slide in chunks
3. divide each chunk into 224px tiles

The ability to define our own custom TileExtractor class allows us to create this pipeline.

1. `SlideLoader` loads low-resolution image into `SlideData.image`
2. `SlidePreprocessor` performs tissue detection on the low-res image, and puts the mask into `SlideData.mask`
3. `TileExtractor`:
    - dimensions of full-resolution image are divided into 5000px chunks
    - for each chunk:
        - use `SlideData.wsi.read_region()` to read full-resolution image for chunk
        - get corresponding mask region from tissue detection
        - upsample the mask to match full-resolution image
        - divide into tiles
4. `TilePreprocessor`
    - filters out any whitespace tiles
    - applies Macenko stain normalization
    - writes tiles to disk

In [8]:
import os
import numpy as np
import matplotlib.pyplot as plt

from pathml.preprocessing.tiling import extract_tiles_with_mask
from pathml.preprocessing.utils import pil_to_rgb, upsample_array, plot_mask, label_whitespace_HE
from pathml.preprocessing.pipeline import Pipeline
from pathml.preprocessing.base_preprocessor import (BaseSlideLoader,
                                                    BaseSlidePreprocessor,
                                                    BaseTileExtractor,
                                                    BaseTilePreprocessor)
from pathml.preprocessing.transforms_HandE import TissueDetectionHE
from pathml.preprocessing.wsi import HESlide
from pathml.preprocessing.transforms import ForegroundDetection
from pathml.preprocessing.stains import StainNormalizationHE

In [2]:
# step 1
class MySlideLoader(BaseSlideLoader):
    def __init__(self, level):
        self.level = level
    
    def apply(self, path):
        data = HESlide(path).load_data(level=self.level)
        # add the level as an attribute to the SlideData object so we can access it later
        data.level = self.level
        return data

# step 2
class MySlidePreprocessor(BaseSlidePreprocessor):
    """slide-level preprocessor which detects regions of tissue"""
    def apply(self, data):
        # using downsampled image, so need to lower min_region_size for tissue detection
        tissue_detector = TissueDetectionHE(
            foreground_detection = ForegroundDetection(min_region_size=1000, max_hole_size=1000)
        )
        tissue_mask = tissue_detector.apply(data.image)
        data.mask = tissue_mask
        return data

Now we get to the trickier part, which is tile extraction.

In [3]:
# Step 3
class MyTileExtractor(BaseTileExtractor):
    
    def __init__(self, tile_size=224, chunk_size_low_res = 1000):
        self.tile_size = 224
        # size of each chunk, at low-resolution
        self.chunk_size_low_res = chunk_size_low_res
    
    def apply(self, data):
        """
        Use the downsampled data.mask to get full-resolution tiles. 
        Process full-resolution image in chunks.
        """
        # get scale for upscaling mask to full-res
        scale = data.wsi.slide.level_downsamples[data.level]
        scale = int(scale)
        # size of each chunk, at low-resolution
        chunk_size_low_res = self.chunk_size_low_res
        # size of each chunk, at full-resolution
        chunk_size = chunk_size_low_res * scale
        # how many chunks in each full_res dim
        # note that openslide uses (width, height) format
        full_res_j, full_res_i = data.wsi.slide.level_dimensions[0]
        # loop thru chunks
        n_chunk_i = full_res_i // chunk_size
        n_chunk_j = full_res_j // chunk_size
        
        for ix_i in range(n_chunk_i):
            for ix_j in range(n_chunk_j):
                # get mask
                mask = data.mask[ix_i*chunk_size_low_res:(ix_i + 1)*chunk_size_low_res, 
                                 ix_j*chunk_size_low_res:(ix_j + 1)*chunk_size_low_res]
                
                if mask.mean() == 0.0:
                    # empty chunk, no need to continue processing
                    continue
                # upscale mask to match full-res image
                mask_upsampled = upsample_array(mask, scale)
                # get full-res image
                region = data.wsi.slide.read_region(
                    location = (ix_j*chunk_size, ix_i*chunk_size),
                    level = 0, size = (chunk_size, chunk_size)
                )
                region_rgb = pil_to_rgb(region)
                
                # divide into tiles
                good_tiles = extract_tiles_with_mask(
                    im = region_rgb, 
                    tile_size = self.tile_size,
                    mask = mask_upsampled
                )
                
                for tile in good_tiles:
                    # adjust i and j coordinates for each tile to account for the chunk offset
                    tile.i += ix_i*chunk_size
                    tile.j += ix_j*chunk_size
                
                # add extracted tiles to data.tiles
                data.tiles = good_tiles if data.tiles is None else data.tiles + good_tiles
        
        return data

We will use the exact same code for the tile-level preprocessor as we used in the Basic H&E example:

In [4]:
# Step 4
class MyTilePreprocessor(BaseTilePreprocessor):
    """
    Simple tile preprocessor which applies color normalizations, 
    filters out whitespace tiles, and writes tiles to disk
    """
    def apply(self, data):
        normalizer = StainNormalizationHE(stain_estimation_method='macenko')
        # save the processed tiles to a new directory in same location as original wsi
        out_dir = os.path.join(
            os.path.dirname(data.wsi.path), 
            os.path.splitext(os.path.basename(data.wsi.path))[0] + "_tiled"
        )
        if not os.path.exists(out_dir):
            os.makedirs(out_dir)
        # extra step to filter out whitespace tiles
        data.tiles[:] = [tile for tile in data.tiles if not label_whitespace_HE(tile.array)]
        # now loop through tiles, normalize the color, and save to disk
        for tile in data.tiles:            
            tile.array = normalizer.apply(tile.array)
            tile.save(out_dir = out_dir, filename = f"{data.wsi.name}_{tile.i}_{tile.j}.jpeg")
        return data

Now we string together everything into a `Pipeline`:

In [5]:
# compose into pipeline
my_pipeline = Pipeline(
    slide_loader = MySlideLoader(level = 2),
    slide_preprocessor = MySlidePreprocessor(),
    tile_extractor = MyTileExtractor(tile_size=224, chunk_size_low_res = 224*4),
    tile_preprocessor = MyTilePreprocessor()
)

## Running pipeline

Now, we are ready to try out the pipeline

**OpenSlide Data**  
This example notebook uses publicly available images from OpenSlide. Download them [here](http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/) if you want to run this notebook locally, or change the filepaths to any whole-slide images that you have locally.

In [6]:
example_image_path = "../data/CMU-2.svs"

In [9]:
%%time
my_pipeline.run(example_image_path)

CPU times: user 14min 1s, sys: 46.8 s, total: 14min 48s
Wall time: 3min 50s


SlideData(wsi=HESlide(path=../data/CMU-2.svs, name=CMU-2), image shape: (1903, 4875, 3), mask shape: (1903, 4875), number of tiles: 10786)

Compare the run-time for this pipeline to that in the Basic H&E example notebook (note that both pipelines were run on the same machine)