## About

This starter code shows how to read slides and tumor masks from the [CAMELYON16](https://camelyon17.grand-challenge.org/Data/) dataset. It will install [OpenSlide](https://openslide.org/) in Colab (the only non-Python dependency). Note that OpenSlide also includes a [DeepZoom viewer](https://github.com/openslide/openslide-python/tree/master/examples/deepzoom), shown in class. To use that, you'll need to install and run OpenSlide locally on your computer.

### Training data

The original slides and annotations are in an unusual format. I converted a bunch of them for you, so you can read them with OpenSlide as shown in this notebook. This [folder](https://drive.google.com/drive/folders/1rwWL8zU9v0M27BtQKI52bF6bVLW82RL5?usp=sharing) contains all the slides and tumor masks I converted (and these should be *plenty* for your project). If you'd like more beyond this, you'll need to use ASAP as described on the competition website to convert it into an appropriate format. 

Note that even with the starter code, it will take some effort to understand how to work with this data (the various zoom levels, and the coordinate system). Happy to help if you're stuck (please catch me in office hours, or right after class).

### Goals and grading

The goal for your project is to build a thoughtful, end-to-end prototype - not to match the accuracy from the [paper](https://arxiv.org/abs/1703.02442), and not necessarily to use all the available data. To receive an A on this work, your project should (for example):
- Use multiple zoom levels
- Use high-magnification images
- Include several visualizations of your results (both heatmaps showing predictions on individual slides, and other metrics/diagrams you design that are appropriate to communicate how well your model performs).

You are also welcome to propose a custom project of similar scope, happy to chat with you about your ideas anytime.

In [None]:
# Install the OpenSlide C library and Python bindings
!apt-get install openslide-tools
!apt-get install python3-openslide

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libopenslide0
Suggested packages:
  libtiff-tools
The following NEW packages will be installed:
  libopenslide0 openslide-tools
0 upgraded, 2 newly installed, 0 to remove and 29 not upgraded.
Need to get 92.5 kB of archives.
After this operation, 268 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopenslide0 amd64 3.4.1+dfsg-2 [79.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 openslide-tools amd64 3.4.1+dfsg-2 [12.7 kB]
Fetched 92.5 kB in 1s (110 kB/s)
Selecting previously unselected package libopenslide0.
(Reading database ... 144429 files and directories currently installed.)
Preparing to unpack .../libopenslide0_3.4.1+dfsg-2_amd64.deb ...
Unpacking libopenslide0 (3.4.1+dfsg-2) ...
Selecting previously unselected package openslide-tools.
Preparing to unpack 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from openslide import open_slide, __library_version__ as openslide_version
import os
from PIL import Image
from skimage.color import rgb2gray
import tensorflow as tf
import copy

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
# See https://openslide.org/api/python/#openslide.OpenSlide.read_region
# Note: x,y coords are with respect to level 0.
# Read a region from the slide
# Return a numpy RBG array
def read_slide(slide, x, y, level, width, height, as_float=False):
    im = slide.read_region((x,y), level, (width, height))
    im = im.convert('RGB') # drop the alpha channel
    if as_float:
        im = np.asarray(im, dtype=np.float32)
    else:
        im = np.asarray(im)
    assert im.shape == (height, width, 3)
    return im

In [None]:
# We can improve efficiency by ignoring non-tissue areas 
# of the slide. We'll find these by looking for all gray regions.
def find_tissue_pixels(image, intensity=0.8):
    im_gray = rgb2gray(image)
    assert im_gray.shape == (image.shape[0], image.shape[1])
    indices = np.where(im_gray <= intensity)
    return list(zip(indices[0], indices[1]))

def apply_mask(im, mask, color=(255,0,0)):
    masked = np.copy(im)
    for x,y in mask: masked[x][y] = color
    return masked

In [None]:
root_path = 'drive/My Drive/slides'

In [None]:
import os
files = os.listdir(root_path)
print(len(files))
print(files[0])

67
tumor_091_mask.tif


In [None]:
slides = []
masks = []
for i in range(len(files)):
  if(files[i].find('xml', 0)!=-1):
    continue
  elif(files[i].find('mask', 0)!=-1):
    masks.append(files[i])
  else:
    slides.append(files[i])
slides.remove('tumor_038.tif')
slides.sort()
masks.sort()
print("Number of Slides", len(slides))
print("Number of Masks", len(masks))

Number of Slides 21
Number of Masks 21


In [None]:
def make_open_slide(slide_path, tumor_mask_path):
  slide = open_slide(slide_path)
  print ("Read WSI from %s with width: %d, height: %d" % (slide_path, 
                                                          slide.level_dimensions[0][0], 
                                                          slide.level_dimensions[0][1]))

  tumor_mask = open_slide(tumor_mask_path)
  print ("Read tumor mask from %s" % (tumor_mask_path))

  print("Slide includes %d levels"% (len(slide.level_dimensions)))
  levels = min(len(slide.level_dimensions), len(tumor_mask.level_dimensions))
  for i in range(levels):
      print("Level %d, dimensions: %s downsample factor %d" % (i, 
                                                              slide.level_dimensions[i], 
                                                              slide.level_downsamples[i]))
      assert tumor_mask.level_dimensions[i][0] == slide.level_dimensions[i][0]
      assert tumor_mask.level_dimensions[i][1] == slide.level_dimensions[i][1]

  width, height = slide.level_dimensions[7]
  assert width * slide.level_downsamples[7] == slide.level_dimensions[0][0]
  assert height * slide.level_downsamples[7] == slide.level_dimensions[0][1]
  return slide, tumor_mask

In [None]:
def read_chips(slide, tumor_mask, level, x_required, y_required, downsample_rate):
  x_dim = slide.level_dimensions[level][0] # X dimension of slide at level
  y_dim = slide.level_dimensions[level][1] # Y dimension of slide at level

  row_iters = int(x_dim/x_required) # Number of cuts in X dimension
  col_iters = int(y_dim/y_required) # Number of cuts in Y dimension

  slide_images = [] # array of all the cuts of slides
  tumor_mask_images = [] #array of all the cuts of tumor masks

  for i in range(0, row_iters):
    x = i*x_required*downsample_rate # Effective x from where cut is made with level 0 as reference
    for j in range(0, col_iters):
      y = j*y_required*downsample_rate # Effective y from where cut is made with level 0 as reference

      slide_image = read_slide(slide, 
                         x=x, 
                         y=y, 
                         level=level, 
                         width=x_required, 
                         height=y_required)
      slide_images.append(slide_image)

      tumor_mask_image = read_slide(tumor_mask, 
                         x=x, 
                         y=y, 
                         level=level, 
                         width=x_required, 
                         height=y_required)[:,:,0]
      tumor_mask_images.append(tumor_mask_image)

  return slide_images, tumor_mask_images, row_iters, col_iters

In [None]:
def plot_subplots(slide_images, tumor_mask_images, rows, cols):
  print("plotting")
  fig, ax = plt.subplots(rows, cols, figsize = (20, 20))

  if(rows==1 and cols==1):
    ax.imshow(slide_images[0])
    ax.imshow(tumor_mask_images[0], cmap='jet', alpha=0.5)
  else:
    count = 0
    for i in range(0, rows):
      for j in range(0, cols):
          ax[i, j].imshow(slide_images[count])
          if(tumor_mask_images is not None):
            ax[i, j].imshow(tumor_mask_images[count], cmap='jet', alpha=0.5)
          count+=1
    plt.show()

In [None]:
def get_concentrated_slides(slide_images, tumor_mask_images):
  concentrated_slides = []
  concentrated_tissue_mask= []
  count = 0
  for slide_index in range(len(slide_images)):
    slide_image = slide_images[slide_index]
    tissue_pixels = find_tissue_pixels(slide_image)
    percent_tissue = len(tissue_pixels) / float(slide_image.shape[0] * slide_image.shape[0])

    if(percent_tissue>=discard_threshold):
      concentrated_slides.append(slide_image)
      concentrated_tissue_mask.append(tumor_mask_images[slide_index])
      count +=1
  return concentrated_slides, concentrated_tissue_mask, count

In [None]:
from tensorflow import image as tfi
def image_standardization(concentrated_slides):
  standardized_slides = []
  for slide in concentrated_slides:
    standardized_slide = tfi.random_brightness(slide, max_delta=64/255, seed=None)
    standardized_slide = tfi.random_saturation(slide, lower=0, upper=0.25)
    standardized_slide = tfi.random_hue(slide, max_delta=0.04)
    standardized_slide = tfi.random_contrast(slide, lower=0, upper=0.75)
    standardized_slides.append(standardized_slide)
  return standardized_slides

In [None]:
def pixel_normalization(standardized_slides, x_required, y_required):
  normalized_slides = []
  for slide in standardized_slides:
    slide = tf.dtypes.cast(slide, tf.float32)/127.5-1
    normalized_slides.append(slide)
  return normalized_slides

In [None]:
def augment(normalized_slides, concentrated_tissue_mask):
  augmented_slides = copy.deepcopy(normalized_slides)
  augmented_tissue_masks = copy.deepcopy(concentrated_tissue_mask)
  for index in range(len(normalized_slides)):
    tissue_mask_3d = tf.expand_dims(concentrated_tissue_mask[index], 2) # Tensorflow's existing functions to rotate images only work on 3D inputs. Since our tissue masks are 2D, we first expand its dimension by 1 along the third axis

    augmented_slides.append(tfi.rot90([normalized_slides[index]], k=1, name=None)[0])
    augmented_tissue_mask = tf.squeeze(tfi.rot90([tissue_mask_3d], k=1, name=None)[0]) # Then we rotate the 3D tissue mask. tf.squeeze removes any dimension, d, which is 1...so it compresses the mask along the appended dimension.
    augmented_tissue_masks.append(augmented_tissue_mask)

    augmented_slides.append(tfi.rot90([normalized_slides[index]], k=2, name=None)[0])
    augmented_tissue_mask = tf.squeeze(tfi.rot90([tissue_mask_3d], k=2, name=None)[0])
    augmented_tissue_masks.append(augmented_tissue_mask)

    augmented_slides.append(tfi.rot90([normalized_slides[index]], k=3, name=None)[0])
    augmented_tissue_mask = tf.squeeze(tfi.rot90([tissue_mask_3d], k=3, name=None)[0])
    augmented_tissue_masks.append(augmented_tissue_mask)

  return augmented_slides, augmented_tissue_masks

In [None]:
def write_to_file(data, file_path, type):
  for i in range(len(data)):
    if type=="mask":
      tf.keras.preprocessing.image.save_img(file_path+str(i)+".png", tf.expand_dims(data[i], 2))
    else:
      tf.keras.preprocessing.image.save_img(file_path+str(i)+".png", data[i])

In [None]:
def find_tumor_masks(image, intensity=0.8):
    indices = np.where(image >= intensity)
    if indices[0].size == 0:
      return False
    else:
      return True

In [None]:
def get_tumor_slides(input_slides, input_masks):
  tumor_slides = []
  tumor_masks = []
  non_tumor_slides = []
  non_tumor_masks =[]
  for i in range(len(input_masks)):
    if(find_tumor_masks(input_masks[i])):
      tumor_slides.append(input_slides[i])
      tumor_masks.append(input_masks[i])
    else:
      non_tumor_slides.append(input_slides[i])
      non_tumor_masks.append(input_masks[i])
  return tumor_slides, tumor_masks, non_tumor_slides, non_tumor_masks



In [None]:
def preprocess(input_slides, masks, level, x_required, y_required, discard_threshold, flag_plot_slides, flag_plot_concentrated_slides, flag_plot_normalized_slides, file_path):
  for i in range(len(input_slides)):
    slide_path = os.path.join(root_path, input_slides[i])
    tumor_mask_path = os.path.join(root_path, masks[i])
    slide, tumor_mask = make_open_slide(slide_path, tumor_mask_path)

    downsampling_rate = int(slide.level_dimensions[0][0]/slide.level_dimensions[level][0])
    
    slide_images, tumor_mask_images, row_iters, col_iters = read_chips(slide, tumor_mask, level, x_required, y_required, downsampling_rate)

    concentrated_slides, concentrated_tissue_mask, count = get_concentrated_slides(slide_images, tumor_mask_images)

    tumor_slides, tumor_masks, non_tumor_slides, non_tumor_masks = get_tumor_slides(concentrated_slides, concentrated_tissue_mask) 
 
    standardized_tumour_slides = image_standardization(tumor_slides)
    standardized_non_tumour_slides = image_standardization(non_tumor_slides)

    normalized_tumour_slides = pixel_normalization(standardized_tumour_slides, x_required, y_required)
    normalized_non_tumour_slides = pixel_normalization(standardized_non_tumour_slides, x_required, y_required)

    augmented_tumour_slides, augmented_tumour_tissue_mask = augment(normalized_tumour_slides, tumor_masks)
    
    sli_tumor = "tumor/slides/" + str(i) +"_"
    mas_tumor = "tumor/masks/" + str(i) +"_"
    sli_non_tumor = "non_tumor/slides/" + str(i) +"_"
    mas_non_tumor = "non_tumor/masks/" + str(i) +"_"

    write_to_file(augmented_tumour_slides, os.path.join(file_path, sli_tumor), "slide")
    write_to_file(augmented_tumour_tissue_mask, os.path.join(file_path, mas_tumor), "mask")
    write_to_file(normalized_non_tumour_slides, os.path.join(file_path, sli_non_tumor), "slide")
    write_to_file(non_tumor_masks, os.path.join(file_path, mas_non_tumor), "mask")

    if(flag_plot_slides):
      plot_subplots(slide_images, tumor_mask_images, row_iters, col_iters)
    if(flag_plot_concentrated_slides):
      plot_subplots(concentrated_slides, concentrated_tissue_mask, int(count/2), 2)
    if(flag_plot_normalized_slides):
      plot_subplots(augmented_slides, augmented_tissue_mask, int(len(augmented_slides)/2), 2)

In [None]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
def make_train_test_split(slides, masks):
  slides, masks = shuffle(slides, masks, random_state=0)
  train_slides, test_slides, train_masks, test_masks = train_test_split(slides, masks, test_size=0.2, random_state=42)
  return train_slides, test_slides, train_masks, test_masks
train_slides, test_slides, train_masks, test_masks = make_train_test_split(slides, masks)


In [None]:
level = 7
IMG_SIZE = 299
discard_threshold = 0.2
flag_plot_slides = False
flag_plot_concentrated_slides= False
flag_plot_normalized_slides = False
file_path_train = "drive/My Drive/Preprocessed_data_ADL/level7/train"
file_path_test = "drive/My Drive/Preprocessed_data_ADL/level7/test"
preprocess(train_slides, train_masks, level, IMG_SIZE, IMG_SIZE, discard_threshold, flag_plot_slides, flag_plot_concentrated_slides, flag_plot_normalized_slides, file_path_train)
preprocess(test_slides, test_masks, level, IMG_SIZE, IMG_SIZE, discard_threshold, flag_plot_slides, flag_plot_concentrated_slides, flag_plot_normalized_slides, file_path_test)


Read WSI from drive/My Drive/slides/tumor_059.tif with width: 97280, height: 221184
Read tumor mask from drive/My Drive/slides/tumor_059_mask.tif
Slide includes 10 levels
Level 0, dimensions: (97280, 221184) downsample factor 1
Level 1, dimensions: (48640, 110592) downsample factor 2
Level 2, dimensions: (24320, 55296) downsample factor 4
Level 3, dimensions: (12160, 27648) downsample factor 8
Level 4, dimensions: (6080, 13824) downsample factor 16
Level 5, dimensions: (3040, 6912) downsample factor 32
Level 6, dimensions: (1520, 3456) downsample factor 64
Level 7, dimensions: (760, 1728) downsample factor 128
Level 8, dimensions: (380, 864) downsample factor 256
Read WSI from drive/My Drive/slides/tumor_005.tif with width: 97792, height: 219648
Read tumor mask from drive/My Drive/slides/tumor_005_mask.tif
Slide includes 10 levels
Level 0, dimensions: (97792, 219648) downsample factor 1
Level 1, dimensions: (48896, 109824) downsample factor 2
Level 2, dimensions: (24448, 54912) downsam

In [None]:
print(len(test_m))
print(len(test_s))

224
224
