## New to .jpeg notebook, this time using tiling

This notebook contains a lot of copied code from one of the public notebooks : [https://www.kaggle.com/rftexas/better-image-tiles-removing-white-spaces] we are including this notebook in the Github repo so that the complete pipeline is visible, but the credit for this work all belongs to Kaggle user PAB97.

In [1]:
import os
import cv2
import PIL
import random
import openslide
import skimage.io
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image, display

In [2]:
train_df = pd.read_csv('E:/prostate-cancer-grade-assessment (1)/train.csv')
images = list(train_df['image_id'])
labels = list(train_df['isup_grade'])

In [3]:
data_dir = 'E:/prostate-cancer-grade-assessment (1)/train_images/'

## Compute statistics

First we need to write a function to compute the proportion of white pixels in the region.

In [4]:
def compute_statistics(image):
    """
    Args:
        image                  numpy.array   multi-dimensional array of the form WxHxC
    
    Returns:
        ratio_white_pixels     float         ratio of white pixels over total pixels in the image 
    """
    width, height = image.shape[0], image.shape[1]
    num_pixels = width * height
    
    num_white_pixels = 0
    
    summed_matrix = np.sum(image, axis=-1)
    # Note: A 3-channel white pixel has RGB (255, 255, 255)
    num_white_pixels = np.count_nonzero(summed_matrix > 620)
    ratio_white_pixels = num_white_pixels / num_pixels
    
    green_concentration = np.mean(image[1])
    blue_concentration = np.mean(image[2])
    
    return ratio_white_pixels, green_concentration, blue_concentration

## Select k-best regions

Then we need a function to sort a list of tuples, where one component of the tuple is the proportion of white pixels in the regions. We are sorting in ascending order.

In [5]:
def select_k_best_regions(regions, k=20):
    """
    Args:
        regions               list           list of 2-component tuples first component the region, 
                                             second component the ratio of white pixels
                                             
        k                     int            number of regions to select
    """
    regions = [x for x in regions if x[3] > 180 and x[4] > 180]
    k_best_regions = sorted(regions, key=lambda tup: tup[2])[:k]
    return k_best_regions

Since we will only store, the coordinates of the top-left pixel, we need a way to retrieve the k best regions, hence the function hereafter...

In [6]:
def get_k_best_regions(coordinates, image, window_size=1024):#window size 512 is default
    regions = {}
    for i, tup in enumerate(coordinates):
        x, y = tup[0], tup[1]
        regions[i] = image[x : x+window_size, y : y+window_size, :]
    
    return regions

## Slide over the image

The main function: the two while loops slide over the image (the first one from top to bottom, the second from left to right). The order does not matter actually.
Then you select the region, compute the statistics of that region, sort the array and select the k-best regions.

In [7]:
def generate_patches(slide_path, window_size=200, stride=256, k=20):#stride 128
    
    image = skimage.io.MultiImage(slide_path)[-2]
    image = np.array(image)
    
    max_width, max_height = image.shape[0], image.shape[1]
    regions_container = []
    i = 0
    
    while window_size + stride*i <= max_height:
        j = 0
        
        while window_size + stride*j <= max_width:            
            x_top_left_pixel = j * stride
            y_top_left_pixel = i * stride
            
            patch = image[
                x_top_left_pixel : x_top_left_pixel + window_size,
                y_top_left_pixel : y_top_left_pixel + window_size,
                :
            ]
            
            ratio_white_pixels, green_concentration, blue_concentration = compute_statistics(patch)
            
            region_tuple = (x_top_left_pixel, y_top_left_pixel, ratio_white_pixels, green_concentration, blue_concentration)
            regions_container.append(region_tuple)
            
            j += 1
        
        i += 1
    
    k_best_region_coordinates = select_k_best_regions(regions_container, k=k)
    k_best_regions = get_k_best_regions(k_best_region_coordinates, image, window_size)
    
    return image, k_best_region_coordinates, k_best_regions

## Show the results

In [8]:
def display_images(regions, title):
    fig, ax = plt.subplots(5, 4, figsize=(15, 15))
    
    for i, region in regions.items():
        ax[i//4, i%4].imshow(region)
    
    fig.suptitle(title)

Now we will show some results. 

Please note:
1. The smaller the window size, the more precise but the longer.
2. I would say that a window size of around 200 is a good choice. It is a good trade-off between generality, having enough of the biopsy structure captured as well as enough details.
3. A too small window size might harm the performance of the model since you might select only a tiny portion of the biopsy. (To counter this, introducing a random choice might be worth trying).

### Window size: 200, stride: 128

In [9]:
def glue_to_one_picture(image_patches, window_size=200, k=32):
    side = int(np.sqrt(k))
    image = np.zeros((side*window_size, side*window_size, 3), dtype=np.int16)
        
    for i, patch in image_patches.items():
        x = i // side
        y = i % side
        image[
            x * window_size : (x+1) * window_size,
            y * window_size : (y+1) * window_size,
            :
        ] = patch
    
    return image

In [10]:
WINDOW_SIZE = 200
STRIDE = 100
K = 25

In [11]:
%%time
import cv2
n = 0
for i in train_df['image_id'].head(2):
    url = data_dir + i + '.tiff'
    image, best_coordinates, best_regions = generate_patches(url, window_size=WINDOW_SIZE, stride=STRIDE, k=K)
    glued_image = glue_to_one_picture(best_regions, window_size=WINDOW_SIZE, k=K)
    cv2.imwrite(i+".jpeg", glued_image)


ValueError: cannot decompress jpeg