# Prostate cANcer graDe Assessment (PANDA) Challenge: Creating the Training Dataset

# Introduction

## Prostate Gland and Prostate Cancer

The prostate is part of the male reproductive system.  The function of the prostate gland is to secrete substances to the urethra.  These secretions nurish and transport sperm.  Prostate cancer is diagnosed from samples from a prostate biopsy (rmicrobe).  The sample is first assigned as a gleason score.  This score is converted to a ISUP grade of 0-5.  The score of 0 is negative and the score of 5 is the most severe form of cancer (Prostate cANcer GraDe Assessment (PANDA) Challenge).

The purpose of this notebook is to create the appropriate training model and save it to a .pkl file to use to predict the testing data.  Since using the training model would take longer than six hours to run on the testing data, the .pkl file will be used in the competition notebook.

## Image Processing, Training, and Creating the .pkl File 

1. Use the dataset for training from the notebook [Creating Training Dataset](http://www.kaggle.com/rmicrobe/creating-training-dataset) (rmicrobe).
2. Remove gray area surrounding the biopsy. The first step involves removing the gray area from around the biopsy (Zenify).
3. Create 4X4 patched image. The second step is to take 16 samples that have the lowest portion of white. This ensures that the sample is most likely going to show the appropriate part of the sample (i.e. glands). (PAB97).
4. Use fast.ai and ResNet34 to train the data.
5. Create the .pkl file.

![image.png](attachment:image.png)

# Import fast.ai and Dependencies

## Install fast.ai without Internet

Internet is not allowed in this competition.  The files have to be loaded through the fastai2 dataset.

In [None]:
!pip install ../input/fastai2/fastprogress-0.2.3-py3-none-any.whl
!pip install ../input/fastai2/fastcore-0.1.18-py3-none-any.whl
!pip install ../input/fastai2/fastai2-0.0.17-py3-none-any.whl

## Import fast.ai

In [None]:
from fastai2.basics import *
from fastai2.callback.all import *
from fastai2.vision.all import *

## Load Dependencies

In [None]:
import os
import cv2
import PIL
from PIL import Image as Img
from PIL import ImageTk
import random
import openslide
import skimage.io
import skimage.color
from skimage.color import rgb2hsv
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image, display

# Set Random Seed

Setting a random seed makes sure that all randomly picked sequences are in the same order.  It is important to keep the same number everytime to keep the same sequence everytime.

In [None]:
np.random.seed(2)

# Import the Training Dataframe

Import the four .csv file from the prostate train dataset directory. This will be the training dataframe for this part of the process.

In [None]:
train_df1 = pd.read_csv('../input/prostate-train-dataset/train_df1.csv', index_col=False)

train_df2 = pd.read_csv('../input/prostate-train-dataset/train_df2.csv', index_col=False)

train_df3 = pd.read_csv('../input/prostate-train-dataset/train_df3.csv', index_col=False)

data_dir = '../input/prostate-cancer-grade-assessment/train_images/'

Append the four dataframes and get rid of the 'Unnamed:0' column.

In [None]:
train_df = train_df1.append(train_df2, ignore_index=True)
train_df = train_df.append(train_df3, ignore_index=True)

In [None]:
train_df = train_df.drop(columns=['Unnamed: 0'])

Get a preview of the dataframe.

In [None]:
train_df.head ()

The next step is to create two lists for training. Do not create a list longer than 4000 items long because a list longer than 4000 items may result in an error. In this notebook, a list length of 4000 will be used. This is the normal length.

In [None]:
train_df = train_df[0:4000]

In [None]:
image = list(train_df['image_id'])
labels = list(train_df['isup_grade'])

# Functions

The enhance_image function removes the gray portion from around the prostate biopsy (Zenify).

In [None]:
def enhance_image(slide_path, contrast=1, brightness=15):
    image = skimage.io.MultiImage(slide_path)[-2]
    image = np.array(image)
    img_enhanced = cv2.addWeighted(image, contrast, image, 0, brightness)
    return img_enhanced

The function compute_statistics calculates the portion of white pixels in the region (PAB97).

In [None]:
def compute_statistics(image):

    width, height = image.shape[0], image.shape[1]
    num_pixels = width * height
    
    num_white_pixels = 0
    
    summed_matrix = np.sum(image, axis=-1)
    # Note: A 3-channel white pixel has RGB (255, 255, 255)
    num_white_pixels = np.count_nonzero(summed_matrix > 620)
    ratio_white_pixels = num_white_pixels / num_pixels
    
    green_concentration = np.mean(image[1])
    blue_concentration = np.mean(image[2])
    
    return ratio_white_pixels, green_concentration, blue_concentration


The functions select_k_best_regions and get_k_best_regions list and select the lowest porportion of white pixels in a particular region (PAB97).

In [None]:
def select_k_best_regions(regions, k=20):

    regions = [x for x in regions if x[3] > 180 and x[4] > 180]
    k_best_regions = sorted(regions, key=lambda tup: tup[2])[:k]
    return k_best_regions

In [None]:
def get_k_best_regions(coordinates, image, window_size=512):
    regions = {}
    for i, tup in enumerate(coordinates):
        x, y = tup[0], tup[1]
        regions[i] = image[x : x+window_size, y : y+window_size, :]
    
    return regions

The function generate_patches slides over the region to calculate the white pixels then calculates the statistics and then selects the region with the least amount of pixels (PAB97).

In [None]:
def generate_patches(image, window_size=200, stride=128, k=20):
        
    max_width, max_height = image.shape[0], image.shape[1]
    regions_container = []
    i = 0
    
    while window_size + stride*i <= max_height:
        j = 0
        
        while window_size + stride*j <= max_width:            
            x_top_left_pixel = j * stride
            y_top_left_pixel = i * stride
            
            patch = image[
                x_top_left_pixel : x_top_left_pixel + window_size,
                y_top_left_pixel : y_top_left_pixel + window_size,
                :
            ]
            
            ratio_white_pixels, green_concentration, blue_concentration = compute_statistics(patch)
            
            region_tuple = (x_top_left_pixel, y_top_left_pixel, ratio_white_pixels, green_concentration, blue_concentration)
            regions_container.append(region_tuple)
            
            j += 1
        
        i += 1
    
    k_best_region_coordinates = select_k_best_regions(regions_container, k=k)
    k_best_regions = get_k_best_regions(k_best_region_coordinates, image, window_size)
    
    return image, k_best_region_coordinates, k_best_regions

The function glue_to_one_picture glues the 16 patches into one 4X4 image (PAB97).

In [None]:
def glue_to_one_picture(image_patches, window_size=200, k=16):
    side = int(np.sqrt(k))
    image = np.zeros((side*window_size, side*window_size, 3), dtype=np.int16)
        
    for i, patch in image_patches.items():
        x = i // side
        y = i % side
        image[
            x * window_size : (x+1) * window_size,
            y * window_size : (y+1) * window_size,
            :
        ] = patch
    
    return image

In [None]:
WINDOW_SIZE = 128
STRIDE = 64
K = 16

The get_i function takes the original image of the biopsy and prepares the image for training.

In [None]:
def get_i(image):
    for i, img in enumerate(image):
        url = data_dir + img + '.tiff'
        enhanced_image = enhance_image (url)
        image, best_coordinates, best_regions = generate_patches(enhanced_image, window_size=WINDOW_SIZE, stride=STRIDE, k=K)
        glued_image = glue_to_one_picture(best_regions, window_size=WINDOW_SIZE, k=K)
        glued_image = np.uint8(glued_image)
        return tensor(glued_image)

# Preparing for Training

This part of the code prepares the images for training.  The images are properly labeled.  A batch of images are shown to make sure everything is working properly.

In [None]:
blocks = (
          ImageBlock,
          CategoryBlock
          )    
getters = [
           get_i,
           ColReader('isup_grade')
          ]
trends = DataBlock(blocks=blocks,
              splitter=RandomSplitter(),
              getters=getters,
              item_tfms=Resize(512),
              )

In [None]:
dls = trends.dataloaders(train_df, bs=16)

In [None]:
dls.show_batch()

# Checkpoints Directory

This creates a checkpoint directory for ResNet34 and copies the model to the directory.

In [None]:
Path('/root/.cache/torch/checkpoints/').mkdir(exist_ok=True, parents=True)
!cp '../input/resnet34/resnet34.pth' '/root/.cache/torch/checkpoints/resnet34-333f7ec4.pth'

# Train the Model using ResNet34 and fast.ai

In [None]:
learn = cnn_learner(dls, resnet34, metrics=error_rate)

In [None]:
torch.cuda.is_available()

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(3)

# Create .pkl File

In [None]:
learn.export ('test.pkl')

# Works Cited

PAB97. “Better image tiles - Removing white spaces.” Kaggle, 22 May 2020, www.kaggle.com/rftexas/better-image-tiles-removing-white-spaces.

“Prostate CANcer GraDe Assessment (PANDA) Challenge.” Kaggle, www.kaggle.com/c/prostate-cancer-grade-assessment/overview/description.

rmicrobe. “Creating Training Dataset.” Kaggle, 3 March 2021, www.kaggle.com/rmicrobe/creating-training-dataset.

rmicrobe. “Microanatomy of the Prostate.” Kaggle, 11 June 2020, www.kaggle.com/rmicrobe/microanatomy-of-the-prostate.

Zenify. “Let's Enhance the Images!” Kaggle, 03 May 2020, www.kaggle.com/debanga/let-s-enhance-the-images.