<img src="https://www.ml.cmu.edu/news/news-archive/2018/september/research-scientists-will-help-build-3d-cellular-map-of-human-body-machine-learning.jpg" alt="HuBMAP" width="800" height="200">

[image source](https://www.ml.cmu.edu/news/news-archive/2018/september/research-scientists-will-help-build-3d-cellular-map-of-human-body-machine-learning.html)
<br>
# <center>HuBMAP: Hacking the Kidney</center>
## <center>Identify glomeruli in human kidney tissue images</center>

# Table of contents <a id='0.1'></a>

1. [Version Notes](#1)
1. [Summary](#2)
2. [Introduction](#3)
3. [Import Packages](#4)
4. [Utility](#5)
   * 5.1 [Augmentation Utilities](#5.1)
   * 5.2 [Data Preprocessing Utilities](#5.2)
   * 5.3 [Training Utilities](#5.3)
5. [Data Overview](#6)
   * 6.1 [Train Data](#6.1)
   * 6.2 [HuBMAP-Metadata](#6.2)
   * 6.3 [Image + Segmentation Mask](#6.3)
       * 6.3.1 [Train Tiff Images](#6.3.1)
       * 6.3.2 [Test Tiff Images](#6.3.2)
6. [EDA](#7)
   * 7.1 [Individual Features](#7.1)
   * 7.2 [Multiple Features](#7.2)
       * 7.2.1 [Image Shape Distribution](#7.2.1)
       * 7.2.2 [Metadata Heatmap](#7.2.2)
7. [Pandas MetaData Profiling](#8)
8. [Tensorflow-Keras Training Pipeline](#9)
    * 8.1 [Configuration](#9.1)
       * 8.1.1 [Hardware Configuration (TPU/GPU/CPU)](#9.1.1)
       * 8.1.2 [Weights and Biases Configuration](#9.1.2)
    * 8.2 [Data Preprocessing](#9.2)
    * 8.3 [Visualize Train Examples](#9.3)
    * 8.4 [Metric: Dice Coefficient](#9.4)
    * 8.5 [Loss](#9.5)
    * 8.6 [Model: Unet](#9.6)
    * 8.7 [Callbacks](#9.7)
    * 8.8 [Training](#9.8)
9. [Reference](#10)

# 1. <a id='1'>Version Notes</a>
[Table of contents](#0.1)

* Version 29: Compelete EDA. 
* Version 31: Added training part using Keras-Tensorflow for training withour tfrecords on TPU. Using 4 folds. 
* Version 33: Added augmentations. Using tf.image for augmentation inside `decode_image_and_mask` function. 
* Version 37
    * Added logging abilities using weights and biases.
    * Using qubvel/segmentation models for training.
    * Added advance augmentations.
    * Training using FCN.
    * Using LR Scheduler for faster training. 
* Version 40: 
    * Updated narration.
    * Using efficientnetb4 encoder.
    * Cosine Annealing Scheduler.
    * Added Coarse Dropout
* Version 41:
    * Added BCE Dice Loss

# 2. <a id='2'>Summary</a>
[Table of contents](#0.1)

* We are given kidney images (.tiff) (high res) and their corresponding masks either as RLE encoded or unencoded JSON files using which we need to develop segmentation model that identify glomeruli in the PAS stained microscopy data.
* We have 8 train tiff images and 5 test tiff images.
* **Private test set** is larger as compared to **public test set**.
* A **Glomerulus** a tiny ball-shaped structure composed of capillary blood vessels actively involved in the filtration of the blood to form urine. The glomerulus is one of the key structures that make up the nephron, the functional unit of the kidney. [source](https://www.medicinenet.com/glomerulus/definition.htm)
* **Functional Tissue Unit** is defined as a “three-dimensional block of cells centered around a capillary, such that each cell in this block is within diffusion distance from any other cell in the same block”. 
* Images are huge tiff files we need to subsequent data preprocessing. Magnificient notebook [here](https://www.kaggle.com/iafoss/256x256-images) will help.

# 3. <a id='3'>Introduction📔</a>
[Table of contents](#0.1)

Welcome to this new Kaggle competition. The [Human BioMolecular Atlas Program (HuBMAP)](https://hubmapconsortium.org/) is sponsored by The [National Institutes of Health (NIH)](https://www.nih.gov/). The primary task of HuBMAP is to catalyze the development of a framework for mapping the human body at a level of **glomeruli functional tissue units** for the first time in history. Hoping to become one of the world’s largest collaborative biological projects, HuBMAP aims to be an open map of the human body at the cellular level. **This competition, “Hacking the Kidney," starts by mapping the human kidney at single cell resolution.**

"**Your challenge is to detect functional tissue units (FTUs) across different tissue preparation pipelines.**"

Successful submissions will construct the tools, resources, and cell atlases needed to determine how the relationships between cells can affect the health of an individual.

## What is HuBMAP?

The focus of HuBMAP is understanding the intrinsic intra-, inter-, and extra- cellular biomolecular distribution in human tissue. HuBMAP will focus on fresh, fixed, or frozen healthy human tissue using in situ and dissociative techniques that have high-spatial resolution.

The Human BioMolecular Atlas Program is a consortium composed of diverse research teams funded by the [Common Fund at the National Institutes of Health](https://commonfund.nih.gov/HuBMAP) . HuBMAP values secure, open sharing, and collaboration with other consortia and the wider research community.

HuBMAP is developing the tools to create an open, global atlas of the human body at the cellular level. These tools and maps will be openly available, to accelerate understanding of the relationships between cell and tissue organization and function and human health.

In [None]:
from IPython.display import HTML

HTML('<center><iframe width="950" height="450" src="https://www.youtube.com/embed/yCh4XnD7rEE" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>')

## What is FTU?

An FTU is defined as a “three-dimensional block of cells centered around a capillary, such that each cell in this block is within diffusion distance from any other cell in the same block” (de Bono, 2013). 

The glomerulus (plural glomeruli) is a network of small blood vessels (capillaries) known as a tuft, located at the beginning of a nephron in the kidney. The tuft is structurally supported by the mesangium (the space between the blood vessels), composed of intraglomerular mesangial cells. The blood is filtered across the capillary walls of this tuft through the glomerular filtration barrier, which yields its filtrate of water and soluble substances to a cup-like sac known as Bowman's capsule. 

<br>

<div style="clear:both;display:table">
<img src="https://ohiostate.pressbooks.pub/app/uploads/sites/36/h5p/content/37/images/file-599206597bdbc.jpg" style="width:45%;float:left"/>
<img src="https://cdn.kastatic.org/ka-perseus-images/0e7bfc98302c3e45dc7ec73ab142566a57513ec3.svg" style="width:45%;float:left"/>
</div>

<br>

## Competition Goal (Brief)

* The goal of this competition is the implementation of a successful and robust glomeruli FTU detector. Develop segmentation algorithms that identify **"Glomerulus"** in the PAS stained microscopy data. Detect functional tissue units (FTUs) across different tissue preparation pipelines.

* For each image we are given annotations in separate JSON file and also the annotations are RLE encoded in train.csv.

* We are segmenting **glomeruli FTU** in each image.

* Since this is segmentation task our evaluation metric is Dice Coefficient. The Dice coefficient can be used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth.

## About Competition Data

The data is huge **(24.5 GB)**. The HuBMAP data used in this hackathon includes 11 fresh frozen and 9 Formalin Fixed Paraffin Embedded (FFPE) PAS kidney images. Stained microscopy employs histological stains such as H&E or PAS to improve resolution and contrast for visualization of anatomical structures such as tubules or glomeruli. Glomeruli FTU annotations exist for all 20 tissue samples. Some of these will be shared for training, and others will be used to judge submissions.

* The dataset is comprised of very large (>500MB - 5GB) TIFF files. 
* **"The training set"** has 8, and the public test set has 5 tiff files respectively. 
* **"The private test set"** is larger than the public test set.
* The training set includes annotations in both RLE-encoded and unencoded (JSON) forms. The annotations denote segmentations of glomeruli.

* **Both the training and public test sets also include anatomical structure segmentations. They are intended to help you identify the various parts of the tissue.**

We are provided with following files:

* For each of the 11 training images we have been provided with a JSON file. Each JSON file has:
   * A type (Feature) and object type id (PathAnnotationObject). Note that these fields are the same between all files and do not offer signal.
   * A geometry containing a Polygon with coordinates for the feature's enclosing volume
   * Additional properties, including the name and color of the feature in the image.
   * The IsLocked field is the same across file types (locked for glomerulus, unlocked for anatomical structure) and is not signal-bearing.

* train.csv contains the unique IDs for each image, as well as an RLE-encoded representation of the mask for the objects in the image. See the evaluation tab for details of the RLE encoding scheme. Note that we are also given annotations in JSON file for each image.

* HuBMAP-20-dataset_information.csv contains additional information (including anonymized patient data) about each image.

## What is RLE?

Run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.

## What we are prediciting?

Participants will develop segmentation algorithm that identify **"glomeruli "** in the PAS stained microscopy data. Detect functional tissue units (FTUs) across different tissue preparation pipelines. Participants are welcome to use other external data and/or pre-trained machine learning models in support of FTU segmentation. 

**We need to segment glomeruli in very large resolution Kidney images and annotations which are availabel as RLE encoded and as well as a JSON format.**

## Evaluation Metric: Dice Coefficient

Dice Coefficient is common in case our task involve **segmentation**. The Dice coefficient can be used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth. the Dice similarity coefficient for two sets X and Y is defined as:

$$\text{DC}(X, Y) = \frac{2 \times |X \cap Y|}{|X| + |Y|}.$$

where X is the predicted set of pixels and Y is the ground truth.

# 4. <a id='4'>Import Packages📚</a>
[Table of contents](#0.1)

In [None]:
!pip install wandb -q --upgrade
!pip install git+https://github.com/qubvel/segmentation_models

In [None]:
%env SM_FRAMEWORK=tf.keras

In [None]:
# basic
import os
import cv2
import sys, gc
import warnings
import time, math
import numpy as np
import pandas as pd
from glob import glob
from pathlib import Path
import pandas_profiling as pp
from tqdm.notebook import tqdm

# visualize
import seaborn as sns
import matplotlib.pyplot as plt
 
# image preprocessing 
import rasterio
import tifffile as tiff
from rasterio.windows import Window
from IPython.display import Image

# kaggle datasets
from kaggle_datasets import KaggleDatasets

# deep learning
import tensorflow as tf
import segmentation_models as sm
from tensorflow.keras.layers import *
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, Callback, LearningRateScheduler

# cross validation
from sklearn.model_selection import KFold

# logging
import wandb
from wandb.keras import WandbCallback
from kaggle_secrets import UserSecretsClient

%matplotlib inline
warnings.filterwarnings('ignore')
print(f'Wandb Version: {wandb.__version__}')
print(f'Seaborn Version: {sns.__version__}')
print(f'Tensorflow Version: {tf.__version__}')

In [None]:
# directory
print('Competition Data/Files')
ROOT = '../input/hubmap-kidney-segmentation/'
os.listdir(ROOT)

# 5. <a id='5'>Utility</a>
[Table of contents](#0.1)

We will make some utility function which we will use in our data pipeline. We will try tensorflow best practices to optimize the data pipeline. We will also use tf.image API for data augmentation with TPU. We will perform augmentations using GPU/TPU using tf.data API.

## 5.1 <a id='5.1'>Augmentation Utilities</a>
[Table of contents](#0.1)

In [None]:
def make_mask(num_holes,side_length,rows, cols, num_channels):
    
    """Builds the mask for all sprinkles."""
    
    row_range = tf.tile(tf.range(rows)[..., tf.newaxis], [1, num_holes])
    col_range = tf.tile(tf.range(cols)[..., tf.newaxis], [1, num_holes])
    r_idx = tf.random.uniform([num_holes], minval=0, maxval=rows-1,
                              dtype=tf.int32)
    c_idx = tf.random.uniform([num_holes], minval=0, maxval=cols-1,
                              dtype=tf.int32)
    r1 = tf.clip_by_value(r_idx - side_length // 2, 0, rows)
    r2 = tf.clip_by_value(r_idx + side_length // 2, 0, rows)
    c1 = tf.clip_by_value(c_idx - side_length // 2, 0, cols)
    c2 = tf.clip_by_value(c_idx + side_length // 2, 0, cols)
    row_mask = (row_range > r1) & (row_range < r2)
    col_mask = (col_range > c1) & (col_range < c2)

    # Combine masks into one layer and duplicate over channels.
    mask = row_mask[:, tf.newaxis] & col_mask
    mask = tf.reduce_any(mask, axis=-1)
    mask = mask[..., tf.newaxis]
    mask = tf.tile(mask, [1, 1, num_channels])
    return mask
    
def sprinkles(image): 
    num_holes = 20
    side_length = 15
    mode = 'normal'
    PROBABILITY = 1
    
    RandProb = tf.cast( tf.random.uniform([],0,1) < PROBABILITY, tf.int32)
    if (RandProb == 0)|(num_holes == 0): return image
    
    img_shape = tf.shape(image)
    if mode is 'normal':
        rejected = tf.zeros_like(image)
    elif mode is 'salt_pepper':
        num_holes = num_holes // 2
        rejected_high = tf.ones_like(image)
        rejected_low = tf.zeros_like(image)
    elif mode is 'gaussian':
        rejected = tf.random.normal(img_shape, dtype=tf.float32)
    else:
        raise ValueError(f'Unknown mode "{mode}" given.')
        
    rows = img_shape[0]
    cols = img_shape[1]
    num_channels = img_shape[-1]
    if mode is 'salt_pepper':
        mask1 = make_mask(num_holes,side_length,rows, cols, num_channels)
        mask2 = make_mask(num_holes,side_length,rows, cols, num_channels)
        filtered_image = tf.where(mask1, rejected_high, image)
        filtered_image = tf.where(mask2, rejected_low, filtered_image)
    else:
        mask = make_mask(num_holes,side_length,rows, cols, num_channels)
        filtered_image = tf.where(mask, rejected, image)
    return filtered_image

def transform_shear(image, height, shear, mask=False):
    
    '''
    shear augmentation on image
    and mask.
    --------------------------------
    
    Arguments:
    image -- input image
    mask -- input mask
    
    Return:
    image -- augmented image 
    mask -- augmented mask
    '''
    
    DIM = height
    XDIM = DIM%2 #fix for size 331
    
    shear = shear * tf.random.uniform([1],dtype='float32')
    shear = math.pi * shear / 180.
        
    # SHEAR MATRIX
    one = tf.constant([1],dtype='float32')
    zero = tf.constant([0],dtype='float32')
    c2 = tf.math.cos(shear)
    s2 = tf.math.sin(shear)
    shear_matrix = tf.reshape(tf.concat([one,s2,zero, zero,c2,zero, zero,zero,one],axis=0),[3,3])    

    # LIST DESTINATION PIXEL INDICES
    x = tf.repeat( tf.range(DIM//2,-DIM//2,-1), DIM )
    y = tf.tile( tf.range(-DIM//2,DIM//2),[DIM] )
    z = tf.ones([DIM*DIM],dtype='int32')
    idx = tf.stack( [x,y,z] )
    
    # ROTATE DESTINATION PIXELS ONTO ORIGIN PIXELS
    idx2 = K.dot(shear_matrix,tf.cast(idx,dtype='float32'))
    idx2 = K.cast(idx2,dtype='int32')
    idx2 = K.clip(idx2,-DIM//2+XDIM+1,DIM//2)
    
    # FIND ORIGIN PIXEL VALUES 
    idx3 = tf.stack([DIM//2-idx2[0,], DIM//2-1+idx2[1,]] )
    d = tf.gather_nd(image, tf.transpose(idx3))
        
    if mask:
        return tf.reshape(d, [DIM,DIM,1])
    
    return tf.reshape(d, [DIM,DIM,3])

def transform_shift(image, height, h_shift, w_shift, mask=False):
    
    '''
    shift augmentation on image
    and mask.
    --------------------------------
    
    Arguments:
    image -- input image
    mask -- input mask
    
    Return:
    image -- augmented image 
    mask -- augmented mask
    '''
    
    DIM = height
    XDIM = DIM%2 #fix for size 331
    
    height_shift = h_shift * tf.random.uniform([1],dtype='float32') 
    width_shift = w_shift * tf.random.uniform([1],dtype='float32') 
    one = tf.constant([1],dtype='float32')
    zero = tf.constant([0],dtype='float32')
        
    # SHIFT MATRIX
    shift_matrix = tf.reshape(tf.concat([one,zero,height_shift, zero,one,width_shift, zero,zero,one],axis=0),[3,3])

    # LIST DESTINATION PIXEL INDICES
    x = tf.repeat( tf.range(DIM//2,-DIM//2,-1), DIM )
    y = tf.tile( tf.range(-DIM//2,DIM//2),[DIM] )
    z = tf.ones([DIM*DIM],dtype='int32')
    idx = tf.stack( [x,y,z] )
    
    # ROTATE DESTINATION PIXELS ONTO ORIGIN PIXELS
    idx2 = K.dot(shift_matrix,tf.cast(idx,dtype='float32'))
    idx2 = K.cast(idx2,dtype='int32')
    idx2 = K.clip(idx2,-DIM//2+XDIM+1,DIM//2)
    
    # FIND ORIGIN PIXEL VALUES 
    idx3 = tf.stack([DIM//2-idx2[0,], DIM//2-1+idx2[1,]] )
    d = tf.gather_nd(image, tf.transpose(idx3))
        
    if mask:
        return tf.reshape(d, [DIM,DIM,1])
    
    return tf.reshape(d, [DIM,DIM,3])

def augmentations(image, mask):
    
    '''
    Apply different augmentations on 
    image and mask.
    --------------------------------
    
    Arguments:
    image -- input image
    mask -- input mask
    
    Return:
    image -- augmented image 
    mask -- augmented mask
    '''
    
    spatial = tf.random.uniform([], 0, 1.0, dtype=tf.float32)
    rotate = tf.random.uniform([], 0, 1.0, dtype=tf.float32)
    shear = tf.random.uniform([], 0, 1.0, dtype=tf.float32)
    shift = tf.random.uniform([], 0, 1.0, dtype=tf.float32)
    pixel = tf.random.uniform([], 0, 1.0, dtype=tf.float32)
    drop_coarse = tf.random.uniform([], 0, 1.0, dtype=tf.float32)
    
    # SPATIAL-LEVEL TRANSFORMATIONS
    ## FLIP LEFT-RIGHT
    if spatial >= .2:
        image = tf.image.flip_left_right(image)
        mask = tf.image.flip_left_right(mask)
    
    ## FLIP UP-DOWN
    if spatial >= .3:   
        image = tf.image.flip_up_down(image)
        mask = tf.image.flip_up_down(mask)
        
    ## ROTATIONS
    if rotate > .75:
        image = tf.image.rot90(image, k=3) # rotate 270º
        mask = tf.image.rot90(mask, k=3) # rotate 270º
    elif rotate > .5:
        image = tf.image.rot90(image, k=2) # rotate 180º
        mask = tf.image.rot90(mask, k=2) # rotate 180º
    elif rotate > .25:
        image = tf.image.rot90(image, k=1) # rotate 90º
        mask = tf.image.rot90(mask, k=1) # rotate 90º
    
    ## SHEAR 
    if shear >= .3:
        image = transform_shear(image, height=config.IMAGE_DIM, shear=20.)
        mask = transform_shear(mask, height=config.IMAGE_DIM, shear=20., mask=True)
    
#     ## SHIFT
#     if shift >= .3:
#         image = transform_shift(image, height=config.IMAGE_DIM, h_shift=15., w_shift=15.)
#         mask = transform_shift(mask, height=config.IMAGE_DIM, h_shift=15., w_shift=15., mask=True)

    ## COARSE-DROPOUT
    if drop_coarse >= .3:
        image = sprinkles(image)
        mask = sprinkles(mask)
    
    # PIXEL-LEVEL TRANSFORMATION
    if pixel >= .2:
        
        if pixel >= .7:
            image = tf.image.random_brightness(image, .2)
        elif pixel >= .6:
            image = tf.image.random_hue(image, .2)
        elif pixel >= .5:
            image = tf.image.random_contrast(image, 0.8, 1.2)
        elif pixel >= .4:
            image = tf.image.random_saturation(image, 0.7, 1.3)
        
    return image, mask

## 5.2 <a id='5.2'>Data Preprocessing Utilities</a>
[Table of contents](#0.1)

In [None]:
# https://www.kaggle.com/paulorzp/rle-functions-run-lenght-encode-decode
def mask2rle(img):
    
    '''
    img: numpy array, 1 - mask, 0 - background
    Returns run length as string formated
    '''
    pixels= img.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return ' '.join(str(x) for x in runs)
 
def rle2mask(mask_rle, shape):
    
    '''
    mask_rle: run-length as string formated (start length)
    shape: (width,height) of array to return 
    Returns numpy array, 1 - mask, 0 - background

    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape).T

def read_tiff(image, encoding_index, resize=None):
    
    '''
    read tiff images and mask.
    ----------------------------
    
    Arguments:
    image -- tiff image
    encoding_index -- corresponding tiff file encoding index.
    
    Returns:
    tiff_image -- tiff image
    tiff_mask -- segmentation mask
    '''
    
    tiff_image = tiff.imread(os.path.join(ROOT, f'train/{image}.tiff'))
    
    if len(tiff_image.shape) == 5:
        tiff_image = np.transpose(tiff_image.squeeze(), (1,2,0))
        
    tiff_mask = rle2mask(train['encoding'][encoding_index],
                         (tiff_image.shape[1], tiff_image.shape[0]))
    
    print(f'Image Shape: {tiff_image.shape}')
    print(f'Mask Shape: {tiff_mask.shape}')
    
    if resize:
        rescaled = (tiff_image.shape[1] // resize, tiff_image.shape[0] // resize)
        tiff_image = cv2.resize(tiff_image, rescaled)
        tiff_mask = cv2.resize(tiff_mask, rescaled)

    return tiff_image, tiff_mask

def read_test_tiff(image, resize=None):
    
    '''
    read tiff images.
    ----------------------------
    
    Arguments:
    image -- tiff image
    
    Returns:
    tiff_image -- tiff image
    tiff_mask -- segmentation mask
    '''
    
    tiff_image = tiff.imread(os.path.join(ROOT, f'test/{image}.tiff'))
    
    if len(tiff_image.shape) == 5:
        tiff_image = np.transpose(tiff_image.squeeze(), (1,2,0))
    
    if resize:
        rescaled = (tiff_image.shape[1] // resize, tiff_image.shape[0] // resize)
        tiff_image = cv2.resize(tiff_image, rescaled)

    return tiff_image

def plot(image, mask):
    
    '''
    plot image and mask
    ---------------------
    
    Arguments:
    image -- tiff image 
    mask -- segmentation mask
    
    Returns:
    matplotlib plot
    '''
    plt.figure(figsize=(15, 15))

    # Image
    plt.subplot(1, 3, 1)
    plt.imshow(image)
    plt.title("Image", fontsize=16)

    # Mask
    plt.subplot(1, 3, 2)
    plt.imshow(mask)
    plt.title("Image Mask", fontsize=16)

    # Image + Mask
    plt.subplot(1, 3, 3)
    plt.imshow(image)
    plt.imshow(mask, alpha=0.5, cmap='plasma')
    plt.title("Image + Mask", fontsize=16);

def plot_subset(image, mask, start_rh, end_rh, start_cw, end_cw):
    
    '''
    plot image and mask
    ---------------------
    
    Arguments:
    image -- tiff image 
    mask -- segmentation mask
    start_rh -- height start
    end_rh -- height end
    start_cw -- width start 
    end_cw -- width end
    
    Returns:
    matplotlib plot
    '''

    # Figure size
    plt.figure(figsize=(15, 15))

    # subset image and mask
    subset_image = image[start_rh:end_rh, start_cw:end_cw, :]
    subset_mask = mask[start_rh:end_rh, start_cw:end_cw]

    # Image
    plt.subplot(1, 3, 1)
    plt.imshow(subset_image)
    plt.title("Zoomed Image", fontsize=16)

    # Mask
    plt.subplot(1, 3, 2)
    plt.imshow(subset_mask)
    plt.title("Zoomed Mask", fontsize=16)

    # Image + Mask
    plt.subplot(1, 3, 3)
    plt.imshow(subset_image)
    plt.imshow(subset_mask, alpha=0.5, cmap='plasma')
    plt.title("Zoomed Image + Mask", fontsize=16);
    
def countplot(column, plot_type='multiple', gridstyle='whitegrid', gs=None,
              palette='Accent', xlab=None, ylab=None, title=None, fontsize=12):
    
    '''
    Make countplots
    -----------------
    
    Arguments:
    column -- column with categorical values
    plot_type -- multiple grid ('multiple/single')
    gridstyle -- seaborn gridstyle
    gs -- gridspec (if using subplots)
    palette -- color palette
    xlab -- x-axis label
    ylab -- y-axis label
    title -- plot title
    fontsize -- fontsize
    
    Returns:
    sns.countplot()
    '''
    if plot_type=='multiple':
        with sns.axes_style(gridstyle):
            ax = f.add_subplot(gs)
            aa = sns.countplot(column, palette=palette)
            for p in ax.patches:
                height = p.get_height()
                aa.text(p.get_x()+p.get_width()/2.,
                        height,
                        '{:1.2f}%'.format(height/len(column)*100),
                        ha="center", fontsize=fontsize)
            plt.xlabel(xlab,fontsize=fontsize)
            plt.ylabel(ylab,fontsize=fontsize)
            plt.title(title)
            
    elif plot_type=='single':
        with sns.axes_style("whitegrid"):
            aa = sns.countplot(column, palette=palette)
            for p in aa.patches:
                height = p.get_height()
                aa.text(p.get_x()+p.get_width()/2.,
                        height,
                        '{:1.2f}%'.format(height/len(column)*100),
                        ha="center", fontsize=fontsize)
            plt.xlabel(xlab,fontsize=fontsize)
            plt.ylabel(ylab,fontsize=fontsize)
            plt.title(title)
            
def distplot(column, gridstyle='whitegrid', gs=None, stats=False, 
             color='yellow', xlab=None, ylab=None, title=None, fontsize=12):
    
    '''
    Make distplots
    -----------------
    
    Arguments:
    column -- column with categorical values
    gridstyle -- seaborn gridstyle
    gs -- gridspec (if using subplots)
    stats -- mean, median, mode.
    color -- matplotlib color
    xlab -- x-axis label
    title -- plot title
    fontsize -- fontsize
    
    Returns:
    sns.distplot()
    '''
    with sns.axes_style(gridstyle):
        if gs:
            ax = f.add_subplot(gs)
            
        aa = sns.distplot(column, color=color)
        
        if stats:
            mean = column.mean()
            median = column.median()
            mode = column.mode()[0] 
            ax.axvline(int(mean), color='r', linestyle='--')
            ax.axvline(int(median), color='g', linestyle='-')
            ax.axvline(mode, color='b', linestyle='-')
            plt.legend({'Mean':mean,'Median':median,'Mode':mode})
            
        plt.xlabel(xlab,fontsize=fontsize)
        plt.title(title)
        
def decode_image_and_mask(image, mask, augment=True):
    
    '''
    decode image and mask in order to
    feed data to TPU.
    --------------------------------
    
    Arguments:
    image -- patches of huge tiff file in png format.
    mask -- patches of mask in png format.
    augment -- apply augmentations on images and masks.
    
    Return:
    image 
    mask
    '''
    
    # load raw data as string
    image = tf.io.read_file(image)
    mask = tf.io.read_file(mask)
    
    image = tf.io.decode_png(image, channels=3)  # convert compressed string to 3D uint8 tensor
    mask = tf.io.decode_png(mask)  # convert compressed string to uinst8 tensor
    
    if augment:
        
        if tf.random.uniform(()) > 0.5:
            image = tf.image.flip_left_right(image)
            mask = tf.image.flip_left_right(mask)
            
        if tf.random.uniform(()) > 0.4:
            image = tf.image.flip_up_down(image)
            mask = tf.image.flip_up_down(mask)
            
        if tf.random.uniform(()) > 0.5:
            image = tf.image.rot90(image, k=1)
            mask = tf.image.rot90(mask, k=1)
            
        if tf.random.uniform(()) > 0.45:
            image = tf.image.random_saturation(image, 0.7, 1.3)
            
        if tf.random.uniform(()) > 0.45:
            image = tf.image.random_contrast(image, 0.8, 1.2)
    
    image = tf.image.convert_image_dtype(image, tf.float32) # convert to floats in the [0,1] range
    mask = tf.cast(mask, tf.float32)  # convert to floats 1. and 0.

    image = tf.reshape(image, [*IMAGE_DIM, 3])  # reshaping image tensor
    mask = tf.reshape(mask, [*IMAGE_DIM]) # reshaping mask tensor
    
    return image, mask

def generate_data(tiff, masks, batch_size=16, shuffle=True):
    
    '''
    generate batches of tf.Dataset
    object
    --------------------------------
    
    Arguments:
    tiff -- tf.data.Dataset object (tf.Tensor)
    mask -- tf.data.Dataset object (tf.Tensor)
    batch_size -- batches of image, mask pair
    shuffle -- generate train if True or validation data if False
    
    Return:
    ds - tf.data.Dataset dataset 
    '''
    
    
    ds = Dataset.zip((tiff, masks)) # create dataset by zipping (image, mask) into pair
    ds = ds.map(decode_image_and_mask, num_parallel_calls=AUTOTUNE) # decode raw data coming from GCS bucket to valid image, mask pair 
    ds = ds.cache() # cache dataset preprocessing work that doesn't fit in memory
    ds = ds.repeat() 
    
    # shuffle while training else set to False
    if shuffle:
        ds = ds.shuffle(buffer_size=1000)
        
    ds = ds.batch(batch_size) # generate batches of data
    ds = ds.prefetch(buffer_size=AUTOTUNE) # fetch dataset while model is training
    return ds

## 5.3 <a id='5.3'>Training Utilities</a>
[Table of contents](#0.1)

In [None]:
def decode_image_and_mask(image, mask):
    
    '''
    decode and normalize image and
    mask.
    --------------------------------
    
    Arguments:
    image -- input image (str)
    mask -- input mask (str)
    
    Return:
    image -- normalized image
    mask -- normalized mask
    '''
    
    # load raw data as string
    image = tf.io.read_file(image)
    mask = tf.io.read_file(mask)
    
    image = tf.io.decode_png(image, channels=3)             # convert compressed string to 3D uint8 tensor
    mask = tf.io.decode_png(mask)                           # convert compressed string to uinst8 tensor
    
    image = tf.image.convert_image_dtype(image, tf.float32) # convert to floats in the [0,1] range
    mask = tf.cast(mask, tf.float32)                        # convert to floats 1. and 0.

    image = tf.reshape(image, (config.IMAGE_DIM, config.IMAGE_DIM, 3))  # reshaping image tensor
    mask = tf.reshape(mask, (config.IMAGE_DIM, config.IMAGE_DIM, 1))    # reshaping mask tensor
    
    return image, mask

def generate_data(tiff, masks, batch_size=16, shuffle=True, augment=False):
    
    '''
    generate batches of tf.Dataset
    object
    --------------------------------
    
    Arguments:
    tiff -- tf.data.Dataset object (tf.Tensor)
    mask -- tf.data.Dataset object (tf.Tensor)
    batch_size -- batches of image, mask pair
    shuffle -- shuffle data 
    augment -- apply augmentations
    
    Return:
    ds - tf.data.Dataset dataset 
    '''
    
    # create dataset by zipping (image, mask) into pair
    ds = tf.data.Dataset.zip((tiff, masks))
    
    # decode raw data coming from GCS bucket to valid image, mask pair
    ds = ds.map(decode_image_and_mask, num_parallel_calls=AUTOTUNE) 
    
    # apply advance augmentations
    if augment:
        ds = ds.map(augmentations ,num_parallel_calls=AUTOTUNE)
    
    # cache dataset preprocessing work that doesn't fit in memory
    ds = ds.cache()
    
    # repeat forever
    ds = ds.repeat() 
    
    # shuffle while training else set to False
    if shuffle:
        ds = ds.shuffle(buffer_size=1000)
        
    ds = ds.batch(config.BATCH_SIZE, drop_remainder=True) # generate batches of data
    ds = ds.prefetch(buffer_size=AUTOTUNE) # fetch dataset while model is training
    return ds

# 6. <a id='6'>Data Overview🔍</a>
[Table of contents](#0.1)

We can see that we have **train and test folders with .tiff images and annotations in JSON**. We have **train.csv** and **HuBMAP-20-dataset_information.csv** containing image, masks information and metadata respectively.

## 6.1 <a id='6.1'>Train Data</a>

We will start getting glimpse of our train data first. 

In [None]:
train = pd.read_csv(os.path.join(ROOT, 'train.csv'))
train

**📌 Observations**

**We have following features in train.csv-**
* id - unique id for each image.
* encoding - RLE encoded representation of the glomeruli mask in the image. Run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count.

In [None]:
print(f'We have {train.shape[0]} rows and {train.shape[1]} columns in our train.csv.')

### Train Dataframe Information

We have two columns in train.csv. 

In [None]:
train.info()

### Missing Values

We have no missing values.

In [None]:
print(f'Missing values in train.csv in each columns:\n{train.isnull().sum()}')

### Unique Values

Only 8 images are there, all unique. 

In [None]:
print('Unique Values in each column of train.csv')
print('##########################################')
for col in train:
    print(f'{col}: {train[col].nunique()}')

## 6.2 <a id='6.2'>HuBMAP-Metadata</a>
[Table of contents](#0.1)

In [None]:
metadata = pd.read_csv(os.path.join(ROOT, 'HuBMAP-20-dataset_information.csv'))
metadata.head()

**📌 Observations**

* image_file - Unique image id.
* width_pixels - pixel width.
* height_pixels - pixel height.
* anatomical_structures_segmention_file	- JSON file corresponding per image.
* glomerulus_segmentation_file - segmentations of glomeruli per image.
* patient_number - As the name suggest patient id.
* race - which race a patient belong.
* ethnicity - which ethnic group patient belong.
* sex - gender
* age - patient's age
* weight_kilograms - weight of patient (in Kg).
* height_centimeters - height of patient (in cm).
* bmi_kg/m^2 - Body-Mass index
* laterality - Side of kideny (left / right).
* percent_cortex - The outer part of the kidney is called the **cortex**.
* percent_medulla - The inner part is the kidney is called the **medulla**.

In [None]:
print(f'We have {metadata.shape[0]} rows and {metadata.shape[1]} columns in our metadata.csv.')

### Metadata Information

First we will convert convert columns with category into categorical type.

In [None]:
for col in ['race', 'ethnicity', 'sex', 'laterality']:
    metadata[col] = metadata[col].astype('category')

In [None]:
metadata.info()

### Missing Values

In [None]:
print(f'Missing values in metadata.csv in each columns:\n{metadata.isnull().sum()}')

### Unique Values

In [None]:
print('Unique Values in each column of metadata.csv')
print('##########################################')
for col in metadata:
    print(f'{col}: {metadata[col].nunique()}')

## 6.3 <a id='6.3'>Image + Segmentation Mask</a>
### 6.3.1 <a id='6.3.1'>Train Tiff Images</a>
[Table of contents](#0.1)

There are 8 tiff images available for training. 

In [None]:
print('We have following train images.')
for index, name in enumerate(train.id.values):
    print(name)

In [None]:
image, mask = read_tiff('2f6ecfcdf', 0)

In [None]:
plot(image, mask)

In [None]:
gc.collect()

In [None]:
plot_subset(image, mask, 5000, 10000, 10000, 15000)

In [None]:
gc.collect()

In [None]:
image, mask = read_tiff('aaa6a05cc', 1)

In [None]:
plot(image, mask)

In [None]:
gc.collect()

In [None]:
plot_subset(image, mask, 10000, 12500, 2000, 4000)

In [None]:
gc.collect()

In [None]:
image, mask = read_tiff('cb2d976f4', 2, 3)

In [None]:
plot(image, mask)

In [None]:
gc.collect()

In [None]:
plot_subset(image, mask, 2000, 5000, 1000, 4000)

In [None]:
gc.collect()

In [None]:
image, mask = read_tiff('0486052bb', 3, 3)

In [None]:
plot(image, mask)

In [None]:
gc.collect()

In [None]:
plot_subset(image, mask, 1000, 4000, 1000, 4000)

In [None]:
gc.collect()

# 7. <a id='7'>EDA</a>
## 7.1 <a id='7.1'>Individual Features</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

distplot(metadata['patient_number'], gs=gs[0,0], stats=True, xlab='Patient Number', title='Patient Number Distribution',
        color = 'darkseagreen')
distplot(metadata['bmi_kg/m^2'], gs=gs[0,1], stats=True, xlab='BMI Index (kg/m^2)', title='BMI Index Distribution',
         color = 'teal')

**📌 Observations**
* Patient Id ranges from 64000 to 67000. 
* BMI Index ranges from 22 to 36 with mean of **27 kg/m^2**.

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

distplot(metadata['weight_kilograms'], gs=gs[0,0], stats=True, xlab='Weight (kgs)', title='Patient Weight Distribution')
distplot(metadata['height_centimeters'], gs=gs[0,1], stats=True, xlab='Height (cms)', title='Patient Height Distribution',
         color = 'purple')

**📌 Observations**
* Both patient's **Weight & Height** seems to be normally distributed. 
* Mean Weight and Height is **81 kg and 170 cm** respectively.
* Most patients have height between **160 cm to 175 cm.**

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

distplot(metadata['width_pixels'], gs=gs[0,0], stats=True, xlab='Image Width (px)', title='Image Weight Distribution',
        color = 'cyan')
distplot(metadata['height_pixels'], gs=gs[0,1], stats=True, xlab='Image Height (px)', title='Image Height Distribution',
         color = 'green')

**📌 Observations**
* Image width distribution ranges from 10000 to 50000. Mean and median at 32000. 
* Image height distribution ranges from 15000 to 38000. Mean and median at 28000 and 30500 respectively.

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

distplot(metadata['percent_cortex'], gs=gs[0,0], stats=True, xlab='Cortex (%)', title='Cortex Distribution',
         color = 'coral')
distplot(metadata['percent_medulla'], gs=gs[0,1], stats=True, xlab='Medulla (%)', title='Medulla Distribution',
         color = 'red')

**📌 Observations**
* Cortex and Medulla distribution is opposite of each other with different ranges . 
* Cortes % ranges from 53 and 80 percent.
* Medulla % ranges fron 20 to 45 percent.

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 3)

countplot(metadata['race'], gs=gs[0,0], xlab='race', ylab='count', title='Race Distribution')
countplot(metadata['sex'], gs=gs[0,1], palette='Blues', xlab='sex', ylab='count', title='Gender Distribution')
countplot(metadata['laterality'], gs=gs[0,2], palette='CMRmap', xlab='laterality', ylab='count',
          title='Laterality Distribution')

**📌 Observations**
* **White** and **Black or African American** are 69% and 31% respectively.
* There are **54% males and 46% females**.
* We can also see some pattern in third column which is reverse of **Gender Distribution**. **54% and 46% laterality (Left/Right Kidney) are Right and Left respectively.**

## 7.2 <a id='7.2'>Multiple Features</a>
### 7.2.1 <a id='7.2.1'>Image Shape Distribution</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    sns.scatterplot(metadata['width_pixels'], metadata['height_pixels'], hue=metadata['age'])
    plt.xlabel('width',fontsize=12)
    plt.ylabel('height',fontsize=12)
    plt.title('Image Resolution Distribution Vs Age', fontsize=14)
    
with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    sns.scatterplot(metadata['width_pixels'], metadata['height_pixels'], hue=metadata['race'], size=metadata['sex'])
    plt.xlabel('width',fontsize=12)
    plt.ylabel('height',fontsize=12)
    plt.title('Image Resolution Distribution Vs Race/Sex', fontsize=14)

In [None]:
gc.collect()

**📌 Observations**
* Most of the **White** patients are **Female**.
* It seems all **Black and African American** patients are male.
* Patients between age 30-60 are more prominant. 

In [None]:
f = plt.figure(figsize=(16, 8))
gs = f.add_gridspec(1, 2)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    sns.scatterplot(metadata['percent_cortex'], metadata['percent_medulla'], hue=metadata['race'],
                    style=metadata['race'], palette='hot')
    plt.xlabel('Cortex (%)',fontsize=12)
    plt.ylabel('Medulla (%)',fontsize=12)
    plt.title('Cortex & Medulla Vs Race', fontsize=14)
    
with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 1])
    sns.scatterplot(metadata['percent_cortex'], metadata['percent_medulla'], hue=metadata['sex'],
                    size=metadata['sex'], palette='cividis')
    plt.xlabel('Cortex (%)',fontsize=12)
    plt.ylabel('Medulla (%)',fontsize=12)
    plt.title('Percent Cortex Vs Percent Medulla ', fontsize=14)

**📌 Observations**
* We see negative relationship b/w **perent_cortex and percent_medulla**.
* Distribution of **perent_cortex and percent_medulla** on the basis of **race** seems equal for both races i.e **White and Black or African American**.
* Distribution of **perent_cortex and percent_medulla** on the basis of **sex** seems equal too for both **male and female**. 

In [None]:
gc.collect()

### 7.2.2 <a id='7.2.2'>Metadata Heatmap</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 10))

mask = np.triu(np.ones_like(metadata.corr(), dtype=bool))

with sns.axes_style("white"):
    sns.heatmap(metadata.corr(), mask=mask, square=True, cmap = 'YlOrBr', annot=True);
    plt.title("Meta Data Feature Correlation", fontsize=14)

# 8. <a id='8'>Pandas MetaData Profiling</a>
[Table of contents](#0.1)

In [None]:
metadata_profile = pp.ProfileReport(metadata)

In [None]:
metadata_profile

## TRAINING PART IN THIS NOTEBOOK IS FOR THE PURPOSE OF HELPING PEOPLE WITH TENSORFLOW-KERAS TRAINING. YOU CAN TRY USING SEVERAL DIFFERENT ARCHITECTURES AVAILBLE IN SEGEMNTATION MODELS LIBRARY.

# 9. <a id='9'>Tensorflow-Keras Training Pipeline</a>
[Table of contents](#0.1)

In this section we will see how to **train Unet model using tensorflow-keras on TPU if the data is not in "Tfrecord format"**. We will use the [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) API in order to build our data pipeline. You can also train model  using tfrecords. Many great notebooks have been published for the same by [Marcos Novaes](https://www.kaggle.com/marcosnovaes) and [Geir Drange](https://www.kaggle.com/mistag).

I am using Weights and Biases to log my result. I have included some step just in case is anyone wants to start with it.

First we need to install [Pavel Yakubovskiy](https://github.com/qubvel) amazing library [segmentation_models](https://github.com/qubvel/segmentation_models). This will make experimentation real quick for us. We are also going to install [Weights & Biases](https://www.wandb.com/) in order to log our results. This will help us to interactively visualize our training results.

<center><img src='https://miro.medium.com/max/875/1*2hZsom9OR2luUM1nGnOQQg.png'></center>

## 9.1 <a id='9.1'>Configuration🔨</a>
### 9.1.1 <a id='9.1.1'>Hardware Configuration (TPU/GPU/CPU)🔨</a>
[Table of contents](#0.1)

We will configure hardware accelerator here. For more information on how **TPU works on Kaggle** please go [here](https://www.kaggle.com/docs/tpu).

In [None]:
try:
    # detect and initialize tpu
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Connecting to tpu...')
    print('device running at:', tpu.master())
except ValueError:
    tpu = None

if tpu:
    print('Initializing TPU...')
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    # instantiate a distribution strategy
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print("TPU initialized")
else:
    print('Using deafualt strategy...')
    strategy = tf.distribute.get_strategy()

REPLICAS = strategy.num_replicas_in_sync
print(f"REPLICAS:  {REPLICAS}")

AUTOTUNE = tf.data.experimental.AUTOTUNE

### 9.1.2 <a id='9.1.2'>Weights and Biases Configuration</a>
[Table of contents](#0.1)

Here I am using **Kaggle's Add-ons** to hide my secret wandb login key. You can create your secret key by first initializing your wandb project after creating your account on Weights and Biases. Don't worry you will see ahead how. For amazing step-by-step guide please refer [here](https://www.kaggle.com/imeintanis/cnn-track-your-experiments-weights-biases/notebook).

* Inside your notebook workspace on top header you will see options Click on **Add-ons**.

* Now click on **secrets**.

* When you'll run this line *wandb.init(project="project-folder-on-W&B", name= 'project_name')* in code cell in upcoming section ahead in this notebook. After running this line you will see a link in output. You have to click on it, copy the key and paste it in the **Value** section inside **secret**.

* See the below image to get some idea.

In [None]:
Image('../input/lyftl5googlecolab/wandb.png')

* Now copy the code as given in the image or check the cell code cell below to use the login key. This way you can hide your credentials.

In [None]:
user_secrets = UserSecretsClient()
secret_value = user_secrets.get_secret("WANDB_KEY")

* Run the below cell to append your wandb API key.

In [None]:
!wandb login $secret_value

Here we are defining are hyperparameters which we will track using Weights and Biases

In [None]:
Params = dict(
    DEVICE = 'tpu',
    RUN = 3,                       # Successful version number to be tracked by W&B
    SEED = 0,                      # seed for reproducibility
    BATCH_SIZE = 32 * REPLICAS,    # batch size
    IMAGE_DIM = 256,               # image dimension
    ENCODER = 'seresnext50',       # segmentation encoder
    WEIGHT = 'imagenet',           # imagenet weights
    VERBOSE = 0,                   # interactive/silent training
    DISPLAY_PLOT = True,           # display plot at end of each fold training
    EPOCHS = 70,                   # epoch
    LR = 1e-5,                     # learning rate
    FOLDS = 4,                     # number of folds
)

In [None]:
DIM = Params['IMAGE_DIM']  # image dimension
RUN = Params['RUN']        # successful wandb run number
MODEL = Params['ENCODER']  # segmentation encoder 

Here we are defining are hyperparameters which we will track using Weights and Biases.

Finally I have initialized my run. Please keep in mind that each run is single execution of the training script. After running below cell you will get some links in output including link to your project page which you need to create first inside your W&B profile.

Here you can see -

* projects -- your project directory at your W&B profile.
* name -- name of every run (single training script/notebook execution). You can keep to default depends on your choice.
* config -- save all your hyperparameters in a config object.

In [None]:
wandb.init(project="hubmap-hacking-the-kidney",
           name= f'TPU-No-Tfrec-Public-{DIM}-{MODEL}-V{RUN}',
           config=Params)

config = wandb.config

In [None]:
config.keys()

## 9.2 <a id='9.2'>Data Preprocessing🔬</a>
[Table of contents](#0.1)

We will write our data pipeline using [tf.data](https://www.tensorflow.org/tutorials/load_data/images#using_tfdata_for_finer_control) API. Please check the [utility](#3) section for implemented utilities. TPU will be reading our data from **GCS buckets**.

In [None]:
# obtain the GCS path of a Kaggle dataset
GCS_PATH = KaggleDatasets().get_gcs_path('hubmap-256x256')

# appending GCS PATH for train images and masks
TIFF = tf.io.gfile.glob(str(GCS_PATH + '/train/*'))
MASK = tf.io.gfile.glob(str(GCS_PATH + '/masks/*'))

In [None]:
TRAIN_TIFF = tf.data.Dataset.from_tensor_slices(TIFF)
TRAIN_MASK = tf.data.Dataset.from_tensor_slices(MASK)

TIFF_COUNT = tf.data.experimental.cardinality(TRAIN_TIFF).numpy()
MASK_COUNT = tf.data.experimental.cardinality(TRAIN_MASK).numpy()

print('Training Data')
print(f'Total Tiff Images: {TIFF_COUNT}')
print(f'Total Masks: {MASK_COUNT}')

Let's print our files from GCS bucket and check them.

In [None]:
for files in TRAIN_TIFF.take(5):
    print(files.numpy())
print('\n')
for files in TRAIN_MASK.take(5):
    print(files.numpy())

In [None]:
train_ds = generate_data(TRAIN_TIFF, TRAIN_MASK, batch_size=config.BATCH_SIZE,
                         shuffle=False)

In [None]:
for image, mask in train_ds.take(1):
    image_batch, mask_batch = image, mask
    print("Image shape: ", image_batch.numpy().shape)
    print("Mask shape: ", mask_batch.numpy().shape)
    
del train_ds

In [None]:
plt.figure(figsize=(16,16))
for i,(img,mask) in enumerate(zip(image_batch[:64], mask_batch[:64])):
    plt.subplot(8,8,i+1)
    plt.imshow(img,vmin=0,vmax=255)
    plt.imshow(np.squeeze(mask), alpha=0.4, cmap='plasma')
    plt.axis('off')
    plt.subplots_adjust(wspace=None, hspace=None)

## 9.3 <a id='9.3'>Data Augmentations</a>
[Table of contents](#0.1)

We will perform augmentations using GPU/TPU using tf.data API. For more information check this amazing notebook by [Chris Deotte](https://www.kaggle.com/cdeotte) and [Dimitre Oliveira](https://www.kaggle.com/dimitreoliveira). The notebooks can be found [here](https://www.kaggle.com/cdeotte/rotation-augmentation-gpu-tpu-0-96#Data-Augmentation-using-GPU/TPU-for-Maximum-Speed!) and [here](https://www.kaggle.com/dimitreoliveira/flower-with-tpus-advanced-augmentations#Advanced-augmentations).

In [None]:
train_ds = generate_data(TRAIN_TIFF, TRAIN_MASK, batch_size=config.BATCH_SIZE,
                         shuffle=False, augment=True)

In [None]:
for image, mask in train_ds.take(1):
    image_batch, mask_batch = image, mask
    print("Image shape: ", image_batch.numpy().shape)
    print("Mask shape: ", mask_batch.numpy().shape)
    
del train_ds

In [None]:
plt.figure(figsize=(16,16))
for i,(img,mask) in enumerate(zip(image_batch[:64], mask_batch[:64])):
    plt.subplot(8,8,i+1)
    plt.imshow(img,vmin=0,vmax=255)
    plt.imshow(np.squeeze(mask), alpha=0.4, cmap='plasma')
    plt.axis('off')
    plt.subplots_adjust(wspace=None, hspace=None)

## 9.4 <a id='9.4'>Metric: Dice Coefficient	🎲</a>
[Table of contents](#0.1)

Dice similarity coefficient is ideal for segmentation tasks. It is measure of how well two contours overlap. The dice index ranges from **0 (imperfect match) to 1 (perfect match)**.

<center><img src="https://miro.medium.com/max/536/1*yUd5ckecHjWZf6hGrdlwzA.png"></center>

The dice coefficient is givern as follows -

$$\text{DSC}(A, B) = \frac{2 \times |A \cap B|}{|A| + |B|}.$$

Here,
* A = predicted mask.
* B = ground truth mask.

$$\text{DSC}(f, x, y) = \frac{2 \times \sum_{i, j} f(x)_{ij} \times y_{ij} + \epsilon}{\sum_{i,j} f(x)_{ij} + \sum_{i, j} y_{ij} + \epsilon}$$

Here,
* x = input image.
* f(x) = predicted output mask by model.
* y = ground truth mask.
* epsilon = small number to avoid divide by zero.

In [None]:
# dice coefficient
def diceCoefficient(y_true, y_pred, epsilon = 1e-10):

    '''
    Dice Coefficient in Tensorflow
    ------------------------------

    Arguments: 
    y_true (Tensorflow tensor) -- tensor of ground truth values.
    y_pred (Tensorflow tensor) -- tensor of predicted values.
    epsilon -- constant to avoid divide by 0 errors.

    Returns:
    dice_coefficient
    '''
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + epsilon) / (K.sum(y_true_f) + K.sum(y_pred_f) + epsilon)

## 9.5 <a id='9.5'>Loss🎲</a>
### Dice Loss
[Table of contents](#0.1)

In [None]:
# soft dice loss
def dice_loss(y_true, y_pred):
    loss = 1 - diceCoefficient(y_true, y_pred)
    return loss

### Tversky Loss

In [None]:
# tversky loss
def tversky(y_true, y_pred, alpha=0.7, beta=0.3, smooth=1):


    '''
    Tversky in Keras
    ------------------------------

    Arguments: 
    y_true (Tensorflow tensor) -- tensor of ground truth values.
    y_pred (Tensorflow tensor) -- tensor of predicted values.
    smooth -- constant to avoid divide by 0 errors.
    alpha -- constant to control penalties for false positives. 
    beta -- constant to control penalties for false negatives.

    Returns:
    tversky loss
    '''

    y_true_pos = K.flatten(y_true)
    y_pred_pos = K.flatten(y_pred)
    true_pos = K.sum(y_true_pos * y_pred_pos)
    false_neg = K.sum(y_true_pos * (1 - y_pred_pos))
    false_pos = K.sum((1 - y_true_pos) * y_pred_pos)
    return (true_pos + smooth) / (true_pos + alpha * false_neg + beta * false_pos + smooth)

# tversky loss
def tversky_loss(y_true, y_pred):
    return 1 - tversky(y_true, y_pred)

### Focal Tversky Loss

In [None]:
# focal tversky loss
def focal_tversky_loss(y_true, y_pred, gamma=0.75):
    tv = tversky(y_true, y_pred)
    return K.pow((1 - tv), gamma)

### Lovasz Loss

In [None]:
# """
# Lovasz-Softmax and Jaccard hinge loss in Tensorflow
# Maxim Berman 2018 ESAT-PSI KU Leuven (MIT License)
# """
def lovasz_loss(y_true, y_pred):
    y_true, y_pred = K.cast(K.squeeze(y_true, -1), 'int32'), K.cast(K.squeeze(y_pred, -1), 'float32')
    logits = K.log(y_pred / (1. - y_pred))
    loss = lovasz_hinge(logits, y_true, per_image=True, ignore=None)
    return loss


def lovasz_grad(gt_sorted):
    """
    Computes gradient of the Lovasz extension w.r.t sorted errors
    See Alg. 1 in paper
    """
    gts = tf.reduce_sum(gt_sorted)
    intersection = gts - tf.cumsum(gt_sorted)
    union = gts + tf.cumsum(1. - gt_sorted)
    jaccard = 1. - intersection / union
    jaccard = tf.concat((jaccard[0:1], jaccard[1:] - jaccard[:-1]), 0)
    return jaccard


def lovasz_hinge(logits, labels, per_image=True, ignore=None):
    """
    Binary Lovasz hinge loss
      logits: [B, H, W] Variable, logits at each pixel (between -\infty and +\infty)
      labels: [B, H, W] Tensor, binary ground truth masks (0 or 1)
      per_image: compute the loss per image instead of per batch
      ignore: void class id
    """
    if per_image:
        def treat_image(log_lab):
            log, lab = log_lab
            log, lab = tf.expand_dims(log, 0), tf.expand_dims(lab, 0)
            log, lab = flatten_binary_scores(log, lab, ignore)
            return lovasz_hinge_flat(log, lab)

        losses = tf.map_fn(treat_image, (logits, labels), dtype=tf.float32)

        # Fixed python3
        losses.set_shape((None,))

        loss = tf.reduce_mean(losses)
    else:
        loss = lovasz_hinge_flat(*flatten_binary_scores(logits, labels, ignore))
    return loss


def lovasz_hinge_flat(logits, labels):
    """
    Binary Lovasz hinge loss
      logits: [P] Variable, logits at each prediction (between -\infty and +\infty)
      labels: [P] Tensor, binary ground truth labels (0 or 1)
      ignore: label to ignore
    """

    def compute_loss():
        labelsf = tf.cast(labels, logits.dtype)
        signs = 2. * labelsf - 1.
        errors = 1. - logits * tf.stop_gradient(signs)
        errors_sorted, perm = tf.nn.top_k(errors, k=tf.shape(errors)[0], name="descending_sort")
        gt_sorted = tf.gather(labelsf, perm)
        grad = lovasz_grad(gt_sorted)
        # loss = tf.tensordot(tf.nn.relu(errors_sorted), tf.stop_gradient(grad), 1, name="loss_non_void")
        # ELU + 1
        loss = tf.tensordot(tf.nn.elu(errors_sorted) + 1., tf.stop_gradient(grad), 1, name="loss_non_void")
        return loss

    # deal with the void prediction case (only void pixels)
    loss = tf.cond(tf.equal(tf.shape(logits)[0], 0),
                   lambda: tf.reduce_sum(logits) * 0.,
                   compute_loss,
#                    strict=True,
                   name="loss"
                   )
    return loss

def flatten_binary_scores(scores, labels, ignore=None):
    """
    Flattens predictions in the batch (binary case)
    Remove labels equal to 'ignore'
    """
    scores = tf.reshape(scores, (-1,))
    labels = tf.reshape(labels, (-1,))
    if ignore is None:
        return scores, labels
    valid = tf.not_equal(labels, ignore)
    vscores = tf.boolean_mask(scores, valid, name='valid_scores')
    vlabels = tf.boolean_mask(labels, valid, name='valid_labels')
    return vscores, vlabels

# lovasz loss
def symmetric_lovasz(y_true, y_pred):
    return 0.5*(lovasz_hinge(y_pred, y_true) + lovasz_hinge(-y_pred, 1.0 - y_true))

### BCE Dice Loss

In [None]:
def bce_dice_loss(y_true, y_pred):
    loss = 0.5*binary_crossentropy(y_true, y_pred) + 0.5*dice_loss(y_true, y_pred)
    return loss

In [None]:
get_custom_objects().update({"dice": dice_loss})
get_custom_objects().update({"bce_dice": bce_dice_loss})
get_custom_objects().update({"lovasz": symmetric_lovasz})
get_custom_objects().update({"tversky": tversky_loss})
get_custom_objects().update({"focal_tversky": focal_tversky_loss})

## 9.6 <a id='9.6'>Model🚀</a>
[Table of contents](#0.1)

In [None]:
model = sm.FPN(config.ENCODER,
               classes=1,
               activation='sigmoid',
               encoder_weights=None)

model.summary()

In [None]:
del model
gc.collect()

## 9.7 <a id='9.7'>Callbacks</a>
## Learning Rate Schedulers
[Table of contents](#0.1) 

In [None]:
###############################
#OneCycleLearningRateSchedular#
###############################

LR_START = 0.00001
LR_MAX = 0.00005
LR_MIN = 0.00001
LR_RAMPUP_EPOCHS = 5
LR_SUSTAIN_EPOCHS = 0
LR_EXP_DECAY = .8

def lrfn(epoch):
    if epoch < LR_RAMPUP_EPOCHS:
        lr = (LR_MAX - LR_START) / LR_RAMPUP_EPOCHS * epoch + LR_START
    elif epoch < LR_RAMPUP_EPOCHS + LR_SUSTAIN_EPOCHS:
        lr = LR_MAX
    else:
        lr = (LR_MAX - LR_MIN) * LR_EXP_DECAY**(epoch - LR_RAMPUP_EPOCHS - LR_SUSTAIN_EPOCHS) + LR_MIN
    return lr

lr_one_cycle = LearningRateScheduler(lrfn, verbose=config.VERBOSE)

##########################
#CosineAnnealingScheduler#
##########################

class CosineAnnealingScheduler(Callback):
    """Cosine annealing scheduler.
    """

    def __init__(self, T_max, eta_max, eta_min=0, verbose=1):
        super(CosineAnnealingScheduler, self).__init__()
        self.T_max = T_max
        self.eta_max = eta_max
        self.eta_min = eta_min
        self.verbose = verbose

    def on_epoch_begin(self, epoch, logs=None):
        if not hasattr(self.model.optimizer, 'lr'):
            raise ValueError('Optimizer must have a "lr" attribute.')
        lr = self.eta_min + (self.eta_max - self.eta_min) * (1 + math.cos(math.pi * epoch / self.T_max)) / 2
        K.set_value(self.model.optimizer.lr, lr)
        print('\nEpoch %05d: CosineAnnealingScheduler setting learning ''rate to %s.' % (epoch + 1, lr))

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        logs['lr'] = K.get_value(self.model.optimizer.lr)


cosine_annealer = CosineAnnealingScheduler(T_max=10, eta_max=1e-5, eta_min=1e-6, verbose=config.VERBOSE)

## Early Stopping

In [None]:
early_stop = EarlyStopping(monitor='val_diceCoefficient', mode = 'max',
                           patience=10, restore_best_weights=True)

## Reduce On Plateau

In [None]:
reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=8, min_lr=0.00001)

## WandbCallback

In [None]:
# wandb = WandbCallback(monitor='val_diceCoefficient')

## 9.8 <a id='9.8'>Training</a>
[Table of contents](#0.1) 

We are going to train our model for the number of FOLDS defined above in the configurations. We will save the model with best `val_diceCoefficient`. Set `VERBOSE` and `DISPLAY_PLOT` if you wish to see the train and val loss and dice coefficient improvement and Ploting results respectively.

Also we are using 4 fold cross validation here. Keeping in mind to avoid possible leakage we will keep tiles from particular image in train or val set.

In [None]:
avg_val_folds = []

ids = train.id.values
fold = KFold(n_splits=config.FOLDS, shuffle=True, random_state=config.SEED)
for fold,(idxT,idxV) in enumerate(fold.split(ids)):
        
    print('#'*16); print(f'#### FOLD {fold+1} ####'); print('#'*16)
    print(f'Image Size: {config.IMAGE_DIM}, Batch Size: {config.BATCH_SIZE}, Epochs: {config.EPOCHS}')
    
    tr = set(ids[idxT])
    val = set(ids[idxV])
    
    # CREATE TRAIN AND VALIDATION SUBSETS
    print('preparing data...')
    train_images = [fname for fname in TIFF if fname.split('/')[4].split('_')[0] in tr]
    train_mask = [fname for fname in MASK if fname.split('/')[4].split('_')[0] in tr]
    valid_images = [fname for fname in TIFF if fname.split('/')[4].split('_')[0] in val]
    valid_mask = [fname for fname in MASK if fname.split('/')[4].split('_')[0] in val]
    
    
    # BUILD MODEL
    print('initializing model...')
    K.clear_session()
    with strategy.scope():   
        
        model = sm.FPN(config.ENCODER,
                       classes=1,
                       activation='sigmoid',
                       encoder_weights='imagenet')
        
        model.compile(optimizer=Adam(lr = config.LR),
                      loss='bce_dice',
                      metrics=[diceCoefficient])
        
    # CALLBACKS
    checkpoint = ModelCheckpoint(f'/kaggle/working/hubmap-tf-keras-{config.DEVICE}-fold-%i.h5'%fold,
                                 verbose=config.VERBOSE,
                                 monitor='val_diceCoefficient',
                                 mode='max',
                                 save_best_only=True)
        
    # TRAIN-VAL DATA
    print('generating folds...')
    TRAIN_TIFF = tf.data.Dataset.from_tensor_slices(train_images)
    TRAIN_MASK = tf.data.Dataset.from_tensor_slices(train_mask)
    VAL_TIFF = tf.data.Dataset.from_tensor_slices(valid_images)
    VAL_MASK = tf.data.Dataset.from_tensor_slices(valid_mask)
    
    print('Training Model...')
    history = model.fit(
        generate_data(TRAIN_TIFF, TRAIN_MASK, batch_size=config.BATCH_SIZE, augment=True),
        epochs = config.EPOCHS,
        steps_per_epoch = tf.data.experimental.cardinality(TRAIN_TIFF).numpy() // config.BATCH_SIZE,
        callbacks = [checkpoint, early_stop, cosine_annealer],
        validation_data = generate_data(VAL_TIFF, VAL_MASK, batch_size=config.BATCH_SIZE, shuffle=False),
        validation_steps = tf.data.experimental.cardinality(VAL_TIFF).numpy() // config.BATCH_SIZE,
        verbose=config.VERBOSE
    )
    
    del TRAIN_TIFF, TRAIN_MASK, VAL_TIFF, VAL_MASK, model
    gc.collect()
    
    # PLOT TRAINING
    # https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords
    if config.DISPLAY_PLOT:
        
        plt.figure(figsize=(15,5))
        TOTAL = np.arange(len(history.history['diceCoefficient']))
        plt.plot(TOTAL,history.history['diceCoefficient'],'-o',label='Train Dice Coefficient',color='#ff7f0e')
        plt.plot(TOTAL,history.history['val_diceCoefficient'],'-o',label='Val Dice Coefficient',color='#1f77b4')
        
        x = np.argmax( history.history['val_diceCoefficient'] ); dy = np.max( history.history['val_diceCoefficient'] )
        xdist = plt.xlim()[1] - plt.xlim()[0]; ydist = plt.ylim()[1] - plt.ylim()[0]
        plt.scatter(x, dy,s=200,color='#1f77b4'); plt.text(x-0.03*xdist, dy-0.13*ydist,'max dice_coe\n%.2f'%dy,size=14)
        
        plt.ylabel('Dice Coefficient',size=14); plt.xlabel('Epoch',size=14)
        plt.legend(loc=2)
        
        plt2 = plt.gca().twinx()
        
        plt2.plot(TOTAL,history.history['loss'],'-o',label='Train Loss',color='#2ca02c')
        plt2.plot(TOTAL,history.history['val_loss'],'-o',label='Val Loss',color='#d62728')
        
        x = np.argmin( history.history['val_loss'] ); y = np.min( history.history['val_loss'] )
        ydist = plt.ylim()[1] - plt.ylim()[0]
        plt.scatter(x,y,s=200,color='#d62728'); plt.text(x-0.03*xdist,y+0.05*ydist,'min loss\n%.2f'%y,size=14)
        
        plt.ylabel('Loss',size=14)
        plt.legend(loc=3)
        plt.show()
        
    avg_val_folds.append(dy)

## Let's Have a look at some example images of what your interactive training results will look like in weights and biases. Weights and Biases may significantly help you in keeping track of your experiments.

In [None]:
Image('../input/lyftl5googlecolab/public_run.png')

# 10. <a id='10'>Resources</a>
[Table of contents](#0.1)

* [What is HuBMAP?](https://hubmapconsortium.org/about/)
* https://cdn.kastatic.org/ka-perseus-images/0e7bfc98302c3e45dc7ec73ab142566a57513ec3.svg
* https://ohiostate.pressbooks.pub/vethisto/chapter/11-the-glomerulus/
* https://www.kaggle.com/paulorzp/rle-functions-run-lenght-encode-decode
* [what is RLE?](https://en.wikipedia.org/wiki/Run-length_encoding)
* [Image Preprocessing](https://www.kaggle.com/c/hubmap-kidney-segmentation/discussion/197887)
* https://www.tensorflow.org/tutorials/load_data/images#using_tfdata_for_finer_control
* https://www.tensorflow.org/api_docs/python/tf
* [dice loss](https://www.kaggle.com/marcosnovaes/hubmap-3-unet-models-with-keras-cpu-gpu/notebook)
* [Plot from chris deotte's notebook](https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords)
* https://www.kaggle.com/dimitreoliveira/flower-with-tpus-advanced-augmentations#Advanced-augmentations
* [HuBMAP fast.ai starter](https://www.kaggle.com/iafoss/hubmap-pytorch-fast-ai-starter/notebook)