# <center>Notebook for Digital Pathology on Clara Train SDK 

## Description
Clara Train SDK comes with many models for different tasks. 
This notebook will go through the usecase of automated detection of metastases from histopathology whole slide images (WSIs).
This notebooks walks you throug :
1. Downloading the data
2. preprocessing the labels
3. Introducing [cuCIM](https://github.com/rapidsai/cucim/), an extensible toolkit designed to provide GPU accelerated I/O to load WSI images.
4. Training a model 
5. Inferring 


## Method Settings
All the data used to train, validate, and test this model is from 
[Camelyon-16 Challenge](https://camelyon16.grand-challenge.org/).

The detection task is formulated as classification: determining if an arbitary 224x224x3 RGB patch sampled from the WSI is tumor or normal.

We adopted [NCRF](https://github.com/baidu-research/NCRF) method's way of patch sampling, please refer to the corresponding website for more information.

Training is performed on patch-label pairs, which are sampled from WSI with tumor delineations.  <br>
<left><img src="screenShots/workflow.png" width="600"/></left>

The prediction map is generated in a sliding-window manner.  <br>
<left><img src="screenShots/prediction.png" width="300"/> <img src="screenShots/image.png" width="300" align="left"/></left>


## Disclamer 
This notebook will use four WSI files to illustrate the training process.
This is **ONLY** intended to show the user how to get started. 
For the model please train on full data (download process can take several days), or download the trained model from NGC release. 


## Prerequisites
- Familiarity with Clara Train main concepts. See [Getting Started Notebook](../../GettingStarted/GettingStarted.ipynb)
- Familiarity with Bring your own component. See [GBring your own component notebook](../../GettingStarted/BYOC.ipynb)

## <center>Now Let's Get Started with Data Preparation and Clara Pathology Detection MMAR

## 1. Download Data
First let's setup directories for the data, this will create /Data folder in the current path, with all necessary subfolders 

In [None]:
import os
root=os.getcwd()
# DataDirRoot=root+"/Data/"
DataDirRoot="/claraDevDay/Data/DP_CAMELYON16/"
print("Data dir root: ", DataDirRoot)

DataDirJson=DataDirRoot+"jsons/"
DataDirCoordRaw=DataDirRoot+"coordsRaw/"
DataDirCoord=DataDirRoot+"coords/"
DataDirWSI=DataDirRoot+"WSI/"
DataDirLoc=DataDirRoot+"LocLabel/"

os.makedirs(DataDirJson, exist_ok=True)
os.makedirs(DataDirCoordRaw, exist_ok=True)
os.makedirs(DataDirCoord, exist_ok=True)
os.makedirs(DataDirWSI, exist_ok=True)
os.makedirs(DataDirLoc, exist_ok=True)

To be specific, below folder will be downloaded
- /WSI stores the histopathology images
- /jsons stores the ground truth annotation in json format
- /coordsRaw stores all patch sample locations
<br>
below folders would be generated:
- /coords will be generated from /coordsRaw by keeping only the locations for the downloaded WSIs (4 in this notebook).
- /LocLabel contains the full sample info, and will be generated based on /coords and /jsons  
<br>
<left><img src="screenShots/folders.png" width="200"/></left>

### 1.1 Download WSIs  
You can download all the images for "CAMELYON16 data set" from various sources listed 
[here](https://camelyon17.grand-challenge.org/Data/). Downloading all data can take several days.

Due to time constraint and for simplicity, this notebook only download 4 WSIs: 2 tumors and 2 normals for training and validation respectively. Let's download them from FTP below. <br>
**Please note: This download could take some time, in total 3.3 GB**

In [None]:
! pip install progressbar2

In [None]:
from progressbar import ProgressBar, Percentage, Bar, ETA, FileTransferSpeed
def download_file_with_progressbar(data):
    f.write(data) 
    global bar
    bar += len(data)

import ftplib
import os
def download_camelyon16_image(filename):
    filename = filename.lower()
    if os.path.exists(filename):
        print(f"The image [{filename}] already exist locally.")
    else:
        print(f"Downloading '{filename}'...")
        prefix = filename.split("_")[0].lower()
        if prefix == "test":
            folder_name = "testing/images"
        elif prefix in ["normal", "tumor"]:
            folder_name = f"training/{prefix}"
        else:
            raise ValueError(
                f"'{filename}' not found on the server."
                " File name should be like 'test_001.tif', 'tumor_001.tif', or 'normal_001.tif'"
            )
        path = f"gigadb/pub/10.5524/100001_101000/100439/CAMELYON16/{folder_name}/"
        ftp = ftplib.FTP("parrot.genomics.cn")
        ftp.login("anonymous", "")
        filepath=path+filename
        print("Downloading ",filepath)
        size = ftp.size(filepath)
        global bar
        bar = ProgressBar(widgets=['Downloading: ', Percentage(), ' ',
                        Bar(marker='#',left='[',right=']'),
                        ' ', ETA(), ' ', FileTransferSpeed()], maxval=size)
        bar.start()    
        global f
        f = open(filename, 'wb')  
        #ftp.cwd(path)
        #ftp.retrbinary("RETR " + filename, open(filename, "wb").write)
        ftp.retrbinary("RETR " + filepath, download_file_with_progressbar)

        ftp.quit()

In [None]:
%cd $DataDirWSI
download_camelyon16_image("tumor_091.tif")
download_camelyon16_image("tumor_107.tif")
download_camelyon16_image("normal_042.tif")
download_camelyon16_image("normal_150.tif")

<br>
Check that the files were downloaded in the correct folder 

In [None]:
!ls $DataDirWSI

### 1.2. Download annotation json file
Annotation information are adopted from 
[NCRF/jsons](https://github.com/baidu-research/NCRF/tree/master/jsons).
Cell below will download the needed files

In [None]:
jsonURL="https://raw.githubusercontent.com/baidu-research/NCRF/master/jsons/"

wget_URL=jsonURL+"train/Tumor_091.json"
!wget $wget_URL -P $DataDirJson --no-check-certificate
wget_URL=jsonURL+"valid/Tumor_107.json"
!wget $wget_URL -P $DataDirJson --no-check-certificate
wget_URL=jsonURL+"train/Normal_042.json"
!wget $wget_URL -P $DataDirJson --no-check-certificate
wget_URL=jsonURL+"valid/Normal_150.json"
!wget $wget_URL -P $DataDirJson --no-check-certificate


### 1.3. Download patch coords
Location information for training/validation patches are adopted from 
[NCRF/coords](https://github.com/baidu-research/NCRF/tree/master/coords).
Cell below will download the needed files

In [None]:
coordsURL="https://raw.githubusercontent.com/baidu-research/NCRF/master/coords/"

wget_URL=coordsURL+"tumor_train.txt"
!wget $wget_URL -P $DataDirCoordRaw --no-check-certificate
wget_URL=coordsURL+"tumor_valid.txt"
!wget $wget_URL -P $DataDirCoordRaw --no-check-certificate
wget_URL=coordsURL+"normal_train.txt"
!wget $wget_URL -P $DataDirCoordRaw --no-check-certificate
wget_URL=coordsURL+"normal_valid.txt"
!wget $wget_URL -P $DataDirCoordRaw --no-check-certificate

Let's only keep the location info for the WSIs we downloaded 

In [None]:
cmd="grep Tumor_091 "+DataDirCoordRaw+"tumor_train.txt"+" > "+DataDirCoord+"tumor_train.txt"
! $cmd
cmd="grep Tumor_107 "+DataDirCoordRaw+"tumor_valid.txt"+" > "+DataDirCoord+"tumor_valid.txt"
! $cmd
cmd="grep Normal_042 "+DataDirCoordRaw+"normal_train.txt"+" > "+DataDirCoord+"normal_train.txt"
! $cmd
cmd="grep Normal_150 "+DataDirCoordRaw+"normal_valid.txt"+" > "+DataDirCoord+"normal_valid.txt"
! $cmd

## 2. Data Preparation for MMAR Training

The current sample location information, e.g. /coords/tumor_train.txt has information below:  <br>
<left><img src="screenShots/coords.png" height="200"/></left>

In fact, at each location, a 728x728 patch will be sampled, which will further be decomposed to a 3x3 grid of 224x224 patches. 
Therefore, we need to convert the downloaded patch coords to json that works with Clara MMAR with the following two steps:
1. Read NCRF coords and annotation jsons, output full index/label information: `prepare_train_data.sh` is used to generate the LocLabel files needed for training and validation from /coords and /jsons listed above. It will append the labels after each filename + coordinate pairs. 

In [None]:
MMAR_ROOT=root+"/MMAR_DP/"
print ("setting MMAR_ROOT =",MMAR_ROOT)
%ls $MMAR_ROOT
!chmod 777 $MMAR_ROOT/commands/*

In [None]:
! $MMAR_ROOT/commands/prepare_train_data.sh

It will identify all 9 labels at each sample location <br>
<left><img src="screenShots/loclabel.png" height="200"/></left>

Cell below will display head of each file

In [None]:
!head -5 $DataDirLoc/tumor_train.txt
!head -5 $DataDirLoc/normal_train.txt
!head -5 $DataDirLoc/tumor_valid.txt
!head -5 $DataDirLoc/normal_valid.txt


2. Then we combine the txt files to a single json which will be used by Clara MMAR using `prepare_json.sh`

In [None]:
! $MMAR_ROOT/commands/prepare_json.sh

This will produce a single json file /Data/datalist.json that will be used by Clara MMAR 

In [None]:
! head -30 $DataDirRoot/datalist.json

<br>
<left><img src="screenShots/clara_json.png" width="400"/></left>

As shown above, each sample Clara accepts for training/validation has the information on WSI path, sample location, and the 9 corresponding labels. Now we have the necessary input for the training/validation pipeline: 
1. path to the folder containing all WSIs
2. json file listing the location and label information for training patches.

The paths are set in /config/environment.json <br>


## 3. Clara MMAR Training 
### Model Overview
The model is based on ResNet18 with the last fully connected layer replaced by a 1x1 convolution layer.

### WSI Reader <span style="color:red">(New in V4)</span>
We recommend using [cuCIM](https://github.com/rapidsai/cucim/), 
an extensible toolkit designed to provide GPU accelerated I/O, 
computer vision & image processing primitives for N-Dimensional images with a focus on biomedical imaging, 
to load WSI images. [OpenSlide](https://openslide.org/), the popular WSI-reader, is also provided for convenience. 
Users can choose between the two using the option in config files<br>

<left><img src="screenShots/wsireader.png" width="700"/></left>

### 3.1 Training
Training can be performed regularly, or with [Smart Cache](https://docs.nvidia.com/clara/tlt-mi/nvmidl/additional_features/smart_cache.html) mechanism.

Before we get started lets check that we have an NVIDIA GPU available in the docker by running the cell below.

In [None]:
!nvidia-smi

#### 3.1.1 Regular Training
Then we can start regular training (w/o smart cache mechanism). For this example, we will train for 4 epochs with /commands/train.sh. <br>
<left><img src="screenShots/train.png" width="300"/></left>    

In [None]:
! $MMAR_ROOT/commands/train.sh 

#### 3.1.2 Smart Cache Training
We can also use smart cache training, which can be especially helpful for pathology applications due to the massive amount of patches used during training. For this example, we will cache 2000 samples and train for 40 smart cache epoches, with /commands/train_smartcache.sh, which will save the output models to models_sc/ <br>
<left><img src="screenShots/train_sc.png" width="300"/></left>    

In [None]:
! $MMAR_ROOT/commands/train_smartcache.sh 

### 3.2 Scores and Results
Example shown here uses the minimum amount of WSIs for simplicity and illustrating how the pipeline works. 
Therefore, the resulting model is a dummy one that is not useful. 
You can either download and train on all the data, or use our model form NGC.

### 3.3 Model Export
Model will be exported to torch script format 

In [None]:
! $MMAR_ROOT/commands/export.sh 

### 3.4 Validation
To run validation on patches, simply run cell below.

In [None]:
! $MMAR_ROOT/commands/validate.sh 

### 3.5 Inference on a WSI
Output of the network itself is the probability map of the input patch.

Inference can then be performed on WSI in a sliding window manner with specified stride. 

A foreground mask is needed to specify the region where the inference will be performed on, given that background region which contains no tissue at all can occupy a significant portion of a WSI. 

Otsu thresholding in `prepare_inference_data.sh` is used to generate foreground masks that will be used to reduce computation burden during inference. The input is the test image, and output is its foreground mask.

In [None]:
! $MMAR_ROOT/commands/prepare_inference_data.sh

With the foreground map, inference can be performed on the WSI. Output of the inference pipeline is a probability map of size 1/stride of original WSI size.
In datalist.json, the inference on full WSIs is specified under "testing" <br>
<left><img src="screenShots/testing.png" width="300"/></left> 

Inference will be performed by /commands/infer.sh, results will be saved to /eval

In [None]:
! $MMAR_ROOT/commands/infer.sh 

## 4. Clara MMAR Performance on Full Camelyon16 Data 
For reference, we list the performance of Clara Paholoygy Detection MMAR on full Camelyon16 data. The experiments are performed using a single 32 GB V100 GPU on a [NVIDIA DGX-2 System](https://www.nvidia.com/en-us/data-center/dgx-2/) 

Benchmarking with [NCRF](https://github.com/baidu-research/NCRF) (Resnet18 baseline, without CRF), which was implemented with pytorch and OpenSlide, we used the same patch locations and performed same 20 epoch experiments.  

Summary for all experiments (time in hours), note that NCRF need to generate PNG files for patches, while Clara loads the patches on the fly:
 
 
 Experiment | uses pre-trained <br/>model | AMP | Library | 2D Patch <br/>Generation | Training <br/>(20 epochs) | Total <br/>Training Time | Speedup | Best <br/>Model | FROC on 48 <br/>Test Cases | Training Speedup <br/>at Best Model
 :--- | :---: | :---: | :---: | :---: | :---:| :----:| :----:| :----:| :----:| :---:
 NCRF | No | NoAmp | - | 10 | 70.5 | 80.5 | 1x | 35.5 | 0.69 | 1x
 Clara | yes | Amp | OpenSlide | N/A | 48.5 | 48.5 | 1.66x | 5.5 | 0.71 | 8x
 Clara | yes | Amp | **cuCIM** | N/A | **39.5** | **39.5** | **2x** | **4.5** | **0.72** | **10x**

<br>

**For inference**
 
Experiment | pre-trained | Inference | speedup  
 --- | --- |  ---:| ---:
 NCRF | - | 26.5 | 1x
 Clara | OpenSlide | 2 | 13x
 Clara | cuCIM | 2 | 13x


Inference speedup is due to 2 factors:
1. Optimized inference pipeline by Clara 
2. Larger batch size used (80 v.s. 20)


Training curve of NCRF:<br>
<left><img src="screenShots/ncrf.png" width="600"/></left> 

Training curve of Clara regular training:<br>
<left><img src="screenShots/reg_train_loss.png" width="600"/></left> <br>
<left><img src="screenShots/reg_val_acc.png" width="600"/></left> 

Training curve of Clara training with smart cache:<br>
<left><img src="screenShots/sc_train_loss.png" width="600"/></left> <br>
<left><img src="screenShots/sc_val_acc.png" width="600"/></left> 

# Exercise 
### 1. Train using open slide 
You can compare performance against openslide.
For this you would to install openslide packages by running cell below 


In [None]:
# Install openslide packages  
!apt-get -y install openslide-tools
!apt-get -y install python-openslide
!python3 -m pip install --upgrade pip
!pip install openslide-python

 
You can change the data loader to use open slide, 
for this you would need to change the config_train.json or the 

you simply need to change 
```
        "image_reader_name": "cuclaraimage"
``` 
to 
```
        "image_reader_name": "openslide"
``` 
