# Note book for Digital Pathology usecase 



# Description

A pre-trained model for automated detection of metastases in whole-slide histopathology images. 
The prediction map is generated in a sliding-window manner by classifying local 224x224x3 RGB patches as either tumor or normal.

# Disclamer 
This note book will use a single file to train. 
This is **ONLY** intended to show the user how to easily get started. 
For the best model please download the final model from NGC. 

## Prerequisites
- Before running any code, please install "openslide-python" and OpenSlide libraries.

### (Temporarily till GA)
We noticed some torch version issues that would be fixed in GA release. 
For you cell below would fix the issue 

In [None]:
!pip install --no-deps torchvision==0.8.0

### (optional) Install openslide 
If you would like to compare performance against openslide, 
you would need to install openslide packages. 
Later on you can change the data loader to use openslide 

In [None]:
!apt-get -y install openslide-tools
!apt-get -y install python-openslide
!python3 -m pip install --upgrade pip
!pip install openslide-python


# 1. Download Data
All the data used to train, validate, and test this model is from 
[Camelyon-16 Challenge](https://camelyon16.grand-challenge.org/).
First lets setup directories for the data 

In [None]:
import os
DataDirRoot="/claraDevDay/Data/DP_CAMELYON16/"

DataDirJson=DataDirRoot+"jsons/train/"
DataDirCoord=DataDirRoot+"coords/"
os.makedirs(DataDirCoord, exist_ok=True)
os.makedirs(DataDirJson, exist_ok=True)
os.makedirs(DataDirRoot+"tif/", exist_ok=True)
os.makedirs(DataDirRoot+"LocLabel/", exist_ok=True)

### 1.1 Download tiff manually 
You can download all the images for "CAMELYON16 data set" from various sources listed 
[here](https://camelyon17.grand-challenge.org/Data/).
For simplicity you only need the smallest file tumor_091.tif (500Mb) <br>
**Please note: This download would take 15+ minutes**

In [None]:
%cd $DataDirRoot/tif

In [None]:
import ftplib
import os

def download_camelyon16_image(filename):
    filename = filename.lower()
    if os.path.exists(filename):
        print(f"The image [{filename}] already exist locally.")
    else:
        print(f"Downloading '{filename}'...")
        prefix = filename.split("_")[0].lower()
        if prefix == "test":
            folder_name = "testing/images"
        elif prefix in ["normal", "tumor"]:
            folder_name = f"training/{prefix}"
        else:
            raise ValueError(
                f"'{filename}' not found on the server."
                " File name should be like 'test_001.tif', 'tumor_001.tif', or 'normal_001.tif'"
            )
        path = f"gigadb/pub/10.5524/100001_101000/100439/CAMELYON16/{folder_name}/"
        ftp = ftplib.FTP("parrot.genomics.cn")
        ftp.login("anonymous", "")
        ftp.cwd(path)
        ftp.retrbinary("RETR " + filename, open(filename, "wb").write)
        ftp.quit()

download_camelyon16_image("tumor_091.tif")


Check that file was downloaded in tif folder 

In [None]:
!ls $DataDirRoot/tif    

### 1.2. Download Json files
Location information for training/validation patches are adopted from 
[NCRF/coords](https://github.com/baidu-research/NCRF/tree/master/coords).
Cell below will download the needed files

In [None]:
DataDirJson=DataDirRoot+"jsons/train/"
blobURL="https://raw.githubusercontent.com/baidu-research/NCRF/master/jsons/train/"

FileName="Tumor_091.json"
wget_URL=blobURL+FileName
!wget $wget_URL -P $DataDirJson

### 1.3. Download coords
Anotation information are adopted from 
[NCRF/jsons](https://github.com/baidu-research/NCRF/tree/master/jsons).
Cell below will download the needed files

In [None]:
coordsURL="https://raw.githubusercontent.com/baidu-research/NCRF/master/coords/"
FileName="tumor_train.txt"
wget_URL=coordsURL+FileName
!wget $wget_URL -P $DataDirRoot

let's only keep locations for tumors we downloaded 

In [None]:
cmd="grep Tumor_091 "+DataDirRoot+FileName+" > "+DataDirCoord+FileName
! $cmd

# Lets Get Started


In [None]:
MMAR_ROOT="/claraDevDay/DomainExamples/DP_detection/"
print ("setting MMAR_ROOT=",MMAR_ROOT)
%ls $MMAR_ROOT
!chmod 777 $MMAR_ROOT/commands/*

# 2. Data Preparation

#### Input and output formats

Input for the training pipeline includes: 
1. folder containing all WSIs
2. txt files listing the location and label information for training patches.

Output of the network itself is the probability of a 224x224x3 patch.

- For training / validation: `prepare_train_data.sh` is used to generate the LocLabel files needed for training and validation from /coords and /jsons listed above. It will append the labels after each filename + coordinate pairs. Together with training images, they will be passed to training/validation pipeline.
- For inference: `prepare_inference_data.sh` is used to generate foreground masks that will be used to reduce computation burden during inference. The input is the test images, and output is the foreground masks.
- For FROC: refer to "Annotation" section of [Camelyon challenge](https://camelyon17.grand-challenge.org/Data/) to prepare ground truth images, which are needed for FROC computation.


In [None]:
! $MMAR_ROOT/commands/prepare_train_data.sh

# 3. Training 
### Model Overview
The model is based on ResNet18 with the option of replacing last fully connected layer by a 1x1 convolution layer.


### 3.1 Normal Training
Lets start training with basic configuration 

In [None]:
! $MMAR_ROOT/commands/train.sh 

### 3.2 Training using smart cache
Now lets take advantage of smart cache data pipeline 

In [None]:
! $MMAR_ROOT/commands/train_smartcache.sh 

## Scores and Results
Example shown here uses single tumor image for simplisity. 
Therefore, the model is a dummy model that is not useful. 
You can either download all the data and retrain or use our model form NGC.

Our trained model on NGC achieve the ~0.92 accuracy on validation patches, 
and FROC of ~0.72 on the 48 Camelyon testing data that have ground truth annotations available.

# 4. Running Validation 

In [None]:
! $MMAR_ROOT/commands/validate.sh 


## Inference on a WSI

Inference is performed on WSI in a sliding window manner with specified stride. 
A foreground mask is needed to specify the region where the inference will be performed on, 
given that background region which contains no tissue at all can occupy a significant portion of a WSI. 
Output of the inference pipeline is a probability map of size 1/stride of original WSI size.


### Running Inference  

In [None]:
! $MMAR_ROOT/commands/infer.sh 

