# Clara Train SDK Performance Comparisons  

This is part one of the performance comparisons where we will use 
a sample test data to see different configurations. 
We will be using jupyter notebook plugins To monitor gpu utilization . 
Part two is the [Spleen Performance Notebook](PerformanceSpleen.ipynb) that runs performance on a representative dataset (spleen segmenation from the Decathlon challenge)  



By the end of this notebook you will understand different model training acceleration techniques available in Clara Train SDK:
1. Batching by transform
2. Smart Caching
3. Smart caching + batching by transform


## Prerequisites
- Familiarity with Clara Train main concepts. See [Getting Started Notebook](../GettingStarted/GettingStarted.ipynb)
- 4 NVIDIA GPUs are recommended to see comparision side by side. 
If you only have a single gpu, you could either run the experiments sequentially or you could reduce the CNN model size to fit all experiments onto a single gpu.


### Resources
You may watch the GTC Digital 2020 talks on Clara Train SDK 
- [S22563](https://developer.nvidia.com/gtc/2020/video/S22563)
Clara train Getting started: cover basics, BYOC, AIAA, AutoML 
- [S22717](https://developer.nvidia.com/gtc/2020/video/S22717)
Clara train Performance: Different aspects of acceleration in train V3


## GPU Dashboard

This notebook comes with an extension called NVDashboard for displaying GPU utilization metrics by embedding inside jupyter notebooks. 
For more info please see https://github.com/rapidsai/jupyterlab-nvdashboard. This extension is already installed. 
From the left sidebar, please click on the third tab (System dashboards) and click on GPU Utilization and GPU Memory. 
Then you can drag the tab to the right side of screen to display these along with training performance notebook.

## Dataset 
This notebook uses a sample dataset (ie. a single image volume of the spleen dataset) provided in the package to train a small neural network for a few epochs. 
This single file is duplicated 32 times for the training set and 9 times for the validation set to mimic the full spleen dataset. 


# Lets get started
It is helpful to check the NVIDIA GPU resources available in the docker by running the cell below

In [None]:
# following command should show all gpus available 
!nvidia-smi

In [None]:
# Run some imports and functions used in the notebook

import time
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
def printFile(filePath,lnSt,lnOffset):
    print ("showing ",str(lnOffset)," lines from file ",filePath, "starting at line",str(lnSt))
    lnOffset=lnSt+lnOffset
    !< $filePath head -n "$lnOffset" | tail -n +"$lnSt"

In [None]:
# setting up MMAR root path 
MMAR_ROOT="/claraDevDay/Performance/"
print ("MMAR_ROOT is set to ",MMAR_ROOT)
!ls $MMAR_ROOT
!chmod 777 $MMAR_ROOT/commands/*

### Problem Setup
In order to compare acceleration we use a single image file and duplicate it for our dataset. 
Let us check how this is done in the `dataset.json` file

In [None]:
dataSetFile=MMAR_ROOT+"../../sampleData/dataset_28GB.json"
printFile(dataSetFile,2,13)


We will be comparing the following configurations:
1. Baseline described in [trn1_base.json](./configDemo/trn1_base.json)
2. Batching by transform (BT) described in [trn1_BT.json](./configDemo/trn1_BT.json)
3. Smart caching described in [trn1_cache.json](./configDemo/trn1_cache.json)
4. Smart cache + batching by transform described in [trn1_BT_cache.json](./configDemo/trn1_BT_cache.json)


## Experiment details

#### Common parameters for all experiments:
- Training dataset is same for all with training dataset size=32
- Batch size= 4
- Crop size = 128x128x128
 
 **Note**: Since the pipeline acceleration affects number of iterations the goal is to keep 
 the #epoch x iterations constant across all experiments.   
 
#### Summary for all experiments:
 
 Parameter | Base Line | Batch by transform | Cache | BT + Cache 
 --- | ---:| ---:| ----:| ---:
 Batch Size | 4 | 4| 4 | 2x2 =4
 Cache Size / replacment | - | - | 32/0 | 32/0
 Itteration/epoch | 32/4=8 | 32 |32/4=4 | 32/2=16 
 epochs |20 | 5 |40 | 10
 Epochs x Iteration | 20x8=160 | 5 x 32 =160 |40 x 4 =160 | 10x16=160
 Time for 10MB file size (speed up) | 1,778 (-) |  463(3.8x) |  344(5.2x)|  311(5.7x) 
 Time for 30MB file size (speed up) | 3,972 (-) | 970 (4x) | 574 (7x)| 403 (10x) 
 
_Note:_ 
- Time and speed ups will vary depending on the dataset file size. Here we have spleen_8.nii.gz which is 10MB in size
- Time and speed ups above were obtained using a V100 GPU with 32 GB while the dataset file was spleen_45.nii.gz

# Examining each configuration
In the following cells  examine how we set up each configuration and run them one at a time.  

## Baseline
You may see the contents of the full json configuration file `./configDemo/trn1_base.json` using `!cat /configDemo/trn1_base.json`
below we focus on the pipeline section and the transformations 

In [None]:
confFile=MMAR_ROOT+"/configDemo/trn1_base.json"
# To show the pipeline
printFile(confFile,109,10)

In [None]:
# showing the transformations
printFile(confFile,25,70)

Running the cell below should result in low utilization as shown below
<br> ![a](screenShots/DemoBaseLine.png)

In [None]:
# This cell will train base line configuration. 
! $MMAR_ROOT/commands/train_FakeDS.sh trn1_base.json 0

## Batch By Transform 

Typically data is moved from disk to memory then a crop is selected with an augmentation, then data is thrown away. 
With Batch by transform we take multiple batches from the same volume as shown below

<br> ![a](screenShots/BT.png)
 
Here, the important things to change are:
- The pipeline:
    - Name `SegmentationImagePipeline` 
    - The pipeline parameter `batch_by_transform:true` as
```
"image_pipeline": {
    "name": "SegmentationImagePipeline",
    "args": {
      "data_list_file_path": "{DATASET_JSON}",
      "data_file_base_dir": "{DATA_ROOT}",
      "data_list_key": "training",
      "output_crop_size": [96, 96, 96],
      "output_batch_size": 0,
      "batched_by_transforms": true,
      "num_workers": 2,
      "prefetch_size": 0
    }
```    
- selecting one of the batching transforms (`CropByPosNegRatio`, `FastCropByPosNegRatio`, `CropByPosNegRatioLabelOnly` ) 
and set the batch_size as 
```
  {
    "name": "FastCropByPosNegRatio",
    "args": {
      "size": [96,96,96],
      "fields": "image",
      "label_field": "label",
      "pos": 1,"neg": 1,
      "batch_size": 12,
      "batches_to_gen_at_once": 48
    }
  },
```

In [None]:
confFile=MMAR_ROOT+"/configDemo/trn1_BT.json"
# To show the pipeline
printFile(confFile,106,12)
# showing the batch by transform 
printFile(confFile,74,10)


Running cell below should result in better utilization

In [None]:
! $MMAR_ROOT/commands/train_FakeDS.sh trn1_BT.json 0

## Smart Cache 
Typically data is moved from disk to memory for a single processing then thrown away. With smart Cache data is cached in memeory for next cycles as shown below
<br> ![a](screenShots/Cache.png) <br>
To minimize this overhead, one idea is to cache the result of the transformation chain and use it for training instead. 
However, we have to be careful to only cache results that are deterministic. 
Non-deterministic transforms like RandomRotate still need to be applied. 
Note: Epoch now is when we go over all data in the cache and not all available data. User should increase the number of epochs to compensate.

Here, the important things to change are:
- The pipeline:
    - Name `SegmentationImagePipelineWithCache` 
    - Setting parameter `batch_by_transform:false`
    - Setting parameter `num_cache_objects` and `replace_percent` as 
```
"image_pipeline": {
    "name": "SegmentationImagePipelineWithCache",
    "args": {
      "data_list_file_path": "{DATASET_JSON}",
      "data_file_base_dir": "{DATA_ROOT}",
      "data_list_key": "training",
      "output_crop_size": [96, 96, 96],
      "output_batch_size": 12,
      "batched_by_transforms": false,
      "num_workers": 4,
      "prefetch_size": 24,
      "num_cache_objects": 32,
      "replace_percent": 0      
    }
```      
- Selecting a **NON batching** transform for cropping `FastPosNegRatioCropROI` or `randomcrop`

For more info on how smart cache works and these parameters, 
please see the [smart cache documentation](https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v3.0/nvmidl/additional_features/smart_cache.html).

In [None]:
confFile=MMAR_ROOT+"/configDemo/trn1_cache.json"
# To show the pipeline
printFile(confFile,108,12)
# showing the cropping transorm 
printFile(confFile,74,10)

Running the cell below should result in better utilization

In [None]:
! $MMAR_ROOT/commands/train_FakeDS.sh trn1_cache.json 0

## Smart Cache + Batch by transform 

Problem with batch by transform is that it gets all crops from the same volume here we configure it to take crops from different volumes as shown below
<br> ![a](screenShots/Cache_BT.png) 
Here, the important things to change are:
- The pipeline:
    - Name `SegmentationImagePipelineWithCache` 
    - Setting parameter **`batch_by_transform:true`**
    - Setting parameter `num_cache_objects` and `replace_percent` as 
```
"image_pipeline": {
    "name": "SegmentationImagePipelineWithCache",
    "args": {
      "data_list_file_path": "{DATASET_JSON}",
      "data_file_base_dir": "{DATA_ROOT}",
      "data_list_key": "training",
      "output_crop_size": [96, 96, 96],
      "output_batch_size": 0,
      "batched_by_transforms": true,
      "num_workers": 4,
      "prefetch_size": 0,
      "num_cache_objects": 32,
      "replace_percent": 0      
    }
```      
- Selecting a batching transform for cropping 
(`CropByPosNegRatio`, `FastCropByPosNegRatio`, `CropByPosNegRatioLabelOnly` ) as 
```
  {
    "name": "FastCropByPosNegRatio",
    "args": {
      "size": [96,96,96],
      "fields": "image",
      "label_field": "label",
      "pos": 1,"neg": 1,
      "batch_size": 12,
      "batches_to_gen_at_once": 48
    }
  },
```  
- Adding a batch_transforms section 
```
  "batch_transforms": [
    {
      "name": "MergeBatchDims",
      "args": {
        "fields": ["image", "label"]
      }
    }
  ],
``` 

In [None]:
confFile=MMAR_ROOT+"/configDemo/trn1_BT_cache.json"
# To show batch_transforms section
printFile(confFile,106,8)
# showing the batch by transform 
printFile(confFile,73,10)
# To show the pipeline
printFile(confFile,114,12)

Running the cell below should result in the best utilization

In [None]:
! $MMAR_ROOT/commands/train_FakeDS.sh trn1_BT_cache.json 0

## Running all configurations together
 
The image below shows how to monitor multiple concurrent jobs. 
This allows you to compare multiple training configurations running simultaneously and compare the GPU utilizations. 
We assume you have 4 gpus and will run each configuration on a different gpu  


Running cell below should result in Utilization similar to 
<br> ![a](screenShots/DemoAll4Running.png)

The cell below defines helper functions to run each shell script

In [None]:
def f1():
    # will show performance in Black color 
    !/$MMAR_ROOT/commands/train_FakeDS.sh trn1_BT_cache.json 3
def f2():
    # will show performance in Green color 
    !/$MMAR_ROOT/commands/train_FakeDS.sh trn1_cache.json 2
def f3():
    # will show performance in RED color 
    !/$MMAR_ROOT/commands/train_FakeDS.sh trn1_BT.json 1
def f4():
    # will show performance in Blue color 
    !/$MMAR_ROOT/commands/train_FakeDS.sh trn1_base.json 0 

The cell below will run all configurations.

In [None]:

from multiprocessing import Process
p1 = Process(target=f1)
p2 = Process(target=f2)
p3 = Process(target=f3)
p4 = Process(target=f4)

p1.start()
p2.start() 
p3.start()
p4.start()

if you wish to suppress the output you could run the following cell instead

In [None]:
import subprocess
trainscript=MMAR_ROOT+"/commands/train_FakeDS.sh"
d = subprocess.Popen([trainscript,"trn1_BT_cache.json","3"]) # Black 
c = subprocess.Popen([trainscript,"trn1_cache.json","2"]) # Green
b = subprocess.Popen([trainscript,"trn1_BT.json","1"]) #Red
a = subprocess.Popen([trainscript,"trn1_base.json","0"]) # Blue

# Exercise
1. You could test out the BT+cache old implementation in clara train V2 by running the 
[trn1_BT_cache_TrnV2.json](configDemo/trn1_BT_cache_TrnV2.json). 
Which has the following important changes
    - Use pipeline `SegmentationImagePipelineWithCache`
    - Set `batch_by_transform = true`
    - Use one of the cropping transforms as FastCropByPosNegRatio
    - No batch merge section 
2. You may redo this comparision on a real dataset such as the spleen segmentation problem 
or you move through even more optimization in the [Spleen Performance Notebook](PerformanceSpleen.ipynb)
3. You may change the cache size and replacements and see its effects. you can follow the table below 

**Summary for all experiments:**
 
 Parameter | Base Line | Batch by transform | Cache | BT + Cache 
 --- | ---:| ---:| ----:| ---:
 Batch Size | 4 | 4| 4 | 2x2=4
 Cache Size / replacement | - | - | 16/0.1 | 16/0.1
 Iteration/epoch | 32/4=8 | 32 |16/4=4 | 16/2=8
 epochs |20 | 5 |40 | 20
 Epochs x Iteration | 20x8=160 | 5 x 32 =160 |40 x 4 =160 | 10x16=160 


  