# Getting started with Clara Train SDK V4.0 PyTorch using MONAI 
Clara Train SDK consists of different modules as shown below 
<br>![side_bar](screenShots/TrainBlock.png)

Clara Train SDK simply allows researcher to train AI models using configuration files. 
It is simple to use, modular and flexible. Allowing researchers to focus on innovation, 
while leaving acceleration and performance issue for NVIDIA's engineers 
   
By the end of this notebook you will:
1. Understand components of [Medical Model ARchive (MMAR)](https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v4.0/nvmidl/mmar.html)
2. Know how to configure train config json to train a CNN
3. Train a CNN with single and multiple GPUs
4. Fine tune a model
5. Export a model 
6. Perform inference on testing dataset 


## Prerequisites
- Nvidia GPU with 8Gb of memory   

### Resources
You could watch the free GTC 2020 talks covering Clara Train SDK 
- [S22563](https://developer.nvidia.com/gtc/2020/video/S22563)
Clara train Getting started: cover basics, BYOC, AIAA, AutoML 


# Background

Clara Train is built using a component-based architecture with using components from [MONAI](https://monai.io/) :
MONAI’s [training workflows](https://docs.monai.io/en/latest/highlights.html#workflows) 
are based off of [PyTorch Ignite’s engine](https://pytorch.org/ignite/engine.html). 
Below is a list of different components used:
- Training Data Pipeline
- Validation Data Pipeline
- [Applications](https://docs.monai.io/en/latest/apps.html)
- [Transforms](https://docs.monai.io/en/latest/transforms.html)
- [Data](https://docs.monai.io/en/latest/data.html)
- [Engines](https://docs.monai.io/en/latest/engines.html)
- [Inference methods](https://docs.monai.io/en/latest/inferers.html)
- [Event handlers](https://docs.monai.io/en/latest/handlers.html)
- [Network architectures](https://docs.monai.io/en/latest/networks.html)
- [Loss functions](https://docs.monai.io/en/latest/losses.html)
- [Optimizers](https://docs.monai.io/en/latest/optimizers.html)
- [Metrics](https://docs.monai.io/en/latest/metrics.html)
- [Visualizations](https://docs.monai.io/en/latest/visualize.html)
- [Utilities](https://docs.monai.io/en/latest/utils.html)   


# Lets get started
Before we get started lets check that we have an NVIDIA GPU available in the docker by running the cell below

In [None]:
# following command should show all gpus available 
!nvidia-smi

Next cell defines functions that we will use throughout the notebook

In [None]:
MMAR_ROOT="/claraDevDay/GettingStarted/"
print ("setting MMAR_ROOT=",MMAR_ROOT)
%ls $MMAR_ROOT

!chmod 777 $MMAR_ROOT/commands/*
def printFile(filePath,lnSt,lnEnd):
    print ("showing ",str(lnOffset)," lines from file ",filePath, "starting at line",str(lnSt))
    # lnEnd=lnSt+lnOffset
    !< $filePath head -n "$lnEnd" | tail -n +"$lnSt"
 

## Medical Model ARchive (MMAR)
Clara Train SDK uses the [Medical Model ARchive (MMAR)](https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v4.0/nvmidl/mmar.html). 
The MMAR defines a standard structure for organizing all artifacts produced during the model development life cycle. 
Clara Train SDK simple basic idea is to train using config file as shown below
<br>![side_bar](screenShots/MMAR.png)


You can download sample models for different problems from [NGC](https://ngc.nvidia.com/catalog/models?orderBy=modifiedDESC&pageNumber=0&query=clara&quickFilter=&filters=) <br> 
All MMAR follow the structure provided in this Notebook. if you navigate to the parent folder structure it should contain the following subdirectories
```
./GettingStarted 
├── commands
├── config
├── docs
├── eval
├── models
└── resources
```

* `commands` contains a number of ready-to-run scripts for:
    - training
    - training with multiple GPU
    - validation
    - inference (testing)
    - exporting models in TensorRT Inference Server format
* `config` contains configuration files (in JSON format) for eac training, 
validation, and deployment for [AI-assisted annotation](https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v4.0/aiaa/index.html) 
(_Note:_ these configuration files are used in the scripts under the `commands` folder)
* `docs` contains local documentation for the model, but for a more complete view it is recommended you visit the NGC model page
* `eval` is used as the output directory for model evaluation (by default)
* `models` is where the PyTorch checkpoint model is stored, and the corresponding graph definition files.
* `resources` currently contains the logger configuration in `log.config` file

Some of the most important files you will need to understand to configure and use Clara Train SDK are

1. `environment.json` which has important common parameters to set the path for 
    * `DATA_ROOT` is the root folder where the data with which we would like to train, validate, or test resides in
    * `DATASET_JSON` expects the path to a JSON-formatted file 
    * `MMAR_CKPT_DIR` the path to the where the PyTorch checkpoint files reside
    * `MMAR_EVAL_OUTPUT_PATH` the path to output evaluation metrics for the neural network during training, validation, and inference
    * `PROCESSING_TASK` the type of processing task the neural net is intended to perform (currently limited to `annotation`, `segmentation`, `classification`)
    * `PRETRAIN_WEIGHTS_FILE` (_optional_) 	determines the location of the pre-trained weights file; if the file does not exist and is needed, 
    the training program will download it from predefined URL from the web


In [None]:
printFile(MMAR_ROOT+"/config/environment.json",0,30)


# Config.json Main Concepts 


`config_train.json` contains all the parameters necessary to define the neural net, 
how is it trained (training hyper-parameters, loss, etc.), 
pre- and post-transformation functions necessary to modify and/or augment the data before input to the neural net, etc. 
The complete documentation on the training configuration is laid out 
[here](https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v4.0/nvmidl/appendix/configuration.html#training-configuration).
Configuration file defines all training related configurations. 
This is were most the researcher would spent most of his time.

<br>![s](screenShots/MMARParts.png)<br> 

Lets take some time to examine each part of it.  


### 1. Global configurations 

In [8]:
confFile=MMAR_ROOT+"/config/config_train_Unet.json"
printFile(confFile,0,9)


    "model": {
      "name": "Unet",
      "args": {
        "num_classes": 6,
        "nf_enc":"32,64,64,64",
        "nf_dec":"64,64,64,64,64,32,32"
      }
    },
    "pre_transforms": [
      {
        "name": "LoadNifti",


### 2. Training configurations section 
This section includes:
1. Loss functions:

In [None]:
printFile(confFile,9,17)


2. Optimizer

In [None]:
printFile(confFile,17,23)


3. Learning rate scheduler

In [None]:
printFile(confFile,23,30)


4. Network architecture

In [None]:
printFile(confFile,30,42)


5. Pre-transforms
    1. Loading transformations:
    2. Resample Transformation
    3. Cropping transformations
    4. Deformable transformations
    5. Intensity Transforms
    6. Augmentation Transforms
    7. Special transforms 

In [None]:
printFile(confFile,42,118)


6. DataSet to use 

In [None]:
printFile(confFile,118,129)


7. DataLoader

In [None]:
printFile(confFile,129,137)


8. inferer

In [None]:
printFile(confFile,137,140)


9. Handlers

In [None]:
printFile(confFile,140,182)


10. Post transforms
    1. Loading transformations:
    2. Resample Transformation
    3. Cropping transformations
    4. Deformable transformations
    5. Intensity Transforms
    6. Augmentation Transforms
    7. Special transforms 

In [None]:
printFile(confFile,182,200)

11. Metric

In [None]:
printFile(confFile,200,210)


### 3. Validation config 
This contains sub sections very similar to the ones in the training section including:
1. Metric 
2. pre-transforms. Since these transforms are usually a subset from teh pre-transforms in the training section, 
we can use the alias to point to these transforms by name as ` "ref": "LoadNifti"`. 
In case we use 2 transforms with the same name as `ScaleByResolution` 
we can give each an alias to refer to as `"name": "ScaleByResolution#ScaleImg"` 
then refer to it in the validation section as `ScaleImg` 
3. Image pipeline
4. Inference

In [None]:
printFile(confFile,214,250)


# Training your first Network

### Start TensorBoard 
Before we start training or while the network is training, 
you can monitor its accuracy using tensorboard in side jupyter lab as shown below 
 <br><img src="screenShots/TensorBoard.png" alt="Drawing" style="height: 300px;"/><br>


### Training script 
We have renamed `train.sh` to `train_W_Config` as we modified it to accept parameters with the config to use

Let's take a look at `train_W_Config.sh` by executing the following cell.

In [None]:
printFile(MMAR_ROOT+"/commands/train_W_Config.sh",30,30)

### Start training
Now that we have our training configuration, to start training simply run `train.sh` as below. 
Please keep in mind that we have setup a dummy data with one file to train a dummy network fast (we only train for 2 epochs). 
Please see exercises on how to easily switch data and train a real segmentation network.


In [None]:
! $MMAR_ROOT/commands/train_W_Config.sh config_train_Unet.json

Now lets see the `models` directory, which would includes out models and the tensorboard files 

In [None]:
! ls -la $MMAR_ROOT/models/config_train_Unet


# Export Model

To export the model we simply run `export.sh` which will: 
- Removes back propagation information from checkpoint files
- Generates two frozen graphs in the models folder
This optimized model will be used by TRTIS server in Clara Deploy.


In [None]:
! $MMAR_ROOT/commands/export.sh



lets check out what was created in the folder. 
after running cell below you should see `model.ts`


In [None]:
!ls -la $MMAR_ROOT/models/config_train_Unet/*.ts


# Validation 
Now that we have trained our model we would like to run evaluation to get some statistics and also do inference to see the resulted output


#### Validate with single GPU 
To run evaluation on your validation dataset you should run `validate.sh`. 
This will run evaluation on the validation dataset and place it in the `MMAR_EVAL_OUTPUT_PATH` as configured in the [environment.json](config/environment.json) 
file (default is eval folder). 
This evaluation would give min, max, mean of the metric as specified in the config_validation file


In [None]:
! $MMAR_ROOT/commands/validate.sh


#### Validate with multiple GPUs 
You can also leverage multi-GPUs for validation using `validate_multi_gpu.sh` 

In [None]:
!$MMAR_ROOT/commands/validate_multi_gpu.sh 

Now lets see results in the folder by running cells below. 
You should see statistics and dice per file in the validation dataset

In [None]:
! ls -la $MMAR_ROOT/eval/

In [None]:
# statistic summary
!cat $MMAR_ROOT/eval/mean_dice_class1_summary_results.txt

In [None]:
!cat $MMAR_ROOT/eval/mean_dice_class1_raw_results.txt

# Inference  

To run inference on validation dataset or test dataset you should run `infer.sh`. 
This will run prediction on the validation dataset and place it in the `MMAR_EVAL_OUTPUT_PATH` as configured in the 
[environment.json](config/environment.json) file (default is eval folder)


In [None]:
! $MMAR_ROOT/commands/infer.sh

Now lets see results in the folder

In [None]:
! ls -la $MMAR_ROOT/eval/

In [None]:
! ls -la $MMAR_ROOT/eval/spleen_8


## Multi-GPU Training
Clara train aims to simplify scaling and utilizing all available gpus. 
Using the same config we already used for train we can simply invoke `train_multi_gpu.sh` to train on multiple gpus. 

Lets examine `train_multi_gpu.sh` script by running cell below. 
You can see we are changing the learning rate as the batch size has doubled.

In [None]:
printFile(MMAR_ROOT+"/commands/train_multi_gpu.sh",0,50)

Lets give it a try and run cell below to train on 2 gpus

In [None]:
! $MMAR_ROOT/commands/train_multi_gpu.sh


## Training Vs FineTune
`train.sh` and `finetune.sh` are identical except they train using different configurations files. 

_Note_: The only difference between the two configs `config_train_Unet.json` and `config_finetune.json` 
is that `config_finetune.json` specifies a `ckpt` file in section below 
while `config_train_Unet.json` does not since it is training from scratch.
```
      {
        "name": "CheckpointLoader",
        "args": {
          "load_path": "{MMAR_CKPT}",
          "load_dict": ["model"]
        }
      },
```



# Exercise:
Now that you are familiar with clara train, you can try to: 
1. Explore different options of clara train by changing / creating a new config file and running training: 
    1. Model architecture: Ahnet, Unet, Segresnet 
    2. Losses
    3. Transformation 
Hint: you for training segresnet you can use the configuration `config_train_segresnet.json` that only changed the network section.
you can train by running cell below     

In [None]:
!$MMAR_ROOT/commands/train_W_Config.sh config_train_segresnet.json


2. Train on real spleen data for this you should:
    1. Download spleen dataset by running the [download](../Data/DownloadDecathlonDataSet.ipynb) Notebook
    2. Switch the dataset file in the [environment.json](config/environment.json)
    3. rerun the `train.sh`
3. Experiment with multi-GPU training by changing number of gpus to train on from 2 to 3 or 4. 
You should edit [train_multi_gpu.sh](commands/train_multi_gpu.sh) then rerun the script 


4. Use DLprof tool for debugging 

In [None]:
!$MMAR_ROOT/commands/debug_dlprof.sh config_train_Unet_NoAMP.json

You then need to run tensor board manually (Not through jupyterlab) using 
```
cd /claraDevDay/GettingStarted/models/config_train_Unet_debug
tensorboard --logdir ./dlprof --port 80
```
remember we startdocker.sh mounted port 80 to 5000 by default for AIAA, we simply are using that mapping here for simplicity 
now if you navigate to `<yourip:5000>` you should see DlProf tool as below. 
This analysis shows you the GPUs you have along improvements that you can do to train faster. 
For example this run shows multiple operations that would be accelerated from AMP.
To test this you can run cell below with AMP enabled in the configuration 

<br>![dlprof](screenShots/Dlprof.png)<br>


In [None]:
!$MMAR_ROOT/commands/debug_dlprof.sh config_train_Unet.json
