# Getting started with Clara Train SDK V4.0 PyTorch using MONAI 
Clara Train SDK simply allows researcher to train AI models using configuration files. 
It is simple to use, modular and flexible. Allowing researchers to focus on innovation, 
while leaving acceleration and performance issue for NVIDIA's engineers. 

Clara Train SDK consists of different modules as shown below 
<br><img src="screenShots/TrainBlock.png" alt="Drawing" style="height: 600px;"/><br>
   
By the end of this notebook you will:
1. Understand components of [Medical Model ARchive (MMAR)](https://docs.nvidia.com/clara/clara-train-sdk/pt/mmar.html)
2. Know how to configure train config json to train a CNN
3. Train a CNN with single and multiple GPUs
4. Fine tune a model
5. Export a model 
6. Perform inference on testing dataset 


## Prerequisites
- Nvidia GPU with 8Gb of memory   


## Resources
You could watch the free GTC 2021 talks covering Clara Train SDK
- [Clara Train 4.0 - 101 Getting Started [SE2688]](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-se2688/)
- [Clara Train 4.0 - 201 Federated Learning [SE3208]](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-se3208/)
- [Take Medical AI from Concept to Production using Clara Imaging [S32482]](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32482/)
- [Federated Learning for Medical AI [S32530]](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32530/) 
- [Get Started Now on Medical Imaging AI with Clara Train on Google Cloud Platform [S32518]](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32518/)
- [Automate 3D Medical Imaging Segmentation with AutoML and Neural Architecture Search [S32083]](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32083/)
- [A Platform for Rapid Development and Clinical Translation of ML Models for Applications in Radiology at UCSF [S31619]](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31619/)


# 1. Background

Clara Train is built using a component-based architecture with using components from [MONAI](https://monai.io/) :
MONAI’s [training workflows](https://docs.monai.io/en/latest/highlights.html#workflows) 
are based off of [PyTorch Ignite’s engine](https://pytorch.org/ignite/engine.html). 
Below is a list of different components used:
- Training Data Pipeline
- Validation Data Pipeline
- [Applications](https://docs.monai.io/en/latest/apps.html)
- [Transforms](https://docs.monai.io/en/latest/transforms.html)
- [Data](https://docs.monai.io/en/latest/data.html)
- [Engines](https://docs.monai.io/en/latest/engines.html)
- [Inference methods](https://docs.monai.io/en/latest/inferers.html)
- [Event handlers](https://docs.monai.io/en/latest/handlers.html)
- [Network architectures](https://docs.monai.io/en/latest/networks.html)
- [Loss functions](https://docs.monai.io/en/latest/losses.html)
- [Optimizers](https://docs.monai.io/en/latest/optimizers.html)
- [Metrics](https://docs.monai.io/en/latest/metrics.html)
- [Visualizations](https://docs.monai.io/en/latest/visualize.html)
- [Utilities](https://docs.monai.io/en/latest/utils.html)   


# Lets get started
Before we get started lets check that we have an NVIDIA GPU available in the docker by running the cell below

In [None]:
# following command should show all gpus available 
!nvidia-smi

Next cell defines functions that we will use throughout the notebook

In [None]:
MMAR_ROOT="/claraDevDay/GettingStarted/"
print ("setting MMAR_ROOT=",MMAR_ROOT)
%ls $MMAR_ROOT

!chmod 777 $MMAR_ROOT/commands/*
def printFile(filePath,lnSt,lnEnd):
    print ("showing ",str(lnEnd-lnSt)," lines from file ",filePath, "starting at line",str(lnSt))
    !< $filePath head -n "$lnEnd" | tail -n +"$lnSt"
 

# 2. Medical Model ARchive (MMAR)
Clara Train SDK uses the [Medical Model ARchive (MMAR)](https://docs.nvidia.com/clara/clara-train-sdk/pt/mmar.html). 
The MMAR defines a standard structure for organizing all artifacts produced during the model development life cycle. 
Clara Train SDK simple basic idea is to train using config file
 

**We recommend opening [config_train_Unet.json](config/config_train_Unet.json) and configuring your screen as shown below**
<br><img src="screenShots/MMAR.png" alt="Drawing" style="height: 400px;"/><br>


You can download sample models for different problems from [NGC](https://ngc.nvidia.com/catalog/models?orderBy=modifiedDESC&pageNumber=0&query=clara&quickFilter=&filters=) <br> 
All MMAR follow the structure provided in this Notebook. if you navigate to the parent folder structure it should contain the following subdirectories
```
./GettingStarted 
├── commands
├── config
├── docs
├── eval
├── models
└── resources
```

* `commands` contains a number of ready-to-run scripts for:
    - training
    - training with multiple GPU
    - fine tune
    - fine tune with multiple GPU
    - validation
    - validation with multiple GPU
    - inference (testing)
    - exporting models in TensorRT Inference Server format
* `config` contains configuration files (in JSON format) for eac training, 
validation, and deployment for [AI-assisted annotation](https://docs.nvidia.com/clara/clara-train-sdk/aiaa/index.html) 
(_Note:_ these configuration files are used in the scripts under the `commands` folder)
* `docs` contains local documentation for the model, but for a more complete view it is recommended you visit the NGC model page
* `eval` is used as the output directory for model evaluation (by default)
* `models` is where the PyTorch checkpoint model is stored, and the corresponding graph definition files.
* `resources` currently contains the logger configuration in `log.config` file

Some of the most important files you will need to understand to configure and use Clara Train SDK are

1. `environment.json` which has important common parameters to set the path for 
    * `DATA_ROOT` is the root folder where the data with which we would like to train, validate, or test resides in
    * `DATASET_JSON` expects the path to a JSON-formatted file 
    * `MMAR_CKPT_DIR` the path to the where the PyTorch checkpoint files reside
    * `MMAR_EVAL_OUTPUT_PATH` the path to output evaluation metrics for the neural network during training, validation, and inference
    * `PROCESSING_TASK` the type of processing task the neural net is intended to perform (currently limited to `annotation`, `segmentation`, `classification`)
    * `PRETRAIN_WEIGHTS_FILE` (_optional_) 	determines the location of the pre-trained weights file; if the file does not exist and is needed, 
    the training program will download it from predefined URL from the web


In [None]:
printFile(MMAR_ROOT+"/config/environment.json",0,30)


# 3. Config.json Main Concepts 


`config_train.json` contains all the parameters necessary to define the neural net, 
how is it trained (training hyper-parameters, loss, etc.), 
pre- and post-transformation functions necessary to modify and/or augment the data before input to the neural net, etc. 
The complete documentation on the training configuration is laid out 
[here](https://docs.nvidia.com/clara/clara-train-sdk/pt/appendix/configuration.html#training-configuration).
Configuration file defines all training related configurations. 
This is were most the researcher would spent most of his time.

Please see our documentation for detailed explanation of the [training configuration](https://docs.nvidia.com/clara/clara-train-sdk/pt/appendix/configuration.html#training-configuration)  


# 4. Training your first Network


## 4.1 Training script 
We have renamed `train.sh` to `train_W_Config` as we modified it to accept parameters with the config to use

Let's take a look at `train_W_Config.sh` by executing the following cell.

In [None]:
printFile(MMAR_ROOT+"/commands/train_W_Config.sh",0,30)

## 4.2 Start training
Now that we have our training configuration, to start training simply run `train.sh` as below. 
Please keep in mind that we have setup a dummy data with one file to train a dummy network fast (we only train for 2 epochs). 
Please see exercises on how to easily switch data and train a real segmentation network.


In [None]:
! $MMAR_ROOT/commands/train_W_Config.sh config_train_Unet.json

Now lets see the `models` directory, which would includes out models and the tensorboard files 

In [None]:
! ls -la $MMAR_ROOT/models/config_train_Unet
!echo ---------------------------------------
!echo Display content of train_stats.json
! cat $MMAR_ROOT/models/config_train_Unet/train_stats.json


# 5. Export Model

To export the model we simply run `export.sh` which will: 
- Create ts file
This optimized model will be used by TRITON server in AIAA and Clara Deploy.


In [None]:
! $MMAR_ROOT/commands/export.sh



lets check out what was created in the folder. 
after running cell below you should see `model.ts`


In [None]:
!ls -la $MMAR_ROOT/models/config_train_Unet/*.ts


# 6. Validation 
Now that we have trained our model we would like to run evaluation to get some statistics and also do inference to see the resulted output


## 6.1 Validate with single GPU 
To run evaluation on your validation dataset you should run `validate.sh`. 
This will run evaluation on the validation dataset and place it in the `MMAR_EVAL_OUTPUT_PATH` as configured in the [environment.json](config/environment.json) 
file (default is eval folder). 
This evaluation would give min, max, mean of the metric as specified in the config_validation file


In [None]:
! $MMAR_ROOT/commands/validate.sh


You could also run `validate_ckpt.sh` which loads the model from the checkpoint instead of the ts file

In [None]:
! $MMAR_ROOT/commands/validate_ckpt.sh


## 6.2 Validate with multiple GPUs 
You can also leverage multi-GPUs for validation using `validate_multi_gpu.sh` 

In [None]:
!$MMAR_ROOT/commands/validate_multi_gpu.sh 

Similarly you could also run `validate_multi_gpu_ckpt.sh` which loads the model from the checkpoint instead of the ts file

In [None]:
! $MMAR_ROOT/commands/validate_multi_gpu_ckpt.sh


## 6.3 Check Validation results 
Now lets see results in the folder by running cells below. 

In [None]:
!ls -la $MMAR_ROOT/eval
for fName in ["metrics.csv","val_mean_dice_raw.csv","val_mean_dice_summary.csv"]:
    print("---------------------------------------")
    print("Display content of ",fName)
    ! cat $MMAR_ROOT/eval/$fName


# 7. Inference  

To run inference on validation dataset or test dataset you should run `infer.sh`. 
This will run prediction on the validation dataset and place it in the `MMAR_EVAL_OUTPUT_PATH` as configured in the 
[environment.json](config/environment.json) file (default is eval folder)


In [None]:
! $MMAR_ROOT/commands/infer.sh

Now lets see results in the folder

In [None]:
! ls -la $MMAR_ROOT/eval/

In [None]:
! ls -la $MMAR_ROOT/eval/spleen_8


# 8.Multi-GPU Training
Clara train aims to simplify scaling and utilizing all available GPUs. 
Using the same config we already used for train we can simply invoke `train_multi_gpu.sh` to train on multiple gpus. 
Main difference between the `train.sh` and `train_multi_gpu.sh` is changing some parameters

train.sh | train_multi_gpu.sh  
 --- | --- 
python3 -u -m medl.apps.train \\<br>-m MMAR_ROOT \\<br>-c CONFIG_FILE \\<br>-e ENVIRONMENT_FILE \\<br>--write_train_stats \\<br>--set \\<br> print_conf=True | python -m torch.distributed.launch\\<br> --nproc_per_node=2 --nnodes=1 --node_rank=0 \\<br> --master_addr="localhost" --master_port=1234 \\<br>-m medl.apps.train \\<br>-m MMAR_ROOT \\<br>-c CONFIG_FILE \\<br>-e ENVIRONMENT_FILE \\<br> --write_train_stats \\<br> --set \\<br> print_conf=True \\<br> multi_gpu=True \\<br> learning_rate= 2e-4
 
Lets examine `train_multi_gpu.sh` script by running cell below. 

In [None]:
printFile(MMAR_ROOT+"/commands/train_multi_gpu.sh",0,50)

Lets give it a try and run cell below to train on 2 GPUs

In [None]:
! $MMAR_ROOT/commands/train_multi_gpu.sh


# 9. Training Vs FineTune
`train.sh` and `finetune.sh` are identical and use the same config file. 
The only difference is that `finetune.sh` enables the load of check point using the `disabled` as shown below 

except they train using different configurations files. 

_Note_: The only difference between the two configs `config_train_Unet.json` and `config_finetune.json` 
is that `config_finetune.json` specifies a `ckpt` file in section below 
while `config_train_Unet.json` does not since it is training from scratch.
```
      {
        "name": "CheckpointLoader",
        "args": {
          "disabled": "{dont_load_ckpt_model}",
          "load_path": "{MMAR_CKPT}",
          "load_dict": ["model"]
        }
      },
```

# Next:

### 1. Load model into AIAA
We will show here how you can quickly load up the model we trained above into AIAA. 
First, you should run [AIAA Notebook](../AIAA/AIAA.ipynb) to start the server.
Section 3.1 in the AIAA notebook shows how to load trained model into AIAA server. 


### 2. Bring your own Components
In order to fully take advantage of clara train SDK you should write your own components. 
Please go to [BYOC notebook](BYOC.ipynb) for examples  


# Exercise:
Now that you are familiar with clara train, you can try to: 
1. Explore different options of clara train by changing / creating a new config file and running training: 
    1. Model architecture: Ahnet, Unet, Segresnet 
    2. Losses
    3. Transformation 

Hint: you for training segresnet you can use the configuration `config_train_segresnet.json` that only changed the network section.
you can train by running cell below     

In [None]:
!$MMAR_ROOT/commands/train_W_Config.sh config_train_segresnet.json


2. Train on real spleen data for this you should:
    1. Download spleen dataset by running the [download](../Data/DownloadDecathlonDataSet.ipynb) Notebook
    2. Switch the dataset file in the [environment.json](config/environment.json)
    3. rerun the `train.sh`


3. Experiment with multi-GPU training by changing number of gpus to train on from 2 to 3 or 4. 
You should edit [train_multi_gpu.sh](commands/train_multi_gpu.sh) then rerun the script 
