# Example : Optimized Pytorch workflow

## Summary
This example is optimized pytorch training workflow.

On this example, we will train hand-writing number classification model with [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database).


We will a lots of skills for solve problem of common training workflow

#### Problem of unoptimized workflow
* Use too much VRAM(Even it really doesn't need)
* Slow Training Speed
* Use only one GPU

- - -
### Import pacakges

In [None]:
import tensorflow as tf
from nvidia.dali import pipeline_def, fn, types
import nvidia.dali.plugin.tf as dali_tf

import os
import glob
import math

#### What those packages do?
* [TensorFlow](https://www.tensorflow.org/) : Define and training model.
* [nvidia.dali](https://developer.nvidia.com/dali/) : Preprocess and load data with GPU-acceleration.
* [os](https://docs.python.org/3/library/os.html) : Get label and join splited path to one.
* [glob](https://docs.python.org/3/library/glob.html) : Get all image files absolute path.
* [math](https://docs.python.org/3/library/math.html) : Compute iteration per epoch with ceil.
---
## Optimizing method
* GPU Accelerated Dataloader - [Nvidia DALI](https://developer.nvidia.com/dali/)
    * Reduce RAM - CPU - GPU Memory bottleneck with [GPU Direct Storage](https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html)
    * Data augmentation with GPU Acceleration
* Fast Forward/Backward Computation - [Mixed Precision Training](https://arxiv.org/abs/1710.03740)
    * Effective [MMA (Matrix Multiply-accumulate)](https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation) Computation on Nvidia Ampere GPU
* Optimized GPU job scheduler - [XLA](https://www.tensorflow.org/xla)
    * Optimize [SM (Stream Multiprocessor)](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf#page=22) interal job scheduling
* Change TensorFlow GPU memory strategy
    * Reduce GPU memory consumption of TensorFlow process
* [Multi GPU training](https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy)
    * Use more then one GPUs for Training

### Set TensorFlow runtime setting
To enable mixed precision training and change GPU memory strategy, this code block need to be run.

In [None]:
gpu_ids = [0]
# Replace 0 with device id what you will use.

# Get available GPUs
gpus = tf.config.list_physical_devices('GPU')
target_gpus = [gpus[gpu_id] for gpu_id in gpu_ids]

# Set tensorflow can use all selected GPUs
tf.config.set_visible_devices(target_gpus, 'GPU')

#Memory strategy change : allocate as much as possible -> allocate as need
for target_gpu in target_gpus:
    tf.config.experimental.set_memory_growth(target_gpu, True)

# Make TensorFlow use mixed precision training
tf.keras.mixed_precision.set_global_policy('mixed_float16')

### Set multi-GPUs training strategy
For use multi GPUs, training strategy need selected first.
On this Example, We will use [MirroredStrategy](https://www.tensorflow.org/guide/distributed_training#mirroredstrategy).

In [None]:
strategy = tf.distribute.MirroredStrategy()

### Define  dataset and multi-worker dataloader
We will assume dataset is infinite or It can only stored partial dataset in [RAM](https://en.wikipedia.org/wiki/Random-access_memory).\
So we will use `DALI` to load every decoded data to GPU Memory with [DMA(Direct Memory Access)](https://en.wikipedia.org/wiki/Direct_memory_access) and augment it.

![](https://developer-blogs.nvidia.com/wp-content/uploads/2019/01/figure1_blogpost_dali_whitebg-625x177.png)

In [None]:
# Define batch size for dataloader
batch_size = 2560
image_dir = r'./mnist_png/training/'

@pipeline_def(batch_size=batch_size)
def mnist_pipeline(image_dir, shard_id):
    images, labels = fn.readers.file(file_root=image_dir, shard_id=shard_id, num_shards=len(target_gpus))
    images = fn.decoders.image(images, device='mixed', output_type=types.GRAY)
    images = fn.crop_mirror_normalize(images, device="gpu", dtype=types.FLOAT, std=[255.], output_layout="CHW")
    labels = labels.gpu()
    return (images, labels)

shapes = (
    (batch_size, 1, 28, 28),
    (batch_size))

dtypes = (
    tf.float32,
    tf.int32)

input_options = tf.distribute.InputOptions(
    experimental_place_dataset_on_device = True,
    experimental_fetch_to_device = False,
    experimental_replication_mode = tf.distribute.InputReplicationMode.PER_REPLICA)

def dataloader_fn(input_context):
    with tf.device("/gpu:{}".format(input_context.input_pipeline_id)):
        device_id = input_context.input_pipeline_id
        dataset = dali_tf.DALIDataset(
                pipeline=mnist_pipeline(image_dir, device_id=device_id, shard_id=device_id),
                batch_size=batch_size,
                output_shapes=shapes,
                output_dtypes=dtypes,
                device_id=device_id
                )
        return dataset

dataloader = strategy.distribute_datasets_from_function(dataloader_fn, input_options)

#### How DALI works
![](https://developer-blogs.nvidia.com/wp-content/uploads/2019/01/fig5_final.png)

On this dataloader, `DALI` will load batch with this process
1. Decode image file on CPU to transform it to array.
2. Directly send raw array to GPU.
3. Preprocess data with GPU-Acceleration.
4. Prefetch next batch and load batch it to training process when batch end.

Batch will prefetched like image below.

![](imgs/prefetch.png)

dataflow will be like picture below:

<img src="./imgs/gpudirect_storage.png" width='425px' height='450px'>

---

### Define model, optimizer, loss function

This example Task is 'Multi labels classification'. so model would like below.

* Model is simple model Based on [Convolutional Layers](https://arxiv.org/abs/1511.08458).
* Loss function will be [sparse categorical crossentropy](https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy).
* Optimizer will be [AdamW](https://arxiv.org/abs/1711.05101).

For convenience, model's performance would be only measured by train set accuracy.

#### Model architecture
<img src="./imgs/model_architecture.png" width="300px" height="500px">

# Compile model for multi gpu training
For multi-gpu training, model must defined and compiled in `strategy.scope()`.
For use XLA, `jit_compile` flag must be `True` on model compile.
```
model.compile(..., jit_compile=True)
```

In [None]:
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, kernel_size=(3,3), input_shape=(1, 28, 28), activation='relu', data_format='channels_first'),
        tf.keras.layers.Conv2D(64, kernel_size=(3,3), activation='relu'),
        tf.keras.layers.MaxPool2D(pool_size=(2,2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(512, activation='relu'),
        tf.keras.layers.Dense(128)
    ])

    optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001)
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metrics = ['accuracy']

    model.compile(optimizer=optimizer,
                loss=loss_fn,
                metrics=metrics,
                jit_compile=True)

### start training

Current Training Environment is like below:

|Precision|Batch preprocssing|Batch caching|GPU select|GPU memory strategy|
|---|---|---|---|---|
|FP16|Inline<br>Compute by GPU|Prefetch on batch demands<br>Stored in GPU memory|Selectable By user|Grow up when need|


In [None]:
epochs = 100
iteration_per_epoch = math.ceil(len(glob.glob(os.path.join(image_dir, '*/*.png')))/batch_size)

model.fit(dataloader, epochs=epochs, steps_per_epoch=iteration_per_epoch)


### After Training
TensorFlow have [critical bug](https://github.com/tensorflow/tensorflow/issues/1727#issuecomment-225665915) that won't release GPU memory after model used(both Training, Evaluation).\
So we need to free GPU memory for other users.

#### Step
1. [Save trained model](https://www.tensorflow.org/guide/keras/save_and_serialize)
2. Kill Tensorflow Process

In [None]:
model_save_path = r'./latest.h5'

# save model to file
model.save(model_save_path)
exit(0)


### Compare optimization Before & After 


||Before|After|
|---|---|---
|**Precision**|TF32|FP16|
|**Dataloader**|TensorFlow|Nvidia DALI|
|**Batch caching**|Next batch only<br>RAM|Auto-Adjusted by DALI<br>GPU memory|
|**Batch preprocessing**|OpenCV/Numpy<br>CPU|DALI<br>GPU|
|**GPU Usage**|Training|Training<br>Preprocessing|
|**GPU Select**|Automatically Selected by TensorFlow|Selectable By user|
|**GPU memory strategy**|As much as Possible<br>([Automatically Selected by TensorFlow]((https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth)))|Grow up when need|