In [1]:
# REF : https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

#### Enable asynchronous data loading and augmentation

- torch.utils.data.DataLoader supports asynchronous data loading and data augmentation in separate worker subprocesses. 
- Setting num_workers > 0 enables asynchronous data loading and overlap between the training and data loading.
- set pin_memory=True, this instructs DataLoader to use pinned memory and enables faster and asynchronous memory copy from the host to the GPU.

#### Disable gradient calculation for validation or inference

-  torch.no_grad() 

#### Disable bias for convolutions directly followed by a batch norm

- If a nn.Conv2d layer is directly followed by a nn.BatchNorm2d layer, then the bias in the convolution is not needed, instead use nn.Conv2d(..., bias=False, ....). Bias is not needed because in the first step BatchNorm subtracts the mean, which effectively cancels out the effect of bias.

#### zero out gradients, use the following method
- optimizer.zero_grad(set_to_none=True)

####  Fuse pointwise operations
Pointwise operations (elementwise addition, multiplication, math functions - sin(), cos(), sigmoid() etc.) can be fused into a single kernel to amortize memory access time and kernel launch time.

In [8]:
import torch

In [11]:
@torch.jit.script
def fused_gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
%timeit fused_gelu(torch.rand(512,512))

1.27 ms ± 9.39 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [12]:
@torch.compile
def fused_gelu_compile(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
%timeit fused_gelu_compile(torch.rand(512,512))

994 µs ± 118 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Enable channels_last memory format for computer vision models
PyTorch 1.5 introduced support for channels_last memory format for convolutional networks. This format is meant to be used in conjunction with AMP to further accelerate convolutional neural networks with Tensor Cores.

#### Checkpoint intermediate buffers
- Checkpointing targets should be selected carefully. The best is not to store large layer outputs that have small re-computation cost. The example target layers are activation functions (e.g. ReLU, Sigmoid, Tanh), up/down sampling and matrix-vector operations with small accumulation depth.