# ThreadBuffer Performance

This notebook demonstrates the use of `ThreadBuffer` to generate batches of data asynchronously from the training thread.

Under certain circumstances the main thread can be busy with the training operations, that is interacting with GPU memory and invoking CUDA operations, which is independent of batch generation operations. If the time taken to generate a batch is significant compared to the time taken to train the network for an iteration, and assuming operations can be done in parallel given the limitations of the GIL or other factors, this should speed up the whole training process. The efficiency gains will be relative to the proportion of these two times, so if batch generation is lengthy but training is very fast then very little parallel computation is possible.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Project-MONAI/tutorials/blob/master/acceleration/threadbuffer_performance.ipynb)

## Setup Environment

The current MONAI master branch must be installed for this feature (as of release 0.3.0), skip this step if already installed:

In [None]:
!python -c "import monai" || pip install -q "monai-weekly[tqdm]"

In [1]:
import monai
import numpy as np
import torch
from monai.data import DataLoader, Dataset, ThreadBuffer, create_test_image_2d
from monai.losses import Dice
from monai.networks.nets import UNet
from monai.transforms import AddChanneld, Compose, MapTransform, EnsureTyped

monai.utils.set_determinism(seed=0)

monai.config.print_config()

MONAI version: 0.4.0
Numpy version: 1.19.1
Pytorch version: 1.7.0a0+7036e91
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 0563a4467fa602feca92d91c7f47261868d171a1

Optional dependencies:
Pytorch Ignite version: 0.4.2
Nibabel version: 3.2.1
scikit-image version: 0.15.0
Pillow version: 8.0.1
Tensorboard version: 2.2.0
gdown version: 3.12.2
TorchVision version: 0.8.0a0
ITK version: 5.1.2
tqdm version: 4.54.1
lmdb version: 1.0.0
psutil version: 5.7.2

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies



The data pipeline is given here which creates random 2D segmentation training pairs. It is artificially slowed by setting the number of worker processes to 0 (often necessary under Windows).

In [2]:
class RandomGenerator(MapTransform):
    """Generates a dictionary containing image and
    segmentation images from a given seed value."""

    def __call__(self, seed):
        rs = np.random.RandomState(seed)
        im, seg = create_test_image_2d(
            256, 256, num_seg_classes=1, random_state=rs
        )

        return {self.keys[0]: im, self.keys[1]: seg}


data = np.random.randint(0, monai.utils.MAX_SEED, 1000)

trans = Compose(
    [
        RandomGenerator(keys=("im", "seg")),
        AddChanneld(keys=("im", "seg")),
        EnsureTyped(keys=("im", "seg")),
    ]
)

train_ds = Dataset(data, trans)
train_loader = DataLoader(train_ds, batch_size=20, shuffle=True, num_workers=0)

Network, loss, and optimizers defined as normal:

In [3]:
device = torch.device("cuda:0")
net = UNet(2, 1, 1, (8, 16, 32), (2, 2, 2), num_res_units=2).to(device)
loss_function = Dice(sigmoid=True)
optimizer = torch.optim.Adam(net.parameters(), 1e-5)
max_epochs = 10

A simple training function is defined which only performs step optimization of the network:

In [4]:
def train_step(batch):
    inputs, labels = batch["im"].to(device), batch["seg"].to(device)

    optimizer.zero_grad()
    outputs = net(inputs)
    loss = loss_function(outputs, labels)
    loss.backward()
    optimizer.step()


def train(use_buffer):
    # wrap the loader in the ThreadBuffer if selected
    src = ThreadBuffer(train_loader, 1) if use_buffer else train_loader

    for _ in range(max_epochs):
        for batch in src:
            train_step(batch)

Timing how long it takes to generate a single batch versus the time taken to optimize the network for one step reveals the proportion of time taken by each during each full training iteration:

In [5]:
it = iter(train_loader)
batch = next(it)

%timeit -n 1 next(it)
%timeit -n 1 train_step(batch)

52.9 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.6 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Without using an asynchronous buffer for batch generation these operations must be sequential:

In [6]:
%timeit -n 1 train(False)

50.7 s ± 2.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


With overlap we see a significant speedup:

In [7]:
%timeit -n 1 train(True)

31.1 s ± 833 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
