# Video Libraries Tests

This notebook focuses on testing video loading libraries. There are many tools one can use to run dataloading for video. But we will focus on which functionality is most needed for ML/DL applications.

If we think a minute about DL/ML we realize we need to access data randomly, efficiently and fast. Therefore, our goal is to have a **seek** functionality that is as quick and as lightweight as possible to load the frames we need for each batch. In addition to that, we would like to easily build batches. 

There are a some common tools we'll look at:
 - cv2
 - pyAV
 - decord
 - DALI
 - pyNvVideoCodec

### Note on time measurements

Please note that to measure operations that happen on GPU, natively parallel and asyncronous chip you should use CUDA Events. In this notebook we use PyTorch to interface with CUDA Events. Be aware that you will have to setup start and end events, initialize the recording when your script starts, trigger end recording when done and *very important* synchronize before computing the elapsed time. Another good practice is to run your function/script for a few warmup iterations before timing, to skip measuring all the initialization overhead.

### Disclaimer on Implementations

The video libraries tests are a best effort and not to be considered a professional benchmark. If you find errors, optimizations, improvements please open an issue and we will improve them.

## Base Implementations

Let's give a look at a simple implementation of the seek function before we move on to the **real businness** of building batches. 

First, for a single video there are five toy examples below, based on opencv, decord-cpu, decord-gpu, pyav, and pyNvVideoCodec, implemented in [toy_cv2.py](assets/toy_cv2.py), [decord_cpu.py](assets/toy_decord.py), [toy_decord_gpu.py](assets/toy_decord_gpu.py), [toy_pyav.py](assets/toy_pyav.py), [toy_pynvvideocodec](assets/toy_pynvvideocodec.py) respectively.
If we look at them in detail, we can observe that opencv, pyav, pyNvVideoCodec go through the video in a sequential order with more or less compact python implementations. Instead decord is much more pythonic and treats the video as a list of frames we can randomly access.
Moving on to the loading time, decord is on par with opencv but pyav is slower. pyNvVideoCodec instead, uses a GPU accelerated approach, making it faster than pyav but slower than opencv on this test. 

Second, moving on to the *real* task of building batches, things get more interesting. As before we have five examples, based on opencv, decord-cpu, decord-gpu, pyav, and pyNvVideoCodec, implemented in [toy_batch_cv2.py](assets/toy_batch_cv2.py), [toy_decord_batch_cpu.py](assets/toy_batch_decord.py), [toy_batch_decord_gpu.py](assets/toy_batch_decord_gpu.py), [toy_batch_pyav.py](assets/toy_batch_pyav.py), [toy_batch_pynvvideocodec](assets/toy_batch_pynvvideocodec.py) respectively.
Opencv, pyav, pyNvVideoCodec need to loop over the video again and again to find the frames to load, this makes the code less compact and more verbose. Decord instead can directly load frames from the video with a `get_batch` function where we pass the frame indexes.
These differences translate in a small processing time advantage for decord. Second fastest is pyNvVideoCodec that is highly optimized to run on GPU using hardware based GPU video decoding, this is a potential huge advantage because it allows to pass data with *zero-copy* (*i.e.* the data is already ready to use in GPU, no host to device copy needed). More in detail for decord we can observe that on a single video the CPU version is faster than the GPU version, we will analyze later on a real use case if these performance change.

Please keep in mind the exact times of these runs can vary based on the machine you are running on.

In [None]:
! python3 assets/toy_cv2.py
! python3 assets/toy_batch_cv2.py

In [None]:
! python3 assets/toy_decord.py
! python3 assets/toy_batch_decord.py

In [None]:
! python3 assets/toy_decord_gpu.py
! python3 assets/toy_batch_decord_gpu.py

In [None]:
! python3 assets/toy_pyav.py
! python3 assets/toy_batch_pyav.py

In [None]:
! python3 assets/toy_pynvvideocodec.py

In [None]:
! python3 assets/toy_batch_pynvvideocodec.py

In [None]:
import matplotlib.pyplot as plt
import numpy as np

libraries = ("cv2", "pyAV", "decord CPU", "decord GPU", "PyNvVideoCodec")
values = {
    'Single frame': (0, 0, 0, 0, 0),
    'Batched frames': (0, 0, 0, 0, 0),
}

x = np.arange(len(libraries))  # the label locations
width = 0.25  # the width of the bars
multiplier = 0

fig, ax = plt.subplots(layout='constrained')

for attribute, values in values.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, values, width, label=attribute)
    ax.bar_label(rects, padding=3)
    multiplier += 1

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('time (s)')
ax.set_title('Time by libraries')
ax.set_xticks(x + width, libraries)
ax.legend(loc='upper left', ncols=3)
ax.set_ylim(0, 2.5)

plt.show()

The plot above reports the run times obtained from the toy runs. These runs help us understand the behaviour of different libraries on a single video, loading sequentially frames at fixed intervals (blue line) and loading random batches of random frames (orange line), in these experiments we set batch size 8.

The first take away is that without a highly efficient seek function that allows us to read at random locations in the video container it is quite challenging to have great dataloading performance.

Secondly, take time to look at the code in the [assets](assets) folder, you will see that some libraries are much more verbose than others.

Finally, we need to keep in mind that for a ML training we need data in GPU, and libraries that provide hardware (HW) decoding and zero copy access to the frames in GPU are nice to have to reduce data transfers.

## Dataloaders

Let's now move on to more practical implementations. Assuming in your day-to-day experiments you are using PyTorch we will now continue using some previous tests (not all) and see how they work in more practical setting.

We will see the following:
 - PyTorch native dataloader (based on pyAV)
 - DALI (NVIDIA accelerated dataloading library)
 - decord (CPU and GPU implementations)
 - pyNvVideoCodec

The dataset used for this example is [UCF101](https://www.crcv.ucf.edu/data/UCF101.php) which is small but not tiny allowing us to create a meaninigful experiment.

### PyTorch

PyTorch implements in the torchvision library many useful functions for image and video processing. The developers chose to use PyAV for video loading under the hood.

*Note* we learned the hard way that the dataloader breaks if it finds unexpected files in the dataset folder. For example I had a script for data preprocessing in the dataset folder and some preprocessed videos, remember to save them in another place!

In [2]:
import av
import torch
import torchvision

In [None]:
# there is a bug in this dataloader apparently
# https://github.com/pytorch/vision/issues/2265
def custom_collate(batch):
    filtered_batch = []
    for video, _, label in batch:
        filtered_batch.append((video, label))
    return torch.utils.data.dataloader.default_collate(filtered_batch)

UCF101_train = torchvision.datasets.UCF101(root='/workspace/playbooks/video/assets/UCF101',
                                     annotation_path='/workspace/playbooks/video/assets/UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist',
                                     frames_per_clip=8,
                                     train=True)

dataloaderUFC101 = torch.utils.data.DataLoader(UCF101_train, batch_size=4, shuffle=True, drop_last=True, collate_fn=custom_collate)

In [None]:
len(dataloaderUFC101)

**Note on time measurement (part 2)**
Below we do a few things to measure the time first we use CUDA Events, to make sure we record the GPU asyncronous execution and not the CPU. Secondly we follow a precise sequence of commands, we set a start and end event, record the start, record the end and syncronize them to make sure the CUDA kernels finished execution. To start we also introduces a very short warmup loop to make sure the GPU is ready to go without measuring initial variables setup or other transients.

In [None]:
device = 'cuda:0'
# init
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# warmup the GPU
for i, (video, label) in enumerate(dataloaderUFC101):
    video.to(device)
    label.to(device)
    if i == 9:
        break
# start
start.record()
# do your things
for i, (video, label) in enumerate(dataloaderUFC101):
    video.to(device)
    label.to(device)
    if i == 499:
        break
    #print(video.size())
# end + sync
end.record()
torch.cuda.synchronize(device)

pytorch_time = start.elapsed_time(end)/1000 # is is in ms
pytorch_time_batch = pytorch_time / i
print('It took %f seconds for %d batches, or %f second per batch' %(pytorch_time, i, pytorch_time_batch) )


In the cell above to make the experiment fast for you we stop the execution after 500 batches are loaded and moved to the GPU. Running on the full dataset is slightly faster because the initialization cost is diluted over more iterations.

The advantage of the native dataloader is that if your dataset is supported you can get away with one line of code. Just choose the parameters of your interest and pass the dataset path to it. The drawback might be less flexibility, and potentially slower compared to DALI for example.

*Note* there is some overhead in creating clips from videos when the Dataset is built, so the training will run only after this step is performed.

### DALI

[DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html) is the NVIDIA accelerated DAta loading LIbrary and implements a lot of useful dataloading pipelines for images, video, audio. 

DALI does not handle variable frame rate and variable frame size, please preprocess the dataset with [preprocess.sh](assets/preprocessassets/preprocess.sh).

If you need to install DALI (comes installed natively in NVIDIA NGC docker images) you will need to use the following pip install command, remember to choose the correct cuda version for your environment.
```bash
pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda120
```

In [7]:
import os
import numpy as np

from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch import DALIGenericIterator
from nvidia.dali.plugin.base_iterator import LastBatchPolicy
import nvidia.dali.types as types

In [None]:
batch_size = 4
sequence_length = 8
stride = 1
initial_prefetch_size = 16
video_directory = "/workspace/playbooks/video/assets/output/"

video_files = []
for train_file in ['trainlist01', 'trainlist02', 'trainlist03']:
    train_file = open("/workspace/playbooks/video/assets/UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist/" + train_file + ".txt", "r")
    video_files += [video_directory + "/" + f.split()[0].split('.')[0] + ".mp4" for f in train_file]

print(video_files[100])
print(len(video_files))
n_iter = 100

In [9]:
@pipeline_def
def video_pipe(filenames):
    videos = fn.readers.video(
        device="gpu",
        name="VideoReader",
        filenames=filenames,
        sequence_length=sequence_length,
        stride=stride,
        image_type=types.RGB,
        pad_sequences=True,
        pad_last_batch=True,
        shard_id=0,
        num_shards=1,
        random_shuffle=True,
        initial_fill=initial_prefetch_size,
        dtype=types.UINT8,
        skip_vfr_check=False,
    )
    return videos

In [None]:
from time import time

pipe = video_pipe(
    batch_size=batch_size,
    num_threads=2,
    device_id=0,
    filenames=video_files,
    seed=123456,
)
pipe.build()

dali_iter = DALIGenericIterator(pipe, ["data"],
                                reader_name="VideoReader",
                                dynamic_shape=True,
                                last_batch_policy=LastBatchPolicy.DROP)

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

for i, data in enumerate(dali_iter):
    if i == 9:
        break
    #print("%d, %s" % (i, data[0]["data"].size()))
    video = data[0]["data"]

start.record()
for i, data in enumerate(dali_iter):
    if i == 2000:
        break
    #print("%d, %s" % (i, data[0]["data"].size()))
    video = data[0]["data"]
    #sequences_out = pipe_out[0]#.as_cpu().as_array()
    #print(sequences_out.shape)
end.record()
torch.cuda.synchronize()


dali_time = start.elapsed_time(end)/1000 # is is in ms
dali_time_batch = dali_time / i
print('It took %f seconds for %d batches, or %f second per batch' %(dali_time, i, dali_time_batch) )

DALI is very similar to the native PyTorch dataloaders to use. It has similar features and accelerates the video loading directly in GPU. Thanks to GPU acceleration it might be faster than PyTorch so a very valid alternative with similar low coding effort.

### Decord CPU

It is common practice in decord to load a frame buffer from each video separately, decord provides also a `VideoLoader` class. We provide an example of VideoLoader usage before moving on to the more common used dataloader implementation.

In [100]:
import torch
import decord
import os
import numpy as np

In [None]:
from decord import VideoLoader, VideoReader
from decord import cpu, gpu

In [102]:
batch_size = 4
sequence_length = 8
stride = 1
initial_prefetch_size = 16
video_directory = "/workspace/playbooks/video/assets/UCF101"

video_files = []
for train_file in ['trainlist01', 'trainlist02', 'trainlist03']:
    train_file = open("/workspace/playbooks/video/assets/UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist/" + train_file + ".txt", "r")
    video_files += [video_directory + "/" + f.split()[0] for f in train_file]

In [None]:
import time

vl = VideoLoader([video_files[0]], ctx=[cpu(0)], shape=(batch_size, 320, 240, 3), interval=stride, skip=5, shuffle=1)
decord.bridge.set_bridge('torch')
device = "cuda:0"

print('Total batches:', len(vl))
tic = time.time()
for batch in vl:
    frames = batch[0].to(device)
    labels = batch[1].to(device)

toc = time.time()-tic
print('Iterated over the video batches in %f sec, %f sec per iter' %(toc, toc/len(vl)))

Common practice is to create a loader for each video in the dataset, so let's define our dataset!

In [11]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class DecordVideoDataset(Dataset):
    def __init__(self, root_dir, transform):
        """
        Args:
            root_dir (string): Directory with all the videos.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.root_dir = root_dir
        self.transform = transform
        self.video_paths = []
        self.labels = []
        
        # Load video paths and labels from directory structure
        for label_idx, action in enumerate(sorted(os.listdir(root_dir))):
            action_dir = os.path.join(root_dir, action)
            if os.path.isdir(action_dir):
                for video_file in os.listdir(action_dir):
                    if video_file.endswith('.avi'):  # Assuming videos are in .avi format
                        self.video_paths.append(os.path.join(action_dir, video_file))
                        self.labels.append(label_idx)
                        
    def __len__(self):
        return len(self.video_paths)
    
    def __getitem__(self, idx):
        video_path = self.video_paths[idx]
        label = self.labels[idx]
        
        # Load video using Decord
        vr = VideoReader(video_path, ctx=cpu(0))  # Use GPU if available: ctx=decord.gpu(0)
        
        # Get frames
        frames_idx = np.random.randint(0, len(vr), sequence_length)
        frames = vr.get_batch(frames_idx) # Returns all frames
        
        # Apply any transformations (e.g., resizing, normalization)
        if self.transform:
            frames = self.transform(frames)
        
        return frames, label    

In [12]:
ucf101_dataset = DecordVideoDataset(root_dir=video_directory, transform=None)
dataloader = DataLoader(ucf101_dataset, batch_size=batch_size, shuffle=True)

In [None]:
device = 'cuda:0'
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

for i, data in enumerate(dataloader):
    video = data[0].to(device)
    label = data[1].to(device)
    #print(video.size())
    #print(label.size())
    if i == 9:
        break

start.record()
for i, data in enumerate(dataloader):
    if i == 1000:
        break
    video = data[0].to(device)
    label = data[1].to(device)
    #print(video.size())
    #print(label.size())

end.record()
torch.cuda.synchronize()

decord_time = start.elapsed_time(end)/1000 # is is in ms
decord_time_batch = decord_time / i
print('It took %f seconds for %d batches, or %f second per batch' %(decord_time, i, decord_time_batch) )

Researchers like to use decord due to its flexibility and ease of personalization of the dataloader if non standard operations need to be performed on the video clips.

### Decord GPU

To use the GPU implementation there is only one very simple change to implement. Pass as context to decord the gpu context.
In the `__getitem__` function you can see that `VideoReader` gets `ctx=gpu(0)` instead of `ctx=cpu(0)`.

**WARNING** we encountered some silent failures when using decord GPU, so we don't encourage you to use it at the moment. Stay tuned, we might find the bug and fix it later!

In [97]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class DecordVideoDataset(Dataset):
    def __init__(self, root_dir, transform):
        """
        Args:
            root_dir (string): Directory with all the videos.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.root_dir = root_dir
        self.transform = transform
        self.video_paths = []
        self.labels = []
        
        # Load video paths and labels from directory structure
        for label_idx, action in enumerate(sorted(os.listdir(root_dir))):
            action_dir = os.path.join(root_dir, action)
            if os.path.isdir(action_dir):
                for video_file in os.listdir(action_dir):
                    if video_file.endswith('.avi'):  # Assuming videos are in .avi format
                        self.video_paths.append(os.path.join(action_dir, video_file))
                        self.labels.append(label_idx)
                        
    def __len__(self):
        return len(self.video_paths)
    
    def __getitem__(self, idx):
        video_path = self.video_paths[idx]
        label = self.labels[idx]
        
        # Load video using Decord
        vr = VideoReader(video_path, ctx=gpu(0))  # Use GPU if available: ctx=decord.gpu(0)
        
        # Get frames
        frames_idx = np.random.randint(0, len(vr), sequence_length)
        frames = vr.get_batch(frames_idx) # Returns all frames
        
        # Apply any transformations (e.g., resizing, normalization)
        if self.transform:
            frames = self.transform(frames)
        
        return frames, label 

In [98]:
ucf101_dataset = DecordVideoDataset(root_dir=video_directory, transform=None)
dataloader = DataLoader(ucf101_dataset, batch_size=batch_size, shuffle=True)

In [None]:
device = 'cuda:0'
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

for i, data in enumerate(dataloader):
    video = data[0].to(device)
    label = data[1].to(device)
    #print(video.size())
    #print(label.size())
    if i == 2:
        break

start.record()
for i, data in enumerate(dataloader):
    if i == 5:
        break
    video = data[0].to(device)
    label = data[1].to(device)
    #print(video.size())
    #print(label.size())

end.record()
torch.cuda.synchronize()

decord_gpu_time = start.elapsed_time(end)/1000 # is is in ms
decord_gpu_time_batch = decord_gpu_time / i
print('It took %f seconds for %d batches, or %f second per batch' %(decord_gpu_time, i, decord_gpu_time_batch) )

### PyNvVideoCodec

What is pyNvVideoCodec? As explained very well in the [docs](https://docs.nvidia.com/video-technologies/pynvvideocodec/pynvc-api-prog-guide/index.html): PyNvVideoCodec is a library that provides Python bindings over C++ APIs for hardware-accelerated video encoding and decoding. Internally, it utilizes core APIs of NVIDIA Video Codec SDK and provides the ease-of-use inherent to Python. It relies on an external FFmpeg library for demuxing media files. Here is a high level block diagram showing client application, PyNvVideoCodec library and related components.

In brief, with this library we exploit the power of hardware accelerated video decoding to accelerate dataloading. It is very easy to install `pip install pycuda pynvvideocodec` and soon will be even easier to use with release 2.0 that implements a built in *seek* function.

In [74]:
import os

batch_size = 4
sequence_length = 8
stride = 1
initial_prefetch_size = 16
video_directory = "/workspace/playbooks/video/assets/UCF101"

In [79]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import PyNvVideoCodec as nvc
import pycuda.driver as cuda
import torch 
import numpy as np

class PyNvCVideoDataset(Dataset):
    def __init__(self, root_dir, transform):
        """
        Args:
            root_dir (string): Directory with all the videos.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.root_dir = root_dir
        self.transform = transform
        self.video_paths = []
        self.labels = []
        self.enable_async_allocations = False
        
        # Load video paths and labels from directory structure
        for label_idx, action in enumerate(sorted(os.listdir(root_dir))):
            action_dir = os.path.join(root_dir, action)
            if os.path.isdir(action_dir):
                for video_file in os.listdir(action_dir):
                    if video_file.endswith('.mp4'):  # Assuming videos are in .avi format
                        self.video_paths.append(os.path.join(action_dir, video_file))
                        self.labels.append(label_idx)
                        
    def __len__(self):
        return len(self.video_paths)

    def load_batch(self, path_to_video, batch_indeces):
        start = cuda.Event()
        end = cuda.Event()

        start.record(stream=cuda_stream_nv_dec)
        demuxer = nvc.CreateDemuxer(path_to_video)
        decoder = nvc.CreateDecoder(
            gpuid=0,
            codec=demuxer.GetNvCodecId(),
            cudacontext=cuda_ctx.handle,
            cudastream=cuda_stream_nv_dec.handle,
            usedevicememory=True,
            enableasyncallocations=self.enable_async_allocations,
        )
    
        frame_count = 0
        frames = None
        for packet_index, packet in enumerate(demuxer):
            for frame in decoder.Decode(packet):
                frame_count += 1
                if frame_count in batch_indeces:
                    if frames is None:
                        frames = torch.from_dlpack(frame).unsqueeze(0)
                    else:
                        frames = torch.cat((frames, torch.from_dlpack(frame).unsqueeze(0)), dim=0)

        end.record()
        end.synchronize()
        return frames

    def __getitem__(self, idx):
        video_path = self.video_paths[idx]
        label = self.labels[idx]
        
        # Open a demuxer
        demuxer = nvc.CreateDemuxer(video_path) 
        # check video lenght
        for frame_count, packet in enumerate(demuxer):
            continue
        
        # Get frames
        frames_idx = np.floor(np.linspace(1, frame_count, sequence_length))
        frames = self.load_batch(video_path, frames_idx) # Returns all frames
        # Apply any transformations (e.g., resizing, normalization)
        if self.transform:
            frames = self.transform(frames)
        
        return frames, label

Here the code runs in a script

In [None]:
! python3 assets/pynvc_dataloader.py

Thanks to the HW + SW acceleration, CUDA zero copy to PyTorch and flexibility it is a very solid alternative to decord.


## Conclusions

To conclude some libraries allow us to have much better throughput than others, thus maximizing the GPUs utilization. In a multimodal training this is crucial to deliver results in a reasonable time.

Going in order we have the pair that requires less coding: DALI and PyTorch dataloader.

For maximum customization instead, decord and pyNvVideoCodec are more interesting. However, at the moment we had silent failures with decord-GPU. So (for now) prefer pyNvVideoCodec or decord-CPU.