# Video File Formats

> A video file format is a type of file format for storing digital video data on a computer system. Video is almost always stored using lossy compression to reduce the file size. *wikipedia*

Normally a video is in a **container** with visual data and audio data. The container can also have many other useful information as metadata, title, subtitles, etc.

In most of the cases, in order to save memory, the visual and audio data are stored with lossy compressions.

More details and comparisons:
 1. [Video File Format](https://en.wikipedia.org/wiki/Video_file_format)
 2. [Comparison of Container Formats](https://en.wikipedia.org/wiki/Comparison_of_video_container_formats)

# Video Data Handling

There are two aspects to consider for accelerated video data handling in the context of machine learning, the availability of hardware accelerated decoding and efficient video frame indicization.

## Hardware Accelerated Decoding

We intentionally focus on decoding here. In fact, to train a VLM, we need to read TB of data, making accelerated data reading essential to complete trainings in a reasonable time.

We all know the GPU is a powerful piece of hardware that can crunch huge loads of parallel computations very quickly. What many might not know is that the GPU has many specialized cores inside separated from the CUDA cores, and there are video decoding cores (and in many GPUs also encoding cores).

![immagine](assets/nvcodec.jpg)

A tipical setting is to use FFmpeg and decode the video data in CPU before sending it on the GPU for the model training. This can be a bottleneck and to improve it we can compile FFmpeg with NVIDIA GPU acceleration such that FFmpeg can perform hardware accelerated video decoding.

More details on [NVCODEC Docs](https://docs.nvidia.com/video-technologies/video-codec-sdk/12.2/index.html#ffmpeg-with-nvidia-gpu).

##  Indexing Friendly Video Formats

Formats that use keyframes and support random access tend to be more indexing-friendly. These allow for quicker seeking to specific points in the video without having to decode all preceding frames. The efficiency of indexing can also depend on how the video was encoded within the container. Videos with regular keyframe intervals are generally easier to index and seek through.

Some container formats known for good indexing support include:
 - MP4 (MPEG-4 Part 14)
 - MKV (Matroska)
 - AVI (Audio Video Interleave)
 - MOV (QuickTime File Format)

Side Note: It's worth noting that the efficiency of indexing can also depend on the specific implementation of the video processing library or tool being used, not just the inherent properties of the video format itself.

Citations:
[1] https://imagekit.io/blog/video-indexing/
[2] https://indexrusher.com/blog/video-compression-for-seo-formats-and-optimization/
[3] https://www.gumlet.com/learn/video-indexing/
[4] https://www.linkedin.com/pulse/everything-you-need-know-video-indexing-gumlet
[5] https://cloudinary.com/guides/web-performance/video-encoding-how-it-works-formats-best-practices
[6] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.readers.video.html
[7] https://docs.nvidia.com/video-technologies/video-codec-sdk/12.0/ffmpeg-with-nvidia-gpu/index.html
[8] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.experimental.readers.video.html

## Video Preprocessing

In many datasets we might find videos with different container format, different resolutions, different frame rates. In addition to that, our machine learning training will usually use a standard resolution in input (*e.g.* 256x128).

Assuming you already overcame the challenge of downloading or transferring your dataset to the cluster we can move of to the video preprocessing stage.

To have detailed information about the video we can use
```bash
ffmpeg -i input.mp4
```
We can preprocess our videos with

```bash
ffmpeg -y -vsync 0 -hwaccel cuda -hwaccel_output_format cuda –resize 1280x720 -i input.mp4 -c:a copy -c:v h264_nvenc -b:v 5M output.mp4
```
where we say to ffmpeg to 
 - ```-y```: Overwrite output file without asking for confirmation.
 - ```-vsync 0```: Disable video sync to potentially speed up the process.
 - ```-hwaccel cuda```: Use CUDA hardware acceleration for decoding.
 - ```-hwaccel_output_format cuda```: Set the output format of the hardware accelerated decoder to CUDA, keeping the frames on the GPU.
 - ```–resize 1280x720```: Resize the video to 1280x720 resolution. (Note: This flag seems to be incorrect. The correct flag would be -vf scale=1280:720.)
 - ```-i input.mp4```: Specify the input file.
 - ```-c:a copy```: Copy the audio stream without re-encoding.
 - ```-c:v h264_nvenc```: Use the NVIDIA H.264 encoder for video encoding.
 - ```-b:v 5M```: Set the video bitrate to 5 Mbps.
 - ```output.mp4```: Specify the output file name.
 
We need a script to preprocess all our dataset and not manually do one video at the time

nvidia dali has seek video frame function? How to use it?

Yes, NVIDIA DALI has functionality to seek and extract specific frames from video files. Here are the key points on how to use it:

## Video Reading and Seeking

DALI provides video reading capabilities through the `readers.video` operator. This operator allows you to load and decode video files, with options to control which frames are extracted.

### Key Parameters for Frame Seeking

1. **sequence_length**: Specifies the number of consecutive frames to extract in each sequence[1][2].

2. **stride**: Controls the step size between sequences. For example, setting `stride=2` will skip every other frame[4].

3. **step**: Determines the distance between the first frame of each consecutive sequence[2].

4. **initial_fill**: Sets the size of the buffer used for shuffling when `random_shuffle` is enabled[1].

### Example Usage

Here's a basic example of how to use the `readers.video` operator to extract frames:

```python
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@pipeline_def
def video_pipe(filenames):
    videos, labels = fn.readers.video(
        device="gpu",
        filenames=filenames,
        sequence_length=16,  # Extract 16 consecutive frames
        stride=2,  # Skip every other frame
        step=32,   # Start of each sequence is 32 frames apart
        random_shuffle=True,
        initial_fill=1024
    )
    return videos, labels
```

## Seeking to Specific Frames

While DALI doesn't have a direct "seek to frame" function, you can achieve similar results by combining the `sequence_length`, `stride`, and `step` parameters:

1. To start from a specific frame, adjust the `step` parameter.
2. To control the spacing between extracted frames, use the `stride` parameter.
3. To extract a certain number of frames, set the `sequence_length` parameter.

For example, to start at frame 100 and extract every 5th frame for a total of 10 frames:

```python
videos, labels = fn.readers.video(
    device="gpu",
    filenames=filenames,
    sequence_length=10,  # Extract 10 frames
    stride=5,            # Take every 5th frame
    step=100,            # Start at frame 100
)
```

## Additional Considerations

- DALI uses FFmpeg for video parsing, which allows it to handle most common video container formats[1][2].
- For variable frame rate videos, DALI will index keyframes to enable effective seeking[1].
- The `readers.video` operator supports both CPU and GPU backends, allowing for hardware-accelerated decoding when using the GPU backend[2].

By adjusting these parameters, you can effectively control which frames are extracted from your video files, allowing for flexible frame seeking and extraction in your DALI pipeline.

Citations:
[1] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.experimental.readers.video.html
[2] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.readers.video.html
[3] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/sequence_processing/sequence_reader_simple_example.html
[4] https://stackoverflow.com/questions/70058109/how-to-extract-frames-from-videos-at-a-certain-fps-ex-15fps-using-nvidia-dali
[5] https://github.com/NVIDIA/VideoProcessingFramework/issues/218
[6] https://www.run.ai/guides/ai-open-source-projects/nvidia-dali
[7] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/sequence_processing/index.html
[8] https://github.com/NVIDIA/DALI/issues/2604

Useful Resources

https://github.com/NVIDIA/flownet2-pytorch

https://github.com/NVIDIA/DALI/blob/040b354f8d3bad417f36a973e2249e4d46174273/docs/examples/sequence_processing/video/video_processing_per_frame_arguments.ipynb

https://github.com/webdataset/webdataset

https://developer.nvidia.com/blog/av1-encoding-and-fruc-video-performance-boosts-and-higher-fidelity-on-the-nvidia-ada-architecture/

https://docs.nvidia.com/video-technologies/pynvvideocodec/pynvc-api-prog-guide/index.html#interoperability

https://developer.nvidia.com/pynvvideocodec

https://gitlab-master.nvidia.com/DeepStreamSDK/deepstream_libraries/-/blob/main/segmentation/main.py?ref_type=heads

# DALI discussion

ffmpeg could be a bottleneck because it runs on CPU for DALI?

You're correct that FFmpeg running on the CPU could potentially be a bottleneck for NVIDIA DALI's video processing pipeline. Here are the key points to understand:

1. DALI's video reader uses FFmpeg for parsing video containers and extracting encoded video frames[1][2]. This initial parsing step typically occurs on the CPU.

2. While DALI can leverage GPU acceleration for decoding and processing video frames, the initial FFmpeg parsing step may still rely on CPU resources[1][2].

3. For certain video container formats that don't support efficient indexing (like MPEG), DALI may need to use FFmpeg to seek through the video when extracting specific sequences, which can be CPU-intensive[1][2].

4. The impact of FFmpeg running on CPU can vary depending on factors like video format, container type, and seeking requirements. For some workflows, it may not be a significant bottleneck, while for others it could limit overall performance.

5. To mitigate potential CPU bottlenecks:

   - Use video formats and containers that support efficient indexing when possible.
   - Adjust DALI pipeline parameters like `prefetch_queue_depth` to balance CPU and GPU workloads[1][2].
   - Consider pre-processing videos to more GPU-friendly formats if the same videos will be used repeatedly in training.

6. DALI does use hardware-accelerated decoding (NVDEC) when available, which can significantly reduce CPU load for the actual frame decoding step[2].

While FFmpeg's CPU usage can be a consideration, DALI's overall design aims to optimize the video processing pipeline by leveraging GPU acceleration where possible. The extent to which FFmpeg on CPU becomes a bottleneck will depend on your specific use case and video characteristics.

Citations:
[1] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.experimental.readers.video.html
[2] https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.readers.video.html
[3] https://stackoverflow.com/questions/78646875/ffmpeg-cpu-usage
[4] https://github.com/NVIDIA/DALI/issues/1012
[5] https://docs.nvidia.com/video-technologies/video-codec-sdk/12.0/ffmpeg-with-nvidia-gpu/index.html
[6] https://stackoverflow.com/questions/70058109/how-to-extract-frames-from-videos-at-a-certain-fps-ex-15fps-using-nvidia-dali
[7] https://www.reddit.com/r/ffmpeg/comments/1734nf2/why_is_my_cpu_used_a_ton_when_encoding_with/
[8] https://www.run.ai/guides/ai-open-source-projects/nvidia-dali

In [2]:
! pip list

Package                   Version
------------------------- --------------------
absl-py                   2.0.0
aiohttp                   3.8.5
aiosignal                 1.3.1
annotated-types           0.5.0
apex                      0.1
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
asttokens                 2.4.0
astunparse                1.6.3
async-timeout             4.0.3
attrs                     23.1.0
audioread                 3.0.1
backcall                  0.2.0
beautifulsoup4            4.12.2
bleach                    6.0.0
blis                      0.7.11
cachetools                5.3.1
catalogue                 2.0.10
certifi                   2023.7.22
cffi                      1.16.0
charset-normalizer        3.2.0
click                     8.1.6
cloudpathlib              0.15.1
cloudpickle               2.2.1
cmake                     3.27.6
comm                      0.1.4
confection                0.1.3
contourpy      


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
