Best way to dynamically letterbox video frames with longest dim as parameter? #2832

dwrodri · 2021-03-30T19:36:11Z

I am trying to implement a DALI pipeline which will feed frames of video into a single-shot object detector written in PyTorch. Specifically, I am trying to replicate the LoadImages class from YOLOv5 using only GPU operations. The preprocessing pipeline I'm referencing performs two main operations in the following order:

Resize the largest dimension to 640px while preserving the aspect ratio of the footage.
Letterbox every frame of the footage such that the width and height of every frame are multiples of a stride. For example, footage which is scaled down to 360x640 where the input shape is 640x640 with a stride of 32 means that your final height is ((640 - 360) % 32 ) + 360 = 384. This isn't the most straightforward approach, I know, but for the purposes of this question (and my own sanity) I am not questioning this formula.

So, operation 1 is pretty well-documented, and can be done by calling fn.video_resize with the right arguments. I'm struggling with figuring out how to perform operation 2.

My first attempt involved capturing the frame dimensions using fn.shapes, but I couldn't find the "obvious" way of padding opposite sides of the frame. Here was my first attempt:

@pipeline_def(batch_size=1, seed=42, device_id=0, num_threads=1)
def tagger_pipeline(
    source: List[str], num_frames: int, new_size: int = 640, fill_color: int = 114
):
    """
    Decode, resize, and letterbox frames from videos using only GPU operations.

    By default, YOLOv5 assumes all dimensions of a frame are multiples of 32, so
    we letterbox to guarantee the image boundaries fall on the edges of a filter step.
    """
    # input shape of the model
    new_shape = types.Constant([new_size, new_size], dtype=types.INT64, device="gpu")
    # stride of the first layer in the feature encoder
    stride = types.ScalarConstant(32, dtype=types.INT64)  # type: ignore

    ###########
    # Operation 1
    ###########
    frames = fn.readers.video_resize(  # type: ignore
        sequence_length=num_frames,
        filenames=source,
        skip_vfr_check=True,
        size=640,
        mode="not_larger",
        dtype=types.FLOAT,
        device="gpu",
    )
    
    ###########
    # Operation 2
    ###########
    shape = fn.shapes(frames, dtype=types.FLOAT)  # type: ignore
    new_unpad = fn.slice(  # type: ignore
        shape,
        axes=[0],
        start=1,
        end=3,
    )

    height_width = fn.slice(shape, axes=0, start=1, end=3)
    height = fn.slice(height_width, axes=0, start=0, end=1)
    width = fn.slice(height_width, axes=0, start=1, end=2)

    # DALI doesn't support modulo operations, so I so this instead
    pre_modulus = new_shape - height_width
    dh_dw = pre_modulus - stride * fn.cast( # type:ignore
        pre_modulus // stride - 0.5, dtype=types.INT64 # type:ignore 
    )  # type: ignore

    # cut padding in half because we're doing half on each side
    dh = fn.slice(dh_dw, axes=0, start=0, end=1) // 2  # type: ignore
    dw = fn.slice(dh_dw, axes=0, start=1, end=2) // 2  # type: ignore

    height_half_pad = types.Constant(
        value=fill_color, shape=[num_frames, dh, width, 3], dtype=types.FLOAT, device="gpu"  # type: ignore
    )
    width_half_pad = types.Constant(
        value=fill_color,
        shape=[num_frames, height+dh, dw, 3],  # type: ignore
        dtype=types.FLOAT, # type:ignore
        device="gpu",
    )
    frames = fn.cat(height_half_pad, frames, fn.copy(height_half_pad), axis=1)
    frames = fn.cat(width_half_pad, frames, fn.copy(width_half_pad), axis=2)
    
    # put channels at the front because that is what the model expects
    frames = fn.transpose(frames, perm=(0, 3, 1, 2))

    return frames

This code doesn't run because I'm generating a dali.types.Constant with a DataNode as one of the arguments for the shape, resulting in the following error:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'DataNode'

So far, I can come up with five potential alternatives which allow me to continue doing this without any CPU-only operations:

Write my own letterboxing operation using CuPy
Try to combine paste and slice.
Use some other library to extract the frame dimensions prior to building the pipeline, and pass in everything needed to make the half pads proper types.Constants.
Have one pipeline for each operation and see if I can feed the outputs from the pipeline for operation 1 into operation 2.
Use DALI for operation 1 and then letterbox with PyTorch ops

I don't see how I can use fn.Pad because it will only place fill values at the end, and fn.transforms.translate is CPU-only, according to the documentation.

I'd really like to be able to pass in a folder of videos where each video has a different aspect ratio, and have all the preprocessing happen on the GPU, but getting this working with just one aspect ratio that isn't known ahead of time is the first step.

Environment

Python 3.8.6, compiled with GCC 10.2
DALI version 1.0.0 install via pip
CUDA Toolkit version 11.1

The text was updated successfully, but these errors were encountered:

mzient · 2021-03-31T09:02:47Z

Unfortunately, DALI executes GPU operators strictly after CPU operators. This means that fn.shapes on a video will produce the shape on the GPU which, admittedly, is not very useful - that's because we avoid transferring tensors from GPU back to CPU, so the parameters which affect output shapes are CPU tensors. Allowing a GPU operator to produce a CPU output (which would only be usable by GPU operators due to ordering, but still usable as a named argument) would require significant rework of how our pipelines are built and validated.

Having said all that, there's one workaround: you can pad your data to given alignment (which will produce a letterbox at the bottom and to the right) and then use warp_affine with a specially crafted matrix that shifts the images just by half of the padding:

    videos = fn.readers.video_resize(  # type: ignore
        sequence_length=num_frames,
        filenames=source,
        skip_vfr_check=True,
        size=640,
        mode="not_larger",
        dtype=types.FLOAT,
        device="gpu",
    )

    padded = fn.pad(videos, axis_names="HW", shape=[1,1], align=32)  # use your preferred alignment here
    shift = fn.cast((fn.shapes(videos) - fn.shapes(padded)) // 2, dtype=types.FLOAT)
    dx = fn.slice(shift, 2, 1, axes=[0])
    dy = fn.slice(shift, 1, 1, axes=[0])
    shift = fn.stack(dx, dy, 0.0, axis=0)
    matrix = fn.cat(np.identity(3,dtype=np.float32), shift, axis=1)
    as_volume = fn.reshape(padded, layout="DHWC")
    warped = fn.warp_affine(as_volume, matrix, fill_value=0, interp_type=types.INTERP_NN)
    letterboxed = fn.reshape(warped, layout="FHWC")

Note that warp_affine doesn't natively support videos - that's why the frame dimension is reinterpreted as depth and 3D warp is used (the depth/frame dimension is left untouched).

dwrodri · 2021-03-31T19:26:48Z

Thank you for the detailed response! Since I was in a pinch, I currently have an implementation where I collect frame dimensions using python-ffmpeg and create types.Constant objects at the start of the pipeline constructor. I will probably add this approach to my code when I get the chance.

Does the ordering requirement affect whether the fn.transforms module will ever get GPU support? This is a really great project, so if I can find a justification to allocate my professional dev time, I definitely wouldn't mind submitting a PR to make this corner of the API a bit better. It makes sense to me with keeping GPU ops last. I haven't tested this, but I assume that you can just throw an error when someone attempts to violate the ordering at the time that Pipeline.build() is called.

Anyways, keep up the great work!

JanuszL · 2021-03-31T20:40:32Z

Hi @dwrodri,

fn.transforms module will ever get GPU support

We don't have that in our roadmap now. Performance-wise it won't yield much of the benefit as the operations are pretty simple. The only benefit would be to accept data that is already on the GPU. If you think that the community would benefit from it feel free to create a PR that would add such functionality.

I haven't tested this, but I assume that you can just throw an error when someone attempts to violate the ordering at the time that Pipeline.build()

DALI would yield ValueError: An operator with device='cpu' cannot accept GPU inputs..

dwrodri · 2021-03-31T20:46:42Z

I'll go ahead and close this issue as it has been answered. Thanks again for the fast and thorough responses!

romanmaznikov1 · 2022-10-14T12:13:40Z

I'll go ahead and close this issue as it has been answered. Thanks again for the fast and thorough responses!

@dwrodri Hi! Please tell me how you solved the problem with the TypeError: int() argument must be a string, a bytes-like object or a number, not 'DataNode. What does your tagger_pipeline function look like in the end?

JanuszL added the question Further information is requested label Mar 31, 2021

dwrodri closed this as completed Mar 31, 2021

anibali mentioned this issue Oct 17, 2021

Using the External Source operator for video sequences #3433

Closed

gemenerik mentioned this issue Feb 23, 2022

Stacked random generators #3699

Closed

Renzzauw mentioned this issue Apr 14, 2022

Question about video pipe memory allocation #3812

Closed

Renzzauw mentioned this issue Jun 24, 2022

Questions about VFR video decoding #4003

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to dynamically letterbox video frames with longest dim as parameter? #2832

Best way to dynamically letterbox video frames with longest dim as parameter? #2832

dwrodri commented Mar 30, 2021 •

edited

Loading

mzient commented Mar 31, 2021 •

edited

Loading

dwrodri commented Mar 31, 2021

JanuszL commented Mar 31, 2021

dwrodri commented Mar 31, 2021

romanmaznikov1 commented Oct 14, 2022 •

edited

Loading

Best way to dynamically letterbox video frames with longest dim as parameter? #2832

Best way to dynamically letterbox video frames with longest dim as parameter? #2832

Comments

dwrodri commented Mar 30, 2021 • edited Loading

Environment

mzient commented Mar 31, 2021 • edited Loading

dwrodri commented Mar 31, 2021

JanuszL commented Mar 31, 2021

dwrodri commented Mar 31, 2021

romanmaznikov1 commented Oct 14, 2022 • edited Loading

dwrodri commented Mar 30, 2021 •

edited

Loading

mzient commented Mar 31, 2021 •

edited

Loading

romanmaznikov1 commented Oct 14, 2022 •

edited

Loading