Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to dynamically letterbox video frames with longest dim as parameter? #2832

Closed
dwrodri opened this issue Mar 30, 2021 · 5 comments
Closed
Labels
question Further information is requested

Comments

@dwrodri
Copy link

dwrodri commented Mar 30, 2021

I am trying to implement a DALI pipeline which will feed frames of video into a single-shot object detector written in PyTorch. Specifically, I am trying to replicate the LoadImages class from YOLOv5 using only GPU operations. The preprocessing pipeline I'm referencing performs two main operations in the following order:

  1. Resize the largest dimension to 640px while preserving the aspect ratio of the footage.
  2. Letterbox every frame of the footage such that the width and height of every frame are multiples of a stride. For example, footage which is scaled down to 360x640 where the input shape is 640x640 with a stride of 32 means that your final height is ((640 - 360) % 32 ) + 360 = 384. This isn't the most straightforward approach, I know, but for the purposes of this question (and my own sanity) I am not questioning this formula.

So, operation 1 is pretty well-documented, and can be done by calling fn.video_resize with the right arguments. I'm struggling with figuring out how to perform operation 2.

My first attempt involved capturing the frame dimensions using fn.shapes, but I couldn't find the "obvious" way of padding opposite sides of the frame. Here was my first attempt:

@pipeline_def(batch_size=1, seed=42, device_id=0, num_threads=1)
def tagger_pipeline(
    source: List[str], num_frames: int, new_size: int = 640, fill_color: int = 114
):
    """
    Decode, resize, and letterbox frames from videos using only GPU operations.

    By default, YOLOv5 assumes all dimensions of a frame are multiples of 32, so
    we letterbox to guarantee the image boundaries fall on the edges of a filter step.
    """
    # input shape of the model
    new_shape = types.Constant([new_size, new_size], dtype=types.INT64, device="gpu")
    # stride of the first layer in the feature encoder
    stride = types.ScalarConstant(32, dtype=types.INT64)  # type: ignore

    ###########
    # Operation 1
    ###########
    frames = fn.readers.video_resize(  # type: ignore
        sequence_length=num_frames,
        filenames=source,
        skip_vfr_check=True,
        size=640,
        mode="not_larger",
        dtype=types.FLOAT,
        device="gpu",
    )
    
    ###########
    # Operation 2
    ###########
    shape = fn.shapes(frames, dtype=types.FLOAT)  # type: ignore
    new_unpad = fn.slice(  # type: ignore
        shape,
        axes=[0],
        start=1,
        end=3,
    )

    height_width = fn.slice(shape, axes=0, start=1, end=3)
    height = fn.slice(height_width, axes=0, start=0, end=1)
    width = fn.slice(height_width, axes=0, start=1, end=2)

    # DALI doesn't support modulo operations, so I so this instead
    pre_modulus = new_shape - height_width
    dh_dw = pre_modulus - stride * fn.cast( # type:ignore
        pre_modulus // stride - 0.5, dtype=types.INT64 # type:ignore 
    )  # type: ignore

    # cut padding in half because we're doing half on each side
    dh = fn.slice(dh_dw, axes=0, start=0, end=1) // 2  # type: ignore
    dw = fn.slice(dh_dw, axes=0, start=1, end=2) // 2  # type: ignore

    height_half_pad = types.Constant(
        value=fill_color, shape=[num_frames, dh, width, 3], dtype=types.FLOAT, device="gpu"  # type: ignore
    )
    width_half_pad = types.Constant(
        value=fill_color,
        shape=[num_frames, height+dh, dw, 3],  # type: ignore
        dtype=types.FLOAT, # type:ignore
        device="gpu",
    )
    frames = fn.cat(height_half_pad, frames, fn.copy(height_half_pad), axis=1)
    frames = fn.cat(width_half_pad, frames, fn.copy(width_half_pad), axis=2)
    
    # put channels at the front because that is what the model expects
    frames = fn.transpose(frames, perm=(0, 3, 1, 2))

    return frames

This code doesn't run because I'm generating a dali.types.Constant with a DataNode as one of the arguments for the shape, resulting in the following error:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'DataNode'

So far, I can come up with five potential alternatives which allow me to continue doing this without any CPU-only operations:

  • Write my own letterboxing operation using CuPy
  • Try to combine paste and slice.
  • Use some other library to extract the frame dimensions prior to building the pipeline, and pass in everything needed to make the half pads proper types.Constants.
  • Have one pipeline for each operation and see if I can feed the outputs from the pipeline for operation 1 into operation 2.
  • Use DALI for operation 1 and then letterbox with PyTorch ops

I don't see how I can use fn.Pad because it will only place fill values at the end, and fn.transforms.translate is CPU-only, according to the documentation.

I'd really like to be able to pass in a folder of videos where each video has a different aspect ratio, and have all the preprocessing happen on the GPU, but getting this working with just one aspect ratio that isn't known ahead of time is the first step.

Environment

  • Python 3.8.6, compiled with GCC 10.2
  • DALI version 1.0.0 install via pip
  • CUDA Toolkit version 11.1
@mzient
Copy link
Contributor

mzient commented Mar 31, 2021

Unfortunately, DALI executes GPU operators strictly after CPU operators. This means that fn.shapes on a video will produce the shape on the GPU which, admittedly, is not very useful - that's because we avoid transferring tensors from GPU back to CPU, so the parameters which affect output shapes are CPU tensors. Allowing a GPU operator to produce a CPU output (which would only be usable by GPU operators due to ordering, but still usable as a named argument) would require significant rework of how our pipelines are built and validated.

Having said all that, there's one workaround: you can pad your data to given alignment (which will produce a letterbox at the bottom and to the right) and then use warp_affine with a specially crafted matrix that shifts the images just by half of the padding:

    videos = fn.readers.video_resize(  # type: ignore
        sequence_length=num_frames,
        filenames=source,
        skip_vfr_check=True,
        size=640,
        mode="not_larger",
        dtype=types.FLOAT,
        device="gpu",
    )

    padded = fn.pad(videos, axis_names="HW", shape=[1,1], align=32)  # use your preferred alignment here
    shift = fn.cast((fn.shapes(videos) - fn.shapes(padded)) // 2, dtype=types.FLOAT)
    dx = fn.slice(shift, 2, 1, axes=[0])
    dy = fn.slice(shift, 1, 1, axes=[0])
    shift = fn.stack(dx, dy, 0.0, axis=0)
    matrix = fn.cat(np.identity(3,dtype=np.float32), shift, axis=1)
    as_volume = fn.reshape(padded, layout="DHWC")
    warped = fn.warp_affine(as_volume, matrix, fill_value=0, interp_type=types.INTERP_NN)
    letterboxed = fn.reshape(warped, layout="FHWC")
    

Note that warp_affine doesn't natively support videos - that's why the frame dimension is reinterpreted as depth and 3D warp is used (the depth/frame dimension is left untouched).

@JanuszL JanuszL added the question Further information is requested label Mar 31, 2021
@dwrodri
Copy link
Author

dwrodri commented Mar 31, 2021

Thank you for the detailed response! Since I was in a pinch, I currently have an implementation where I collect frame dimensions using python-ffmpeg and create types.Constant objects at the start of the pipeline constructor. I will probably add this approach to my code when I get the chance.

Does the ordering requirement affect whether the fn.transforms module will ever get GPU support? This is a really great project, so if I can find a justification to allocate my professional dev time, I definitely wouldn't mind submitting a PR to make this corner of the API a bit better. It makes sense to me with keeping GPU ops last. I haven't tested this, but I assume that you can just throw an error when someone attempts to violate the ordering at the time that Pipeline.build() is called.

Anyways, keep up the great work!

@JanuszL
Copy link
Contributor

JanuszL commented Mar 31, 2021

Hi @dwrodri,

fn.transforms module will ever get GPU support

We don't have that in our roadmap now. Performance-wise it won't yield much of the benefit as the operations are pretty simple. The only benefit would be to accept data that is already on the GPU. If you think that the community would benefit from it feel free to create a PR that would add such functionality.

I haven't tested this, but I assume that you can just throw an error when someone attempts to violate the ordering at the time that Pipeline.build()

DALI would yield ValueError: An operator with device='cpu' cannot accept GPU inputs..

@dwrodri
Copy link
Author

dwrodri commented Mar 31, 2021

I'll go ahead and close this issue as it has been answered. Thanks again for the fast and thorough responses!

@romanmaznikov1
Copy link

romanmaznikov1 commented Oct 14, 2022

I'll go ahead and close this issue as it has been answered. Thanks again for the fast and thorough responses!

@dwrodri Hi! Please tell me how you solved the problem with the TypeError: int() argument must be a string, a bytes-like object or a number, not 'DataNode. What does your tagger_pipeline function look like in the end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants