Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is VideoReader compatible with Resize? #1247

Closed
predrag12 opened this issue Sep 10, 2019 · 42 comments
Closed

Is VideoReader compatible with Resize? #1247

predrag12 opened this issue Sep 10, 2019 · 42 comments
Labels
enhancement New feature or request external contribution welcome A good place to start contributing to the project IBM Power architecutre questions and issues Video Video related feature/question

Comments

@predrag12
Copy link

Hello,

Trying to use Resize after VideoReader on a GPU for inferencing:

self.input = ops.VideoReader(device="gpu", filenames="sintel_trailer-720p.mp4, ...)
self.resize = ops.Resize(device="gpu", resize_x=..., resize_y=...)
...
output = self.resize(self.input())

Results in an error:

Assert on "input.GetLayout() == DALI_NHWC" failed: Resize expects interleaved channel layout (NHWC)

VideoReader has scale param but it doesn't offer all the config options like Resize, and neither of the samples like video_label_example or optical flow have resizing.

Alternatively is DALI compatible with NVDECODER GPU->GPU in any form, if yes, is there an example?

Thank you.

@awolant
Copy link
Contributor

awolant commented Sep 11, 2019

Hi, thanks for the question.
For now our Resize op is not supporting video files.

Could you explain what you mean by:

Alternatively is DALI compatible with NVDECODER GPU->GPU in any form, if yes, is there an example?

@awolant awolant added the question Further information is requested label Sep 11, 2019
@predrag12
Copy link
Author

predrag12 commented Sep 11, 2019

Hi,

I was hopping to avoid GPU->CPU-GPU transitions between video decoder and resizing and other transform operations. If VideoReader is not compatible with DALI Resize, is there any intermediate repackaging GPU operator that can be used between the two or is going through CPU memory the only way, and if not, is there a sample (c/c++, py) where underlying nvenc decoder can directly pass GPU frames to Resize and other DALI operators?

Thank you.

@awolant
Copy link
Contributor

awolant commented Sep 11, 2019

Ok, so to copy DALI outputs directly to the GPU you can use copy_to_external. I did some small python example on how to use it with PyCUDA in here. This should allow you to use some custom Python code to resize the frames outside DALI pipeline.
Other option would be to write your own custom resize op for videos using our plugin system. This is more work definitely, but this would allow you to resize the videos inside the pipeline, if that is what you want.

@predrag12
Copy link
Author

predrag12 commented Sep 12, 2019

The intent is to keep all the video related operations on the GPU, and resizing outside DALI pipeline in python on CPU would likely be too slow. Maybe an inverse (copy_from_external) where external source can share its GPU data directly with DALI pipeline, in this case video file or stream -> resize and rest of pipeline. As far as the custom operator options go:

  • write c/c++ cu custom plugin operator that transforms VideoReader NFHWC format to NHWC image format for Resize, that you mentioned, but from the simple sample it is not clear how to reformat the input->output data, are there any more complex samples?

  • write GPU ExternalSource, not clear from external source sample how to produce binary GPU data that pipeline can consume?

  • write c/c++ cu custom plugin operator that is a source operator and uses libnvcuvid/nvdecoder directly and generates same data as option 2, not clear from plugin sample how to create source operator, are there any good examples in the DALI source?

@awolant could you comment on options above, btw, seems little odd that Resize would not be made compatible with VideoReader since resizing is a very common operation and setting the sequence_length=1 (F=1) doesn't help.

Thank you.

@dave-andersen
Copy link

dali folks, for those of us who don't care about sequences, is there a possibility of a Reshape that can be used to go from NFHWC to NHWC, and thus become compatible with the non-sequence ops?

@awolant
Copy link
Contributor

awolant commented Sep 19, 2019

My intention with the above copy_to_external was to resize it outside of the DALI pipeline in Python, but on the GPU using some kind of library or custom native code.

About your proposed solutions:

  • I'm not sure if I understand correctly what you want to do there. If you just drop F you will loose the frame dimension and will be left with only one image.
    You could try to write an op that works on NFHWC and does resize of each frame. You could possibly reuse parts of existing Resize operator for that.
  • Currently ExternalSource does not support GPU input, so this will not work.
  • I would not recommend trying to write custom reader as plugin - this requires familiarity with loader-parser dynamics in DALI.

As you pointed out, extending existing Resize op is the best option by far.

In general, DALI video support is in it's rather initial phase an we are working on improving it. We will take requests like this into consideration.

@awolant awolant added enhancement New feature or request external contribution welcome A good place to start contributing to the project labels Sep 19, 2019
@dave-andersen
Copy link

dave-andersen commented Sep 19, 2019

Thanks, Albert. My comment on Reshape to drop F is that I'm happy with just having a batch dimension, as I'm processing each frame independently -- basically just using DALI as a very efficient loader to get data into pytorch/tensorflow. And it's not necessarily the case that the F dimension would be gone forever -- one could reshape from NFHWC to NHWC, do per-frame transforms, and then reshape back to NFHWC. There's only a small subset of processing ops that require sequences (optical flow, etc.).

@predrag12
Copy link
Author

Hi @awolant,

The motive here is to use DALI within the GPU bound video inference pipeline other than optic flow, whether it is from video file or streaming. For the video file it would be natural to use VideoReader and Resize and treat consecutive frames as batch of images if sequence is set to 1. Similarly for video streaming as batch of images, would require to replace the source with something else other than VideoReader like (nv)decoder, but still be able to pass GPU memory from source to rest of DALI pipeline.

For VideoReader, if F=1 wouldn't the N>1 contain consecutive frames and formally be compatible with Resize which does expect NHWC although its receiving N1HWC. Is the data packed the same just that the descriptor is confusing the resize?

For ExternalSource, it is listed as under https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#externalsource as both CPU and GPU, are there any GPU samples?

So the question is going any of the routes above, either modify or create new Resize or creating new ExternalSource, are there any more advanced custom operator c/c++ samples that exemplify some of the data operations between input and output rather than just copy?

Thank you.

@aalugore
Copy link
Contributor

Hi @awolant,

I was wondering why the decision to add the 'F' dimension to the video files? Having this restriction makes the filter difficult to use in the general case. In theory, and my own naïveté, each frame of a video could be treated as a separate image making the 'F' dimension feel a bit redundant. If there is a specific reason for it, could you point us to the DALI source code where that dimension is being used? We are toying with the option of altering the source code of the Resize filter to handle the transformation between NFHWC <-> NHWC so maybe you could help point us in the right direction.

Additionally, I noticed you mentioned ExternalSource does not accept input on the gpu despite the documentation @predrag12 cited above saying it can be run on the GPU. Is this an error in documentation? Or is "accepting input on the GPU" and "being able to be scheduled on the GPU", as the documentation states, two different things?

Lastly, I was hoping you could point us to more interesting examples of ExternalSource and/or Resize being used with GPUs if that exists? Or perhaps just the area of DALI source code where we would need to look into if we were indeed going to extend video reading functionality.

Thanks in advance!

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2019

Hi,
ExternalSource accepts input only on the CPU (via numpy array) but the output could be on the GPU or the CPU - that what is in the documentation (maybe not 100% self-explanatory). One of the examples is in https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/util/external_source_test.cc. However, it should not be difficult to extend it to accept GPU inputs using cupy array, cuda_array or DLPack interface.
Currently, there aren't many operators that support sequences - but you can check ElementExtract.
Regarding sequences - yes, from the processing point of view there is just F * batch size images. On the other hand, some transformations could benefit by grouping F consecutive images into once bag of processing kernels - like some coefficient (during resize) could be reused for the whole sequence saving the storage and compute time.

@predrag12
Copy link
Author

Hi @JanuszL,

Thanks for the link. Are there some tutorials how to make either GPU (output) ExternalSource or source operator in c/c++, https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/util/external_source_test.cc could use some comments in that regard? The custom operator sample https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/extend/create_a_custom_operator.html is also rather oversimplified to learn any real data transformations between input and output.

The problem with sequences is that there are very few operators that support it, OpticalFlow that I am aware of, and the rest expect batches, so it breaks the compatibility between operators, like with VideoReader->Resize when the video is treated as continuous stream of frames. Does applying ExtractElement transform NFHWC -> NHWC?

Regarding the Resize there is the (input) data itself and data descriptor (shape). Would setting the F=1 on VideoReader produce N1HWC as same packed data as NHWC? If yes then only the line of Resize code that asserts the shape would need to be changed to accept N1HWC. If not then N1HWC needs to be repackaged to NHWC, is there some information how NFHWC is being packed in GPU memory?

Thank you.

@JanuszL
Copy link
Contributor

JanuszL commented Oct 1, 2019

Hi,
To make the output of ExternalSource available on the GPU just write something like this:

  self.input = ops.ExternalSource(device='GPU')

Then the rest is the same as in https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/external_input.html.
Regarding breaking ht compatibility - yes and no. The sequence of frames by definition is not compatible with the single images, even if you can treat them as the flat sequence it won't work as it is f*batch size of images instead of batch size. In the case of ExtractElement is can return more than one frame per element and it returns sequences - so it won't work either.

Regarding the Resize there is the (input) data itself and data descriptor (shape). Would setting the F=1 on VideoReader produce N1HWC as the same packed data as NHWC? If yes then only the line of Resize code that asserts the shape would need to be changed to accept N1HWC. If not then N1HWC needs to be repackaged to NHWC, is there some information about how NFHWC is being packed in GPU memory?

Theoretically, it could work - frames should be placed continuously in the memory, but it is rather a hack to make it working with a number of frames 1. Readjusting resize to work with sequences should not be that difficult I guess.
@mzient could provide guidance if you want to jump into the code.

@predrag12
Copy link
Author

Hi @JanuszL,

The result of the ExtractElement is sill not clear in terms of current Resize absorbing it. Frame is technically sequence of one and mulitplied by batch is NHWC. So when VideoReader sequence is set to 1 or ExtractFrame element is set to 1, what do they produce in terms of N F H W C descriptor and what do they produce in terms of GPU memory tensor packaging? Is the descriptor the problem for Resize or data memory packing, and is there a good c/c++ sample that exemplifies DALI GPU data access or transformations?

Thank you.

@JanuszL
Copy link
Contributor

JanuszL commented Oct 7, 2019

@predrag12 - it is a problem that resize doesn't support sequences for now, even if there is only one frame inside - @mzient, am I correct.
Regarding C++ you can look into our tests, but we don't officially support native C++ API which is still under heavy reworking. So we see no point in pointless effort of providing and maintaining C++ examples that are obsolete release to release.
What we want to do is to finish reworking that is still in progress so we can think about the C++ API we want to commit to supporting in the long term.

@JanuszL JanuszL removed the question Further information is requested label Jan 21, 2020
@syb0rg
Copy link

syb0rg commented Feb 17, 2020

@JanuszL You mentioned above that "Readjusting resize to work with sequences should not be that difficult" Has any further work been done on this recently? I've been attempting to create a custom operator for this and am currently stuck re-implementing ResizeBase::RunGPU(); I am just wondering if I should continue independently or work with you to implement it officially?

@JanuszL
Copy link
Contributor

JanuszL commented Feb 17, 2020

@syb0rg - we haven't done any progress as we are pursuing other goals now. If you have any specific question how you should progress with your implementation feel free to ask.

@syb0rg
Copy link

syb0rg commented Feb 17, 2020

@JanuszL Sure, my C++ days are a bit rusty and I'm not super familiar with the library's structure yet so that will be super helpful!

Right now I can get my custom operator to hit the ResizeBase::RunGPU method, then it runs into some issues with the number of dimensions. I can't merely change all the 3s to 4s in the file as I'll get compilation errors. I'm thinking I change the in_view to handle 4, allow SubdivideInput() to run to get the minibatches, then I need to split the video up by their frames to reduce the dimension cardinality to 3 and process the resize like normal. Afterwards I would regroup the frames into the original video and return the output. Thoughts? Recommendations? I can attach what work I have right now too if that would help you understand.

@mzient
Copy link
Contributor

mzient commented Feb 18, 2020

My idea would be to change the meaning of minibatches to allow them to represent a bunch of frames (possibly from different videos in the batch) - TensorListView can contain images from multiple samples (tensors).
As the CPU variant goes, you could just resize the kernel manager or interleave setup/run.
We'll probably do something similar on our own, but the timing depends on many factors.

@predrag12
Copy link
Author

predrag12 commented Mar 13, 2020

Are there any tangible changes planned to consolidate the unfortunate discrepancy/inconsistency between treatment of series of image files vs. series of video frames (technically still images) either in official VideoReader or Resize operator? If affirmative can you please comment on the timeline, if negative, since this was raised 7MO ago and was mentioned that "Readjusting resize to work with sequences should not be that difficult", could you please provide C/C++ snippet that handles F in NFHWC and directly illustrates your point? Referring to the GPU end to end inference case.

@JanuszL
Copy link
Contributor

JanuszL commented Mar 13, 2020

Are there any tangible changes planned to consolidate the unfortunate discrepancy/inconsistency between treatment of series of image files vs. series of video frames (technically still images) either in official VideoReader or Resize operator?

I don't think there is any discrepancy/inconsistency. For the sequence of frames, you need to resize a set of consecutive frames within the samples with the same scale, but the scale may vary sample to sample in the batch. In the case of single images, each image may have a different scale. It looks similar but the details differ.
We plan to implement it in the near feature but now we are focusing on the audio operators, again I cannot commit to any particular data as our priorities depend on many factors.
As said, if you are in a rush you can try to do it on your own and we would be more than happy to accept and review any PR that would implement this functionality.

@predrag12
Copy link
Author

It is inconsistent in a sense that whether the image in memory comes from a file or from video, resize filter or in your case operator, should treat width and height dimension in the same way, and treat other dimensions like N and F as just matter of repetition since there is no dependence btw consecutive frames, particularly for video where resolution is basically fixed. One would hardly call 7MO wait for this "rushing" it. Since you guys mentioned above that this is not a difficult adjustment, one would expect that this can or should be done organically. If you need external help then that would be predicated on having a good tutorial on how to write more complex operators other than the simplest one mentioned in the documentation. Or as repeatedly requested earlier a short C/C++ snipped that would exemplify how to approach this problem in unpacking and packing dimensions in the resize source code and am I sure people would be more than happy to at least start with that.

@JanuszL
Copy link
Contributor

JanuszL commented Mar 14, 2020

We really appreciate that the community around DALI is getting more and more engaged. We will bump up the priority of this request and we will keep you posted.

@predrag12
Copy link
Author

predrag12 commented Mar 16, 2020

There are most likely two avenues here since NF are temporal dimensions and CHW are spatial. One Resize handling F, and second VideoReader handling F 0 or -1 by eliminating F dimension altogether and then resize could handle NCHW without modification. I would personally prefer the VideoReader change, not sure if you would want to make official eventually or not, but that would certainly work even unofficial for our purposes. If you could point to the lines of VideoReader code that would need modification and how to handle multiple dimensions both in terms of manipulating data blob and emitting data format, that may be a very helpful. Thanks.

@JanuszL
Copy link
Contributor

JanuszL commented Mar 16, 2020

One Resize handling F

So you want to interpolate frames. I don't know it that going to work. In the end, you would end up with ghosting and some strange temporal artifacts. If you want to interpolate frames you need more than simple resize algorithm.

second VideoSource handling F 0 or -1 by eliminating F dimension altogether and then resize could handle NCHW without modification

You want to return a single frame instead of sequence and work with that?
If so you need to look into https://github.com/NVIDIA/DALI/blob/master/dali/operators/reader/loader/video_loader.cc#L710 and https://github.com/NVIDIA/DALI/blob/master/dali/operators/reader/video_reader_op.h#L86 for the beginning.

@predrag12
Copy link
Author

predrag12 commented Mar 16, 2020

Didn't mean to interpolate frames but have only one temporal dimension emitted by VideoReader that Resize (as is) can understand as batch of frames. Word sequence is overloaded here since colloquially it means series of samples but in documentation it means F not N. Resize can currently handle one sequencing/temporal dimension N (batch of N samples/frames), it expects F to be implicitly 1 or rather nonexistent as dimension in the input data blob and data descriptor. VideoReader on the other hand has batch of sequences N x F, so to make its output compatible with Resize, mentioned passing F=0 or -1 as a VideoReader param as a means for it to know not to make NFCHW or N1CHW just NCHW. N and F would be collapsed into one and Resize reads it as N, something like numpy squeeze, which makes sense in batch inference (with output size N). I am not sure though how frames data is packaged and does it equate to contiguous blob over all dimensions and just the descriptor needs to change, or the data needs to be rearranged too?
For example changing

tl_sequence_output.SetLayout("FHWC");
to "NHWC"

@JanuszL
Copy link
Contributor

JanuszL commented Mar 16, 2020

Squeezing the dimension won't work as resize expects a batch size number of samples. If you squeeze you will get F*batch size.
With the current architecture, you need to adjust resize.

@JanuszL
Copy link
Contributor

JanuszL commented Mar 17, 2020

You can always take a look at #1740 (comment) - someone managed to get a good result.

@predrag12
Copy link
Author

predrag12 commented Mar 17, 2020

I don't think FxN would be problem if one cares only about a batch of images to feed to inference (not training) and doesn't care about sequence since there is no correlation btw image frames, for example batch of 10 frames would be 10HWC or rather 10CHW before fed to inference. ExternalSource is able to produce the NHWC that is compatible with Resize, why would VideoReader not be able to produce the same, with mods? Similar questions related to video sampling both spatial and temporal were brought up in #1183, #1356, #1405, #1478, #1770, #1825, #2069.

So the question is if one wants to send 10HWC from VideoReader that Resize would receive, is it doable to

  • internally set F to 10 and externally report format as "NHWC" and N as 10?
  • do any descriptor values N H W C reported downstream need to change, if yes, is there any documentation or code lines?
  • do any internal frame packing data blobs needs to change, if yes, is there documentation how frames are packed together before sending to output?

@JanuszL
Copy link
Contributor

JanuszL commented Mar 17, 2020

I don't think FxN would be problem if one cares only about a batch of images to feed to inference (not training) and doesn't care about sequence since there is no correlation btw image frames, for example batch of 10 frames would be 10HWC or rather 10CHW before fed to inference. ExternalSource is able to produce the NHWC that is compatible with Resize, why would VideoReader not be able to produce the same, with mods? Similar questions related to video sampling both spatial and temporal were brought up in #1183, #1356, #1405, #1478, #1770

As I understand correctly you want the VideoReader to produce a batch of frames instead of sequences. Is that correct?
If so, then bellow references are still valid:

@JanuszL JanuszL added the Video Video related feature/question label Mar 17, 2020
@predrag12
Copy link
Author

predrag12 commented Mar 18, 2020

Correct, batch of frames that Resize and inference engine understands. Looking at the code links, there is not much in terms of code comments, do you have any guidance as to which lines of code to change, any comments on the approach suggested above (below)?

  • internally set F to 10 and externally report format as "NHWC" and N as 10?
  • any descriptor values N H W C reported downstream need to change and any frame data blobs need to change, is there any documentation?

@JanuszL
Copy link
Contributor

JanuszL commented Mar 18, 2020

do you have any guidance as to which lines of code to change, any comments on the approach suggested above (below)?

Nothing particular, only the reference where to start I have shared. We would need to dig into the code to figure out how to do it properly.

@bartwojcik
Copy link

When is this expected to land? I must say that the lack of image operations support for videos (especially resize) is rather disappointing. As others have pointed out, conceptually this is simple for per-frame operations and should be possible to implement with a simple reshape-like operation, isn't it?

@JanuszL
Copy link
Contributor

JanuszL commented Apr 7, 2020

When is this expected to land?

Couple of weeks. Still, we encourage the community to take a stab and help in the development.

@aalugore
Copy link
Contributor

Hello, I was reading through this again and see that it hasn't been updated in a while. I understand time frame's may have changed with the state of the world, but I was wondering if any internal progress has been made or if we have a new outlook on when this could be addressed?

@JanuszL JanuszL added the IBM Power architecutre questions and issues label May 14, 2020
@JanuszL
Copy link
Contributor

JanuszL commented May 14, 2020

Hi,
We are still pursuing other requests and priorities. We hope to get into that soon.

@predrag12
Copy link
Author

predrag12 commented May 16, 2020

Hi,

Could you please be a little bit more specific as to the 'soon' ETA? Just for reference this issue has been opened entering 9MO now and was indicated 1.5MO back as 'couple of weeks', and indicated that this 'should not be that difficult'. There are number of participants on this thread and is linked by number of issues with same/similar request, i.e. compatibility btw video and resize operators. Asking the community to contribute to this would be predicted on comprehensive documentation and samples and recommendations closely related to how data blob is packed and descriptor etc., neither of which are available, and if it is simple and would take more time to document, maybe could be done internally. Can you please comment/recommend which VideoReader code lines would be subject to change and how, in order to have VideoReader be compatible with Resize (as is) and inference downstream?

Thanks.

@JanuszL
Copy link
Contributor

JanuszL commented May 18, 2020

Hi,
DALI is an open-source project, however, we have many drivers and stakeholders, not necessarily visible on GitHub. We are doing our best to meet the expectations of as many customers and users as possible, but we cannot make everyone happy. Now. we are fully focusing on providing comprehensive audio support.
I have already seen that someone managed, with a little help from our side. to implement this on his own: #1740.

@predrag12
Copy link
Author

predrag12 commented May 18, 2020

So there is no commitment to any timeline at all or providing sufficient documentation or sample or recommendations specifically closely related to subject of the issue? As a summary/reminder the ask here is not a custom resize operator but how to fix the VideoReader to be compatible with existing Resize and rest of the inference pipeline.

@awolant
Copy link
Contributor

awolant commented Jul 21, 2020

Hi @predrag12 ,
recently we introduced VideoReaderResize op that can resize videos. It supports all Resize parameters for videos.
It is available in our nightly build.

@JanuszL JanuszL added this to the Release_0.25.0 milestone Jul 21, 2020
@JanuszL
Copy link
Contributor

JanuszL commented Sep 1, 2020

0.25 was released. The requested feature is implemented there.

@JanuszL JanuszL closed this as completed Sep 1, 2020
@predrag12
Copy link
Author

Unfortunately this is not meeting the requirement that was (repeatedly) requested in the issue during entire year, the ask was to optionally via params eliminate video sequence (F) and leave batch (N), not to combine two operators and still have mandatory sequence F on the output of VideoReader which breaks inference. Also the ask was to have some documentation or comments in the code so that VideoReader could be modified more easily.

@JanuszL
Copy link
Contributor

JanuszL commented Sep 2, 2020

the requirement that was (repeatedly) requested in the issue during entire year, the ask was to optionally via params eliminate video sequence (F) and leave batch (N

I think it was stated that [Reshape)[https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.Reshape] the sequence length of 1 or [ElementExtract][(https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.ElementExtract) should do what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request external contribution welcome A good place to start contributing to the project IBM Power architecutre questions and issues Video Video related feature/question
Projects
None yet
Development

No branches or pull requests

8 participants