-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is VideoReader compatible with Resize? #1247
Comments
Hi, thanks for the question. Could you explain what you mean by:
|
Hi, I was hopping to avoid GPU->CPU-GPU transitions between video decoder and resizing and other transform operations. If VideoReader is not compatible with DALI Resize, is there any intermediate repackaging GPU operator that can be used between the two or is going through CPU memory the only way, and if not, is there a sample (c/c++, py) where underlying nvenc decoder can directly pass GPU frames to Resize and other DALI operators? Thank you. |
Ok, so to copy DALI outputs directly to the GPU you can use |
The intent is to keep all the video related operations on the GPU, and resizing outside DALI pipeline in python on CPU would likely be too slow. Maybe an inverse (copy_from_external) where external source can share its GPU data directly with DALI pipeline, in this case video file or stream -> resize and rest of pipeline. As far as the custom operator options go:
@awolant could you comment on options above, btw, seems little odd that Resize would not be made compatible with VideoReader since resizing is a very common operation and setting the sequence_length=1 (F=1) doesn't help. Thank you. |
dali folks, for those of us who don't care about sequences, is there a possibility of a Reshape that can be used to go from NFHWC to NHWC, and thus become compatible with the non-sequence ops? |
My intention with the above About your proposed solutions:
As you pointed out, extending existing Resize op is the best option by far. In general, DALI video support is in it's rather initial phase an we are working on improving it. We will take requests like this into consideration. |
Thanks, Albert. My comment on Reshape to drop F is that I'm happy with just having a batch dimension, as I'm processing each frame independently -- basically just using DALI as a very efficient loader to get data into pytorch/tensorflow. And it's not necessarily the case that the F dimension would be gone forever -- one could reshape from NFHWC to NHWC, do per-frame transforms, and then reshape back to NFHWC. There's only a small subset of processing ops that require sequences (optical flow, etc.). |
Hi @awolant, The motive here is to use DALI within the GPU bound video inference pipeline other than optic flow, whether it is from video file or streaming. For the video file it would be natural to use VideoReader and Resize and treat consecutive frames as batch of images if sequence is set to 1. Similarly for video streaming as batch of images, would require to replace the source with something else other than VideoReader like (nv)decoder, but still be able to pass GPU memory from source to rest of DALI pipeline. For VideoReader, if F=1 wouldn't the N>1 contain consecutive frames and formally be compatible with Resize which does expect NHWC although its receiving N1HWC. Is the data packed the same just that the descriptor is confusing the resize? For ExternalSource, it is listed as under https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#externalsource as both CPU and GPU, are there any GPU samples? So the question is going any of the routes above, either modify or create new Resize or creating new ExternalSource, are there any more advanced custom operator c/c++ samples that exemplify some of the data operations between input and output rather than just copy? Thank you. |
Hi @awolant, I was wondering why the decision to add the 'F' dimension to the video files? Having this restriction makes the filter difficult to use in the general case. In theory, and my own naïveté, each frame of a video could be treated as a separate image making the 'F' dimension feel a bit redundant. If there is a specific reason for it, could you point us to the DALI source code where that dimension is being used? We are toying with the option of altering the source code of the Additionally, I noticed you mentioned Lastly, I was hoping you could point us to more interesting examples of Thanks in advance! |
Hi, |
Hi @JanuszL, Thanks for the link. Are there some tutorials how to make either GPU (output) ExternalSource or source operator in c/c++, https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/util/external_source_test.cc could use some comments in that regard? The custom operator sample https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/extend/create_a_custom_operator.html is also rather oversimplified to learn any real data transformations between input and output. The problem with sequences is that there are very few operators that support it, OpticalFlow that I am aware of, and the rest expect batches, so it breaks the compatibility between operators, like with VideoReader->Resize when the video is treated as continuous stream of frames. Does applying ExtractElement transform NFHWC -> NHWC? Regarding the Resize there is the (input) data itself and data descriptor (shape). Would setting the F=1 on VideoReader produce N1HWC as same packed data as NHWC? If yes then only the line of Resize code that asserts the shape would need to be changed to accept N1HWC. If not then N1HWC needs to be repackaged to NHWC, is there some information how NFHWC is being packed in GPU memory? Thank you. |
Hi,
Then the rest is the same as in https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/external_input.html.
Theoretically, it could work - frames should be placed continuously in the memory, but it is rather a hack to make it working with a number of frames 1. Readjusting resize to work with sequences should not be that difficult I guess. |
Hi @JanuszL, The result of the ExtractElement is sill not clear in terms of current Resize absorbing it. Frame is technically sequence of one and mulitplied by batch is NHWC. So when VideoReader sequence is set to 1 or ExtractFrame element is set to 1, what do they produce in terms of N F H W C descriptor and what do they produce in terms of GPU memory tensor packaging? Is the descriptor the problem for Resize or data memory packing, and is there a good c/c++ sample that exemplifies DALI GPU data access or transformations? Thank you. |
@predrag12 - it is a problem that resize doesn't support sequences for now, even if there is only one frame inside - @mzient, am I correct. |
@JanuszL You mentioned above that "Readjusting resize to work with sequences should not be that difficult" Has any further work been done on this recently? I've been attempting to create a custom operator for this and am currently stuck re-implementing |
@syb0rg - we haven't done any progress as we are pursuing other goals now. If you have any specific question how you should progress with your implementation feel free to ask. |
@JanuszL Sure, my C++ days are a bit rusty and I'm not super familiar with the library's structure yet so that will be super helpful! Right now I can get my custom operator to hit the |
My idea would be to change the meaning of minibatches to allow them to represent a bunch of frames (possibly from different videos in the batch) - TensorListView can contain images from multiple samples (tensors). |
Are there any tangible changes planned to consolidate the unfortunate discrepancy/inconsistency between treatment of series of image files vs. series of video frames (technically still images) either in official VideoReader or Resize operator? If affirmative can you please comment on the timeline, if negative, since this was raised 7MO ago and was mentioned that "Readjusting resize to work with sequences should not be that difficult", could you please provide C/C++ snippet that handles F in NFHWC and directly illustrates your point? Referring to the GPU end to end inference case. |
I don't think there is any discrepancy/inconsistency. For the sequence of frames, you need to resize a set of consecutive frames within the samples with the same scale, but the scale may vary sample to sample in the batch. In the case of single images, each image may have a different scale. It looks similar but the details differ. |
It is inconsistent in a sense that whether the image in memory comes from a file or from video, resize filter or in your case operator, should treat width and height dimension in the same way, and treat other dimensions like N and F as just matter of repetition since there is no dependence btw consecutive frames, particularly for video where resolution is basically fixed. One would hardly call 7MO wait for this "rushing" it. Since you guys mentioned above that this is not a difficult adjustment, one would expect that this can or should be done organically. If you need external help then that would be predicated on having a good tutorial on how to write more complex operators other than the simplest one mentioned in the documentation. Or as repeatedly requested earlier a short C/C++ snipped that would exemplify how to approach this problem in unpacking and packing dimensions in the resize source code and am I sure people would be more than happy to at least start with that. |
We really appreciate that the community around DALI is getting more and more engaged. We will bump up the priority of this request and we will keep you posted. |
There are most likely two avenues here since NF are temporal dimensions and CHW are spatial. One Resize handling F, and second VideoReader handling F 0 or -1 by eliminating F dimension altogether and then resize could handle NCHW without modification. I would personally prefer the VideoReader change, not sure if you would want to make official eventually or not, but that would certainly work even unofficial for our purposes. If you could point to the lines of VideoReader code that would need modification and how to handle multiple dimensions both in terms of manipulating data blob and emitting data format, that may be a very helpful. Thanks. |
So you want to interpolate frames. I don't know it that going to work. In the end, you would end up with ghosting and some strange temporal artifacts. If you want to interpolate frames you need more than simple resize algorithm.
You want to return a single frame instead of sequence and work with that? |
Didn't mean to interpolate frames but have only one temporal dimension emitted by VideoReader that Resize (as is) can understand as batch of frames. Word sequence is overloaded here since colloquially it means series of samples but in documentation it means F not N. Resize can currently handle one sequencing/temporal dimension N (batch of N samples/frames), it expects F to be implicitly 1 or rather nonexistent as dimension in the input data blob and data descriptor. VideoReader on the other hand has batch of sequences N x F, so to make its output compatible with Resize, mentioned passing F=0 or -1 as a VideoReader param as a means for it to know not to make NFCHW or N1CHW just NCHW. N and F would be collapsed into one and Resize reads it as N, something like numpy squeeze, which makes sense in batch inference (with output size N). I am not sure though how frames data is packaged and does it equate to contiguous blob over all dimensions and just the descriptor needs to change, or the data needs to be rearranged too? DALI/dali/operators/reader/video_reader_op.h Line 104 in b152caa
|
Squeezing the dimension won't work as resize expects a batch size number of samples. If you squeeze you will get F*batch size. |
You can always take a look at #1740 (comment) - someone managed to get a good result. |
I don't think FxN would be problem if one cares only about a batch of images to feed to inference (not training) and doesn't care about sequence since there is no correlation btw image frames, for example batch of 10 frames would be 10HWC or rather 10CHW before fed to inference. ExternalSource is able to produce the NHWC that is compatible with Resize, why would VideoReader not be able to produce the same, with mods? Similar questions related to video sampling both spatial and temporal were brought up in #1183, #1356, #1405, #1478, #1770, #1825, #2069. So the question is if one wants to send 10HWC from VideoReader that Resize would receive, is it doable to
|
As I understand correctly you want the VideoReader to produce a batch of frames instead of sequences. Is that correct? |
Correct, batch of frames that Resize and inference engine understands. Looking at the code links, there is not much in terms of code comments, do you have any guidance as to which lines of code to change, any comments on the approach suggested above (below)?
|
Nothing particular, only the reference where to start I have shared. We would need to dig into the code to figure out how to do it properly. |
When is this expected to land? I must say that the lack of image operations support for videos (especially resize) is rather disappointing. As others have pointed out, conceptually this is simple for per-frame operations and should be possible to implement with a simple reshape-like operation, isn't it? |
Couple of weeks. Still, we encourage the community to take a stab and help in the development. |
Hello, I was reading through this again and see that it hasn't been updated in a while. I understand time frame's may have changed with the state of the world, but I was wondering if any internal progress has been made or if we have a new outlook on when this could be addressed? |
Hi, |
Hi, Could you please be a little bit more specific as to the 'soon' ETA? Just for reference this issue has been opened entering 9MO now and was indicated 1.5MO back as 'couple of weeks', and indicated that this 'should not be that difficult'. There are number of participants on this thread and is linked by number of issues with same/similar request, i.e. compatibility btw video and resize operators. Asking the community to contribute to this would be predicted on comprehensive documentation and samples and recommendations closely related to how data blob is packed and descriptor etc., neither of which are available, and if it is simple and would take more time to document, maybe could be done internally. Can you please comment/recommend which VideoReader code lines would be subject to change and how, in order to have VideoReader be compatible with Resize (as is) and inference downstream? Thanks. |
Hi, |
So there is no commitment to any timeline at all or providing sufficient documentation or sample or recommendations specifically closely related to subject of the issue? As a summary/reminder the ask here is not a custom resize operator but how to fix the VideoReader to be compatible with existing Resize and rest of the inference pipeline. |
Hi @predrag12 , |
0.25 was released. The requested feature is implemented there. |
Unfortunately this is not meeting the requirement that was (repeatedly) requested in the issue during entire year, the ask was to optionally via params eliminate video sequence (F) and leave batch (N), not to combine two operators and still have mandatory sequence F on the output of VideoReader which breaks inference. Also the ask was to have some documentation or comments in the code so that VideoReader could be modified more easily. |
I think it was stated that [Reshape)[https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.Reshape] the sequence length of 1 or [ElementExtract][(https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.ElementExtract) should do what you want. |
Hello,
Trying to use Resize after VideoReader on a GPU for inferencing:
self.input = ops.VideoReader(device="gpu", filenames="sintel_trailer-720p.mp4, ...)
self.resize = ops.Resize(device="gpu", resize_x=..., resize_y=...)
...
output = self.resize(self.input())
Results in an error:
Assert on "input.GetLayout() == DALI_NHWC" failed: Resize expects interleaved channel layout (NHWC)
VideoReader has scale param but it doesn't offer all the config options like Resize, and neither of the samples like video_label_example or optical flow have resizing.
Alternatively is DALI compatible with NVDECODER GPU->GPU in any form, if yes, is there an example?
Thank you.
The text was updated successfully, but these errors were encountered: