-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing torchvision before DALI hangs pipeline indefinitely #2872
Comments
Hi, I checked both scripts on |
Hi @JanuszL, thanks for the quick response. The problem is certainly not with the video files - I can reproduce the same issue using the sintel videos from DALI_extra, as you suggest. Furthermore, the videos I tested on are not VFR as I created them manually with ffmpeg at a set frame rate. I also tested on a VFR video to see if that would change things - DALI provides a useful error message in that case that VFR is not supported but only when torchvision is imported after. However, no such message shows up in this issue, so I think VFR is completely orthogonal to this issue. Moreover, even if VFR was a problem, then the expected behavior would still not be this. The issue is that importing torchvision before dali leads to hangs and does not display an error message anyhow. The expected behavior when VFR is the issue is for DALI to display the error message DALI normally throws for such a case, right? Could you please review my post and ensure you are using the same library versions? If this is troublesome I could create a dockerfile to ensure repro. |
Hi @ndalton12,
DALI uses a heuristic. You can't be sure if it is VFR or not until you parse the whole video and DALI doesn't do that. I tried running this inside a clean docker env and it still works:
The HW and driver are:
|
Hi @JanuszL,
I see. I was able to run through every video to confirm none are VFR. I was also able to run without issues using your setup. I was only able to reproduce the issue inside docker when using the same specific environment. Any idea why any of these libraries would conflict against dali? Strangely enough the issue is still related to importing torchvision.
Here is a working environment in which I removed some of the unnecessary libraries from above:
The only hint I have (besides going through the removals one by one) is this output from running the code with the working environment:
Do you also encounter this warning? It looks like a ffmpeg issue, but both the working and non-working environment have the same version of ffmpeg. But this warning does not show up in the non-working version. |
Hi, It may be some ABI incompatibility.
works fine. |
@JanuszL Sorry for the late reply, here you go:
I am able to reproduce with this setup. |
Hi, It seems that when you import Torchvision first brings system FFmpeg binaries into the process and DALI uses symbols from it despite we ship own FFmpeg build. The upstream FFmpeg build has a different configuration and doesn't work well with DALI. |
@JanuszL Okay, thanks for the info. On a slightly different note, I am getting large differences in model performance when using DALI loaded videos vs. using torchvision as my video reader. I tried to eliminate any differences between the two methods, but the DALI version generally has worse accuracy (15-20% difference) on the train and test split. After training and against an untouched dataset, though, the two methods tend to perform similarly (about 5% max difference). Any ideas why this could happen and/or if it's related to this issue? The datasets are original_clips, pre cut data (which is a subset of original clips), and phases_clips (which is totally separate). The statistic measured is accuracy. |
Hi @ndalton12, It is hard to tell what is the reason. It would be best to compare side by side how the output from DALI and the torchvision look like. Maybe there is a difference in the configurations of both of the pipelines (despite the same operations are used the default options and thus behavior may differ). |
@ndalton12 - we have a preliminary fix to this particular problem #2911 and NVIDIA/DALI_deps#6. Let us validate if this is a way to go for us, but thumbs up. |
Hi @JanuszL, I can give the new fix a try when a new release is pushed. About the differences in performance: I have minimized the differences in the pipeline as much as possible, so that only the frame loading is different (all the transformations, specifically converting to float and the resizing and normalization, are done the exact same way). The only other difference is the random ordering of the frames. Upon visual inspection, the images are basically the same with the difference being that the torchvision solution has a bit more artifacting (random colored pixels here and there) on the image. Despite this, the dali version still fails to generalize and performs poorly on the test set and separate dataset. I am a bit lost at what could be causing this difference - any advice? Attribution methods show that the torchvision version learns important geometry in the frames while the dali version does not seem learn to distinguish classes on the geometry. The only theory I have is that the artifacting from torchvision provides an important regularization. This seems unlikely though as the dali version still has minor artifacting due to using the same transformation pipeline after frame loading. |
The only thing that comes to my mind is the frame/sequence distribution. Have you turned on shuffling? Are you sure you load all videos as in Torchvision? |
#2911 has been merged. Please check the nightly build that follows it to see if that resolves your problem. |
Hi @JanuszL, it looks like the problem has been resolved so I will close the issue. Thanks for your help. As to the difference in performance, I wasn't able to figure it out. I used the same pre-processing after loading but the performance difference is still there. Not really sure what would cause it as then only the video loading is different. |
Hi all,
I was recently trying to implement a DALI video reader to increase the speed at which I can provide video data to the GPU for a pytorch task. However, I was getting weird behavior where sometimes one or two batches could be loaded before the entire loading pipeline deadlocks indefinitely (no output). Ctrl+c does not stop the program at this point either, so the only way to stop it is to suspend the program (ctrl+z) and/or issue a
kill -9
to the job.After much painful debugging, I was able to narrow down the problem (or at least one way it can occur, there may be more!). Basically,
import torchvision
(or most but not all submodules of torchvision, e.g. importing just the torchvision MNIST dataloader does not cause the issue) BEFORE importing anyimport nvidia.dali...
will cause this issue. Importing torchvision AFTER dali means the issue does not show up!Here is a minimal example of a working script: https://gist.github.com/ndalton12/0f1900a411150f1dfb9b1ac6384d9889.
Here is a minimal example of a not working script: https://gist.github.com/ndalton12/b888395646cebe319f78006faa0b6f6a. The script never gets past printing the "doing the thing".
Note the only meaningful difference between these two scripts is that the not working one imports torchvision at the top. However, if you change the not working script to import torchvision after the dali imports instead, it will work as expected.
GDB does not show anything useful, just a bunch of threads starting, then nothing.
Version info:
torchvision-0.9.1-py38_cu111
nvidia-dali-cuda110 1.1.0
pytorch-lightning 1.3.0rc1
cudatoolkit 11.1.1
pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.5_0
Also, the
"/data/critical_view_clips/"
directory looks as such:All the files are valid since they work in the working example I provided and I already cleaned the non-valid video files.
TL;DR: Title
The text was updated successfully, but these errors were encountered: