Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing torchvision before DALI hangs pipeline indefinitely #2872

Closed
ndalton12 opened this issue Apr 16, 2021 · 14 comments
Closed

Importing torchvision before DALI hangs pipeline indefinitely #2872

ndalton12 opened this issue Apr 16, 2021 · 14 comments
Labels
bug Something isn't working lack_of_repro Needs clear reproduction steps and script
Milestone

Comments

@ndalton12
Copy link

ndalton12 commented Apr 16, 2021

Hi all,

I was recently trying to implement a DALI video reader to increase the speed at which I can provide video data to the GPU for a pytorch task. However, I was getting weird behavior where sometimes one or two batches could be loaded before the entire loading pipeline deadlocks indefinitely (no output). Ctrl+c does not stop the program at this point either, so the only way to stop it is to suspend the program (ctrl+z) and/or issue a kill -9 to the job.

After much painful debugging, I was able to narrow down the problem (or at least one way it can occur, there may be more!). Basically, import torchvision (or most but not all submodules of torchvision, e.g. importing just the torchvision MNIST dataloader does not cause the issue) BEFORE importing any import nvidia.dali... will cause this issue. Importing torchvision AFTER dali means the issue does not show up!

Here is a minimal example of a working script: https://gist.github.com/ndalton12/0f1900a411150f1dfb9b1ac6384d9889.

Here is a minimal example of a not working script: https://gist.github.com/ndalton12/b888395646cebe319f78006faa0b6f6a. The script never gets past printing the "doing the thing".

Note the only meaningful difference between these two scripts is that the not working one imports torchvision at the top. However, if you change the not working script to import torchvision after the dali imports instead, it will work as expected.

GDB does not show anything useful, just a bunch of threads starting, then nothing.

Version info:
torchvision-0.9.1-py38_cu111
nvidia-dali-cuda110 1.1.0
pytorch-lightning 1.3.0rc1
cudatoolkit 11.1.1
pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.5_0

Also, the "/data/critical_view_clips/" directory looks as such:

/data/critical_view_clips:
- class1
  - file1.mp4
  - file2.mp4
  - etc
- class2
  - file1.mp4
  - file2.mp4
  - etc

All the files are valid since they work in the working example I provided and I already cleaned the non-valid video files.

TL;DR: Title

@ndalton12 ndalton12 changed the title Importing torchvision before DALI deadlocks pipeline indefinitely Importing torchvision before DALI hangs pipeline indefinitely Apr 17, 2021
@JanuszL JanuszL added bug Something isn't working lack_of_repro Needs clear reproduction steps and script labels Apr 19, 2021
@JanuszL
Copy link
Contributor

JanuszL commented Apr 19, 2021

Hi,

I checked both scripts on DALI_extra/db/video/sintel/ and it works fine.
I suspect the problem lies inside the file you are trying to load. The first problem may be the VFR (variable frame rate) video, the second that some frames in the input video are just missing. DALI video reader calculates the number of frames in each file and generates allowed sequences given provided step, stride, and length. Then it randomly selects one sequence and tries to obtain it from the video. In the mentioned cases DALI may wait till the decoder produces frame N (based on the FPS and the video length) while the frame with a given timestamp doesn't exist in the stream and DALI just waits infinitely for it.
We are working on enabling such a use case but we are not there yet.
To be 100% sure if that is the case you would need to share a sample video file that causes problems.

@JanuszL JanuszL added enhancement New feature or request and removed bug Something isn't working labels Apr 19, 2021
@ndalton12
Copy link
Author

ndalton12 commented Apr 19, 2021

Hi @JanuszL, thanks for the quick response.

The problem is certainly not with the video files - I can reproduce the same issue using the sintel videos from DALI_extra, as you suggest. Furthermore, the videos I tested on are not VFR as I created them manually with ffmpeg at a set frame rate. I also tested on a VFR video to see if that would change things - DALI provides a useful error message in that case that VFR is not supported but only when torchvision is imported after. However, no such message shows up in this issue, so I think VFR is completely orthogonal to this issue.

Moreover, even if VFR was a problem, then the expected behavior would still not be this. The issue is that importing torchvision before dali leads to hangs and does not display an error message anyhow. The expected behavior when VFR is the issue is for DALI to display the error message DALI normally throws for such a case, right?

Could you please review my post and ensure you are using the same library versions? If this is troublesome I could create a dockerfile to ensure repro.

@JanuszL
Copy link
Contributor

JanuszL commented Apr 20, 2021

Hi @ndalton12,

DALI provides a useful error message in that case that VFR is not supported

DALI uses a heuristic. You can't be sure if it is VFR or not until you parse the whole video and DALI doesn't do that.

I tried running this inside a clean docker env and it still works:

docker run --rm -ti --gpus '"capabilities=compute,utility,video"' nvidia/cuda:11.1.1-runtime-ubuntu18.04

apt update && apt install -y nano python3 python3-distutils python3-pip git git-lfs wget &&\
git lfs install &&\
pip3 install --upgrade pip && \
pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html &&\
pip3 install pytorch-lightning==1.3.0rc1 &&\
pip3 install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110 &&\
git clone https://github.com/NVIDIA/DALI_extra &&\
wget https://gist.githubusercontent.com/ndalton12/b888395646cebe319f78006faa0b6f6a/raw/9e13d5415aed90e237703096a5cf580adfbac3fc/not_working_dali.py && \
mkdir /data && \
ln -s /DALI_extra/db/video/sintel/ /data/critical_view_clips && \
python3 not_working_dali.py

The HW and driver are:

root@090ec2ca2941:/# nvidia-smi
Tue Apr 20 15:10:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:01:00.0 Off |                  N/A |
| 29%   44C    P0    55W / 250W |      0MiB / 12194MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@ndalton12
Copy link
Author

Hi @JanuszL,

DALI uses a heuristic. You can't be sure if it is VFR or not until you parse the whole video and DALI doesn't do that.

I see. I was able to run through every video to confirm none are VFR.

I was also able to run without issues using your setup. I was only able to reproduce the issue inside docker when using the same specific environment. Any idea why any of these libraries would conflict against dali? Strangely enough the issue is still related to importing torchvision.

_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                      1_llvm    conda-forge
absl-py                   0.11.0                   pypi_0    pypi
aiohttp                   3.7.3                    pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     20.3.0                   pypi_0    pypi
av                        8.0.3                    pypi_0    pypi
blas                      1.0                         mkl    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
cachetools                4.2.1                    pypi_0    pypi
captum                    0.3.1                    pypi_0    pypi
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
chardet                   3.0.4                    pypi_0    pypi
clearml                   0.17.4                   pypi_0    pypi
cloudpickle               1.6.0                    pypi_0    pypi
cudatoolkit               11.1.1               h6406543_8    conda-forge
cxxfilt                   0.2.2                    pypi_0    pypi
cycler                    0.10.0                     py_2    conda-forge
dbus                      1.13.18              hb2f20db_0  
dill                      0.3.3                    pypi_0    pypi
einops                    0.3.0                    pypi_0    pypi
entmax                    1.0                      pypi_0    pypi
expat                     2.3.0                h9c3ff4c_0    conda-forge
ffmpeg                    4.3                  hf484d3e_0    pytorch
fontconfig                2.13.1               h6c09931_0  
freetype                  2.10.4               h0708190_1    conda-forge
fsspec                    0.8.5                    pypi_0    pypi
furl                      2.1.0                    pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
glib                      2.68.0               h9c3ff4c_2    conda-forge
glib-tools                2.68.0               h9c3ff4c_2    conda-forge
gmp                       6.2.1                h58526e2_0    conda-forge
gnutls                    3.6.13               h85f3911_1    conda-forge
google-auth               1.26.1                   pypi_0    pypi
google-auth-oauthlib      0.4.2                    pypi_0    pypi
grpcio                    1.35.0                   pypi_0    pypi
gst-plugins-base          1.14.0               h8213a91_2  
gstreamer                 1.14.0               h28cd5cc_2  
humanfriendly             9.1                      pypi_0    pypi
icu                       58.2              hf484d3e_1000    conda-forge
idna                      2.10                     pypi_0    pypi
joblib                    1.0.1                    pypi_0    pypi
jpeg                      9b                   h024ee3a_2  
jsonschema                3.2.0                    pypi_0    pypi
kiwisolver                1.3.1            py38h1fd1430_1    conda-forge
kornia                    0.5.0                    pypi_0    pypi
lame                      3.100             h7f98852_1001    conda-forge
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libglib                   2.68.0               h3e27bee_2    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
libtiff                   4.1.0                h2733197_1  
libuuid                   1.0.3                h1bed415_2  
libuv                     1.41.0               h7f98852_0    conda-forge
libxcb                    1.14                 h7b6447c_0  
libxml2                   2.9.10               hb55368b_3  
llvm-openmp               11.0.1               h4bd325d_0    conda-forge
llvmlite                  0.36.0                   pypi_0    pypi
lmdb                      1.1.1                    pypi_0    pypi
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
markdown                  3.3.3                    pypi_0    pypi
matplotlib                3.3.4            py38h578d9bd_0    conda-forge
matplotlib-base           3.3.4            py38h0efea84_0    conda-forge
mkl                       2020.4             h726a3e6_304    conda-forge
mkl-service               2.3.0            py38h1e0a361_2    conda-forge
mkl_fft                   1.3.0            py38h5c078b8_1    conda-forge
mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
monai                     0.4.0                    pypi_0    pypi
multidict                 5.1.0                    pypi_0    pypi
ncurses                   6.2                  h58526e2_4    conda-forge
nettle                    3.6                  he412f7d_0    conda-forge
ninja                     1.10.2               h4bd325d_0    conda-forge
numba                     0.53.1                   pypi_0    pypi
numpy                     1.19.2           py38h54aff64_0  
numpy-base                1.19.2           py38hfa32c7d_0  
nvidia-dali-cuda110       1.1.0                    pypi_0    pypi
oauthlib                  3.1.0                    pypi_0    pypi
olefile                   0.46               pyh9f0ad1d_1    conda-forge
opencv-python             4.5.1.48                 pypi_0    pypi
openh264                  2.1.1                h780b84a_0    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
orderedmultidict          1.0.1                    pypi_0    pypi
pandas                    1.2.3                    pypi_0    pypi
pathlib2                  2.3.5                    pypi_0    pypi
pcre                      8.44                 he1b5a44_0    conda-forge
pillow                    8.2.0            py38he98fc37_0  
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
protobuf                  3.14.0                   pypi_0    pypi
psutil                    5.8.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pyjwt                     2.0.1                    pypi_0    pypi
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyqt                      5.9.2            py38h05f1152_4  
pyrsistent                0.17.3                   pypi_0    pypi
python                    3.8.5                h7579374_1  
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytorch                   1.8.1           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
pytorch-lightning         1.2.7                    pypi_0    pypi
pytz                      2021.1                   pypi_0    pypi
pyyaml                    5.3.1                    pypi_0    pypi
qt                        5.9.7                h5867ecd_1  
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.25.1                   pypi_0    pypi
requests-file             1.5.1                    pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
rsa                       4.7.1                    pypi_0    pypi
scikit-learn              0.24.1                   pypi_0    pypi
scipy                     1.6.0                    pypi_0    pypi
setuptools                52.0.0           py38h06a4308_0  
sip                       4.19.13          py38he6710b0_0  
six                       1.15.0             pyh9f0ad1d_0    conda-forge
slicer                    0.0.7                    pypi_0    pypi
sqlite                    3.35.4               h74cdb3f_0    conda-forge
tensorboard               2.4.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
threadpoolctl             2.1.0                    pypi_0    pypi
timm                      0.3.4                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
torchmetrics              0.3.0rc0                 pypi_0    pypi
torchvision               0.9.1                py38_cu111    pytorch
tornado                   6.1              py38h497a2fe_1    conda-forge
tqdm                      4.56.2                   pypi_0    pypi
typing_extensions         3.7.4.3                    py_0    conda-forge
urllib3                   1.26.3                   pypi_0    pypi
vit-pytorch               0.6.7                    pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
x-transformers            0.8.3                    pypi_0    pypi
xz                        5.2.5                h516909a_1    conda-forge
yarl                      1.6.3                    pypi_0    pypi
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge

Here is a working environment in which I removed some of the unnecessary libraries from above:

_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                      1_llvm    conda-forge
absl-py                   0.12.0                   pypi_0    pypi
aiohttp                   3.7.4.post0              pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     20.3.0                   pypi_0    pypi
blas                      1.0                         mkl  
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
cachetools                4.2.1                    pypi_0    pypi
captum                    0.3.1                    pypi_0    pypi
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
chardet                   4.0.0                    pypi_0    pypi
clearml                   0.17.4                   pypi_0    pypi
cudatoolkit               11.1.1               h6406543_8    conda-forge
cycler                    0.10.0                   pypi_0    pypi
einops                    0.3.0                    pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
freetype                  2.10.4               h0708190_1    conda-forge
fsspec                    2021.4.0                 pypi_0    pypi
furl                      2.1.2                    pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
gmp                       6.2.1                h58526e2_0    conda-forge
gnutls                    3.6.13               h85f3911_1    conda-forge
google-auth               1.29.0                   pypi_0    pypi
google-auth-oauthlib      0.4.4                    pypi_0    pypi
grpcio                    1.37.0                   pypi_0    pypi
humanfriendly             9.1                      pypi_0    pypi
idna                      2.10                     pypi_0    pypi
joblib                    1.0.1                    pypi_0    pypi
jpeg                      9b                   h024ee3a_2  
jsonschema                3.2.0                    pypi_0    pypi
kiwisolver                1.3.1                    pypi_0    pypi
kornia                    0.5.0                    pypi_0    pypi
lame                      3.100             h7f98852_1001    conda-forge
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
libtiff                   4.1.0                h2733197_1  
libuv                     1.41.0               h7f98852_0    conda-forge
llvm-openmp               11.1.0               h4bd325d_1    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
markdown                  3.3.4                    pypi_0    pypi
matplotlib                3.4.1                    pypi_0    pypi
mkl                       2020.4             h726a3e6_304    conda-forge
mkl-service               2.3.0            py38h1e0a361_2    conda-forge
mkl_fft                   1.3.0            py38h5c078b8_1    conda-forge
mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
monai                     0.4.0                    pypi_0    pypi
multidict                 5.1.0                    pypi_0    pypi
ncurses                   6.2                  h58526e2_4    conda-forge
nettle                    3.6                  he412f7d_0    conda-forge
ninja                     1.10.2               h4bd325d_0    conda-forge
numpy                     1.19.2           py38h54aff64_0  
numpy-base                1.19.2           py38hfa32c7d_0  
nvidia-dali-cuda110       1.1.0                    pypi_0    pypi
oauthlib                  3.1.0                    pypi_0    pypi
olefile                   0.46               pyh9f0ad1d_1    conda-forge
opencv-python             4.5.1.48                 pypi_0    pypi
openh264                  2.1.1                h780b84a_0    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
orderedmultidict          1.0.1                    pypi_0    pypi
pathlib2                  2.3.5                    pypi_0    pypi
pillow                    8.2.0            py38he98fc37_0  
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
protobuf                  3.15.8                   pypi_0    pypi
psutil                    5.8.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pyjwt                     2.0.1                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
pyrsistent                0.17.3                   pypi_0    pypi
python                    3.8.5           h1103e12_9_cpython    conda-forge
python-dateutil           2.8.1                    pypi_0    pypi
python_abi                3.8                      1_cp38    conda-forge
pytorch                   1.8.1           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
pytorch-lightning         1.2.7                    pypi_0    pypi
pyyaml                    5.3.1                    pypi_0    pypi
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.25.1                   pypi_0    pypi
requests-file             1.5.1                    pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
rsa                       4.7.2                    pypi_0    pypi
setuptools                49.6.0           py38h578d9bd_3    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.35.4               h74cdb3f_0    conda-forge
tensorboard               2.5.0                    pypi_0    pypi
tensorboard-data-server   0.6.0                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
torchmetrics              0.2.0                    pypi_0    pypi
torchvision               0.9.1                py38_cu111    pytorch
tqdm                      4.60.0                   pypi_0    pypi
typing_extensions         3.7.4.3                    py_0    conda-forge
urllib3                   1.26.4                   pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yarl                      1.6.3                    pypi_0    pypi
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge

The only hint I have (besides going through the removals one by one) is this output from running the code with the working environment:

[mov,mp4,m4a,3gp,3g2,mj2 @ 0x55928e8125c0] Could not find codec parameters for stream 1 (Audio: aac (mp4a / 0x6134706D), 48000 Hz, 2 channels, 127 kb/s): unspecified sample format
Consider increasing the value for the 'analyzeduration' and 'probesize' options
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x55928e80f440] Could not find codec parameters for stream 1 (Audio: aac (mp4a / 0x6134706D), 48000 Hz, 2 channels, 127 kb/s): unspecified sample format
Consider increasing the value for the 'analyzeduration' and 'probesize' options

Do you also encounter this warning? It looks like a ffmpeg issue, but both the working and non-working environment have the same version of ffmpeg. But this warning does not show up in the non-working version.

@JanuszL
Copy link
Contributor

JanuszL commented Apr 21, 2021

Hi,

It may be some ABI incompatibility.
Could you provide a set of commands to run to recreate your environment?
The one I have:

_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
absl-py                   0.12.0                   pypi_0    pypi
aiohttp                   3.7.4.post0              pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     20.3.0                   pypi_0    pypi
blas                      1.0                         mkl
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
cachetools                4.2.1                    pypi_0    pypi
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
chardet                   4.0.0                    pypi_0    pypi
cudatoolkit               11.1.1               h6406543_8    conda-forge
cycler                    0.10.0                     py_2    conda-forge
expat                     2.3.0                h9c3ff4c_0    conda-forge
ffmpeg                    4.3                  hf484d3e_0    pytorch
freetype                  2.10.4               h0708190_1    conda-forge
fsspec                    2021.4.0                 pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
glib                      2.68.1               h9c3ff4c_0    conda-forge
glib-tools                2.68.1               h9c3ff4c_0    conda-forge
gmp                       6.2.1                h58526e2_0    conda-forge
gnutls                    3.6.13               h85f3911_1    conda-forge
google-auth               1.29.0                   pypi_0    pypi
google-auth-oauthlib      0.4.4                    pypi_0    pypi
grpcio                    1.37.0                   pypi_0    pypi
icu                       67.1                 he1b5a44_0    conda-forge
idna                      2.10                     pypi_0    pypi
intel-openmp              2020.2                      254
jpeg                      9b                   h024ee3a_2
kiwisolver                1.3.1            py38h1fd1430_1    conda-forge
lame                      3.100             h7f98852_1001    conda-forge
lcms2                     2.12                 h3be6417_0
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libglib                   2.68.1               h3e27bee_0    conda-forge
libgomp                   9.3.0               h2828fa1_19    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
libtiff                   4.1.0                h2733197_1
libuv                     1.41.0               h7f98852_0    conda-forge
llvm-openmp               11.1.0               h4bd325d_1    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
markdown                  3.3.4                    pypi_0    pypi
matplotlib                3.2.2                         1    conda-forge
matplotlib-base           3.2.2            py38h5d868c9_1    conda-forge
mkl                       2020.2                      256
mkl-service               2.3.0            py38h1e0a361_2    conda-forge
mkl_fft                   1.3.0            py38h54f3939_0
mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
multidict                 5.1.0                    pypi_0    pypi
ncurses                   6.2                  h58526e2_4    conda-forge
nettle                    3.6                  he412f7d_0    conda-forge
ninja                     1.10.2               h4bd325d_0    conda-forge
numpy                     1.19.2           py38h54aff64_0
numpy-base                1.19.2           py38hfa32c7d_0
nvidia-dali-cuda110       1.1.0                    pypi_0    pypi
oauthlib                  3.1.0                    pypi_0    pypi
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openh264                  2.1.1                h780b84a_0    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
packaging                 20.9                     pypi_0    pypi
pcre                      8.44                 he1b5a44_0    conda-forge
pillow                    8.2.0            py38he98fc37_0
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
protobuf                  3.15.8                   pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pydeprecate               0.2.0                    pypi_0    pypi
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
python                    3.8.8           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytorch                   1.8.1           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
pytorch-lightning         1.3.0rc1                 pypi_0    pypi
pyyaml                    5.3.1                    pypi_0    pypi
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.25.1                   pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
rsa                       4.7.2                    pypi_0    pypi
setuptools                49.6.0           py38h578d9bd_3    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.35.4               h74cdb3f_0    conda-forge
tensorboard               2.5.0                    pypi_0    pypi
tensorboard-data-server   0.6.0                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
torchaudio                0.8.1                      py38    pytorch
torchmetrics              0.3.0                    pypi_0    pypi
torchvision               0.9.1                py38_cu111    pytorch
tornado                   6.1              py38h497a2fe_1    conda-forge
tqdm                      4.60.0                   pypi_0    pypi
typing_extensions         3.7.4.3                    py_0    conda-forge
urllib3                   1.26.4                   pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yarl                      1.6.3                    pypi_0    pypi
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge

works fine.

@ndalton12
Copy link
Author

@JanuszL Sorry for the late reply, here you go:

docker run --rm -ti --gpus '"capabilities=compute,utility,video"' nvidia/cuda:11.1.1-runtime-ubuntu18.04

apt update && apt install -y wget git git-lfs && git lfs install
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh  # follow default install
bash
conda install pytorch cudatoolkit=11.1 -c pytorch -c conda-forge  # follow defaults
git clone https://gist.github.com/7011e9c7963d6167ffa0c7fbcff98285.git
conda env create -f 7011e9c7963d6167ffa0c7fbcff98285/dali_env.yaml
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110
git clone https://github.com/NVIDIA/DALI_extra
git clone https://gist.github.com/b888395646cebe319f78006faa0b6f6a.git
mkdir /data && ln -s /DALI_extra/db/video/sintel/ /data/critical_view_clips
python b888395646cebe319f78006faa0b6f6a/not_working_dali.py

I am able to reproduce with this setup.

@JanuszL JanuszL added bug Something isn't working and removed enhancement New feature or request labels Apr 27, 2021
@JanuszL
Copy link
Contributor

JanuszL commented Apr 27, 2021

Hi,

It seems that when you import Torchvision first brings system FFmpeg binaries into the process and DALI uses symbols from it despite we ship own FFmpeg build. The upstream FFmpeg build has a different configuration and doesn't work well with DALI.
We need to discuss how to tackle that issue.
As a work around please import DALI as the first thing if possible.

@ndalton12
Copy link
Author

ndalton12 commented Apr 29, 2021

@JanuszL Okay, thanks for the info.

On a slightly different note, I am getting large differences in model performance when using DALI loaded videos vs. using torchvision as my video reader. I tried to eliminate any differences between the two methods, but the DALI version generally has worse accuracy (15-20% difference) on the train and test split. After training and against an untouched dataset, though, the two methods tend to perform similarly (about 5% max difference). Any ideas why this could happen and/or if it's related to this issue?

The datasets are original_clips, pre cut data (which is a subset of original clips), and phases_clips (which is totally separate). The statistic measured is accuracy.
image

@JanuszL
Copy link
Contributor

JanuszL commented Apr 29, 2021

Hi @ndalton12,

It is hard to tell what is the reason. It would be best to compare side by side how the output from DALI and the torchvision look like. Maybe there is a difference in the configurations of both of the pipelines (despite the same operations are used the default options and thus behavior may differ).

@JanuszL
Copy link
Contributor

JanuszL commented Apr 29, 2021

@ndalton12 - we have a preliminary fix to this particular problem #2911 and NVIDIA/DALI_deps#6. Let us validate if this is a way to go for us, but thumbs up.

@ndalton12
Copy link
Author

Hi @JanuszL, I can give the new fix a try when a new release is pushed.

About the differences in performance: I have minimized the differences in the pipeline as much as possible, so that only the frame loading is different (all the transformations, specifically converting to float and the resizing and normalization, are done the exact same way). The only other difference is the random ordering of the frames. Upon visual inspection, the images are basically the same with the difference being that the torchvision solution has a bit more artifacting (random colored pixels here and there) on the image. Despite this, the dali version still fails to generalize and performs poorly on the test set and separate dataset.

I am a bit lost at what could be causing this difference - any advice? Attribution methods show that the torchvision version learns important geometry in the frames while the dali version does not seem learn to distinguish classes on the geometry. The only theory I have is that the artifacting from torchvision provides an important regularization. This seems unlikely though as the dali version still has minor artifacting due to using the same transformation pipeline after frame loading.

@JanuszL
Copy link
Contributor

JanuszL commented May 5, 2021

@ndalton12,

The only thing that comes to my mind is the frame/sequence distribution. Have you turned on shuffling? Are you sure you load all videos as in Torchvision?

@JanuszL JanuszL added this to the Release_1.2.0 milestone May 7, 2021
@JanuszL
Copy link
Contributor

JanuszL commented May 7, 2021

#2911 has been merged. Please check the nightly build that follows it to see if that resolves your problem.

@ndalton12
Copy link
Author

Hi @JanuszL, it looks like the problem has been resolved so I will close the issue. Thanks for your help.

As to the difference in performance, I wasn't able to figure it out. I used the same pre-processing after loading but the performance difference is still there. Not really sure what would cause it as then only the video loading is different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lack_of_repro Needs clear reproduction steps and script
Projects
None yet
Development

No branches or pull requests

2 participants