NVIDIA-DALI Capabilities issue #3390

gulzainali98 · 2021-09-29T16:14:28Z

I am running enroot container on slurm cluster and I am getting following error:

This is the whole error: https://pastebin.com/96CYv9fs
I am trying to run training for this rep: https://github.com/m-tassano/fastdvdnet

The error mentioned in Pastebin occurs at the following line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102

Code works fine on the local machine. This error is occurring only on slurm cluster. I searched a bit and came across this post: #2229
Which is a similar issue as mine

After going through the solutions in this issue, I found out that when running a video reader pipeline in a container, you need to explicitly enable all the capabilities. In the case of simple docker images, it can be done using the following syntax: NVIDIA/nvidia-docker#1128 (comment)

However, I am not sure how to achieve this task on our enroot containers?

JanuszL · 2021-09-29T16:31:15Z

Hi @gulzainali98,

I'm sorry but I haven't run DALI using enroot with the video reader.
Have you tried to add the following line to the container:

ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility

?
In the case of the docker runtime, it should be equivalent to adding the capabilities to the docker invocation command. Maybe it would work with enroot as well.

If that doesn't help I would wait for the answer to NVIDIA/enroot#100.

gulzainali98 · 2021-09-29T17:33:57Z

Hello,
I was able to resolve that error by exporting NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility as environment variable. But now I am getting following error. nvml error: 13 Local version of NVML doesn't implement this function

at the same line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102

gulzainali98 · 2021-09-29T17:48:40Z

complete error:
File "train_fastdvdnet.py", line 277, in main(**vars(argspar)) File "train_fastdvdnet.py", line 57, in main temp_stride=3) File "/home/mkhan/generic_loss/fastdvdnet/dataloaders.py", line 118, in init self.pipeline.build() File "/opt/conda/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 660, in build self._pipe.Build(self._names_and_devices) RuntimeError: nvml error: 13 Local version of NVML doesn't implement this function

gulzainali98 · 2021-09-29T18:00:33Z

I have also tried updating ops.videoreader to ops.reader.Video but still getting same error.

JanuszL · 2021-09-29T18:02:51Z

Hi @gulzainali98,

Can you try a different DALI pipeline, like one from here?
This seems to be a driver incompatibility. Are you sure you have the driver version corresponding to the CUDA version your DALI installation is built with?

gulzainali98 · 2021-09-29T18:21:17Z

yes the version of cuda version is correct which is 11.0 and I am installing DALI with cu110 suffix.
I will try another pipeline and see if the error persists.

gulzainali98 · 2021-09-29T18:51:41Z

same error even with the simple pipeline:

File "train_fastdvdnet.py", line 277, in
main(**vars(argspar))
File "train_fastdvdnet.py", line 57, in main
temp_stride=3)
File "/home/mkhan/generic_loss/fastdvdnet/dataloaders.py", line 119, in init
pipe.build()
File "/opt/conda/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 660, in build
self._pipe.Build(self._names_and_devices)
RuntimeError: nvml error: 13 Local version of NVML doesn't implement this function
srun: error: serv-9216: task 0: Exited with exit code 1

gulzainali98 · 2021-09-29T18:52:46Z

I have tried it on RTX3090, A100 and simple nvidia 1080 gpus as well.

JanuszL · 2021-09-29T18:54:59Z

@gulzainali98 can you also check if nvidia-smi works in your cluster env?
As I understand DALI works fine in your local env but not in the cluster.
Maybe there is something wrong with NVIDIA_DRIVER_CAPABILITIES?

gulzainali98 · 2021-09-29T19:50:58Z

Nvidia-smi does not work in my enroot. I am working on slurm cluster. Not sure how this may connect with the problem. Can you please tell explain a bit?

…

On Wed, Sep 29, 2021, 20:55 Janusz Lisiecki ***@***.***> wrote: @gulzainali98 <https://github.com/gulzainali98> can you also check if nvidia-smi works in your cluster env? As I understand DALI works fine in your local env but not in the cluster. Maybe there is something wrong with NVIDIA_DRIVER_CAPABILITIES? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3390 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIGSH2OKXFQKKGESY4IKDLDUENOJ7ANCNFSM5FAIQAHA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

JanuszL · 2021-09-30T09:38:47Z

Hi @gulzainali98,

nvidia-smi uses the same underlying library as DALI has a problem with - NVML. So if it doesn't work the problem is not related to DALI.
I would check with the cluster administrators why this doesn't work for you. Maybe you can start with some minimal image - like empty ubuntu and see if nvidia-smi does work there and then doing baby steps to figure out what breaks your config.

gulzainali98 · 2021-09-30T09:43:01Z

NVIDIA-SMI does not work when I use --export "NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility" while starting my slurm job. Confirming with admins now.

JanuszL · 2021-09-30T10:33:03Z

You cam also check if the suggestion from NVIDIA/enroot#100 (comment) helps with that problem.

gulzainali98 · 2021-09-30T10:49:16Z

Okay, so I was able to run the basic pipeline. I was making a mistake with the slurm command. However, I am not sure how I can convert the current pipeline here: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py but I am getting error: "Current pipeline object is no longer valid." on old pipeline present in the original code

should I simply now use crop_mirror_normalize and uniform from nvidia.dali.fn? So all the previous operations have simply being turned into functional operations.

JanuszL · 2021-09-30T11:01:56Z

Hi @gulzainali98,

The pipeline there should work fine as it is. It uses the old API, but DALI still supports it even though the functional API is the recommended way.
Do you have any problem with the pipeline defined there, or it is related to the rework you want to do?

gulzainali98 · 2021-09-30T11:04:11Z

https://pastebin.com/199pc4qw
Getting following error on the old pipeline. At the end of the error it says "current pipeline object is not valid"

JanuszL · 2021-09-30T11:10:35Z

In the log, I see that DALI cannot open libnvcuvid.so.
Again, can you check if this works in your case (as suggested NVIDIA/enroot#100 (comment))?

$ NVIDIA_DRIVER_CAPABILITIES=compute,utility,video enroot start nvidia+cuda+11.4.0-base ldconfig -p | grep nvcuvid
        libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

gulzainali98 · 2021-09-30T11:32:27Z

I can not directly execute the enroot but I am running the slurm command by setting "NVIDIA_DRIVER_CAPABILITIES=compute,utility,video" and then using --export command to import the NVIDIA_DRIVER_CAPABILITIES in slurm environment.

here is the command and output of my slurm job:
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility srun --ntasks=1 --gpus-per-task=1 --container-image=nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'echo $NVIDIA_DRIVER_CAPABILITIES'

srun: job xxxx queued and waiting for resources
srun: job xxxx has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
compute,utility,video

JanuszL · 2021-09-30T11:43:49Z

It still doesn't answer the question if ldconfig -p | grep nvcuvid libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 command shows what we need, and if libnvcuvid.so is available.
There is also one more thing. The video part of the driver may be not installed on your cluster. The GPU driver has several optional parts that can be opt-out. The common user usually installs all of them, but in the case of clusters, some of them may not be there.
I guess the cluster has nvidia-headless-4XX and nvidia-utils-4XX, while you probably need libnvidia-decode-4XX as well.

gulzainali98 · 2021-09-30T11:48:25Z

I had an admin run the commands here is the output: https://pastebin.com/vCbkgE3D

JanuszL · 2021-09-30T12:28:50Z

Mhh,

The most recent log shows that libnvcuvid.so is there but DALI cannot access it as the previous error states. Can you rerun the test with the nvcr.io_nvidia_pytorch_21.08-py3.sqsh image (I'm sorry but I'm slowly running out of ideas about what is wrong).

gulzainali98 · 2021-09-30T13:07:34Z

It's the same error again

joernhees · 2021-09-30T15:13:38Z

hi, "an admin" here...

to me this seems like some conflated issue with how the NVIDIA_DRIVER_CAPABILITIES env var is treated (used / set / ignored) by srun/pyxis/enroot and running the right base image (some images seem to mess with the NVIDIA_DRIVER_CAPABILITIES defaults) on the right hardware (some GPUs don't seem to like libnvcuvid?!?)...

anyhow, here seems to be a way that at least gets us to the expected container state:

# slurm / pyxis / enroot / srun seem to not really respect the NVIDIA_DRIVER_CAPABILITIES env var
# so pass it in via an an enroot user config file:
mkdir -p ~/.config/enroot/environ.d
echo 'NVIDIA_DRIVER_CAPABILITIES=all' > ~/.config/enroot/environ.d/nvidia_caps.env

# running srun then 
LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'nvidia-smi ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157732 queued and waiting for resources
srun: job 157732 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
Thu Sep 30 16:41:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   34C    P0    59W / 300W |      0MiB / 16160MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

in case there are further issues, my next bet would be on whether the right conda/virtualenv is actually used by the code...

joernhees · 2021-09-30T15:40:48Z

hmm, apparently just setting the env var inside the container by a simple early export NVIDIA_DRIVER_CAPABILITIES=all ; also seems to work and lets us get away without that config file workaround 🤷‍♂️

JanuszL · 2021-09-30T16:01:25Z

Hi @joernhees,

Thank you for the advice. @gulzainali98 can you check these hints and see if they help?

joernhees · 2021-09-30T16:26:23Z

yeah, thanks a lot, with your pointers we were able to solve it... it essentially boiled down to this: NVIDIA/enroot#100 (comment)

i think this can be closed

JanuszL added the help wanted Extra attention is needed label Sep 29, 2021

flx42 mentioned this issue Sep 29, 2021

NVIDIA-DALI Capabilities issue NVIDIA/enroot#100

Open

JanuszL closed this as completed Sep 30, 2021

gulzainali98 mentioned this issue Oct 4, 2021

nvml version doesn't implement this function m-tassano/fastdvdnet#40

Closed

JanuszL mentioned this issue Feb 8, 2022

A question about VideoReader and self.pipeline.build() #3661

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA-DALI Capabilities issue #3390

NVIDIA-DALI Capabilities issue #3390

gulzainali98 commented Sep 29, 2021

JanuszL commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

JanuszL commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

JanuszL commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021 via email

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021 •

edited

Loading

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

joernhees commented Sep 30, 2021

joernhees commented Sep 30, 2021

JanuszL commented Sep 30, 2021

joernhees commented Sep 30, 2021

NVIDIA-DALI Capabilities issue #3390

NVIDIA-DALI Capabilities issue #3390

Comments

gulzainali98 commented Sep 29, 2021

JanuszL commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

JanuszL commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021

JanuszL commented Sep 29, 2021

gulzainali98 commented Sep 29, 2021 via email

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021 • edited Loading

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

JanuszL commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021

joernhees commented Sep 30, 2021

joernhees commented Sep 30, 2021

JanuszL commented Sep 30, 2021

joernhees commented Sep 30, 2021

gulzainali98 commented Sep 30, 2021 •

edited

Loading