Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA-DALI Capabilities issue #3390

Closed
gulzainali98 opened this issue Sep 29, 2021 · 26 comments
Closed

NVIDIA-DALI Capabilities issue #3390

gulzainali98 opened this issue Sep 29, 2021 · 26 comments
Labels
help wanted Extra attention is needed

Comments

@gulzainali98
Copy link

I am running enroot container on slurm cluster and I am getting following error:

This is the whole error: https://pastebin.com/96CYv9fs
I am trying to run training for this rep: https://github.com/m-tassano/fastdvdnet

The error mentioned in Pastebin occurs at the following line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102

Code works fine on the local machine. This error is occurring only on slurm cluster. I searched a bit and came across this post: #2229
Which is a similar issue as mine

After going through the solutions in this issue, I found out that when running a video reader pipeline in a container, you need to explicitly enable all the capabilities. In the case of simple docker images, it can be done using the following syntax: NVIDIA/nvidia-docker#1128 (comment)

However, I am not sure how to achieve this task on our enroot containers?

@JanuszL
Copy link
Contributor

JanuszL commented Sep 29, 2021

Hi @gulzainali98,

I'm sorry but I haven't run DALI using enroot with the video reader.
Have you tried to add the following line to the container:

ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility

?
In the case of the docker runtime, it should be equivalent to adding the capabilities to the docker invocation command. Maybe it would work with enroot as well.

If that doesn't help I would wait for the answer to NVIDIA/enroot#100.

@JanuszL JanuszL added the help wanted Extra attention is needed label Sep 29, 2021
@gulzainali98
Copy link
Author

Hello,
I was able to resolve that error by exporting NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility as environment variable. But now I am getting following error. nvml error: 13 Local version of NVML doesn't implement this function

at the same line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102

@gulzainali98
Copy link
Author

complete error:
File "train_fastdvdnet.py", line 277, in main(**vars(argspar)) File "train_fastdvdnet.py", line 57, in main temp_stride=3) File "/home/mkhan/generic_loss/fastdvdnet/dataloaders.py", line 118, in init self.pipeline.build() File "/opt/conda/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 660, in build self._pipe.Build(self._names_and_devices) RuntimeError: nvml error: 13 Local version of NVML doesn't implement this function

@gulzainali98
Copy link
Author

I have also tried updating ops.videoreader to ops.reader.Video but still getting same error.

@JanuszL
Copy link
Contributor

JanuszL commented Sep 29, 2021

Hi @gulzainali98,

Can you try a different DALI pipeline, like one from here?
This seems to be a driver incompatibility. Are you sure you have the driver version corresponding to the CUDA version your DALI installation is built with?

@gulzainali98
Copy link
Author

yes the version of cuda version is correct which is 11.0 and I am installing DALI with cu110 suffix.
I will try another pipeline and see if the error persists.

@gulzainali98
Copy link
Author

same error even with the simple pipeline:

File "train_fastdvdnet.py", line 277, in
main(**vars(argspar))
File "train_fastdvdnet.py", line 57, in main
temp_stride=3)
File "/home/mkhan/generic_loss/fastdvdnet/dataloaders.py", line 119, in init
pipe.build()
File "/opt/conda/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 660, in build
self._pipe.Build(self._names_and_devices)
RuntimeError: nvml error: 13 Local version of NVML doesn't implement this function
srun: error: serv-9216: task 0: Exited with exit code 1

@gulzainali98
Copy link
Author

I have tried it on RTX3090, A100 and simple nvidia 1080 gpus as well.

@JanuszL
Copy link
Contributor

JanuszL commented Sep 29, 2021

@gulzainali98 can you also check if nvidia-smi works in your cluster env?
As I understand DALI works fine in your local env but not in the cluster.
Maybe there is something wrong with NVIDIA_DRIVER_CAPABILITIES?

@gulzainali98
Copy link
Author

gulzainali98 commented Sep 29, 2021 via email

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2021

Hi @gulzainali98,

nvidia-smi uses the same underlying library as DALI has a problem with - NVML. So if it doesn't work the problem is not related to DALI.
I would check with the cluster administrators why this doesn't work for you. Maybe you can start with some minimal image - like empty ubuntu and see if nvidia-smi does work there and then doing baby steps to figure out what breaks your config.

@gulzainali98
Copy link
Author

NVIDIA-SMI does not work when I use --export "NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility" while starting my slurm job. Confirming with admins now.

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2021

You cam also check if the suggestion from NVIDIA/enroot#100 (comment) helps with that problem.

@gulzainali98
Copy link
Author

Okay, so I was able to run the basic pipeline. I was making a mistake with the slurm command. However, I am not sure how I can convert the current pipeline here: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py but I am getting error: "Current pipeline object is no longer valid." on old pipeline present in the original code

should I simply now use crop_mirror_normalize and uniform from nvidia.dali.fn? So all the previous operations have simply being turned into functional operations.

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2021

Hi @gulzainali98,

The pipeline there should work fine as it is. It uses the old API, but DALI still supports it even though the functional API is the recommended way.
Do you have any problem with the pipeline defined there, or it is related to the rework you want to do?

@gulzainali98
Copy link
Author

https://pastebin.com/199pc4qw
Getting following error on the old pipeline. At the end of the error it says "current pipeline object is not valid"

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2021

In the log, I see that DALI cannot open libnvcuvid.so.
Again, can you check if this works in your case (as suggested NVIDIA/enroot#100 (comment))?

$ NVIDIA_DRIVER_CAPABILITIES=compute,utility,video enroot start nvidia+cuda+11.4.0-base ldconfig -p | grep nvcuvid
        libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

@gulzainali98
Copy link
Author

gulzainali98 commented Sep 30, 2021

I can not directly execute the enroot but I am running the slurm command by setting "NVIDIA_DRIVER_CAPABILITIES=compute,utility,video" and then using --export command to import the NVIDIA_DRIVER_CAPABILITIES in slurm environment.

here is the command and output of my slurm job:
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility srun --ntasks=1 --gpus-per-task=1 --container-image=nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'echo $NVIDIA_DRIVER_CAPABILITIES'

srun: job xxxx queued and waiting for resources
srun: job xxxx has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
compute,utility,video

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2021

It still doesn't answer the question if ldconfig -p | grep nvcuvid libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 command shows what we need, and if libnvcuvid.so is available.
There is also one more thing. The video part of the driver may be not installed on your cluster. The GPU driver has several optional parts that can be opt-out. The common user usually installs all of them, but in the case of clusters, some of them may not be there.
I guess the cluster has nvidia-headless-4XX and nvidia-utils-4XX, while you probably need libnvidia-decode-4XX as well.

@gulzainali98
Copy link
Author

I had an admin run the commands here is the output: https://pastebin.com/vCbkgE3D

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2021

Mhh,

The most recent log shows that libnvcuvid.so is there but DALI cannot access it as the previous error states. Can you rerun the test with the nvcr.io_nvidia_pytorch_21.08-py3.sqsh image (I'm sorry but I'm slowly running out of ideas about what is wrong).

@gulzainali98
Copy link
Author

It's the same error again

@joernhees
Copy link

hi, "an admin" here...

to me this seems like some conflated issue with how the NVIDIA_DRIVER_CAPABILITIES env var is treated (used / set / ignored) by srun/pyxis/enroot and running the right base image (some images seem to mess with the NVIDIA_DRIVER_CAPABILITIES defaults) on the right hardware (some GPUs don't seem to like libnvcuvid?!?)...

anyhow, here seems to be a way that at least gets us to the expected container state:

# slurm / pyxis / enroot / srun seem to not really respect the NVIDIA_DRIVER_CAPABILITIES env var
# so pass it in via an an enroot user config file:
mkdir -p ~/.config/enroot/environ.d
echo 'NVIDIA_DRIVER_CAPABILITIES=all' > ~/.config/enroot/environ.d/nvidia_caps.env

# running srun then 
LC_ALL=C srun -p V100-16GB --ntasks=1 --gpus-per-task=1 --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_21.08-py3.sqsh bash -c 'nvidia-smi ; ldconfig -p | grep nvcuvid ; echo $NVIDIA_DRIVER_CAPABILITIES'
srun: job 157732 queued and waiting for resources
srun: job 157732 has been allocated resources
pyxis: creating container filesystem ...
pyxis: starting container ...
Thu Sep 30 16:41:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   34C    P0    59W / 300W |      0MiB / 16160MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
	libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
all

in case there are further issues, my next bet would be on whether the right conda/virtualenv is actually used by the code...

@joernhees
Copy link

hmm, apparently just setting the env var inside the container by a simple early export NVIDIA_DRIVER_CAPABILITIES=all ; also seems to work and lets us get away without that config file workaround 🤷‍♂️

@JanuszL
Copy link
Contributor

JanuszL commented Sep 30, 2021

Hi @joernhees,

Thank you for the advice. @gulzainali98 can you check these hints and see if they help?

@joernhees
Copy link

yeah, thanks a lot, with your pointers we were able to solve it... it essentially boiled down to this: NVIDIA/enroot#100 (comment)

i think this can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants