-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA-DALI Capabilities issue #3390
Comments
Hi @gulzainali98, I'm sorry but I haven't run DALI using enroot with the video reader.
? If that doesn't help I would wait for the answer to NVIDIA/enroot#100. |
Hello, at the same line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102 |
complete error: |
I have also tried updating ops.videoreader to ops.reader.Video but still getting same error. |
Hi @gulzainali98, Can you try a different DALI pipeline, like one from here? |
yes the version of cuda version is correct which is 11.0 and I am installing DALI with cu110 suffix. |
same error even with the simple pipeline: File "train_fastdvdnet.py", line 277, in |
I have tried it on RTX3090, A100 and simple nvidia 1080 gpus as well. |
@gulzainali98 can you also check if |
Nvidia-smi does not work in my enroot. I am working on slurm cluster. Not
sure how this may connect with the problem. Can you please tell explain a
bit?
…On Wed, Sep 29, 2021, 20:55 Janusz Lisiecki ***@***.***> wrote:
@gulzainali98 <https://github.com/gulzainali98> can you also check if
nvidia-smi works in your cluster env?
As I understand DALI works fine in your local env but not in the cluster.
Maybe there is something wrong with NVIDIA_DRIVER_CAPABILITIES?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3390 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIGSH2OKXFQKKGESY4IKDLDUENOJ7ANCNFSM5FAIQAHA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hi @gulzainali98,
|
NVIDIA-SMI does not work when I use --export "NVIDIA_DRIVER_CAPABILITIES=compute,utility,video,graphics,compat32,utility" while starting my slurm job. Confirming with admins now. |
You cam also check if the suggestion from NVIDIA/enroot#100 (comment) helps with that problem. |
Okay, so I was able to run the basic pipeline. I was making a mistake with the slurm command. However, I am not sure how I can convert the current pipeline here: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py but I am getting error: "Current pipeline object is no longer valid." on old pipeline present in the original code should I simply now use crop_mirror_normalize and uniform from nvidia.dali.fn? So all the previous operations have simply being turned into functional operations. |
Hi @gulzainali98, The pipeline there should work fine as it is. It uses the old API, but DALI still supports it even though the functional API is the recommended way. |
https://pastebin.com/199pc4qw |
In the log, I see that DALI cannot open libnvcuvid.so.
|
I can not directly execute the enroot but I am running the slurm command by setting "NVIDIA_DRIVER_CAPABILITIES=compute,utility,video" and then using --export command to import the NVIDIA_DRIVER_CAPABILITIES in slurm environment. here is the command and output of my slurm job: srun: job xxxx queued and waiting for resources |
It still doesn't answer the question if |
I had an admin run the commands here is the output: https://pastebin.com/vCbkgE3D |
Mhh, The most recent log shows that |
It's the same error again |
hi, "an admin" here... to me this seems like some conflated issue with how the NVIDIA_DRIVER_CAPABILITIES env var is treated (used / set / ignored) by srun/pyxis/enroot and running the right base image (some images seem to mess with the NVIDIA_DRIVER_CAPABILITIES defaults) on the right hardware (some GPUs don't seem to like libnvcuvid?!?)... anyhow, here seems to be a way that at least gets us to the expected container state:
in case there are further issues, my next bet would be on whether the right conda/virtualenv is actually used by the code... |
hmm, apparently just setting the env var inside the container by a simple early |
Hi @joernhees, Thank you for the advice. @gulzainali98 can you check these hints and see if they help? |
yeah, thanks a lot, with your pointers we were able to solve it... it essentially boiled down to this: NVIDIA/enroot#100 (comment) i think this can be closed |
I am running enroot container on slurm cluster and I am getting following error:
This is the whole error: https://pastebin.com/96CYv9fs
I am trying to run training for this rep: https://github.com/m-tassano/fastdvdnet
The error mentioned in Pastebin occurs at the following line: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py#L102
Code works fine on the local machine. This error is occurring only on slurm cluster. I searched a bit and came across this post: #2229
Which is a similar issue as mine
After going through the solutions in this issue, I found out that when running a video reader pipeline in a container, you need to explicitly enable all the capabilities. In the case of simple docker images, it can be done using the following syntax: NVIDIA/nvidia-docker#1128 (comment)
However, I am not sure how to achieve this task on our enroot containers?
The text was updated successfully, but these errors were encountered: