Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the test checking for cooperative kernels in conditional nodes. #9869

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

galv
Copy link
Collaborator

@galv galv commented Jul 24, 2024

Now we conditionally xfail only when a cuda driver version less than 12.6 is installed. CUDA 12.6 fixes this issue. Before it, cooperative kernels could not be used within the body of a conditional node.

We also provide a better error message for users to know that the fix is to upgrade to CUDA 12.6.

@github-actions github-actions bot added core Changes to NeMo Core ASR labels Jul 24, 2024
nemo/core/utils/cuda_python_utils.py Dismissed Show dismissed Hide dismissed
@galv galv added Run CICD and removed Run CICD labels Jul 24, 2024
@galv galv requested a review from nithinraok July 25, 2024 06:34
Now we conditionally xfail only when a cuda driver version less than
12.6 is installed. CUDA 12.6 fixes this issue. Before it, cooperative
kernels could not be used within the body of a conditional node.

We also provide a better error message for users to know that the fix
is to upgrade to CUDA 12.6.

Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
@galv galv force-pushed the galv/better-cooperative-kernel-error-message branch from 87ee9ed to ce7477f Compare July 25, 2024 06:36
@galv galv added Run CICD and removed Run CICD labels Jul 25, 2024
Signed-off-by: galv <galv@users.noreply.github.com>
except RuntimeError as err:
if "CUDA error: invalid argument" in str(err):
raise RuntimeError(
"CUDA Graph capture failed. It is likely that you are calling a cooperative kernel in your RNN-T or TDT prediction network. Cooperative kernels are not allowed inside the bodies of CUDA Graph conditional nodes until CUDA 12.6. Please update to CUDA 12.6. File an issue if that still does not work."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support CUDA 12.5 with published pytorch containers see here: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html.

Make sure to support running without cuda_graphs decoding by default and for later version we can make cuda_graphs on by default when containers support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR core Changes to NeMo Core Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants