-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GPU detection in merlin.core.utils for Distributed class #98
Update GPU detection in merlin.core.utils for Distributed class #98
Conversation
Documentation preview |
Click to view CI ResultsGitHub pull request #98 of commit 5386077a55d023523a735ba2d21a9a3be18685ed, no merge conflicts. Running as SYSTEM Setting status of 5386077a55d023523a735ba2d21a9a3be18685ed to PENDING with url https://10.20.13.93:8080/job/merlin_core/65/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/98/*:refs/remotes/origin/pr/98/* # timeout=10 > git rev-parse 5386077a55d023523a735ba2d21a9a3be18685ed^{commit} # timeout=10 Checking out Revision 5386077a55d023523a735ba2d21a9a3be18685ed (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5386077a55d023523a735ba2d21a9a3be18685ed # timeout=10 Commit message: "Check cuda.gpus.lst for available GPUs" > git rev-list --no-walk f336ca3ff96810efbded64e0559ebb880ee06364 # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins14859643852832795028.sh Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already up-to-date: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (62.3.2) /usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 342 items / 1 skipped |
5386077
to
21fbb71
Compare
Click to view CI ResultsGitHub pull request #98 of commit 21fbb71aa1808f19a1357651992b0c2f9bb60239, no merge conflicts. Running as SYSTEM Setting status of 21fbb71aa1808f19a1357651992b0c2f9bb60239 to PENDING with url https://10.20.13.93:8080/job/merlin_core/66/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/98/*:refs/remotes/origin/pr/98/* # timeout=10 > git rev-parse 21fbb71aa1808f19a1357651992b0c2f9bb60239^{commit} # timeout=10 Checking out Revision 21fbb71aa1808f19a1357651992b0c2f9bb60239 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 21fbb71aa1808f19a1357651992b0c2f9bb60239 # timeout=10 Commit message: "Check cuda.gpus.lst for available GPUs" > git rev-list --no-walk 5386077a55d023523a735ba2d21a9a3be18685ed # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins15247958366670095269.sh Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already up-to-date: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (62.3.2) /usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 342 items / 1 skipped |
Click to view CI ResultsGitHub pull request #98 of commit 0e6f300caea67715418756e1f77b3990d8010caf, no merge conflicts. Running as SYSTEM Setting status of 0e6f300caea67715418756e1f77b3990d8010caf to PENDING with url https://10.20.13.93:8080/job/merlin_core/67/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_core using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/core > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems username and pass > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/98/*:refs/remotes/origin/pr/98/* # timeout=10 > git rev-parse 0e6f300caea67715418756e1f77b3990d8010caf^{commit} # timeout=10 Checking out Revision 0e6f300caea67715418756e1f77b3990d8010caf (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 0e6f300caea67715418756e1f77b3990d8010caf # timeout=10 Commit message: "Move gpu check to compat module" > git rev-list --no-walk 21fbb71aa1808f19a1357651992b0c2f9bb60239 # timeout=10 [merlin_core] $ /bin/bash /tmp/jenkins13254012742949887097.sh Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already up-to-date: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (62.3.2) /usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 342 items / 1 skipped |
Updating the automatic GPU detection in
merlin.core.utils
so that theDistributed
class works on both GPU/CPU automatically depending on availability.Motivation. When adding the XGBoost + Dask integration in merlin models. Added an example to use the
merlin.core.utils.Distributed
helper and encountered this issue with the default behaviour. NVIDIA-Merlin/models#466Implementation Details 馃毀
numba.cuda
doesn't necessarily indicate that we have GPUs available.numba.cuda.gpus.lst
, handling a potential CudaSupportError exception.merlin.core.dispatch
has aHAS_GPU
variable, however this raises aRuntimeError
when the the GPU is unavailable in some configurations. (and a lazy runtime error even if you try to catch aRuntimeError
on importingdask_cudf
)Testing Details
Unsure how best to automate tests for this in CI.
Manual tests conducted:
CUDA_VISIBLE_DEVICES
unset ->HAS_GPU = True
CUDA_VISIBLE_DEVICES=""
->HAS_GPU = False
--gpu
setting ->HAS_GPU = False