-
Notifications
You must be signed in to change notification settings - Fork 183
Description
cuda-python 11.6.1
cuda toolkit 11.2
Ubuntu Linux
If you run something like the following on a multi-GPU machine
device_num = 5
err, = cuda.cuInit(0)
err, device = cuda.cuDeviceGet(device_num)
err, cuda_context = cuda.cuCtxCreate(0, device)
err, = cudart.cudaSetDevice(device)
The call to cudart.cudaSetDevice will properly set your device to '5', but it will also allocate ~305 MB of memory on device 0 (or whichever is the 0th device in the device list provided by CUDA_VISIBLE_DEVICES). I think this issue (possibly in the C-CUDA runtime underneath?) may possibly be the root of many downstream issues in libraries like Tensorflow and Pytorch who have similar issues where a user selects a device but still gets tons of allocations on other devices. This 305 MB may not sound like a lot, but I'm running a program on an Nvidia-DGX with 16 GPUs and I have 64 worker processes, causing 64*305 = 19GB of unusable space to be allocated on GPU 0, which crashes the program. I cannot simply set CUDA_VISIBLE_DEVICES to correct this problem because the workers are communicating via shared GPU memory (via cuIPCMemHandle) with their parent process, and the parent process needs access to all GPUs. Additionally, the worker processes are performing data augmentation on one GPU, while writing output to another GPU with a different device ID.
I am trying to investigate a workaround to not call 'cudart.cudaSetDevice' at all, but when it is not called I cannot properly use the pointer given by cuda.cuMemAlloc to create a PyTorch tensor. When I call cudart.cudaSetDevice, I am able to use the pointer properly.