-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-smi killed after a while #271
Comments
Hi @sandrich. Thanks for reporting this. With regards to the toolkit version, this is independent of the CUDA version which is determined by the driver that is installed on the system (in the case of the GPU Operator most likely by the driver container). @klueska I recall that due to the following Update: The |
The heavy-duty workaround is to update to a version of Kubernetes that contains this patch: The lighter-weight workaround would be to make sure that your pod requests a set of exclusive CPUs as described here (even just one exclusive CPU would be sufficient): |
@klueska that is to add a request section of at least 1 full core like so?
The following resources were set in the test deployment
|
Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)? |
Exactly. The node has cpuManagerPolicy set to static
And here the pod details
|
OK. Yeah, everything looks good from the perspective of the pod specs, etc. I’m guessing you must be running into the runc bug then: And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: kubernetes/kubernetes#101771 I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not. |
Hi, OpenShift does not use runc but rather cri-o? |
Also what we see is the following in the logs of the node
I wonder if 16gb memory is not enough for the node that is serving the A100 card. It is a VM on VMWare with Direct Passthrough. We are not using vGPU |
@sandrich did you try it out with increased memory mapped to VM? |
@shivamerla I did which did not change anything. What did change is adding more memory to the container |
@sandrich can you check if below settings are enabled on your VM:
|
I run a rapidsai container with jupyter notebook.
When I freshly start the container all is fine. I can run some GPU workload inside the notebook.
Then randomly the notebook kernel gets killed. When I check nvidia-smi it crashes
I am not sure how to further debug this issue and where this comes from?
Environment:
OpenShift 4.7
GPU: Nvidia A100, MIG mode using the mig manager
Operator: 1.7.1
ClusterPolicy
Any idea how to debug where this issue comes from?
Also we need 11.2 support I suppose we cannot go with a newer toolkit image?
The text was updated successfully, but these errors were encountered: