Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-smi killed after a while #271

Open
sandrich opened this issue Oct 14, 2021 · 12 comments
Open

nvidia-smi killed after a while #271

sandrich opened this issue Oct 14, 2021 · 12 comments

Comments

@sandrich
Copy link

sandrich commented Oct 14, 2021

I run a rapidsai container with jupyter notebook.
When I freshly start the container all is fine. I can run some GPU workload inside the notebook.

Thu Oct 14 09:58:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:13:00.0 Off |                   On |
| N/A   37C    P0    65W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Then randomly the notebook kernel gets killed. When I check nvidia-smi it crashes

nvidia-smi
Thu Oct 14 09:59:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
Killed

I am not sure how to further debug this issue and where this comes from?

Environment:
OpenShift 4.7
GPU: Nvidia A100, MIG mode using the mig manager
Operator: 1.7.1

ClusterPolicy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  migManager:
    nodeSelector:
      nvidia.com/gpu.deploy.mig-manager: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/cloud-native
    env:
      - name: WITH_REBOOT
        value: 'true'
    securityContext: {}
    version: 'sha256:495ed3b42e0541590c537ab1b33bda772aad530d3ef6a4f9384d3741a59e2bf8'
    image: k8s-mig-manager
    tolerations: []
    priorityClassName: system-node-critical
  operator:
    defaultRuntime: crio
    initContainer:
      image: cuda
      imagePullSecrets: []
      repository: nexus.bisinfo.org:8088/nvidia
      version: 'sha256:ba39801ba34370d6444689a860790787ca89e38794a11952d89a379d2e9c87b5'
    deployGFD: true
  gfd:
    nodeSelector:
      nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    env:
      - name: GFD_SLEEP_INTERVAL
        value: 60s
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
    securityContext: {}
    version: 'sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f'
    image: gpu-feature-discovery
    tolerations: []
    priorityClassName: system-node-critical
  dcgmExporter:
    nodeSelector:
      nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/k8s
    securityContext: {}
    version: 'sha256:8af02463a8b60b21202d0bf69bc1ee0bb12f684fa367f903d138df6cacc2d0ac'
    image: dcgm-exporter
    tolerations: []
    priorityClassName: system-node-critical
  driver:
    licensingConfig:
      configMapName: 'licensing-config'
    nodeSelector:
      nvidia.com/gpu.deploy.driver: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    securityContext: {}
    repoConfig:
      configMapName: repo-config
      destinationDir: "/etc/yum.repos.d"
    version: 'sha256:09ba3eca64a80fab010a9fcd647a2675260272a8c3eb515dfed6dc38a2d31ead'
    image: driver
    tolerations: []
    priorityClassName: system-node-critical
  devicePlugin:
    nodeSelector:
      nvidia.com/gpu.deploy.device-plugin: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    securityContext: {}
    version: 'sha256:85def0197f388e5e336b1ab0dbec350816c40108a58af946baa1315f4c96ee05'
    image: k8s-device-plugin
    tolerations: []
    args: []
    priorityClassName: system-node-critical
  mig:
    strategy: single
  validator:
    nodeSelector:
      nvidia.com/gpu.deploy.operator-validator: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/cloud-native
    env:
      - name: WITH_WORKLOAD
        value: 'true'
    securityContext: {}
    version: 'sha256:2bb62b9ca89bf9ae26399eeeeaf920d7752e617fa070c1120bf800253f624a10'
    image: gpu-operator-validator
    tolerations: []
    priorityClassName: system-node-critical
  toolkit:
    nodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/k8s
    securityContext: {}
    version: 1.5.0-ubi8
    image: container-toolkit
    tolerations: []
    priorityClassName: system-node-critical

Any idea how to debug where this issue comes from?
Also we need 11.2 support I suppose we cannot go with a newer toolkit image?

@elezar
Copy link
Member

elezar commented Oct 14, 2021

Hi @sandrich. Thanks for reporting this. With regards to the toolkit version, this is independent of the CUDA version which is determined by the driver that is installed on the system (in the case of the GPU Operator most likely by the driver container).

@klueska I recall that due to the following runc bug we saw that long running containers would lose access to devices. Do you recall what our workaround was?

Update: The runc bug was triggered due to CPUManager issuing an update command for the container's CPU set every 10s irrespective as to whether changes were required. Our workaround was to patch CPUManager to only issue an update if something had changed. The changes have been merged into upstream 1.22 but I am uncertain of the backport status.

@klueska
Copy link
Contributor

klueska commented Oct 14, 2021

The heavy-duty workaround is to update to a version of Kubernetes that contains this patch:
kubernetes/kubernetes#101771

The lighter-weight workaround would be to make sure that your pod requests a set of exclusive CPUs as described here (even just one exclusive CPU would be sufficient):
https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/

@sandrich
Copy link
Author

sandrich commented Oct 14, 2021

@klueska that is to add a request section of at least 1 full core like so?

resources:
      requests:
        cpu: 1

The following resources were set in the test deployment

resources:
          limits:
            cpu: "1"
            memory: 1000Mi
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 1000Mi
            nvidia.com/gpu: "1"

@klueska
Copy link
Contributor

klueska commented Oct 14, 2021

Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?

@sandrich
Copy link
Author

Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?

Exactly. The node has cpuManagerPolicy set to static

cat /etc/kubernetes/kubelet.conf | grep cpu
  "cpuManagerPolicy": "static",
  "cpuManagerReconcilePeriod": "5s",

And here the pod details

oc describe pod rapidsai-998589866-dkltb
Name:         rapidsai-998589866-dkltb
Namespace:    med-gpu-python-dev
Priority:     0
Node:         adchio1011.ocp-dev.opz.bisinfo.org/10.20.12.21
Start Time:   Fri, 15 Oct 2021 14:48:40 +0200
Labels:       app=rapidsai
              deployment=rapidsai
              pod-template-hash=998589866
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "100.70.4.26"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "100.70.4.26"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted
Status:       Running
IP:           100.70.4.26
IPs:
  IP:           100.70.4.26
Controlled By:  ReplicaSet/rapidsai-998589866
Containers:
  rapidsai:
    Container ID:  cri-o://bbf668d97da94e3a8de9b8df79a6c65ce7fa0c61026e060ce56afbcfc08b862d
    Image:         quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37:latest
    Image ID:      quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37@sha256:10cc2b92ae96a6f402c0b9ad6901c00cd9b3d37b5040fd2ba8e6fc8b279bb06c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/conda/envs/rapids/bin/jupyter-lab
      --allow-root
      --notebook-dir=/var/jupyter/notebook
      --ip=0.0.0.0
      --no-browser
      --NotebookApp.token=''
      --NotebookApp.allow_origin="*"
    State:          Running
      Started:      Fri, 15 Oct 2021 14:48:44 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          1000Mi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1000Mi
      nvidia.com/gpu:  1
    Environment:
      HOME:  /tmp
    Mounts:
      /var/jupyter/notebook from jupyter-notebook (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6g9vj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  jupyter-notebook:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  notebook
    ReadOnly:   false
  default-token-6g9vj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6g9vj
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

@klueska
Copy link
Contributor

klueska commented Oct 15, 2021

OK. Yeah, everything looks good from the perspective of the pod specs, etc.

I’m guessing you must be running into the runc bug then:
opencontainers/runc#2366 (comment)

And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: kubernetes/kubernetes#101771

I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.

@sandrich
Copy link
Author

OK. Yeah, everything looks good from the perspective of the pod specs, etc.

I’m guessing you must be running into the runc bug then: opencontainers/runc#2366 (comment)

And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: kubernetes/kubernetes#101771

I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.

Hi, OpenShift does not use runc but rather cri-o?

@sandrich
Copy link
Author

Also what we see is the following in the logs of the node

[14136.622417] cuda-EvtHandlr invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=-997
[14136.622588] CPU: 1 PID: 711806 Comm: cuda-EvtHandlr Tainted: P           OE    --------- -  - 4.18.0-305.19.1.el8_4.x86_64 #1
[14136.622781] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.17369862.B64.2012240522 12/24/2020 [14136.622987] Call Trace:
[14136.623038]  dump_stack+0x5c/0x80
[14136.623103]  dump_header+0x4a/0x1db
[14136.623168]  oom_kill_process.cold.32+0xb/0x10 [14136.623252]  out_of_memory+0x1ab/0x4a0 [14136.623322]  mem_cgroup_out_of_memory+0xe8/0x100
[14136.623406]  try_charge+0x65a/0x690
[14136.623470]  mem_cgroup_charge+0xca/0x220 [14136.623543]  __add_to_page_cache_locked+0x368/0x3d0
[14136.623632]  ? scan_shadow_nodes+0x30/0x30 [14136.623706]  add_to_page_cache_lru+0x4a/0xc0 [14136.623784]  iomap_readpages_actor+0x103/0x230 [14136.623865]  iomap_apply+0xfb/0x330 [14136.623930]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624010]  ? __blk_mq_run_hw_queue+0x51/0xd0 [14136.624092]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624172]  iomap_readpages+0xa8/0x1e0 [14136.624242]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624322]  read_pages+0x6b/0x190 [14136.624385]  __do_page_cache_readahead+0x1c1/0x1e0
[14136.624470]  filemap_fault+0x783/0xa20 [14136.624538]  ? __mod_memcg_lruvec_state+0x21/0x100
[14136.624625]  ? page_add_file_rmap+0xef/0x130 [14136.624702]  ? alloc_set_pte+0x21c/0x440 [14136.624779]  ? _cond_resched+0x15/0x30 [14136.624885]  __xfs_filemap_fault+0x6d/0x200 [xfs] [14136.624971]  __do_fault+0x36/0xd0 [14136.625033]  __handle_mm_fault+0xa7a/0xca0 [14136.625108]  handle_mm_fault+0xc2/0x1d0 [14136.625178]  __do_page_fault+0x1ed/0x4c0 [14136.625249]  do_page_fault+0x37/0x130 [14136.625316]  ? page_fault+0x8/0x30 [14136.625379]  page_fault+0x1e/0x30 [14136.625440] RIP: 0033:0x7fbd5b2b00e0 [14136.625508] Code: Unable to access opcode bytes at RIP 0x7fbd5b2b00b6."

I wonder if 16gb memory is not enough for the node that is serving the A100 card. It is a VM on VMWare with Direct Passthrough. We are not using vGPU

@shivamerla
Copy link
Contributor

@sandrich did you try it out with increased memory mapped to VM?

@sandrich
Copy link
Author

@shivamerla I did which did not change anything. What did change is adding more memory to the container

@shivamerla
Copy link
Contributor

@sandrich can you check if below settings are enabled on your VM:

pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=128

@sandrich
Copy link
Author

Yes this one is set

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants