Skip to content

DCM toleration docs: wrong namespace for OpenShift, destructive patch, and unsafe cleanup #534

@leo8a

Description

@leo8a

Summary

The DCM partition profile documentation (docs/dcm/applying-partition-profiles.rst) has three issues in the toleration step that can cause control plane disruptions on OpenShift clusters.

Bug 1 — Wrong namespace for OpenShift

The OpenShift commands target kube-system, which is nearly empty on OCP. Critical control plane components run in openshift-* namespaces (e.g. openshift-dns, openshift-ovn-kubernetes, openshift-multus, openshift-ingress).

Current (line ~49):

oc get deployments -n kube-system -o json | jq -r '.items[] | .metadata.name' | xargs -I {} oc patch deployment {} -n kube-system ...

Expected: Patch deployments/daemonsets across the relevant openshift-* namespaces where pods are scheduled on GPU nodes. The existing note (line ~33) mentions this but the actual commands don't implement it.

Bug 2 — "op": "add" replaces existing tolerations

The JSON patch uses "op": "add" on /spec/template/spec/tolerations, which replaces the entire tolerations array with only the DCM toleration. Any pre-existing tolerations (e.g. node-role.kubernetes.io/master, node.kubernetes.io/unreachable) are silently dropped.

Current:

[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "amd-dcm", ...}]}]

Expected: Read existing tolerations, check if amd-dcm already exists, append if not, then write back the merged list.

Bug 3 — Cleanup blindly removes first toleration

The cleanup step (line ~355) uses "op": "remove", "path": "/spec/template/spec/tolerations/0" which removes the toleration at index 0 regardless of what it is. If the DCM toleration isn't first in the array, this deletes a legitimate toleration instead.

Current:

[{"op": "remove", "path": "/spec/template/spec/tolerations/0"}]

Expected: Filter out only the amd-dcm toleration by key, preserving all others.

Impact

  • Bug 1: On OpenShift, the toleration step is a no-op against kube-system, leaving openshift-* DaemonSets unprotected. Applying the NoExecute taint evicts DNS, networking, and ingress pods from the GPU node.
  • Bug 2: Drops existing tolerations, which can prevent pods from scheduling on tainted nodes (e.g. control-plane nodes with NoSchedule).
  • Bug 3: Silently removes a wrong toleration during cleanup, potentially breaking pod scheduling.

Affected file

docs/dcm/applying-partition-profiles.rst — lines 43-65 (add tolerations) and lines 344-361 (remove tolerations).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions