Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet fails to restart with CPUManager policy static when node's some CPU is offline #124066

Open
mtawaken opened this issue Mar 27, 2024 · 16 comments · May be fixed by #124561
Open

kubelet fails to restart with CPUManager policy static when node's some CPU is offline #124066

mtawaken opened this issue Mar 27, 2024 · 16 comments · May be fixed by #124561
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mtawaken
Copy link

What happened?

I have a node with Configuration:
a. 96 CPUS with one CPU(ID=95) disabled.
b. kubelet CPUManager static policy enabled.
c. kubeReserved: cpu: 2, systemReserved: cpu: 2

After start kubelet, the topology kubelet detected shows
"Detected CPU topoloogy" topology=&{NumCPUs:95 NumCores:48 NumSockets:2 NumNUMANodes:2 CPUDetails:......
Notes Core 25 contains CPU 25,73。 Core 47 contains CPU 47,95,but 95 is disabled and not shown in the log.

The static Policy process CPU reservation log below.
"takeFullCores: clainming core" core=47
"takeFullCores: clainming core" core=24
"takeRemainingCPUs: claiming CPU" cpu=25
"Reserved CPUs not available for exclusive assignment" reservedSize=4 reserved="24-25,47,72"
"Updated default CPUSet" cpuSet="0-94"

The weird thing is when the first Pod admits, the reserved CPU(ID=25) is reused.
"Topology Affinity" pod="xxx" containerName="xxx" affinity={NUMANodeAffinity:<nil> Prefered:false}
"AllocateCPUs" numCPUs=20 socket=<nil>
"takeFullCores: clainming core" core=25
"takeRemainingCPUs: claiming CPU" cpu=26
"takeRemainingCPUs: claiming CPU" cpu=74
....
"Updated default CPUSet" cpuSet="0-24,35-72,83-94"
"AllocateCPUs" result="25-34,73-82"

The confliction would further cause kubelet restart failure because of the static policy validation
"Static policy invalid state, please drain node and remove policy state file" err="not all reserved cpus: \"24-25,47,72\" are present in defaultCpuSet: \"0-24,35-72,83-94\""

What did you expect to happen?

Pod Cpuset Allocation should not share with Reserved CPUs

How can we reproduce it (as minimally and precisely as possible)?

  1. use chcpu -d to disable one CPU in the node. Check /sys/devices/system/cpu/online file to make sure it works.
  2. change kubelet configuration
    a. cpuManagerPolicy: static
    b. kubeReserved: cpu: 2, systemReserved: cpu: 2
  3. restart kubelet
  4. create a guaranteed Pod with some CPU
  5. restart kubelet, expects it would fail to startup

Anything else we need to know?

it seems that kubelet CPUManager static Policy have not taken CPU offline senario into account.

  1. in the specific case above, this method would result in 1(95/48), not 2
    func (topo *CPUTopology) CPUsPerCore() int {
    if topo.NumCores == 0 {
    return 0
    }
    return topo.NumCPUs / topo.NumCores
    }
  2. thus result in a Used Core as free (1==1)
    func (a *cpuAccumulator) isCoreFree(coreID int) bool {
    return a.details.CPUsInCores(coreID).Size() == a.topo.CPUsPerCore()
    }
  3. all logic releated with CPUsPerCore, CPUsPerSocket is affected

Kubernetes version

tested in kubernetes: v1.25, other version is the same.

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@mtawaken mtawaken added the kind/bug Categorizes issue or PR as related to a bug. label Mar 27, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 27, 2024
@vaibhav2107
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 27, 2024
@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 27, 2024
@ffromani
Copy link
Contributor

ffromani commented Apr 1, 2024

/cc

2 similar comments
@swatisehgal
Copy link
Contributor

/cc

@esotsal
Copy link
Contributor

esotsal commented Apr 3, 2024

/cc

@AnishShah
Copy link
Contributor

AnishShah commented Apr 3, 2024

/kind feature
/remove-kind bug

kubelet does not respond well to dynamic changes to resources on the node. There is an open KEP-3953 to improve this with dynamic changes to resources.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 3, 2024
@AnishShah
Copy link
Contributor

/cc @haircommander

@haircommander
Copy link
Contributor

a bit more context on @AnishShah 's comment:

In the SIG-Node CI meeting today we talked about this and came to the conclusion that this is a feature request. In part because of the structured error that is clearly catching this problem, and in part because this is ultimately a hotplugged resource problem, even though the kubelet is restarting in between. How the kubelet responds to cpus being taken offline underneath it would be solved the same way if the kubelet was continuously running or if it was restarted before and after.

I thought there was an open KEP about hotplugging memory/cpu and having the kubelet respond, but I can't find it.

We also talked a bit about what the kubelet should do in this case. We decided the kubelet should probably evict in this case, but that means making the eviction manager aware of the needs of the cpu manager.

All of this together informed the decision to mark this as a feature

@mtawaken
Copy link
Author

mtawaken commented Apr 7, 2024

thanks for the comment.
I went through the webinar recording and agree that the hotplugged resource feature requires a lot of effort and no need to be in haste.
BUT I think there's some aspect missing in the discussion. In this case, the node configuration is static, no dynamic change involved. kubelet is wrong from the very beginning(pls refer the code I pasted above).
So maybe it's still a bug unless we declare not considering odd number of CPUs with HT?

PS: the node's one CPU is always offline because of the HPC issue, it is necessary for tunning in some AMD architectures.

@haircommander
Copy link
Contributor

I think "very beginning" is questionable, since you have running containers and an existing state file. I get the point that it's the beginning of the kubelet's process execution, but clearly this node has not just been rebooted or started from scratch. That's why it feels like a hotplug case to me.

@haircommander
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 10, 2024
@ffromani
Copy link
Contributor

I think this is what @haircommander mentioned before: kubernetes/enhancements#3953

@haircommander haircommander moved this from Triage to Triaged in SIG Node Bugs Apr 10, 2024
@mtawaken
Copy link
Author

mtawaken commented Apr 11, 2024

I think "very beginning" is questionable, since you have running containers and an existing state file

Understood the case that resource dynamicly changes and then restart kubelet. About this case, we could reproduce it with no running containers and clear any existing state file before starting kubelet, it is technically a brand new node.

To clarify,

  • Dynaminc node resize Feature: the Topology is not up-to-date, and we need to inform Topology and make it reasonable about existing allocations and further allocations
  • This issue: the Topology is already up-to-date, no offline CPU is used, but further container allocation shares some CPU with reserved CPUs(kube-reserved/system-reserved). It mainly violates the primary design.

@ffromani
Copy link
Contributor

after another careful review I think @mtawaken has a point here. The internal logic in cpumanager (e.g. topology.go) has implicit assumptions about cpus (hyperthreads) distribution on cores (physical entities) and this seem to break in corner cases like this.

IOW, the internal logic makes too many assumptions, exemplified in the code snipped shared in the issue description, trivially dividing cpu count (virtual) by cpu count (physical) vs checking the actual topology and alignment.

The easiest way to break this logic is indeed disabling a single core like in this case.

A workaround COULD be using the full-pcpus-only cpumanager policy option, but this will have its own set of drawbacks so is not a drop-in solution. But still perhaps @mtawaken you can evaluate this approach?

I'm inclined to re-triage this as bug, but with lower priority because it seems a very narrow corner case.

@mtawaken
Copy link
Author

@ffromani sorry for the late reply. I evaluated full-pcpus-only option, it only works as an extend option for "distribute-cpus-across-numa", and still relys on the the result of CPUsPerCore, which is already wrong.

I submit a pr as the workaround inside out environment by modifying IsCoreFree/IsSocketFree method. It is not a full solution for the offline CPU scenario(maybe should covered by the dynamic node resize feature?), but it can solve the reuse problem of reserved CPUs, and make kubelet restarting back to normal.

would you mind to take a review @ffromani

@ffromani
Copy link
Contributor

@ffromani sorry for the late reply. I evaluated full-pcpus-only option, it only works as an extend option for "distribute-cpus-across-numa", and still relys on the the result of CPUsPerCore, which is already wrong.

Right. The critical flow is that we do trivial divisions in the cpu topology code (and in the surrounding code) rather than enumerating the CPUs and taking their relationship (and online/offline status) intou account.

I submit a pr as the workaround inside out environment by modifying IsCoreFree/IsSocketFree method. It is not a full solution for the offline CPU scenario(maybe should covered by the dynamic node resize feature?), but it can solve the reuse problem of reserved CPUs, and make kubelet restarting back to normal.

This seems promising, I'll have a look

@ffromani
Copy link
Contributor

/kind bug
/priority important-longterm

I believe this is a legit bug, albeit I expect to be pretty rare in practice. Retriaging as such.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging a pull request may close this issue.

8 participants