kubelet fails to restart with CPUManager policy static when node's some CPU is offline #124066

mtawaken · 2024-03-27T03:41:27Z

What happened?

I have a node with Configuration:
a. 96 CPUS with one CPU(ID=95) disabled.
b. kubelet CPUManager static policy enabled.
c. kubeReserved: cpu: 2, systemReserved: cpu: 2

After start kubelet, the topology kubelet detected shows
"Detected CPU topoloogy" topology=&{NumCPUs:95 NumCores:48 NumSockets:2 NumNUMANodes:2 CPUDetails:......
Notes Core 25 contains CPU 25,73。 Core 47 contains CPU 47,95，but 95 is disabled and not shown in the log.

The static Policy process CPU reservation log below.
"takeFullCores: clainming core" core=47
"takeFullCores: clainming core" core=24
"takeRemainingCPUs: claiming CPU" cpu=25
"Reserved CPUs not available for exclusive assignment" reservedSize=4 reserved="24-25,47,72"
"Updated default CPUSet" cpuSet="0-94"

The weird thing is when the first Pod admits, the reserved CPU(ID=25) is reused.
"Topology Affinity" pod="xxx" containerName="xxx" affinity={NUMANodeAffinity:<nil> Prefered:false}
"AllocateCPUs" numCPUs=20 socket=<nil>
"takeFullCores: clainming core" core=25
"takeRemainingCPUs: claiming CPU" cpu=26
"takeRemainingCPUs: claiming CPU" cpu=74
....
"Updated default CPUSet" cpuSet="0-24,35-72,83-94"
"AllocateCPUs" result="25-34,73-82"

The confliction would further cause kubelet restart failure because of the static policy validation
"Static policy invalid state, please drain node and remove policy state file" err="not all reserved cpus: \"24-25,47,72\" are present in defaultCpuSet: \"0-24,35-72,83-94\""

What did you expect to happen?

Pod Cpuset Allocation should not share with Reserved CPUs

How can we reproduce it (as minimally and precisely as possible)?

use chcpu -d to disable one CPU in the node. Check /sys/devices/system/cpu/online file to make sure it works.
change kubelet configuration
a. cpuManagerPolicy: static
b. kubeReserved: cpu: 2, systemReserved: cpu: 2
restart kubelet
create a guaranteed Pod with some CPU
restart kubelet, expects it would fail to startup

Anything else we need to know?

it seems that kubelet CPUManager static Policy have not taken CPU offline senario into account.

in the specific case above, this method would result in 1(95/48), not 2

kubernetes/pkg/kubelet/cm/cpumanager/topology/topology.go

Lines 49 to 54 in 227c2e7

    
           func (topo *CPUTopology) CPUsPerCore() int { 
        
           	if topo.NumCores == 0 { 
        
           		return 0 
        
           	} 
        
           	return topo.NumCPUs / topo.NumCores 
        
           }

thus result in a Used Core as free (1==1)

kubernetes/pkg/kubelet/cm/cpumanager/cpu_assignment.go

Lines 259 to 262 in 227c2e7

    
           func (a *cpuAccumulator) isCoreFree(coreID int) bool { 
        
           	return a.details.CPUsInCores(coreID).Size() == a.topo.CPUsPerCore() 
        
           }

all logic releated with CPUsPerCore, CPUsPerSocket is affected

Kubernetes version

tested in kubernetes: v1.25, other version is the same.

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

vaibhav2107 · 2024-03-27T05:11:57Z

/sig node

ffromani · 2024-04-01T08:31:17Z

/cc

swatisehgal · 2024-04-03T14:09:22Z

/cc

esotsal · 2024-04-03T17:48:25Z

/cc

AnishShah · 2024-04-03T17:54:21Z

/kind feature
/remove-kind bug

kubelet does not respond well to dynamic changes to resources on the node. There is an open KEP-3953 to improve this with dynamic changes to resources.

AnishShah · 2024-04-03T17:54:55Z

/cc @haircommander

haircommander · 2024-04-03T20:47:34Z

a bit more context on @AnishShah 's comment:

In the SIG-Node CI meeting today we talked about this and came to the conclusion that this is a feature request. In part because of the structured error that is clearly catching this problem, and in part because this is ultimately a hotplugged resource problem, even though the kubelet is restarting in between. How the kubelet responds to cpus being taken offline underneath it would be solved the same way if the kubelet was continuously running or if it was restarted before and after.

I thought there was an open KEP about hotplugging memory/cpu and having the kubelet respond, but I can't find it.

We also talked a bit about what the kubelet should do in this case. We decided the kubelet should probably evict in this case, but that means making the eviction manager aware of the needs of the cpu manager.

All of this together informed the decision to mark this as a feature

mtawaken · 2024-04-07T09:19:51Z

thanks for the comment.
I went through the webinar recording and agree that the hotplugged resource feature requires a lot of effort and no need to be in haste.
BUT I think there's some aspect missing in the discussion. In this case, the node configuration is static, no dynamic change involved. kubelet is wrong from the very beginning(pls refer the code I pasted above).
So maybe it's still a bug unless we declare not considering odd number of CPUs with HT?

PS: the node's one CPU is always offline because of the HPC issue, it is necessary for tunning in some AMD architectures.

haircommander · 2024-04-10T13:27:31Z

I think "very beginning" is questionable, since you have running containers and an existing state file. I get the point that it's the beginning of the kubelet's process execution, but clearly this node has not just been rebooted or started from scratch. That's why it feels like a hotplug case to me.

haircommander · 2024-04-10T17:27:49Z

/triage accepted

ffromani · 2024-04-10T17:32:34Z

I think this is what @haircommander mentioned before: kubernetes/enhancements#3953

mtawaken · 2024-04-11T03:43:18Z

I think "very beginning" is questionable, since you have running containers and an existing state file

Understood the case that resource dynamicly changes and then restart kubelet. About this case, we could reproduce it with no running containers and clear any existing state file before starting kubelet, it is technically a brand new node.

To clarify,

Dynaminc node resize Feature: the Topology is not up-to-date, and we need to inform Topology and make it reasonable about existing allocations and further allocations
This issue: the Topology is already up-to-date, no offline CPU is used, but further container allocation shares some CPU with reserved CPUs(kube-reserved/system-reserved). It mainly violates the primary design.

ffromani · 2024-04-12T10:19:28Z

after another careful review I think @mtawaken has a point here. The internal logic in cpumanager (e.g. topology.go) has implicit assumptions about cpus (hyperthreads) distribution on cores (physical entities) and this seem to break in corner cases like this.

IOW, the internal logic makes too many assumptions, exemplified in the code snipped shared in the issue description, trivially dividing cpu count (virtual) by cpu count (physical) vs checking the actual topology and alignment.

The easiest way to break this logic is indeed disabling a single core like in this case.

A workaround COULD be using the full-pcpus-only cpumanager policy option, but this will have its own set of drawbacks so is not a drop-in solution. But still perhaps @mtawaken you can evaluate this approach?

I'm inclined to re-triage this as bug, but with lower priority because it seems a very narrow corner case.

mtawaken · 2024-04-26T13:53:42Z

@ffromani sorry for the late reply. I evaluated full-pcpus-only option, it only works as an extend option for "distribute-cpus-across-numa", and still relys on the the result of CPUsPerCore, which is already wrong.

I submit a pr as the workaround inside out environment by modifying IsCoreFree/IsSocketFree method. It is not a full solution for the offline CPU scenario(maybe should covered by the dynamic node resize feature?), but it can solve the reuse problem of reserved CPUs, and make kubelet restarting back to normal.

would you mind to take a review @ffromani

ffromani · 2024-04-29T09:51:51Z

@ffromani sorry for the late reply. I evaluated full-pcpus-only option, it only works as an extend option for "distribute-cpus-across-numa", and still relys on the the result of CPUsPerCore, which is already wrong.

Right. The critical flow is that we do trivial divisions in the cpu topology code (and in the surrounding code) rather than enumerating the CPUs and taking their relationship (and online/offline status) intou account.

I submit a pr as the workaround inside out environment by modifying IsCoreFree/IsSocketFree method. It is not a full solution for the offline CPU scenario(maybe should covered by the dynamic node resize feature?), but it can solve the reuse problem of reserved CPUs, and make kubelet restarting back to normal.

This seems promising, I'll have a look

ffromani · 2024-04-29T12:12:20Z

/kind bug
/priority important-longterm

I believe this is a legit bug, albeit I expect to be pretty rare in practice. Retriaging as such.

mtawaken added the kind/bug Categorizes issue or PR as related to a bug. label Mar 27, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 27, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 27, 2024

SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 27, 2024

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 3, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 10, 2024

haircommander moved this from Triage to Triaged in SIG Node Bugs Apr 10, 2024

mtawaken linked a pull request Apr 26, 2024 that will close this issue

kubelet: consider real CPU topology when check Core/Socket free #124561

Open

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet fails to restart with CPUManager policy static when node's some CPU is offline #124066

kubelet fails to restart with CPUManager policy static when node's some CPU is offline #124066

mtawaken commented Mar 27, 2024

vaibhav2107 commented Mar 27, 2024

ffromani commented Apr 1, 2024

swatisehgal commented Apr 3, 2024

esotsal commented Apr 3, 2024

AnishShah commented Apr 3, 2024 •

edited

AnishShah commented Apr 3, 2024

haircommander commented Apr 3, 2024

mtawaken commented Apr 7, 2024

haircommander commented Apr 10, 2024

haircommander commented Apr 10, 2024

ffromani commented Apr 10, 2024

mtawaken commented Apr 11, 2024 •

edited

ffromani commented Apr 12, 2024

mtawaken commented Apr 26, 2024

ffromani commented Apr 29, 2024

ffromani commented Apr 29, 2024

kubelet fails to restart with CPUManager policy static when node's some CPU is offline #124066

kubelet fails to restart with CPUManager policy static when node's some CPU is offline #124066

Comments

mtawaken commented Mar 27, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

vaibhav2107 commented Mar 27, 2024

ffromani commented Apr 1, 2024

swatisehgal commented Apr 3, 2024

esotsal commented Apr 3, 2024

AnishShah commented Apr 3, 2024 • edited

AnishShah commented Apr 3, 2024

haircommander commented Apr 3, 2024

mtawaken commented Apr 7, 2024

haircommander commented Apr 10, 2024

haircommander commented Apr 10, 2024

ffromani commented Apr 10, 2024

mtawaken commented Apr 11, 2024 • edited

ffromani commented Apr 12, 2024

mtawaken commented Apr 26, 2024

ffromani commented Apr 29, 2024

ffromani commented Apr 29, 2024

AnishShah commented Apr 3, 2024 •

edited

mtawaken commented Apr 11, 2024 •

edited