Skip to content

Bump min required memory to 3GB and Fix KVM driver (tests) timeouts #20852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 3, 2025

Conversation

prezha
Copy link
Contributor

@prezha prezha commented May 27, 2025

Refactor KVM driver waiting logic for domain start, getting ip address and shutting domain down. Add more config/state outputs to aid future debugging.

Bump go/libvirt to v1.11002.0 and set the minimum memory required for running all tests to 3GB to avoid some really weird system behaviour.

fixes #20818
closes #20770

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 27, 2025
@k8s-ci-robot k8s-ci-robot requested a review from medyagh May 27, 2025 20:10
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: prezha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from spowelljr May 27, 2025 20:10
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 27, 2025
Rewrite KVM driver waiting logic for domain start, getting ip address
and shutting domain down. Add more config/state outputs to aid future
debugging.

Bump go/libvirt to v1.11002.0 and set the minimum memory required for
running all tests to 3GB to avoid some really weird system behaviour.
@prezha
Copy link
Contributor Author

prezha commented May 27, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 27, 2025
@prezha prezha changed the title Fix KVM driver tests timeouts Fix KVM driver (tests) timeouts May 27, 2025
@minikube-pr-bot

This comment has been minimized.

sha256 6924efde5de86fe277676e929dc9917d466efa02fb934197bc2eba35d5680971 go1.23.4.linux-amd64.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you intending to build go ISO in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i also wanted to rebuild the iso and use that one to also avoid the "spam" tests failures and filter/rule them out

@@ -1098,8 +1098,8 @@ func suggestMemoryAllocation(sysLimit, containerLimit, nodes int) int {
return mem
}

const fallback = 2200
maximum := 6000
const fallback = 3072
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this means, we are bumping the mininum required memory for minikube ? if yes and if needed how about be done in an explicit separate PR that can get attention (since affects all drivers)

all other refactors seems to be doing good by unifying the error message wording

Copy link
Contributor Author

@prezha prezha Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that bumping minimum required memory now, after we bumped several components starting from buildroot, became mandatory and also core to fix this issue, but i don't mind opening a separate pr for just memory bump if we think that would help, although it's coupled with few other improvements here on how we wait machine to get an ip

btw, we're also seeing other examples of failures with less than 3gb mem - like this one

i was able to replicate the issue with at least these ten kvm driver tests consistently failing (whilst all having in common originally using 2gb mem):

  • TestCertExpiration
  • TestCertOptions
  • TestDockerFlags
  • TestForceSystemdEnv
  • TestForceSystemdFlag
  • TestMountStart/serial/StartWithMountFirst
  • TestNoKubernetes/serial/Start
  • TestOffline
  • TestPause/serial/Start
  • TestScheduledStopUnix

here's couple console capture examples of kernel panic due to oom:
Screenshot-20250601 172738
Screenshot-20250601 182853

i've also added the ability to capture and log the vm console outputs in this pr - that should help future debugging as well, eg:

...
I0601 21:44:01.038071 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | unable to find current IP address of domain mount-start-1-457392 in network mk-mount-start-1-457392 (interfaces detected: [])
I0601 21:44:01.038113 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | console log:
I0601 21:44:01.038120 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.962861] Kernel panic - not syncing: System is deadlocked on memory
I0601 21:44:01.038126 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.963493] CPU: 0 PID: 1 Comm: init Not tainted 5.10.207 #1
I0601 21:44:01.038134 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.963911] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
I0601 21:44:01.038140 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.964838] Call Trace:
I0601 21:44:01.038146 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.965022]  dump_stack+0x57/0x6e
I0601 21:44:01.038154 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.965340]  panic+0x11a/0x2d9
I0601 21:44:01.038160 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.965565]  out_of_memory.cold+0x2f/0x7e
I0601 21:44:01.038185 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.965886]  __alloc_pages_slowpath.constprop.0+0xae1/0xc40
I0601 21:44:01.038192 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.966343]  __alloc_pages_nodemask+0x2cb/0x300
I0601 21:44:01.038196 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.966678]  alloc_pages_vma+0x78/0x260
I0601 21:44:01.038200 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.966964]  handle_mm_fault+0xa06/0x14c0
I0601 21:44:01.038205 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.967283]  do_user_addr_fault+0x183/0x3b0
I0601 21:44:01.038209 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.967585]  exc_page_fault+0x56/0x110
I0601 21:44:01.038214 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.967877]  ? asm_exc_page_fault+0x8/0x30
I0601 21:44:01.038218 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.968241]  asm_exc_page_fault+0x1e/0x30
I0601 21:44:01.038226 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.968573] RIP: 0033:0x7f6cd017dd54
I0601 21:44:01.038234 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.968837] Code: 8d 34 19 49 39 d0 49 89 70 60 0f 95 c2 48 29 d8 48 83 c1 10 0f b6 d2 48 83 c8 01 48 c1 e2 02 48 09 da 48 83 ca 01 48 89 51 f8 <48> 89 46 08 48 89 cf 4c 89 e6 48 89 4c 24 08 e8 18 c7 ff ff 48 8b
I0601 21:44:01.038239 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.970354] RSP: 002b:00007ffff1d812d0 EFLAGS: 00010206
I0601 21:44:01.038245 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.970720] RAX: 000000000000e581 RBX: 0000000000001010 RCX: 000055b57abe9a80
I0601 21:44:01.038251 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.971262] RDX: 0000000000001011 RSI: 000055b57abeaa80 RDI: 0000000000000000
I0601 21:44:01.038257 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.971777] RBP: ffffffffffffffc0 R08: 00007f6cd02b6ac0 R09: 0000000000000000
I0601 21:44:01.038263 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.972372] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000001000
I0601 21:44:01.038269 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.972887] R13: 0000000000000000 R14: 00000000000000ff R15: 00007f6cd02b6b20
I0601 21:44:01.038275 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.973726] Kernel Offset: 0x32e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
I0601 21:44:01.038280 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [    4.975132] ---[ end Kernel panic - not syncing: System is deadlocked on memory ]---
I0601 21:44:01.038286 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | 
I0601 21:44:01.038463 2477073 client.go:171] duration metric: took 1m37.283163428s to LocalClient.Create
...

</features>
<cpu mode='host-passthrough'>
{{if gt .NUMANodeCount 1}}
{{- if gt .NUMANodeCount 1}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this not working in past ? seems to be a new syntax ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're using that syntax to avoid printing empty lines where conditions are not met - we use that in a number of other templates already - eg: v1beta4.go

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 1, 2025
@minikube-pr-bot

This comment has been minimized.

@prezha
Copy link
Contributor Author

prezha commented Jun 2, 2025

ok-to-build-iso

@minikube-bot
Copy link
Collaborator

Hi @prezha, we have updated your PR with the reference to newly built ISO. Pull the changes locally if you want to test with them or update your PR further.

@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 20852) |
+----------------+----------+---------------------+
| minikube start | 52.4s    | 51.8s               |
| enable ingress | 15.0s    | 15.1s               |
+----------------+----------+---------------------+

Times for minikube start: 53.4s 54.4s 53.0s 51.8s 49.5s
Times for minikube (PR 20852) start: 54.0s 47.3s 50.0s 55.4s 52.1s

Times for minikube ingress: 15.5s 15.0s 14.6s 15.1s 15.0s
Times for minikube (PR 20852) ingress: 14.6s 14.5s 15.1s 16.2s 15.0s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 20852) |
+----------------+----------+---------------------+
| minikube start | 25.6s    | 25.0s               |
| enable ingress | 13.0s    | 12.8s               |
+----------------+----------+---------------------+

Times for minikube ingress: 13.3s 12.3s 12.8s 13.8s 12.8s
Times for minikube (PR 20852) ingress: 12.8s 11.3s 13.4s 13.3s 13.3s

Times for minikube (PR 20852) start: 23.0s 26.1s 25.3s 24.0s 26.7s
Times for minikube start: 26.3s 25.9s 23.5s 25.0s 27.3s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 20852) |
+----------------+----------+---------------------+
| minikube start | 24.1s    | 22.8s               |
| enable ingress | 37.4s    | 37.4s               |
+----------------+----------+---------------------+

Times for minikube start: 22.6s 25.9s 26.4s 24.3s 21.1s
Times for minikube (PR 20852) start: 22.7s 20.5s 25.3s 21.7s 23.9s

Times for minikube ingress: 39.3s 39.3s 39.8s 29.8s 38.8s
Times for minikube (PR 20852) ingress: 38.8s 38.8s 29.8s 39.8s 39.9s

@@ -145,7 +145,7 @@ func getLabels(containerRuntime string) *stackdriver.Labels {
func minikubeStartTime(ctx context.Context, projectID, minikubePath, containerRuntime string) (float64, error) {
defer deleteMinikube(ctx, minikubePath)

cmd := exec.CommandContext(ctx, minikubePath, "start", "--driver=docker", "-p", profile, "--memory=2048", "--trace=gcp", fmt.Sprintf("--container-runtime=%s", containerRuntime))
cmd := exec.CommandContext(ctx, minikubePath, "start", "--driver=docker", "-p", profile, "--memory=3072", "--trace=gcp", fmt.Sprintf("--container-runtime=%s", containerRuntime))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw this is for the Docker Driver, do we even need this ? (this is benchmarking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's definitelly not "critical", but i think that now that we know we can expect issues with less than 3gb mem, my thinking is trying to avoid potential flakiness, including here

@medyagh medyagh merged commit 4da3ced into kubernetes:master Jun 3, 2025
26 of 34 checks passed
@medyagh medyagh changed the title Fix KVM driver (tests) timeouts Bump min required memory to 3GB and Fix KVM driver (tests) timeouts Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
5 participants