-
Notifications
You must be signed in to change notification settings - Fork 5k
Bump min required memory to 3GB and Fix KVM driver (tests) timeouts #20852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: prezha The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Rewrite KVM driver waiting logic for domain start, getting ip address and shutting domain down. Add more config/state outputs to aid future debugging. Bump go/libvirt to v1.11002.0 and set the minimum memory required for running all tests to 3GB to avoid some really weird system behaviour.
/ok-to-test |
This comment has been minimized.
This comment has been minimized.
sha256 6924efde5de86fe277676e929dc9917d466efa02fb934197bc2eba35d5680971 go1.23.4.linux-amd64.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you intending to build go ISO in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i also wanted to rebuild the iso and use that one to also avoid the "spam" tests failures and filter/rule them out
@@ -1098,8 +1098,8 @@ func suggestMemoryAllocation(sysLimit, containerLimit, nodes int) int { | |||
return mem | |||
} | |||
|
|||
const fallback = 2200 | |||
maximum := 6000 | |||
const fallback = 3072 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this means, we are bumping the mininum required memory for minikube ? if yes and if needed how about be done in an explicit separate PR that can get attention (since affects all drivers)
all other refactors seems to be doing good by unifying the error message wording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that bumping minimum required memory now, after we bumped several components starting from buildroot, became mandatory and also core to fix this issue, but i don't mind opening a separate pr for just memory bump if we think that would help, although it's coupled with few other improvements here on how we wait machine to get an ip
btw, we're also seeing other examples of failures with less than 3gb mem - like this one
i was able to replicate the issue with at least these ten kvm driver tests consistently failing (whilst all having in common originally using 2gb mem):
- TestCertExpiration
- TestCertOptions
- TestDockerFlags
- TestForceSystemdEnv
- TestForceSystemdFlag
- TestMountStart/serial/StartWithMountFirst
- TestNoKubernetes/serial/Start
- TestOffline
- TestPause/serial/Start
- TestScheduledStopUnix
here's couple console capture examples of kernel panic due to oom:
i've also added the ability to capture and log the vm console outputs in this pr - that should help future debugging as well, eg:
...
I0601 21:44:01.038071 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | unable to find current IP address of domain mount-start-1-457392 in network mk-mount-start-1-457392 (interfaces detected: [])
I0601 21:44:01.038113 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | console log:
I0601 21:44:01.038120 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.962861] Kernel panic - not syncing: System is deadlocked on memory
I0601 21:44:01.038126 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.963493] CPU: 0 PID: 1 Comm: init Not tainted 5.10.207 #1
I0601 21:44:01.038134 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.963911] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
I0601 21:44:01.038140 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.964838] Call Trace:
I0601 21:44:01.038146 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.965022] dump_stack+0x57/0x6e
I0601 21:44:01.038154 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.965340] panic+0x11a/0x2d9
I0601 21:44:01.038160 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.965565] out_of_memory.cold+0x2f/0x7e
I0601 21:44:01.038185 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.965886] __alloc_pages_slowpath.constprop.0+0xae1/0xc40
I0601 21:44:01.038192 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.966343] __alloc_pages_nodemask+0x2cb/0x300
I0601 21:44:01.038196 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.966678] alloc_pages_vma+0x78/0x260
I0601 21:44:01.038200 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.966964] handle_mm_fault+0xa06/0x14c0
I0601 21:44:01.038205 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.967283] do_user_addr_fault+0x183/0x3b0
I0601 21:44:01.038209 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.967585] exc_page_fault+0x56/0x110
I0601 21:44:01.038214 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.967877] ? asm_exc_page_fault+0x8/0x30
I0601 21:44:01.038218 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.968241] asm_exc_page_fault+0x1e/0x30
I0601 21:44:01.038226 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.968573] RIP: 0033:0x7f6cd017dd54
I0601 21:44:01.038234 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.968837] Code: 8d 34 19 49 39 d0 49 89 70 60 0f 95 c2 48 29 d8 48 83 c1 10 0f b6 d2 48 83 c8 01 48 c1 e2 02 48 09 da 48 83 ca 01 48 89 51 f8 <48> 89 46 08 48 89 cf 4c 89 e6 48 89 4c 24 08 e8 18 c7 ff ff 48 8b
I0601 21:44:01.038239 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.970354] RSP: 002b:00007ffff1d812d0 EFLAGS: 00010206
I0601 21:44:01.038245 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.970720] RAX: 000000000000e581 RBX: 0000000000001010 RCX: 000055b57abe9a80
I0601 21:44:01.038251 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.971262] RDX: 0000000000001011 RSI: 000055b57abeaa80 RDI: 0000000000000000
I0601 21:44:01.038257 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.971777] RBP: ffffffffffffffc0 R08: 00007f6cd02b6ac0 R09: 0000000000000000
I0601 21:44:01.038263 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.972372] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000001000
I0601 21:44:01.038269 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.972887] R13: 0000000000000000 R14: 00000000000000ff R15: 00007f6cd02b6b20
I0601 21:44:01.038275 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.973726] Kernel Offset: 0x32e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
I0601 21:44:01.038280 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG | [ 4.975132] ---[ end Kernel panic - not syncing: System is deadlocked on memory ]---
I0601 21:44:01.038286 2477073 main.go:141] libmachine: (mount-start-1-457392) DBG |
I0601 21:44:01.038463 2477073 client.go:171] duration metric: took 1m37.283163428s to LocalClient.Create
...
</features> | ||
<cpu mode='host-passthrough'> | ||
{{if gt .NUMANodeCount 1}} | ||
{{- if gt .NUMANodeCount 1}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this not working in past ? seems to be a new syntax ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're using that syntax to avoid printing empty lines where conditions are not met - we use that in a number of other templates already - eg: v1beta4.go
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ok-to-build-iso |
Hi @prezha, we have updated your PR with the reference to newly built ISO. Pull the changes locally if you want to test with them or update your PR further. |
kvm2 driver with docker runtime
Times for minikube start: 53.4s 54.4s 53.0s 51.8s 49.5s Times for minikube ingress: 15.5s 15.0s 14.6s 15.1s 15.0s docker driver with docker runtime
Times for minikube ingress: 13.3s 12.3s 12.8s 13.8s 12.8s Times for minikube (PR 20852) start: 23.0s 26.1s 25.3s 24.0s 26.7s docker driver with containerd runtime
Times for minikube start: 22.6s 25.9s 26.4s 24.3s 21.1s Times for minikube ingress: 39.3s 39.3s 39.8s 29.8s 38.8s |
@@ -145,7 +145,7 @@ func getLabels(containerRuntime string) *stackdriver.Labels { | |||
func minikubeStartTime(ctx context.Context, projectID, minikubePath, containerRuntime string) (float64, error) { | |||
defer deleteMinikube(ctx, minikubePath) | |||
|
|||
cmd := exec.CommandContext(ctx, minikubePath, "start", "--driver=docker", "-p", profile, "--memory=2048", "--trace=gcp", fmt.Sprintf("--container-runtime=%s", containerRuntime)) | |||
cmd := exec.CommandContext(ctx, minikubePath, "start", "--driver=docker", "-p", profile, "--memory=3072", "--trace=gcp", fmt.Sprintf("--container-runtime=%s", containerRuntime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw this is for the Docker Driver, do we even need this ? (this is benchmarking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's definitelly not "critical", but i think that now that we know we can expect issues with less than 3gb mem, my thinking is trying to avoid potential flakiness, including here
Refactor KVM driver waiting logic for domain start, getting ip address and shutting domain down. Add more config/state outputs to aid future debugging.
Bump go/libvirt to v1.11002.0 and set the minimum memory required for running all tests to 3GB to avoid some really weird system behaviour.
fixes #20818
closes #20770