Skip to content

Conversation

@def-
Copy link
Contributor

@def- def- commented Apr 10, 2025

Seen failing in https://buildkite.com/materialize/nightly/builds/11783#01961ce8-bbdc-41da-b266-10397ff883e9

  Warning  FailedScheduling  82s (x18 over 9m49s)  default-scheduler  0/7 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 5 Insufficient memory. preemption: 0/7 nodes are available: 2 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@def- def- force-pushed the pr-cloudtest-upgrade branch from 384e0ae to 8e44431 Compare April 10, 2025 06:51
@def- def- requested a review from a team as a code owner April 10, 2025 06:51
@def-
Copy link
Contributor Author

def- commented Apr 10, 2025

@doy-materialize I think this is related to license keys, this looks suspiciously like it's failing after 24 GB:

Apr 10 07:27:00 ip-10-61-114-81.ec2.internal kernel: Memory cgroup out of memory: Killed process 211652 (clusterd) total-vm:24047444kB, anon-rss:4062840kB, file-rss:102776kB, shmem-rss:0kB, UID:999 pgtables:32032kB oom_score_adj:979

https://buildkite.com/materialize/nightly/builds/11792
Is it possible that the license key passing in in misc/python/materialize/cloudtest/k8s/environmentd.py is broken?
I see no output from environmentd about license keys though.

@def- def- requested a review from a team as a code owner April 10, 2025 07:57
@def- def- force-pushed the pr-cloudtest-upgrade branch from e905308 to 8e44431 Compare April 10, 2025 08:25
@doy-materialize
Copy link
Contributor

hmmm, i'm not 100% sure here - we do set memory limits based on the cluster size that is provided, but the only thing the license key work changed was whether we prevent clusters from being launched if their declared memory usage for the cluster size would be higher than 24G. that said, it's also expected that environmentd won't complain about a missing license key if the process orchestrator is being used, so it is possible that something isn't being passed through correctly. i'd check the cluster size definitions first though?

@def- def- force-pushed the pr-cloudtest-upgrade branch from 8e44431 to 056b508 Compare April 14, 2025 05:48
@def- def- changed the title cloudtest: Bump number of workers for upgrade test cloudtest: Parallelize upgrade test Apr 14, 2025
@def-
Copy link
Contributor Author

def- commented Apr 14, 2025

@def- def- requested a review from aljoscha April 14, 2025 07:08
@def- def- enabled auto-merge April 14, 2025 07:08
@def- def- merged commit ec2fbab into MaterializeInc:main Apr 14, 2025
88 checks passed
@def- def- deleted the pr-cloudtest-upgrade branch April 14, 2025 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants