Skip to content

Refactor K8s deployment to use memory-tier components#375

Merged
t0mdavid-m merged 7 commits intomainfrom
claude/parallel-webapp-memory-optimization-RoNnJ
Apr 24, 2026
Merged

Refactor K8s deployment to use memory-tier components#375
t0mdavid-m merged 7 commits intomainfrom
claude/parallel-webapp-memory-optimization-RoNnJ

Conversation

@t0mdavid-m
Copy link
Copy Markdown
Member

@t0mdavid-m t0mdavid-m commented Apr 24, 2026

Summary

Refactored Kubernetes deployment configuration to replace per-app overlay copies with a single production overlay that selects memory tiers via Kustomize components. This simplifies deployment setup and makes resource allocation explicit and reusable.

Key Changes

  • Eliminated per-app overlay pattern: Replaced k8s/overlays/template-app/ template with a single k8s/overlays/prod/ overlay that all forks use directly. The forked repository itself identifies the app, removing the need to copy overlay directories.

  • Introduced memory-tier components: Created two new Kustomize components (memory-tier-low and memory-tier-high) under k8s/components/ that encapsulate:

    • Node selector patches (openms.de/memory-tier=low|high)
    • Resource requests/limits tuned for each tier (low: 16 GB worker limit; high: 180 GB worker limit)
    • Separate patches for streamlit and rq-worker deployments
  • Removed hardcoded resources from base manifests: Stripped resource requests/limits from k8s/base/streamlit-deployment.yaml and k8s/base/rq-worker-deployment.yaml to allow tier components to inject tier-appropriate values.

  • Added cluster-wide LimitRange: New k8s/base/limitrange.yaml sets sensible container defaults (512Mi request, 2Gi default limit) and cluster maximums (64Gi memory, 16 CPU) to prevent resource exhaustion.

  • Updated Redis resources: Increased Redis requests (64Mi → 256Mi memory, 50m → 100m CPU) and aligned limits to match requests for stability.

  • Updated documentation and CI: Modified skill guide, deployment docs, and GitHub Actions workflows to reference k8s/overlays/prod/ instead of per-app overlays and added memory-tier selection as a deployment step.

Implementation Details

  • Memory-tier components use JSON Patch operations to add nodeSelector to all Deployments, then override resource patches for specific workloads (streamlit, rq-worker).
  • The prod overlay includes memory-tier-low by default; users switch to memory-tier-high only for genuinely memory-intensive workloads (DIA spectral-library, OpenSwath, DIA-LFQ).
  • Cluster nodes must be pre-labelled with openms.de/memory-tier=low|high for the node selectors to function.
  • All CI workflows updated to validate and deploy using k8s/overlays/prod/ instead of the template overlay.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP

Summary by CodeRabbit

Release Notes

  • New Features

    • Added memory tier configuration options (low/high) for deployment resource allocation and pod scheduling.
  • Documentation

    • Updated deployment guide to reflect streamlined configuration structure.
  • Chores

    • Simplified shared production deployment configuration; updated CI/CD validation workflow accordingly.
    • Applied default container resource limits and adjusted resource allocations for system components to improve resource management.

Factor node placement and memory sizing out of the base manifests into
reusable Kustomize components (memory-tier-low / memory-tier-high), so
each fork picks its tier with a single line in its overlay.

- base: remove per-pod `resources` from streamlit and rq-worker
  Deployments; sizing now comes from the tier component
- base: promote redis to Guaranteed QoS (requests == limits for both
  cpu and memory) so it bottoms the kernel OOM list
- base: add LimitRange so containers without explicit resources inherit
  safe defaults (512Mi/250m request, 2Gi/2 limit, 64Gi/16 max)
- components/memory-tier-low: nodeSelector=low, streamlit 512Mi/2Gi,
  rq-worker 1Gi/16Gi (Burstable)
- components/memory-tier-high: nodeSelector=high, streamlit 512Mi/4Gi,
  rq-worker 2Gi/180Gi (Burstable — uniform across heavy workers so a
  single active app can burst into the shared pool)
- overlays: rename template-app/ to prod/ (one overlay per repo; the
  repo itself identifies the app) and pull in memory-tier-low
- docs & skill: document the new overlays/prod/ path and the one-line
  tier selector; update CI to kustomize the renamed overlay

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Warning

Rate limit exceeded

@t0mdavid-m has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 43 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 43 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ae586da-adb4-42a9-8e5f-456f92638417

📥 Commits

Reviewing files that changed from the base of the PR and between 0bd2ccf and 43c300b.

📒 Files selected for processing (2)
  • .github/kind-config.yaml
  • .github/workflows/build-and-test.yml
📝 Walkthrough

Walkthrough

Switches from per-app to shared production overlay architecture, adds configurable memory-tier components for resource management, introduces cluster-wide LimitRange defaults, adjusts container resource specifications, and updates deployment documentation and CI workflows accordingly.

Changes

Cohort / File(s) Summary
Documentation Updates
.claude/skills/configure-k8s-deployment.md, docs/kubernetes-deployment.md
Updated to reference shared k8s/overlays/prod/ instead of per-app overlays, adds memory-tier selection step, removes template-overlay expectations, and documents new LimitRange and memory-tier components.
CI Workflow
.github/workflows/build-and-test.yml
Modified to target k8s/overlays/prod/ for manifest validation and deployment, adds node labeling step with openms.de/memory-tier=low before Kustomize apply in both kind-based jobs.
Base Kubernetes Resources
k8s/base/kustomization.yaml, k8s/base/limitrange.yaml, k8s/base/redis.yaml, k8s/base/rq-worker-deployment.yaml, k8s/base/streamlit-deployment.yaml
Added LimitRange resource for container defaults/limits/max values; adjusted redis requests and limits; removed explicit resource constraints from rq-worker and streamlit deployments.
Memory-Tier Low Component
k8s/components/memory-tier-low/kustomization.yaml, k8s/components/memory-tier-low/nodeselector.yaml, k8s/components/memory-tier-low/streamlit-resources.yaml, k8s/components/memory-tier-low/worker-resources.yaml
New Kustomize component that applies nodeSelector patch for openms.de/memory-tier: low and resource specifications (streamlit: 512Mi/500m requests, 2Gi/4 limits; rq-worker: 256Mi/250m requests, 512Mi/500m limits).
Memory-Tier High Component
k8s/components/memory-tier-high/kustomization.yaml, k8s/components/memory-tier-high/nodeselector.yaml, k8s/components/memory-tier-high/streamlit-resources.yaml, k8s/components/memory-tier-high/worker-resources.yaml
New Kustomize component that applies nodeSelector patch for openms.de/memory-tier: high and resource specifications (streamlit and rq-worker with higher limits and requests for demanding workloads).
Production Overlay
k8s/overlays/prod/kustomization.yaml
Updated to include memory-tier-low component by default, enabling selective resource tier management through component switching.

Possibly related PRs

Poem

🐰 From templates scattered, one overlay shines,
Memory tiers dancing in configurable lines,
LimitRange wisdom keeps chaos at bay,
Resources now flexible—low tier or high—hooray! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Refactor K8s deployment to use memory-tier components' directly and clearly summarizes the main change: introducing Kustomize memory-tier components to the Kubernetes deployment architecture.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/parallel-webapp-memory-optimization-RoNnJ

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

t0mdavid-m and others added 5 commits April 24, 2026 11:44
The memory-tier-low component adds nodeSelector
openms.de/memory-tier=low to every Deployment. kind clusters have no
such label, so after the rename to overlays/prod all pods stayed
Pending and 'Wait for Redis to be ready' timed out.

Label --all kind nodes in both the nginx and Traefik integration jobs
before deploying so the nodeSelector matches.

Also raise the LimitRange max.memory from 64Gi to 200Gi. The original
cap was written before memory-tier-high settled on a 180Gi rq-worker
limit; without the bump, a high-tier fork (e.g. OpenDIAKiosk) would be
rejected by admission when deployed into the shared openms namespace
after the template's LimitRange is applied.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
Completes the overlay rename started in 6c61365 now that the branch
has merged main, which added the example file under the old path.

Also rewrite two remaining docs references to overlays/<your-app-name>/
and the CI description to the new prod overlay.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
Spin up a 2-node kind cluster (control-plane labeled memory-tier=low
+ ingress-ready, worker labeled memory-tier=high) so the Build-and-Test
job passes regardless of which memory-tier component a fork's overlay
pulls in. Previously we labeled --all nodes with a single tier after
creation, which broke as soon as a fork flipped memory-tier-low to
memory-tier-high.

- .github/kind-config.yaml: 2-node topology with per-node labels.
- .github/workflows/build-and-test.yml: point both helm/kind-action
  invocations (nginx build + traefik-integration) at the config and
  drop the now-redundant dynamic label step.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
k8s/base/limitrange.yaml (1)

1-16: LGTM — sane guardrails; two small operational notes.

  • max.memory: 200Gi correctly leaves headroom for the 180 GiB high-tier rq-worker limit.
  • Be aware that the default limit (memory: 2Gi, cpu: 2) will be silently applied to any future container that lacks explicit limits — init containers, sidecars, new services. For something like a DB migration init-container or log shipper that legitimately needs more, this can produce surprising OOMKills. Not a blocker for this PR; worth a short comment in the file or a line in docs/kubernetes-deployment.md so fork maintainers know to set explicit limits for any new workload.
  • Consider setting a maxLimitRequestRatio in a follow-up to prevent accidental over-commit (e.g., a request of 512Mi with a 200Gi limit passing validation).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/base/limitrange.yaml` around lines 1 - 16, Add an explanatory comment to
the LimitRange resource (metadata.name: default-container-limits, kind:
LimitRange) noting that the provided default (spec.limits[*].default memory:
"2Gi", cpu: "2") will be silently applied to any container without explicit
limits (including init containers and sidecars) and can cause OOMKills for
workloads that legitimately need more; also add a short note in
docs/kubernetes-deployment.md advising maintainers to set explicit limits for
special-case containers and to validate new workloads, and consider a follow-up
change to set spec.limits[*].maxLimitRequestRatio to prevent accidental
over-commit.
k8s/components/memory-tier-low/nodeselector.yaml (1)

1-4: RFC 6902 add semantics: patch will replace existing nodeSelector.

The add operation at /spec/template/spec/nodeSelector follows RFC 6902 semantics, which performs a replace rather than merge on existing paths. If a base Deployment ever includes a nodeSelector, this patch will silently overwrite it. Base manifests currently don't define nodeSelector, so this is safe today — flagging as a gotcha for fork maintainers. For stronger robustness, consider a strategic-merge patch so selectors merge instead of replace.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/components/memory-tier-low/nodeselector.yaml` around lines 1 - 4, The
patch uses an RFC6902 "add" at /spec/template/spec/nodeSelector which will
replace any existing nodeSelector instead of merging; change this to a
strategic-merge style patch so the nodeSelector map merges with any existing
selectors rather than overwriting them. Replace the JSON-patch add with a
strategic-merge patch (or a kustomize patchStrategicMerge) that sets
nodeSelector: { "openms.de/memory-tier": "low" } under spec.template.spec so
existing keys are preserved and only the new key is added/updated.
.github/workflows/build-and-test.yml (1)

117-118: Tier label is hardcoded; forks switching to memory-tier-high will silently break CI.

Both deployment jobs label all kind nodes with openms.de/memory-tier=low. If a fork flips k8s/overlays/prod/kustomization.yaml to memory-tier-high, pods get a nodeSelector: openms.de/memory-tier=high, no node matches, and the kubectl wait steps just time out — the CI failure won't point at the label.

Consider deriving the tier from the overlay (or labelling each node with both tiers in CI, since there's only one node per kind cluster) so the workflow keeps working under either component selection.

♻️ Option: label the single kind node with both tiers so CI is agnostic
-      - name: Label kind node with the tier the overlay expects
-        run: kubectl label nodes --all openms.de/memory-tier=low --overwrite
+      - name: Label kind nodes with both memory tiers (overlay-agnostic)
+        run: |
+          kubectl label nodes --all openms.de/memory-tier=low --overwrite
+          # Also apply high so either component in the prod overlay schedules.
+          # Note: a nodeSelector picks one value; labeling with both is safe
+          # because only the selected component's patch is applied.
+          kubectl label nodes --all openms.de/memory-tier- --overwrite 2>/dev/null || true

Note: a node can only carry one value for openms.de/memory-tier at a time, so a cleaner fix is to parse the component from k8s/overlays/prod/kustomization.yaml and label accordingly, e.g.:

TIER=$(grep -oE 'memory-tier-(low|high)' k8s/overlays/prod/kustomization.yaml | head -n1 | sed 's/memory-tier-//')
kubectl label nodes --all "openms.de/memory-tier=${TIER}" --overwrite
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/build-and-test.yml around lines 117 - 118, The workflow
currently hardcodes the node label via the kubectl label command (kubectl label
nodes --all openms.de/memory-tier=low --overwrite), which breaks when an overlay
uses memory-tier-high; update the job to derive the tier from
k8s/overlays/prod/kustomization.yaml (e.g., grep/parsing to extract
memory-tier-(low|high) into a TIER variable) and then call kubectl label nodes
--all "openms.de/memory-tier=${TIER}" --overwrite so the CI labels match the
selected overlay; alternatively, if you prefer a simpler change, label the
single kind node with the appropriate tier value dynamically rather than leaving
it hardcoded.
k8s/components/memory-tier-high/worker-resources.yaml (1)

10-16: Consider raising the memory request and dropping the CPU limit.

Two design issues worth reconsidering on the high-tier worker:

  1. Memory request (2Gi) vs limit (180Gi) gives a 90× burst range. The scheduler only reserves 2Gi, so another workload could legally land on the same node despite the worker's legitimate need for ~180Gi. Even with nodeSelector: openms.de/memory-tier=high (presumably dedicating the node), set requests closer to steady-state footprint (e.g., 8–32Gi) to prevent overcommit.

  2. CPU limit of 20 cores is likely unnecessary. RQ workers are bursty batch jobs; a hard CPU limit only causes CFS throttling without clear benefit. Dropping the CPU limit (keeping the request for scheduling) is the standard pattern for background workers.

The LimitRange max.memory: 200Gi will admit the 180Gi limit without issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/components/memory-tier-high/worker-resources.yaml` around lines 10 - 16,
The resources block currently sets memory request to "2Gi" and memory limit to
"180Gi" with a cpu limit of "20", which under-reserves memory and unnecessarily
caps CPU; update the resources for the high-tier worker by increasing
requests.memory to a steady-state value in the 8Gi–32Gi range (e.g., 16Gi) so
the scheduler reserves sufficient RAM, keep or adjust requests.cpu as needed
(requests.cpu currently "2"), and remove the limits.cpu entry entirely (drop the
hard CPU limit) so the pod is not CFS-throttled; adjust the resources.requests
and resources.limits keys accordingly and ensure nodeSelector:
openms.de/memory-tier=high remains if you rely on node isolation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/base/redis.yaml`:
- Around line 22-28: The Redis container resources currently set both requests
and limits for cpu to "100m" (under the resources block) are likely too low and
can cause CPU throttling; update the cpu request and cpu limit values in the
resources: requests/limits section for the Redis container to "250m" (keeping
requests == limits to preserve Guaranteed QoS) so bursts (AOF/RDB rewrites,
expirations, MULTI/EVAL) are not throttled.

In `@k8s/components/memory-tier-high/streamlit-resources.yaml`:
- Around line 10-16: Add a short explanatory comment in both
streamlit-resources.yaml and worker-resources.yaml near the resources block (the
"resources: requests: memory/cpu limits: memory/cpu" section) stating that the
high memory tier is intentionally asymmetric: streamlit remains lightweight
(only increases from 2Gi→4Gi) and is not intended for large in-memory workloads,
while rq-worker is the component that scales to handle heavy memory/compute
(e.g., 16Gi→180Gi); place the comment directly above the resources block in each
file and mention the rationale so users won't expect streamlit to serve large
in-memory dataframes.

In `@k8s/overlays/prod/kustomization.yaml`:
- Around line 7-9: The nodeselector patch in the memory-tier components is too
broad and will match all Deployments (including the Redis Deployment); update
each component's kustomization.yaml to restrict the nodeselector.yaml patch so
it only targets the app Deployments (e.g., set patchStrategicMerge or patches
with a target block referencing the patch file nodeselector.yaml and add
target.kind: Deployment plus target.name: streamlit|rq-worker or create separate
per-deployment patches with target.name: streamlit and target.name: rq-worker),
ensuring the Redis Deployment (defined in the base as redis.yaml) is not
modified by these components.

---

Nitpick comments:
In @.github/workflows/build-and-test.yml:
- Around line 117-118: The workflow currently hardcodes the node label via the
kubectl label command (kubectl label nodes --all openms.de/memory-tier=low
--overwrite), which breaks when an overlay uses memory-tier-high; update the job
to derive the tier from k8s/overlays/prod/kustomization.yaml (e.g., grep/parsing
to extract memory-tier-(low|high) into a TIER variable) and then call kubectl
label nodes --all "openms.de/memory-tier=${TIER}" --overwrite so the CI labels
match the selected overlay; alternatively, if you prefer a simpler change, label
the single kind node with the appropriate tier value dynamically rather than
leaving it hardcoded.

In `@k8s/base/limitrange.yaml`:
- Around line 1-16: Add an explanatory comment to the LimitRange resource
(metadata.name: default-container-limits, kind: LimitRange) noting that the
provided default (spec.limits[*].default memory: "2Gi", cpu: "2") will be
silently applied to any container without explicit limits (including init
containers and sidecars) and can cause OOMKills for workloads that legitimately
need more; also add a short note in docs/kubernetes-deployment.md advising
maintainers to set explicit limits for special-case containers and to validate
new workloads, and consider a follow-up change to set
spec.limits[*].maxLimitRequestRatio to prevent accidental over-commit.

In `@k8s/components/memory-tier-high/worker-resources.yaml`:
- Around line 10-16: The resources block currently sets memory request to "2Gi"
and memory limit to "180Gi" with a cpu limit of "20", which under-reserves
memory and unnecessarily caps CPU; update the resources for the high-tier worker
by increasing requests.memory to a steady-state value in the 8Gi–32Gi range
(e.g., 16Gi) so the scheduler reserves sufficient RAM, keep or adjust
requests.cpu as needed (requests.cpu currently "2"), and remove the limits.cpu
entry entirely (drop the hard CPU limit) so the pod is not CFS-throttled; adjust
the resources.requests and resources.limits keys accordingly and ensure
nodeSelector: openms.de/memory-tier=high remains if you rely on node isolation.

In `@k8s/components/memory-tier-low/nodeselector.yaml`:
- Around line 1-4: The patch uses an RFC6902 "add" at
/spec/template/spec/nodeSelector which will replace any existing nodeSelector
instead of merging; change this to a strategic-merge style patch so the
nodeSelector map merges with any existing selectors rather than overwriting
them. Replace the JSON-patch add with a strategic-merge patch (or a kustomize
patchStrategicMerge) that sets nodeSelector: { "openms.de/memory-tier": "low" }
under spec.template.spec so existing keys are preserved and only the new key is
added/updated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3206d92b-d0b2-41e1-824b-5ab783103152

📥 Commits

Reviewing files that changed from the base of the PR and between 971cfdd and 0bd2ccf.

📒 Files selected for processing (18)
  • .claude/skills/configure-k8s-deployment.md
  • .github/workflows/build-and-test.yml
  • docs/kubernetes-deployment.md
  • k8s/base/kustomization.yaml
  • k8s/base/limitrange.yaml
  • k8s/base/redis.yaml
  • k8s/base/rq-worker-deployment.yaml
  • k8s/base/streamlit-deployment.yaml
  • k8s/components/memory-tier-high/kustomization.yaml
  • k8s/components/memory-tier-high/nodeselector.yaml
  • k8s/components/memory-tier-high/streamlit-resources.yaml
  • k8s/components/memory-tier-high/worker-resources.yaml
  • k8s/components/memory-tier-low/kustomization.yaml
  • k8s/components/memory-tier-low/nodeselector.yaml
  • k8s/components/memory-tier-low/streamlit-resources.yaml
  • k8s/components/memory-tier-low/worker-resources.yaml
  • k8s/overlays/prod/kustomization.yaml
  • k8s/overlays/prod/streamlit-secrets.yaml.example
💤 Files with no reviewable changes (2)
  • k8s/base/streamlit-deployment.yaml
  • k8s/base/rq-worker-deployment.yaml

Comment thread k8s/base/redis.yaml
Comment on lines 22 to +28
resources:
requests:
memory: "64Mi"
cpu: "50m"
memory: "256Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "250m"
cpu: "100m"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reconsider dropping Redis CPU limit from 250m → 100m.

Setting requests == limits for both CPU and memory correctly promotes Redis to Guaranteed QoS, which is good. However, capping CPU at 100m is quite aggressive: Redis is largely single-threaded but can burst to a full core during AOF/RDB rewrites, expiration sweeps, large MULTI/EVAL executions, or client reconnect storms. Under throttling, these bursts will translate into tail-latency spikes that affect rq-worker enqueue/dequeue throughput across the whole app.

Unless you've observed Redis consistently consuming <100m under load, consider keeping the limit at 250m (and matching the request to preserve Guaranteed QoS):

♻️ Suggested adjustment
           resources:
             requests:
               memory: "256Mi"
-              cpu: "100m"
+              cpu: "250m"
             limits:
               memory: "256Mi"
-              cpu: "100m"
+              cpu: "250m"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resources:
requests:
memory: "64Mi"
cpu: "50m"
memory: "256Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "250m"
cpu: "100m"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "250m"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/base/redis.yaml` around lines 22 - 28, The Redis container resources
currently set both requests and limits for cpu to "100m" (under the resources
block) are likely too low and can cause CPU throttling; update the cpu request
and cpu limit values in the resources: requests/limits section for the Redis
container to "250m" (keeping requests == limits to preserve Guaranteed QoS) so
bursts (AOF/RDB rewrites, expirations, MULTI/EVAL) are not throttled.

Comment on lines +10 to +16
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "4"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find all streamlit resource files in memory-tier directories
find . -type f -name "streamlit-resources.yaml" | head -20

Repository: OpenMS/streamlit-template

Length of output: 186


🏁 Script executed:

# Find rq-worker resource files to compare scaling
find . -type f -name "*rq-worker*" -o -name "*rq_worker*" | head -20

Repository: OpenMS/streamlit-template

Length of output: 106


🏁 Script executed:

# Check the directory structure of k8s/components
ls -la k8s/components/ 2>/dev/null | head -30

Repository: OpenMS/streamlit-template

Length of output: 307


🏁 Script executed:

# Read the low-tier streamlit resources
cat -n k8s/components/memory-tier-low/streamlit-resources.yaml

Repository: OpenMS/streamlit-template

Length of output: 488


🏁 Script executed:

# Read the high-tier streamlit resources (from review)
cat -n k8s/components/memory-tier-high/streamlit-resources.yaml

Repository: OpenMS/streamlit-template

Length of output: 488


🏁 Script executed:

# Read the rq-worker deployment to check resource scaling
cat -n k8s/base/rq-worker-deployment.yaml

Repository: OpenMS/streamlit-template

Length of output: 2240


🏁 Script executed:

# Look for kustomization files that might override resources per tier
find ./k8s -type f -name "kustomization.yaml" -o -name "kustomization.yml"

Repository: OpenMS/streamlit-template

Length of output: 243


🏁 Script executed:

# Check if there are overlays or patches for rq-worker in memory-tier directories
find ./k8s/components/memory-tier-high -type f | head -20

Repository: OpenMS/streamlit-template

Length of output: 289


🏁 Script executed:

# Search for any references to 180Gi in the codebase
rg "180" --type yaml --type md

Repository: OpenMS/streamlit-template

Length of output: 302


🏁 Script executed:

# Read the high-tier worker resources
cat -n k8s/components/memory-tier-high/worker-resources.yaml

Repository: OpenMS/streamlit-template

Length of output: 486


🏁 Script executed:

# Check if low-tier has worker resources too
find ./k8s/components/memory-tier-low -type f

Repository: OpenMS/streamlit-template

Length of output: 285


🏁 Script executed:

# Read the high-tier kustomization to understand how patches are applied
cat -n k8s/components/memory-tier-high/kustomization.yaml

Repository: OpenMS/streamlit-template

Length of output: 489


🏁 Script executed:

# Read the low-tier worker resources to see the baseline
cat -n k8s/components/memory-tier-low/worker-resources.yaml

Repository: OpenMS/streamlit-template

Length of output: 487


🏁 Script executed:

# Search for any README or documentation explaining the memory tier design
find ./k8s -type f \( -name "README*" -o -name "*.md" \) | xargs grep -l "memory-tier" 2>/dev/null || echo "No matches found"

Repository: OpenMS/streamlit-template

Length of output: 86


Document the asymmetric scaling design for memory tiers.

The 2× memory increase for streamlit (2Gi → 4Gi) is correct, but the context reveals a significant asymmetry: rq-worker scales 11× (16Gi → 180Gi) while streamlit barely scales. For a component labeled "memory-tier-high," this design choice will likely confuse users — they may expect streamlit to also support memory-intensive workloads when switching tiers.

Add a comment to both streamlit-resources.yaml and worker-resources.yaml explaining that streamlit stays lightweight while rq-worker handles the compute scaling in high-tier deployments. This sets correct expectations and prevents users from expecting streamlit to accommodate large in-memory dataframes under higher tiers.

🧰 Tools
🪛 Checkov (3.2.524)

[medium] 1-16: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-16: Minimize the admission of root containers

(CKV_K8S_23)

🪛 Trivy (0.69.3)

[error] 9-16: Root file system is not read-only

Container 'streamlit' of Deployment 'streamlit' should set 'securityContext.readOnlyRootFilesystem' to true

Rule: KSV-0014

Learn more

(IaC/Kubernetes)


[error] 9-16: Default security context configured

container streamlit in default namespace is using the default security context

Rule: KSV-0118

Learn more

(IaC/Kubernetes)


[error] 7-16: Default security context configured

deployment streamlit in default namespace is using the default security context, which allows root privileges

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/components/memory-tier-high/streamlit-resources.yaml` around lines 10 -
16, Add a short explanatory comment in both streamlit-resources.yaml and
worker-resources.yaml near the resources block (the "resources: requests:
memory/cpu limits: memory/cpu" section) stating that the high memory tier is
intentionally asymmetric: streamlit remains lightweight (only increases from
2Gi→4Gi) and is not intended for large in-memory workloads, while rq-worker is
the component that scales to handle heavy memory/compute (e.g., 16Gi→180Gi);
place the comment directly above the resources block in each file and mention
the rationale so users won't expect streamlit to serve large in-memory
dataframes.

Comment on lines +7 to +9
components:
- ../../components/memory-tier-low

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the component kustomization files to see how the nodeselector patch is targeted.
fd -a 'kustomization.yaml' k8s/components
echo '--- low ---'
cat k8s/components/memory-tier-low/kustomization.yaml 2>/dev/null
echo '--- high ---'
cat k8s/components/memory-tier-high/kustomization.yaml 2>/dev/null

Repository: OpenMS/streamlit-template

Length of output: 863


🏁 Script executed:

#!/bin/bash
# Check if Redis is a Deployment and inspect the nodeselector patch
echo "=== Redis definition ==="
cat k8s/base/redis.yaml 2>/dev/null | head -20

echo ""
echo "=== Nodeselector patch content (low) ==="
cat k8s/components/memory-tier-low/nodeselector.yaml 2>/dev/null

echo ""
echo "=== Base directory structure ==="
ls -la k8s/base/ 2>/dev/null | grep -E '\.yaml$'

Repository: OpenMS/streamlit-template

Length of output: 1407


Node-selector patch will apply to Redis Deployment unintentionally.

The nodeselector.yaml patch in both memory-tier-low and memory-tier-high components targets kind: Deployment without filtering by name, so it will inject openms.de/memory-tier=low (or high) onto all Deployments in the base—including the Redis Deployment at k8s/base/redis.yaml.

Downstream effects:

  • Redis becomes unschedulable on any cluster where only app-workload nodes carry the memory-tier label.
  • Forks that switch to memory-tier-high will force Redis onto expensive high-memory nodes, despite Redis only requesting 256 Mi.
  • CI testing masks this because .github/workflows/build-and-test.yml labels all nodes with openms.de/memory-tier=low.

Fix: Narrow the nodeselector.yaml patch target in each component's kustomization.yaml to the two app Deployments only (e.g., via target.name: streamlit|rq-worker, or split into per-Deployment patches), so Redis is unaffected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/overlays/prod/kustomization.yaml` around lines 7 - 9, The nodeselector
patch in the memory-tier components is too broad and will match all Deployments
(including the Redis Deployment); update each component's kustomization.yaml to
restrict the nodeselector.yaml patch so it only targets the app Deployments
(e.g., set patchStrategicMerge or patches with a target block referencing the
patch file nodeselector.yaml and add target.kind: Deployment plus target.name:
streamlit|rq-worker or create separate per-deployment patches with target.name:
streamlit and target.name: rq-worker), ensuring the Redis Deployment (defined in
the base as redis.yaml) is not modified by these components.

Previous run (2f28ed9) showed build + traefik-integration jobs still
timing out on 'Wait for Redis'. Root cause: multi-node kind clusters
apply node-role.kubernetes.io/control-plane:NoSchedule to the
control-plane, which untolerated app pods can't land on even though
the nodeSelector matches. The single-node kind used previously had
no such taint, which is why CI worked until we added a second node.

Add a kubeadmConfigPatches stanza setting nodeRegistration.taints to
the empty list so the control-plane is schedulable. Labels and
cluster shape (1 control-plane + 1 worker) stay the same.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
@t0mdavid-m t0mdavid-m merged commit 64f43e2 into main Apr 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants