Refactor K8s deployment to use memory-tier components#375
Conversation
Factor node placement and memory sizing out of the base manifests into reusable Kustomize components (memory-tier-low / memory-tier-high), so each fork picks its tier with a single line in its overlay. - base: remove per-pod `resources` from streamlit and rq-worker Deployments; sizing now comes from the tier component - base: promote redis to Guaranteed QoS (requests == limits for both cpu and memory) so it bottoms the kernel OOM list - base: add LimitRange so containers without explicit resources inherit safe defaults (512Mi/250m request, 2Gi/2 limit, 64Gi/16 max) - components/memory-tier-low: nodeSelector=low, streamlit 512Mi/2Gi, rq-worker 1Gi/16Gi (Burstable) - components/memory-tier-high: nodeSelector=high, streamlit 512Mi/4Gi, rq-worker 2Gi/180Gi (Burstable — uniform across heavy workers so a single active app can burst into the shared pool) - overlays: rename template-app/ to prod/ (one overlay per repo; the repo itself identifies the app) and pull in memory-tier-low - docs & skill: document the new overlays/prod/ path and the one-line tier selector; update CI to kustomize the renamed overlay https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 43 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughSwitches from per-app to shared production overlay architecture, adds configurable memory-tier components for resource management, introduces cluster-wide LimitRange defaults, adjusts container resource specifications, and updates deployment documentation and CI workflows accordingly. Changes
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The memory-tier-low component adds nodeSelector openms.de/memory-tier=low to every Deployment. kind clusters have no such label, so after the rename to overlays/prod all pods stayed Pending and 'Wait for Redis to be ready' timed out. Label --all kind nodes in both the nginx and Traefik integration jobs before deploying so the nodeSelector matches. Also raise the LimitRange max.memory from 64Gi to 200Gi. The original cap was written before memory-tier-high settled on a 180Gi rq-worker limit; without the bump, a high-tier fork (e.g. OpenDIAKiosk) would be rejected by admission when deployed into the shared openms namespace after the template's LimitRange is applied. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
…tp://127.0.0.1:34609/git/OpenMS/streamlit-template into claude/parallel-webapp-memory-optimization-RoNnJ
Completes the overlay rename started in 6c61365 now that the branch has merged main, which added the example file under the old path. Also rewrite two remaining docs references to overlays/<your-app-name>/ and the CI description to the new prod overlay. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
Spin up a 2-node kind cluster (control-plane labeled memory-tier=low + ingress-ready, worker labeled memory-tier=high) so the Build-and-Test job passes regardless of which memory-tier component a fork's overlay pulls in. Previously we labeled --all nodes with a single tier after creation, which broke as soon as a fork flipped memory-tier-low to memory-tier-high. - .github/kind-config.yaml: 2-node topology with per-node labels. - .github/workflows/build-and-test.yml: point both helm/kind-action invocations (nginx build + traefik-integration) at the config and drop the now-redundant dynamic label step. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
k8s/base/limitrange.yaml (1)
1-16: LGTM — sane guardrails; two small operational notes.
max.memory: 200Gicorrectly leaves headroom for the 180 GiB high-tier rq-worker limit.- Be aware that the
defaultlimit (memory: 2Gi,cpu: 2) will be silently applied to any future container that lacks explicit limits — init containers, sidecars, new services. For something like a DB migration init-container or log shipper that legitimately needs more, this can produce surprising OOMKills. Not a blocker for this PR; worth a short comment in the file or a line indocs/kubernetes-deployment.mdso fork maintainers know to set explicit limits for any new workload.- Consider setting a
maxLimitRequestRatioin a follow-up to prevent accidental over-commit (e.g., a request of 512Mi with a 200Gi limit passing validation).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/base/limitrange.yaml` around lines 1 - 16, Add an explanatory comment to the LimitRange resource (metadata.name: default-container-limits, kind: LimitRange) noting that the provided default (spec.limits[*].default memory: "2Gi", cpu: "2") will be silently applied to any container without explicit limits (including init containers and sidecars) and can cause OOMKills for workloads that legitimately need more; also add a short note in docs/kubernetes-deployment.md advising maintainers to set explicit limits for special-case containers and to validate new workloads, and consider a follow-up change to set spec.limits[*].maxLimitRequestRatio to prevent accidental over-commit.k8s/components/memory-tier-low/nodeselector.yaml (1)
1-4: RFC 6902addsemantics: patch will replace existing nodeSelector.The
addoperation at/spec/template/spec/nodeSelectorfollows RFC 6902 semantics, which performs a replace rather than merge on existing paths. If a base Deployment ever includes anodeSelector, this patch will silently overwrite it. Base manifests currently don't definenodeSelector, so this is safe today — flagging as a gotcha for fork maintainers. For stronger robustness, consider a strategic-merge patch so selectors merge instead of replace.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/components/memory-tier-low/nodeselector.yaml` around lines 1 - 4, The patch uses an RFC6902 "add" at /spec/template/spec/nodeSelector which will replace any existing nodeSelector instead of merging; change this to a strategic-merge style patch so the nodeSelector map merges with any existing selectors rather than overwriting them. Replace the JSON-patch add with a strategic-merge patch (or a kustomize patchStrategicMerge) that sets nodeSelector: { "openms.de/memory-tier": "low" } under spec.template.spec so existing keys are preserved and only the new key is added/updated..github/workflows/build-and-test.yml (1)
117-118: Tier label is hardcoded; forks switching tomemory-tier-highwill silently break CI.Both deployment jobs label all kind nodes with
openms.de/memory-tier=low. If a fork flipsk8s/overlays/prod/kustomization.yamltomemory-tier-high, pods get anodeSelector: openms.de/memory-tier=high, no node matches, and thekubectl waitsteps just time out — the CI failure won't point at the label.Consider deriving the tier from the overlay (or labelling each node with both tiers in CI, since there's only one node per kind cluster) so the workflow keeps working under either component selection.
♻️ Option: label the single kind node with both tiers so CI is agnostic
- - name: Label kind node with the tier the overlay expects - run: kubectl label nodes --all openms.de/memory-tier=low --overwrite + - name: Label kind nodes with both memory tiers (overlay-agnostic) + run: | + kubectl label nodes --all openms.de/memory-tier=low --overwrite + # Also apply high so either component in the prod overlay schedules. + # Note: a nodeSelector picks one value; labeling with both is safe + # because only the selected component's patch is applied. + kubectl label nodes --all openms.de/memory-tier- --overwrite 2>/dev/null || trueNote: a node can only carry one value for
openms.de/memory-tierat a time, so a cleaner fix is to parse the component fromk8s/overlays/prod/kustomization.yamland label accordingly, e.g.:TIER=$(grep -oE 'memory-tier-(low|high)' k8s/overlays/prod/kustomization.yaml | head -n1 | sed 's/memory-tier-//') kubectl label nodes --all "openms.de/memory-tier=${TIER}" --overwrite🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/build-and-test.yml around lines 117 - 118, The workflow currently hardcodes the node label via the kubectl label command (kubectl label nodes --all openms.de/memory-tier=low --overwrite), which breaks when an overlay uses memory-tier-high; update the job to derive the tier from k8s/overlays/prod/kustomization.yaml (e.g., grep/parsing to extract memory-tier-(low|high) into a TIER variable) and then call kubectl label nodes --all "openms.de/memory-tier=${TIER}" --overwrite so the CI labels match the selected overlay; alternatively, if you prefer a simpler change, label the single kind node with the appropriate tier value dynamically rather than leaving it hardcoded.k8s/components/memory-tier-high/worker-resources.yaml (1)
10-16: Consider raising the memory request and dropping the CPU limit.Two design issues worth reconsidering on the high-tier worker:
Memory request (2Gi) vs limit (180Gi) gives a 90× burst range. The scheduler only reserves 2Gi, so another workload could legally land on the same node despite the worker's legitimate need for ~180Gi. Even with
nodeSelector: openms.de/memory-tier=high(presumably dedicating the node), set requests closer to steady-state footprint (e.g., 8–32Gi) to prevent overcommit.CPU limit of 20 cores is likely unnecessary. RQ workers are bursty batch jobs; a hard CPU limit only causes CFS throttling without clear benefit. Dropping the CPU limit (keeping the request for scheduling) is the standard pattern for background workers.
The LimitRange
max.memory: 200Giwill admit the 180Gi limit without issue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/components/memory-tier-high/worker-resources.yaml` around lines 10 - 16, The resources block currently sets memory request to "2Gi" and memory limit to "180Gi" with a cpu limit of "20", which under-reserves memory and unnecessarily caps CPU; update the resources for the high-tier worker by increasing requests.memory to a steady-state value in the 8Gi–32Gi range (e.g., 16Gi) so the scheduler reserves sufficient RAM, keep or adjust requests.cpu as needed (requests.cpu currently "2"), and remove the limits.cpu entry entirely (drop the hard CPU limit) so the pod is not CFS-throttled; adjust the resources.requests and resources.limits keys accordingly and ensure nodeSelector: openms.de/memory-tier=high remains if you rely on node isolation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@k8s/base/redis.yaml`:
- Around line 22-28: The Redis container resources currently set both requests
and limits for cpu to "100m" (under the resources block) are likely too low and
can cause CPU throttling; update the cpu request and cpu limit values in the
resources: requests/limits section for the Redis container to "250m" (keeping
requests == limits to preserve Guaranteed QoS) so bursts (AOF/RDB rewrites,
expirations, MULTI/EVAL) are not throttled.
In `@k8s/components/memory-tier-high/streamlit-resources.yaml`:
- Around line 10-16: Add a short explanatory comment in both
streamlit-resources.yaml and worker-resources.yaml near the resources block (the
"resources: requests: memory/cpu limits: memory/cpu" section) stating that the
high memory tier is intentionally asymmetric: streamlit remains lightweight
(only increases from 2Gi→4Gi) and is not intended for large in-memory workloads,
while rq-worker is the component that scales to handle heavy memory/compute
(e.g., 16Gi→180Gi); place the comment directly above the resources block in each
file and mention the rationale so users won't expect streamlit to serve large
in-memory dataframes.
In `@k8s/overlays/prod/kustomization.yaml`:
- Around line 7-9: The nodeselector patch in the memory-tier components is too
broad and will match all Deployments (including the Redis Deployment); update
each component's kustomization.yaml to restrict the nodeselector.yaml patch so
it only targets the app Deployments (e.g., set patchStrategicMerge or patches
with a target block referencing the patch file nodeselector.yaml and add
target.kind: Deployment plus target.name: streamlit|rq-worker or create separate
per-deployment patches with target.name: streamlit and target.name: rq-worker),
ensuring the Redis Deployment (defined in the base as redis.yaml) is not
modified by these components.
---
Nitpick comments:
In @.github/workflows/build-and-test.yml:
- Around line 117-118: The workflow currently hardcodes the node label via the
kubectl label command (kubectl label nodes --all openms.de/memory-tier=low
--overwrite), which breaks when an overlay uses memory-tier-high; update the job
to derive the tier from k8s/overlays/prod/kustomization.yaml (e.g., grep/parsing
to extract memory-tier-(low|high) into a TIER variable) and then call kubectl
label nodes --all "openms.de/memory-tier=${TIER}" --overwrite so the CI labels
match the selected overlay; alternatively, if you prefer a simpler change, label
the single kind node with the appropriate tier value dynamically rather than
leaving it hardcoded.
In `@k8s/base/limitrange.yaml`:
- Around line 1-16: Add an explanatory comment to the LimitRange resource
(metadata.name: default-container-limits, kind: LimitRange) noting that the
provided default (spec.limits[*].default memory: "2Gi", cpu: "2") will be
silently applied to any container without explicit limits (including init
containers and sidecars) and can cause OOMKills for workloads that legitimately
need more; also add a short note in docs/kubernetes-deployment.md advising
maintainers to set explicit limits for special-case containers and to validate
new workloads, and consider a follow-up change to set
spec.limits[*].maxLimitRequestRatio to prevent accidental over-commit.
In `@k8s/components/memory-tier-high/worker-resources.yaml`:
- Around line 10-16: The resources block currently sets memory request to "2Gi"
and memory limit to "180Gi" with a cpu limit of "20", which under-reserves
memory and unnecessarily caps CPU; update the resources for the high-tier worker
by increasing requests.memory to a steady-state value in the 8Gi–32Gi range
(e.g., 16Gi) so the scheduler reserves sufficient RAM, keep or adjust
requests.cpu as needed (requests.cpu currently "2"), and remove the limits.cpu
entry entirely (drop the hard CPU limit) so the pod is not CFS-throttled; adjust
the resources.requests and resources.limits keys accordingly and ensure
nodeSelector: openms.de/memory-tier=high remains if you rely on node isolation.
In `@k8s/components/memory-tier-low/nodeselector.yaml`:
- Around line 1-4: The patch uses an RFC6902 "add" at
/spec/template/spec/nodeSelector which will replace any existing nodeSelector
instead of merging; change this to a strategic-merge style patch so the
nodeSelector map merges with any existing selectors rather than overwriting
them. Replace the JSON-patch add with a strategic-merge patch (or a kustomize
patchStrategicMerge) that sets nodeSelector: { "openms.de/memory-tier": "low" }
under spec.template.spec so existing keys are preserved and only the new key is
added/updated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3206d92b-d0b2-41e1-824b-5ab783103152
📒 Files selected for processing (18)
.claude/skills/configure-k8s-deployment.md.github/workflows/build-and-test.ymldocs/kubernetes-deployment.mdk8s/base/kustomization.yamlk8s/base/limitrange.yamlk8s/base/redis.yamlk8s/base/rq-worker-deployment.yamlk8s/base/streamlit-deployment.yamlk8s/components/memory-tier-high/kustomization.yamlk8s/components/memory-tier-high/nodeselector.yamlk8s/components/memory-tier-high/streamlit-resources.yamlk8s/components/memory-tier-high/worker-resources.yamlk8s/components/memory-tier-low/kustomization.yamlk8s/components/memory-tier-low/nodeselector.yamlk8s/components/memory-tier-low/streamlit-resources.yamlk8s/components/memory-tier-low/worker-resources.yamlk8s/overlays/prod/kustomization.yamlk8s/overlays/prod/streamlit-secrets.yaml.example
💤 Files with no reviewable changes (2)
- k8s/base/streamlit-deployment.yaml
- k8s/base/rq-worker-deployment.yaml
| resources: | ||
| requests: | ||
| memory: "64Mi" | ||
| cpu: "50m" | ||
| memory: "256Mi" | ||
| cpu: "100m" | ||
| limits: | ||
| memory: "256Mi" | ||
| cpu: "250m" | ||
| cpu: "100m" |
There was a problem hiding this comment.
Reconsider dropping Redis CPU limit from 250m → 100m.
Setting requests == limits for both CPU and memory correctly promotes Redis to Guaranteed QoS, which is good. However, capping CPU at 100m is quite aggressive: Redis is largely single-threaded but can burst to a full core during AOF/RDB rewrites, expiration sweeps, large MULTI/EVAL executions, or client reconnect storms. Under throttling, these bursts will translate into tail-latency spikes that affect rq-worker enqueue/dequeue throughput across the whole app.
Unless you've observed Redis consistently consuming <100m under load, consider keeping the limit at 250m (and matching the request to preserve Guaranteed QoS):
♻️ Suggested adjustment
resources:
requests:
memory: "256Mi"
- cpu: "100m"
+ cpu: "250m"
limits:
memory: "256Mi"
- cpu: "100m"
+ cpu: "250m"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resources: | |
| requests: | |
| memory: "64Mi" | |
| cpu: "50m" | |
| memory: "256Mi" | |
| cpu: "100m" | |
| limits: | |
| memory: "256Mi" | |
| cpu: "250m" | |
| cpu: "100m" | |
| resources: | |
| requests: | |
| memory: "256Mi" | |
| cpu: "250m" | |
| limits: | |
| memory: "256Mi" | |
| cpu: "250m" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@k8s/base/redis.yaml` around lines 22 - 28, The Redis container resources
currently set both requests and limits for cpu to "100m" (under the resources
block) are likely too low and can cause CPU throttling; update the cpu request
and cpu limit values in the resources: requests/limits section for the Redis
container to "250m" (keeping requests == limits to preserve Guaranteed QoS) so
bursts (AOF/RDB rewrites, expirations, MULTI/EVAL) are not throttled.
| resources: | ||
| requests: | ||
| memory: "512Mi" | ||
| cpu: "500m" | ||
| limits: | ||
| memory: "4Gi" | ||
| cpu: "4" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find all streamlit resource files in memory-tier directories
find . -type f -name "streamlit-resources.yaml" | head -20Repository: OpenMS/streamlit-template
Length of output: 186
🏁 Script executed:
# Find rq-worker resource files to compare scaling
find . -type f -name "*rq-worker*" -o -name "*rq_worker*" | head -20Repository: OpenMS/streamlit-template
Length of output: 106
🏁 Script executed:
# Check the directory structure of k8s/components
ls -la k8s/components/ 2>/dev/null | head -30Repository: OpenMS/streamlit-template
Length of output: 307
🏁 Script executed:
# Read the low-tier streamlit resources
cat -n k8s/components/memory-tier-low/streamlit-resources.yamlRepository: OpenMS/streamlit-template
Length of output: 488
🏁 Script executed:
# Read the high-tier streamlit resources (from review)
cat -n k8s/components/memory-tier-high/streamlit-resources.yamlRepository: OpenMS/streamlit-template
Length of output: 488
🏁 Script executed:
# Read the rq-worker deployment to check resource scaling
cat -n k8s/base/rq-worker-deployment.yamlRepository: OpenMS/streamlit-template
Length of output: 2240
🏁 Script executed:
# Look for kustomization files that might override resources per tier
find ./k8s -type f -name "kustomization.yaml" -o -name "kustomization.yml"Repository: OpenMS/streamlit-template
Length of output: 243
🏁 Script executed:
# Check if there are overlays or patches for rq-worker in memory-tier directories
find ./k8s/components/memory-tier-high -type f | head -20Repository: OpenMS/streamlit-template
Length of output: 289
🏁 Script executed:
# Search for any references to 180Gi in the codebase
rg "180" --type yaml --type mdRepository: OpenMS/streamlit-template
Length of output: 302
🏁 Script executed:
# Read the high-tier worker resources
cat -n k8s/components/memory-tier-high/worker-resources.yamlRepository: OpenMS/streamlit-template
Length of output: 486
🏁 Script executed:
# Check if low-tier has worker resources too
find ./k8s/components/memory-tier-low -type fRepository: OpenMS/streamlit-template
Length of output: 285
🏁 Script executed:
# Read the high-tier kustomization to understand how patches are applied
cat -n k8s/components/memory-tier-high/kustomization.yamlRepository: OpenMS/streamlit-template
Length of output: 489
🏁 Script executed:
# Read the low-tier worker resources to see the baseline
cat -n k8s/components/memory-tier-low/worker-resources.yamlRepository: OpenMS/streamlit-template
Length of output: 487
🏁 Script executed:
# Search for any README or documentation explaining the memory tier design
find ./k8s -type f \( -name "README*" -o -name "*.md" \) | xargs grep -l "memory-tier" 2>/dev/null || echo "No matches found"Repository: OpenMS/streamlit-template
Length of output: 86
Document the asymmetric scaling design for memory tiers.
The 2× memory increase for streamlit (2Gi → 4Gi) is correct, but the context reveals a significant asymmetry: rq-worker scales 11× (16Gi → 180Gi) while streamlit barely scales. For a component labeled "memory-tier-high," this design choice will likely confuse users — they may expect streamlit to also support memory-intensive workloads when switching tiers.
Add a comment to both streamlit-resources.yaml and worker-resources.yaml explaining that streamlit stays lightweight while rq-worker handles the compute scaling in high-tier deployments. This sets correct expectations and prevents users from expecting streamlit to accommodate large in-memory dataframes under higher tiers.
🧰 Tools
🪛 Checkov (3.2.524)
[medium] 1-16: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-16: Minimize the admission of root containers
(CKV_K8S_23)
🪛 Trivy (0.69.3)
[error] 9-16: Root file system is not read-only
Container 'streamlit' of Deployment 'streamlit' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 9-16: Default security context configured
container streamlit in default namespace is using the default security context
Rule: KSV-0118
(IaC/Kubernetes)
[error] 7-16: Default security context configured
deployment streamlit in default namespace is using the default security context, which allows root privileges
Rule: KSV-0118
(IaC/Kubernetes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@k8s/components/memory-tier-high/streamlit-resources.yaml` around lines 10 -
16, Add a short explanatory comment in both streamlit-resources.yaml and
worker-resources.yaml near the resources block (the "resources: requests:
memory/cpu limits: memory/cpu" section) stating that the high memory tier is
intentionally asymmetric: streamlit remains lightweight (only increases from
2Gi→4Gi) and is not intended for large in-memory workloads, while rq-worker is
the component that scales to handle heavy memory/compute (e.g., 16Gi→180Gi);
place the comment directly above the resources block in each file and mention
the rationale so users won't expect streamlit to serve large in-memory
dataframes.
| components: | ||
| - ../../components/memory-tier-low | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the component kustomization files to see how the nodeselector patch is targeted.
fd -a 'kustomization.yaml' k8s/components
echo '--- low ---'
cat k8s/components/memory-tier-low/kustomization.yaml 2>/dev/null
echo '--- high ---'
cat k8s/components/memory-tier-high/kustomization.yaml 2>/dev/nullRepository: OpenMS/streamlit-template
Length of output: 863
🏁 Script executed:
#!/bin/bash
# Check if Redis is a Deployment and inspect the nodeselector patch
echo "=== Redis definition ==="
cat k8s/base/redis.yaml 2>/dev/null | head -20
echo ""
echo "=== Nodeselector patch content (low) ==="
cat k8s/components/memory-tier-low/nodeselector.yaml 2>/dev/null
echo ""
echo "=== Base directory structure ==="
ls -la k8s/base/ 2>/dev/null | grep -E '\.yaml$'Repository: OpenMS/streamlit-template
Length of output: 1407
Node-selector patch will apply to Redis Deployment unintentionally.
The nodeselector.yaml patch in both memory-tier-low and memory-tier-high components targets kind: Deployment without filtering by name, so it will inject openms.de/memory-tier=low (or high) onto all Deployments in the base—including the Redis Deployment at k8s/base/redis.yaml.
Downstream effects:
- Redis becomes unschedulable on any cluster where only app-workload nodes carry the memory-tier label.
- Forks that switch to
memory-tier-highwill force Redis onto expensive high-memory nodes, despite Redis only requesting 256 Mi. - CI testing masks this because
.github/workflows/build-and-test.ymllabels all nodes withopenms.de/memory-tier=low.
Fix: Narrow the nodeselector.yaml patch target in each component's kustomization.yaml to the two app Deployments only (e.g., via target.name: streamlit|rq-worker, or split into per-Deployment patches), so Redis is unaffected.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@k8s/overlays/prod/kustomization.yaml` around lines 7 - 9, The nodeselector
patch in the memory-tier components is too broad and will match all Deployments
(including the Redis Deployment); update each component's kustomization.yaml to
restrict the nodeselector.yaml patch so it only targets the app Deployments
(e.g., set patchStrategicMerge or patches with a target block referencing the
patch file nodeselector.yaml and add target.kind: Deployment plus target.name:
streamlit|rq-worker or create separate per-deployment patches with target.name:
streamlit and target.name: rq-worker), ensuring the Redis Deployment (defined in
the base as redis.yaml) is not modified by these components.
Previous run (2f28ed9) showed build + traefik-integration jobs still timing out on 'Wait for Redis'. Root cause: multi-node kind clusters apply node-role.kubernetes.io/control-plane:NoSchedule to the control-plane, which untolerated app pods can't land on even though the nodeSelector matches. The single-node kind used previously had no such taint, which is why CI worked until we added a second node. Add a kubeadmConfigPatches stanza setting nodeRegistration.taints to the empty list so the control-plane is schedulable. Labels and cluster shape (1 control-plane + 1 worker) stay the same. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
Summary
Refactored Kubernetes deployment configuration to replace per-app overlay copies with a single production overlay that selects memory tiers via Kustomize components. This simplifies deployment setup and makes resource allocation explicit and reusable.
Key Changes
Eliminated per-app overlay pattern: Replaced
k8s/overlays/template-app/template with a singlek8s/overlays/prod/overlay that all forks use directly. The forked repository itself identifies the app, removing the need to copy overlay directories.Introduced memory-tier components: Created two new Kustomize components (
memory-tier-lowandmemory-tier-high) underk8s/components/that encapsulate:openms.de/memory-tier=low|high)Removed hardcoded resources from base manifests: Stripped resource requests/limits from
k8s/base/streamlit-deployment.yamlandk8s/base/rq-worker-deployment.yamlto allow tier components to inject tier-appropriate values.Added cluster-wide LimitRange: New
k8s/base/limitrange.yamlsets sensible container defaults (512Mi request, 2Gi default limit) and cluster maximums (64Gi memory, 16 CPU) to prevent resource exhaustion.Updated Redis resources: Increased Redis requests (64Mi → 256Mi memory, 50m → 100m CPU) and aligned limits to match requests for stability.
Updated documentation and CI: Modified skill guide, deployment docs, and GitHub Actions workflows to reference
k8s/overlays/prod/instead of per-app overlays and added memory-tier selection as a deployment step.Implementation Details
nodeSelectorto all Deployments, then override resource patches for specific workloads (streamlit, rq-worker).prodoverlay includesmemory-tier-lowby default; users switch tomemory-tier-highonly for genuinely memory-intensive workloads (DIA spectral-library, OpenSwath, DIA-LFQ).openms.de/memory-tier=low|highfor the node selectors to function.k8s/overlays/prod/instead of the template overlay.https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores