Update Documentation for Vault Cluster Setup with Security Considerations#76
Conversation
PR Analysis
PR Feedback💡 General suggestions: The PR is well-structured and provides valuable information on the architecture and security considerations for setting up a Vault cluster. It would be beneficial to provide more context or explanation for some of the security measures mentioned, especially for readers who may not be familiar with these concepts. 🤖 Code feedback:
✨ Usage guide:Overview: With a configuration file, use the following template:
See the review usage page for a comprehensive guide on using this tool. |
|
/describe |
|
PR Description updated to latest commit (a623eab) |
…L composition, per-claim CNP)
Four bugs surfaced once OpenWebUI was wired into the AI gateway
end-to-end (commit 556e7e1b) — each individually small but the
combination was needed to make the chat UI return real completions
instead of timeouts and 4xx:
1. **CEC backend Service port collision.** SR exposes 50051
(ext_proc gRPC) and 8080 (HTTP API). With
`backendServices: [{name: vllm-semantic-router}]` Cilium pushed
BOTH endpoints into the EDS cluster, so cilium-envoy would
intermittently dial 8080 for ext_proc and get back HTTP/1.1
instead of an ext_proc gRPC stream — surfaced as
`upstream connect error or disconnect/reset before headers.
reset reason: protocol error`. `failure_mode_allow: true` did
not save the request: the ext_proc passthrough returned 403.
Fix: pin the cluster to 50051 with `number: ["50051"]`. Pinned
the four xplane-* clusters to 8000 too for symmetry / Hubble
clarity.
2. **`llm-router` Service had no endpoints.** Selector targeted
`app.kubernetes.io/name: vllm-semantic-router`, but the chart
actually labels the SR pod `semantic-router`. CEC service-
redirect itself doesn't need endpoints, but the egress-policy
plane on the source side does — clients with restrictive egress
CNPs (default-deny + `toEndpoints`) couldn't resolve a
destination identity and Cilium's L7 filter returned plain
`Access denied` 403. Fix: switch the selector to
`semantic-router` so endpoints exist and the policy decision
becomes deterministic.
3. **OpenWebUI egress was scoped to a single label.** With CEC
service-redirect the actual upstream is whichever vLLM Service
SR picks (any of 4 today, future-N tomorrow), not SR. Listing
each label by hand re-couples the UI to the model fleet. Fix:
widen to `entities: cluster` on TCP 80 / 8000 / 8080 —
acceptable for a chat UI workload (restricted PSS, scoped IAM,
`world` egress already capped at 443).
4. **`_defaultIngress` cross-namespace match in the
InferenceService composition.** Cilium CNP
`fromEndpoints.matchLabels` defaults to the policy's own
namespace when no `io.kubernetes.pod.namespace` key is set.
The default-deny + `xplane-openwebui` allow only matched pods
in the llm namespace (where SR + promptfoo live). Adding the
`apps` namespace label makes the rule actually match the real
OpenWebUI pod and unblocks chat completions through the
gateway.
Plus the workaround for the still-using-the-old-composition issue
(task #76): the four InferenceService claim manifests
(phi4-mini, qwen3-8b, deepseek-r1-distill-qwen3-8b, llamaguard3-1b)
get a `spec.networkPolicies.ingress` override that mirrors the new
KCL composition's `_defaultIngress`. Drop the per-claim blocks
once the next composition version is published.
Verified live with a UI prompt:
POST /v1/chat/completions {"model":"auto","messages":[...]}
-> HTTP 200, x-vsr-selected-model: xplane-phi4-mini,
body: "2+2 equals 4."
Three documentation artifacts shipped together (all from the same 2026-05-04 brainstorming session): **docs/superpowers/specs/2026-05-04-coding-llm-fleet-design.md.** Design doc for swapping DeepSeek-R1-Distill-Qwen-7B out for Qwen2.5-Coder-7B-Instruct and adding Qwen2.5-Coder-1.5B-Base as an always-warm FIM model for inline tab-complete. Covers fleet specification, two-path routing (client-deterministic for OpenCode + Continue, SR cascade for OpenWebUI MoM), GPU budget vs the 4-GPU NodePool cap, ~$700-900/mo cost envelope, and the operational concerns (Karpenter consolidation, ext_proc cold-connect, CNP overrides). **docs/superpowers/specs/2026-05-04-coding-llm-fleet-plan.md.** Companion implementation plan with concrete ordered steps: rename DeepSeek claim to Qwen2.5-Coder, add Qwen2.5-Coder-1.5B-Base FIM claim, rewrite SR decisions[] to route code prompts to xplane-qwen-coder, expose individual models in /v1/models, write client config docs for OpenCode and Continue, smoke-test cascade vs direct-pick, drop old DeepSeek weights from S3. Status table at the top tracks which phases shipped pre-redeploy vs which are deferred (smoke tests + S3 cleanup need a deployed cluster; composition republish is task #76). **clusters/mycluster-0-llm-platform/README.md.** Adds the new fleet shape table (5 claims) + whole-cluster `terramate script run --reverse destroy` procedure (one y/n prompt, walks every stack in reverse) + explicit list of data buckets preserved by the Orphan managementPolicies on the Bucket MRs.
Phased TDD-shaped plan to deliver the design at 2026-05-05-ai-gateway-redesign-design.md (commit 060c02e). 5 phases, each independently mergeable: P1 smoke (dedicated Envoy + qwen3-8b route), P2 SR ext_proc via EnvoyExtensionPolicy with filter-ordering verification (Lua fallback documented), P3 full fleet routing (Service-backed), P4 InferencePool + EPP per claim (folds task #76), P5 demolition (delete CEC + llm-router-proxy + GHCR workflow, repoint Tailscale, close #78). Cross-cutting verification table + per-phase rollback playbook + open items for implementation-time discovery.
Bumps the InferenceService composition from 0.3.2 → 0.3.3 with two additions to _defaultIngress so the post-P4 traffic flow lands cleanly on vLLM pods without per-claim CNP overrides: 1. AI Gateway data plane → vLLM TCP 8000. The dedicated Envoy AI Gateway data plane (envoy-ai-gateway-system, gateway.envoyproxy.io/owning-gateway-name=ai-gateway) routes traffic to vLLM via InferencePool selection — EPP returns a pod IP, gateway connects directly. Without this allow, the gateway would silently drop on the first inference request. 2. Endpoint Picker Plugin (EPP) → vLLM TCP 8000. Each EPP scrapes /metrics for queue depth + KV-cache pressure to score endpoints. All 5 EPP pods (one per InferencePool) share the `inferencepool` pod-template label set by the upstream chart; matched via matchExpressions Exists scoped to the llm namespace. Composition source URL bumped to 0.3.3-pr1434 (CI will publish via the crossplane-modules.yml workflow on this PR's next run, per the kcl.mod version-rewrite path). Drops the per-claim networkPolicies overrides on all 5 model claims (phi4-mini, qwen3-8b, llamaguard3-1b, qwen-coder, qwen-coder-fim). Each was a verbatim copy of the composition's _defaultIngress, added when an earlier composition version omitted the apps namespace label on the OpenWebUI selector. The default now covers everything those overrides did, plus the new AI Gateway + EPP sources. Closes task #76.
…L composition, per-claim CNP)
Four bugs surfaced once OpenWebUI was wired into the AI gateway
end-to-end (commit 556e7e1b) — each individually small but the
combination was needed to make the chat UI return real completions
instead of timeouts and 4xx:
1. **CEC backend Service port collision.** SR exposes 50051
(ext_proc gRPC) and 8080 (HTTP API). With
`backendServices: [{name: vllm-semantic-router}]` Cilium pushed
BOTH endpoints into the EDS cluster, so cilium-envoy would
intermittently dial 8080 for ext_proc and get back HTTP/1.1
instead of an ext_proc gRPC stream — surfaced as
`upstream connect error or disconnect/reset before headers.
reset reason: protocol error`. `failure_mode_allow: true` did
not save the request: the ext_proc passthrough returned 403.
Fix: pin the cluster to 50051 with `number: ["50051"]`. Pinned
the four xplane-* clusters to 8000 too for symmetry / Hubble
clarity.
2. **`llm-router` Service had no endpoints.** Selector targeted
`app.kubernetes.io/name: vllm-semantic-router`, but the chart
actually labels the SR pod `semantic-router`. CEC service-
redirect itself doesn't need endpoints, but the egress-policy
plane on the source side does — clients with restrictive egress
CNPs (default-deny + `toEndpoints`) couldn't resolve a
destination identity and Cilium's L7 filter returned plain
`Access denied` 403. Fix: switch the selector to
`semantic-router` so endpoints exist and the policy decision
becomes deterministic.
3. **OpenWebUI egress was scoped to a single label.** With CEC
service-redirect the actual upstream is whichever vLLM Service
SR picks (any of 4 today, future-N tomorrow), not SR. Listing
each label by hand re-couples the UI to the model fleet. Fix:
widen to `entities: cluster` on TCP 80 / 8000 / 8080 —
acceptable for a chat UI workload (restricted PSS, scoped IAM,
`world` egress already capped at 443).
4. **`_defaultIngress` cross-namespace match in the
InferenceService composition.** Cilium CNP
`fromEndpoints.matchLabels` defaults to the policy's own
namespace when no `io.kubernetes.pod.namespace` key is set.
The default-deny + `xplane-openwebui` allow only matched pods
in the llm namespace (where SR + promptfoo live). Adding the
`apps` namespace label makes the rule actually match the real
OpenWebUI pod and unblocks chat completions through the
gateway.
Plus the workaround for the still-using-the-old-composition issue
(task #76): the four InferenceService claim manifests
(phi4-mini, qwen3-8b, deepseek-r1-distill-qwen3-8b, llamaguard3-1b)
get a `spec.networkPolicies.ingress` override that mirrors the new
KCL composition's `_defaultIngress`. Drop the per-claim blocks
once the next composition version is published.
Verified live with a UI prompt:
POST /v1/chat/completions {"model":"auto","messages":[...]}
-> HTTP 200, x-vsr-selected-model: xplane-phi4-mini,
body: "2+2 equals 4."
Three documentation artifacts shipped together (all from the same 2026-05-04 brainstorming session): **docs/superpowers/specs/2026-05-04-coding-llm-fleet-design.md.** Design doc for swapping DeepSeek-R1-Distill-Qwen-7B out for Qwen2.5-Coder-7B-Instruct and adding Qwen2.5-Coder-1.5B-Base as an always-warm FIM model for inline tab-complete. Covers fleet specification, two-path routing (client-deterministic for OpenCode + Continue, SR cascade for OpenWebUI MoM), GPU budget vs the 4-GPU NodePool cap, ~$700-900/mo cost envelope, and the operational concerns (Karpenter consolidation, ext_proc cold-connect, CNP overrides). **docs/superpowers/specs/2026-05-04-coding-llm-fleet-plan.md.** Companion implementation plan with concrete ordered steps: rename DeepSeek claim to Qwen2.5-Coder, add Qwen2.5-Coder-1.5B-Base FIM claim, rewrite SR decisions[] to route code prompts to xplane-qwen-coder, expose individual models in /v1/models, write client config docs for OpenCode and Continue, smoke-test cascade vs direct-pick, drop old DeepSeek weights from S3. Status table at the top tracks which phases shipped pre-redeploy vs which are deferred (smoke tests + S3 cleanup need a deployed cluster; composition republish is task #76). **clusters/mycluster-0-llm-platform/README.md.** Adds the new fleet shape table (5 claims) + whole-cluster `terramate script run --reverse destroy` procedure (one y/n prompt, walks every stack in reverse) + explicit list of data buckets preserved by the Orphan managementPolicies on the Bucket MRs.
Type
Documentation
Description
This PR primarily updates the documentation for the Vault cluster setup. The most significant changes include:
Changes walkthrough
README.md
terraform/vault/cluster/README.md
The changes in this file include the removal of the
'Architecture' section and the addition of a 'Security
Considerations' section. The 'High Availability' section has
been expanded with more detailed explanations of the
architectural decisions. The 'Getting Started' section has
also been slightly modified.
README.md
terraform/vault/management/README.md
A minor change has been made to the instructions for
building and importing the full chain bundle. The user is
now instructed to navigate to the
'terraform/vault/management' directory before executing the
commands.
✨ Usage guide:
Overview:
The
describetool scans the PR code changes, and generates a description for the PR - title, type, summary, walkthrough and labels. The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on a PR.When commenting, to edit configurations related to the describe tool (
pr_descriptionsection), use the following template:With a configuration file, use the following template:
Enabling\disabling automation
meaning the
describetool will run automatically on every PR, will keep the original title, and will add the original user description above the generated description.the tool will replace every marker of the form
pr_agent:marker_namein the PR description with the relevant content, wheremarker_nameis one of the following:type: the PR type.summary: the PR summary.walkthrough: the PR walkthrough.Note that when markers are enabled, if the original PR description does not contain any markers, the tool will not alter the description at all.
Custom labels
The default labels of the
describetool are quite generic: [Bug fix,Tests,Enhancement,Documentation,Other].If you specify custom labels in the repo's labels page or via configuration file, you can get tailored labels for your use cases.
Examples for custom labels:
Main topic:performance- pr_agent:The main topic of this PR is performanceNew endpoint- pr_agent:A new endpoint was added in this PRSQL query- pr_agent:A new SQL query was added in this PRDockerfile changes- pr_agent:The PR contains changes in the DockerfileThe list above is eclectic, and aims to give an idea of different possibilities. Define custom labels that are relevant for your repo and use cases.
Note that Labels are not mutually exclusive, so you can add multiple label categories.
Make sure to provide proper title, and a detailed and well-phrased description for each label, so the tool will know when to suggest it.
More PR-Agent commands
See the describe usage page for a comprehensive guide on using this tool.