feat(litellm): in-pod codex device-auth for cluster-bound sessions#365
Merged
Conversation
ChatGPT binds OAuth sessions to the IP/device that completed device-auth.
Laptop-device-auth'd tokens uploaded to the cluster get token_invalidated
on first use (confirmed via direct probe today). The cloud-codex-cody pod
already proved the fix: device-auth FROM inside the cluster produces
sessions ChatGPT keeps alive across cluster usage.
This brings that fix one layer up so Nova/Pixel and any future codex
agent share the same auth surface (LiteLLM), rather than each agent
needing its own pod with its own codex login.
What changes:
1. New `codex-cli` sidecar on the LiteLLM pod. Installs codex CLI on
first boot, idles. Operator runs:
kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1
Completes device-auth in browser; resulting auth.json lands on the
shared chatgpt-auth volume as /chatgpt-auth/auth-1.json. Repeat for
accounts 2 and 3.
2. codex-auth-rotator now PREFERS pod-side /chatgpt-auth/auth-N.json
files when present, and only falls back to env-var-fed tokens
(laptop-bound, dead) when no pod-side files exist. Keeps the existing
rotation cadence + 429 signal handling unchanged.
3. chatgpt-auth volume can be a PVC (values: litellm.chatgptAuth.
persistence.enabled). Required for the cluster-bound flow — emptyDir
loses tokens on every pod restart. Dev opts in; defaults stay off
so OSS deployments aren't surprised.
4. Adds `strategy.type: Recreate` to the LiteLLM Deployment when the
PVC is enabled — RWO single-writer can't hand off cleanly with
RollingUpdate.
After this lands + operator does device-auth × N from inside the
codex-cli sidecar, all dev LLM traffic (openclaw moltbot via LiteLLM
chatgpt/ bridge, and any future codex CLI agents pointed at LiteLLM)
uses cluster-bound sessions. Nova/Pixel come back to life without
another laptop device-auth round.
Follow-up: switch cloud-codex-cody to point codex CLI at LiteLLM
(model_provider override + virtual key) so Cody routes through the
same auth surface instead of needing her own /state/.codex/auth.json.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
samxu01
added a commit
that referenced
this pull request
May 15, 2026
…iner (#366) * feat(litellm): in-pod codex device-auth for cluster-IP-bound sessions ChatGPT binds OAuth sessions to the IP/device that completed device-auth. Laptop-device-auth'd tokens uploaded to the cluster get token_invalidated on first use (confirmed via direct probe today). The cloud-codex-cody pod already proved the fix: device-auth FROM inside the cluster produces sessions ChatGPT keeps alive across cluster usage. This brings that fix one layer up so Nova/Pixel and any future codex agent share the same auth surface (LiteLLM), rather than each agent needing its own pod with its own codex login. What changes: 1. New `codex-cli` sidecar on the LiteLLM pod. Installs codex CLI on first boot, idles. Operator runs: kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1 Completes device-auth in browser; resulting auth.json lands on the shared chatgpt-auth volume as /chatgpt-auth/auth-1.json. Repeat for accounts 2 and 3. 2. codex-auth-rotator now PREFERS pod-side /chatgpt-auth/auth-N.json files when present, and only falls back to env-var-fed tokens (laptop-bound, dead) when no pod-side files exist. Keeps the existing rotation cadence + 429 signal handling unchanged. 3. chatgpt-auth volume can be a PVC (values: litellm.chatgptAuth. persistence.enabled). Required for the cluster-bound flow — emptyDir loses tokens on every pod restart. Dev opts in; defaults stay off so OSS deployments aren't surprised. 4. Adds `strategy.type: Recreate` to the LiteLLM Deployment when the PVC is enabled — RWO single-writer can't hand off cleanly with RollingUpdate. After this lands + operator does device-auth × N from inside the codex-cli sidecar, all dev LLM traffic (openclaw moltbot via LiteLLM chatgpt/ bridge, and any future codex CLI agents pointed at LiteLLM) uses cluster-bound sessions. Nova/Pixel come back to life without another laptop device-auth round. Follow-up: switch cloud-codex-cody to point codex CLI at LiteLLM (model_provider override + virtual key) so Cody routes through the same auth surface instead of needing her own /state/.codex/auth.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(litellm): codex-cli is a sidecar (containers:), not an init container In PR #365 the codex-cli block landed in the initContainers list by mistake, which made the pod stuck Init:1/2 — codex-cli's sleep loop never exits, so the pod never progressed to Running, and helm-upgrade hit the 10m timeout. Move codex-cli into containers: (sidecar position, after the codex-auth-rotator). LiteLLM main container can now reach Ready while codex-cli idles in parallel waiting for operator exec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
samxu01
added a commit
that referenced
this pull request
May 15, 2026
…pt.com (#369) Multi-runtime ≠ multi-auth-surface. Codex CLI's runtime distinction (sandbox, tool use, sessions) is independent from where its HTTPS calls go. Point codex CLI at LiteLLM instead of chatgpt.com so: - single auth surface across openclaw and codex runtimes - one rotator, one cluster-bound auth.json (already established by PR #365) - per-agent codex login --device-auth no longer needed - per-agent /state/.codex/auth.json no longer needed - shared quota pool across all agents - LiteLLM observability captures all model traffic regardless of runtime What changes: - Boot script seeds ~/.codex/config.toml with model_provider=litellm, base_url pointing at LiteLLM service, wire_api=responses (matches the chatgpt/ bridge's Responses-API shape), env_key=LITELLM_API_KEY. - LITELLM_API_KEY exported from a k8s Secret (cloud-codex-<name>-litellm-key, optional so the pod can boot before the key exists; warning logged if missing). - Drops the "wait for /state/.codex/auth.json" gate — no longer needed since codex CLI no longer holds its own auth. Operator setup (per agent): 1. POST /api/registry/install (cloud-codex/<name>) 2. Mint AgentInstallation runtime token → secret cloud-codex-<name>-token 3. Mint LiteLLM virtual key → secret cloud-codex-<name>-litellm-key 4. helm upgrade — pod boots, no device-auth needed The cloud-codex pod's PVC still holds /state/.commonly/tokens/<name>.json (commonly agent run loop's CAP token); only the codex auth.json went away. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves the cluster-bound device-auth pattern from per-agent pods (cloud-codex) up to the LiteLLM pod so Nova/Pixel and any future codex-CLI agent share one auth surface.
Root cause we keep hitting
ChatGPT binds OAuth sessions to the IP/device that ran device-auth. Tokens device-auth'd on a laptop and uploaded to the cluster get `token_invalidated` on first cluster use. Direct probe today on freshly-uploaded account-1/2 tokens both returned 401 INVALIDATED within seconds.
Fix
Run device-auth from INSIDE the LiteLLM pod. Resulting auth.json is cluster-IP-bound natively. Three changes:
codex-cli sidecar on the LiteLLM pod. Installs codex CLI, idles. Operator runs:
```
kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1
```
Completes device-auth in browser; auth.json lands on shared chatgpt-auth volume as `/chatgpt-auth/auth-1.json`. Repeat for accounts 2 and 3.
Rotator prefers pod-side files `/chatgpt-auth/auth-{1,2,3}.json` when present; falls back to env-var-fed tokens otherwise. Existing rotation cadence + 429 signal handling unchanged.
chatgpt-auth can be a PVC via `litellm.chatgptAuth.persistence.enabled`. Required for the cluster-bound flow (emptyDir wipes on every helm-upgrade). Dev enables it.
Plus `strategy.type: Recreate` when the PVC is enabled (RWO single-writer constraint).
Test plan
Follow-up (not in this PR)
🤖 Generated with Claude Code