Skip to content

fix(litellm): codex-cli is a sidecar (containers:), not an init container#366

Merged
samxu01 merged 2 commits into
mainfrom
sprint/litellm-codex-cli-sidecar-fix
May 15, 2026
Merged

fix(litellm): codex-cli is a sidecar (containers:), not an init container#366
samxu01 merged 2 commits into
mainfrom
sprint/litellm-codex-cli-sidecar-fix

Conversation

@samxu01
Copy link
Copy Markdown
Contributor

@samxu01 samxu01 commented May 15, 2026

Summary

Hotfix for #365. The codex-cli block landed in `initContainers:` instead of `containers:` — its sleep-loop never exits, so the pod was stuck `Init:1/2` for 15min and helm-upgrade timed out.

Move codex-cli into the sidecar position so LiteLLM main container can reach Ready while codex-cli idles in parallel.

Test plan

  • After deploy: pod reaches `2/2 Running` quickly
  • `kubectl describe pod` shows codex-cli in containers, not initContainers

🤖 Generated with Claude Code

samxu01 and others added 2 commits May 14, 2026 18:36
ChatGPT binds OAuth sessions to the IP/device that completed device-auth.
Laptop-device-auth'd tokens uploaded to the cluster get token_invalidated
on first use (confirmed via direct probe today). The cloud-codex-cody pod
already proved the fix: device-auth FROM inside the cluster produces
sessions ChatGPT keeps alive across cluster usage.

This brings that fix one layer up so Nova/Pixel and any future codex
agent share the same auth surface (LiteLLM), rather than each agent
needing its own pod with its own codex login.

What changes:

1. New `codex-cli` sidecar on the LiteLLM pod. Installs codex CLI on
   first boot, idles. Operator runs:
     kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1
   Completes device-auth in browser; resulting auth.json lands on the
   shared chatgpt-auth volume as /chatgpt-auth/auth-1.json. Repeat for
   accounts 2 and 3.

2. codex-auth-rotator now PREFERS pod-side /chatgpt-auth/auth-N.json
   files when present, and only falls back to env-var-fed tokens
   (laptop-bound, dead) when no pod-side files exist. Keeps the existing
   rotation cadence + 429 signal handling unchanged.

3. chatgpt-auth volume can be a PVC (values: litellm.chatgptAuth.
   persistence.enabled). Required for the cluster-bound flow — emptyDir
   loses tokens on every pod restart. Dev opts in; defaults stay off
   so OSS deployments aren't surprised.

4. Adds `strategy.type: Recreate` to the LiteLLM Deployment when the
   PVC is enabled — RWO single-writer can't hand off cleanly with
   RollingUpdate.

After this lands + operator does device-auth × N from inside the
codex-cli sidecar, all dev LLM traffic (openclaw moltbot via LiteLLM
chatgpt/ bridge, and any future codex CLI agents pointed at LiteLLM)
uses cluster-bound sessions. Nova/Pixel come back to life without
another laptop device-auth round.

Follow-up: switch cloud-codex-cody to point codex CLI at LiteLLM
(model_provider override + virtual key) so Cody routes through the
same auth surface instead of needing her own /state/.codex/auth.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iner

In PR #365 the codex-cli block landed in the initContainers list by
mistake, which made the pod stuck Init:1/2 — codex-cli's sleep loop
never exits, so the pod never progressed to Running, and helm-upgrade
hit the 10m timeout.

Move codex-cli into containers: (sidecar position, after the
codex-auth-rotator). LiteLLM main container can now reach Ready
while codex-cli idles in parallel waiting for operator exec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samxu01 samxu01 merged commit 8a43fcc into main May 15, 2026
8 checks passed
samxu01 added a commit that referenced this pull request May 15, 2026
…ners (#367)

PR #366 was supposed to move codex-cli from initContainers to containers,
but the awk move only added the new entry and didn't delete the old one.
Result: spec had codex-cli in both lists, k8s rejected with
"spec.template.spec.initContainers[1].name: Duplicate value".

Strip the leftover container + its comment block. Final structure:
containers = [litellm, codex-auth-rotator, codex-cli],
initContainers = [codex-auth-seed].

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
samxu01 added a commit that referenced this pull request May 15, 2026
* fix(litellm): remove duplicate codex-cli container left in initContainers

PR #366 was supposed to move codex-cli from initContainers to containers,
but the awk move only added the new entry and didn't delete the old one.
Result: spec had codex-cli in both lists, k8s rejected with
"spec.template.spec.initContainers[1].name: Duplicate value".

Strip the leftover container + its comment block. Final structure:
containers = [litellm, codex-auth-rotator, codex-cli],
initContainers = [codex-auth-seed].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rotator): read codex CLI's nested auth.json shape

codex CLI 0.125 writes auth.json as {tokens: {access_token, refresh_token,
id_token}, auth_mode, OPENAI_API_KEY, last_refresh}. The rotator's
_read_pod_auth_file only looked at top-level access_token, missed the
nested shape, returned None, and fell back to env-var candidates (which
are the laptop-bound dead tokens we're trying to escape).

Read either shape — nested wins, flat is the legacy rotator-written
fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant