Skip to content

Fix #196: SDK bootstrap honors TJ_CONFIG for config discovery#199

Merged
anilmurty merged 1 commit into
mainfrom
fix/196-sdk-honor-tj-config
Jun 22, 2026
Merged

Fix #196: SDK bootstrap honors TJ_CONFIG for config discovery#199
anilmurty merged 1 commit into
mainfrom
fix/196-sdk-honor-tj-config

Conversation

@anilmurty

Copy link
Copy Markdown
Contributor

The Python SDK bootstrap ignored TJ_CONFIG, so a process pointing it at a project/custom config could silently initialize against the global config and write spans into the global DuckDB — a data-isolation hazard observed in an instrumentation pilot.

Summary

  • load_config() now honors TJ_CONFIG (then the documented search-path order: tokenjam.toml.tj/config.toml~/.config/tj/config.toml) when no explicit path is passed.
  • Fixes the SDK bootstrap (ensure_initialised()) and, by living in the shared discovery function, the other bare load_config() SDK callers (nemoclaw, openai_agents_sdk, llamaindex) — same bug class.
  • CLI behavior unchanged; explicit paths still win; out-of-scope litellm.py untouched.

Root cause

ensure_initialised() (reached by @watch() / any patch_*()) called load_config() with no path. The CLI resolves TJ_CONFIG via Click's envvar="TJ_CONFIG" and passes the path into load_config(path); the SDK path passed nothing, so find_config_file(None) fell straight through to the search paths and picked the global config.

Fix — in load_config(), not ensure_initialised()

if path is None:
    path = os.environ.get("TJ_CONFIG") or None
config_path = find_config_file(path)

I chose load_config() (the single discovery function the SDK and CLI share) over patching ensure_initialised():

  • It keeps CLI and SDK consistent — the env var is honored in one place, not re-read separately in the SDK.
  • It also fixes the other SDK integrations that call load_config() bare (nemoclaw / openai_agents_sdk / llamaindex), which had the identical leak.
  • An explicit path argument still wins over the env var (so CLI --config beats TJ_CONFIG, via Click precedence → load_config(path) → env-read skipped).
  • A TJ_CONFIG pointing at a missing file now raises FileNotFoundError instead of silently falling back to the global config — matching the CLI and closing the exact silent-fallback hazard.
  • Bare find_config_file() CLI callers (onboard / doctor / budget / mcp) are deliberately untouched to keep the blast radius narrow.

Tests / Verification (tests/integration/test_sdk_config_discovery.py)

  • load_config() honors TJ_CONFIG; an explicit path beats it; a missing TJ_CONFIG file raises.
  • End-to-end SUBPROCESS test (@watch() with TJ_CONFIG set, run in a fresh process so the OTel global TracerProvider + bootstrap singleton start clean — the real "a process sets TJ_CONFIG" scenario, with PYTHONPATH pinned to this checkout): asserts spans land in the TJ_CONFIG-pointed DB and zero spans reach a decoy global DB.
  • Confirmed the suite fails without the fix (3/4 fail; the intended DB is never created) and passes with it.
  • Full suite: pytest tests/unit/ tests/synthetic/ tests/agents/ tests/integration/827 passed. ruff check tokenjam/ + mypy tokenjam/core/config.py clean. CLI config tests (test_cli.py, test_config.py) unaffected.

What's NOT in this PR

Closes #196

🤖 Generated with Claude Code

The Python SDK bootstrap (`ensure_initialised()`, reached by `@watch()` or any
`patch_*()`) called `load_config()` with no path. The CLI resolves `TJ_CONFIG`
via Click's `envvar` and passes the path in, but the SDK path did not — so a
process that set `TJ_CONFIG` to point at a project/custom config was silently
ignored and the SDK initialized against the GLOBAL config, writing spans into
the global DuckDB. Observed in an instrumentation pilot: spans briefly landed
in the global DB despite `TJ_CONFIG` being set — a data-isolation hazard.

Fix in `load_config()` rather than `ensure_initialised()`: when no explicit
`path` is passed, honor `TJ_CONFIG` before the search-path discovery order.
This is the single discovery function the SDK and CLI share, so the fix also
covers the other bare `load_config()` SDK callers (`nemoclaw`,
`openai_agents_sdk`, `llamaindex`) — the same bug class — and keeps CLI/SDK
consistent without re-reading the env in two places. An explicit `path`
argument still wins (CLI `--config` beats `TJ_CONFIG`, preserved), and a
`TJ_CONFIG` pointing at a missing file fails loudly instead of silently
falling back to the global config — matching the CLI and closing the exact
silent-fallback hazard. Bare `find_config_file()` CLI callers (onboard,
doctor, budget, mcp) are deliberately untouched to keep the change narrow.

Tests (`tests/integration/test_sdk_config_discovery.py`): `load_config()`
honors `TJ_CONFIG`, an explicit path beats it, and a missing `TJ_CONFIG` file
raises; plus an end-to-end SUBPROCESS test (fresh OTel global provider +
bootstrap singleton — the real "a process sets TJ_CONFIG" scenario) that runs
`@watch()` with `TJ_CONFIG` set and asserts spans land in the intended DB and
not the decoy global one. Verified the suite fails without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@anilmurty anilmurty left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: ✅ APPROVE — ready to merge

Reviewed the diff independently. CI all 4 green (827 tests), MERGEABLE, tight scope (core/config.py + one integration test, 146/0).

Per-criterion (#196)

  1. ensure_initialised() honors TJ_CONFIG — fixed at the root: load_config(path=None) now reads TJ_CONFIG before search-path discovery. This also fixes the other bare-load_config() SDK callers (nemoclaw / openai_agents_sdk / llamaindex) — same bug class, one place. Correct choice over patching just the one call site. ✅
  2. Spans land in the intended DB — the subprocess end-to-end test (@watch() in a fresh process, clean OTel global + bootstrap singleton, decoy global DB, asserts zero spans reach it) is the right design for this, and the agent verified it fails without the fix (3/4) and passes with it. ✅

Verified the safety properties

  • Explicit path wins — env is read only when path is None. No double-read; CLI (which passes the resolved path) is unaffected. Empty TJ_CONFIGor None → discovery, handled.
  • Missing TJ_CONFIG file now raises (via find_config_file's existing override check) instead of silently falling back to the global DB — this is the #196 hazard, and raising matches the CLI's --config behavior. Correct, intentional behavior change.
  • No CLI regression — green CLI integration tests confirm; bare find_config_file() CLI callers (onboard/doctor/budget/mcp) correctly left untouched (narrowest blast radius; onboard's global-config behavior is intentional).

One note for release notes (not a blocker)

The "missing TJ_CONFIG → raise" is a (correct, minor) behavior change worth a one-line mention when this ships — anyone with a stale TJ_CONFIG pointing at a deleted file will now get a clear error instead of silent global fallback.

Ready to merge. Independent of #198 (different files), so it can land anytime.

@anilmurty anilmurty merged commit 249791c into main Jun 22, 2026
4 checks passed
@anilmurty anilmurty deleted the fix/196-sdk-honor-tj-config branch June 23, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] SDK ensure_initialised() ignores TJ_CONFIG → can write spans to the wrong (global) DB

1 participant