Fix #196: SDK bootstrap honors TJ_CONFIG for config discovery#199
Merged
Conversation
The Python SDK bootstrap (`ensure_initialised()`, reached by `@watch()` or any `patch_*()`) called `load_config()` with no path. The CLI resolves `TJ_CONFIG` via Click's `envvar` and passes the path in, but the SDK path did not — so a process that set `TJ_CONFIG` to point at a project/custom config was silently ignored and the SDK initialized against the GLOBAL config, writing spans into the global DuckDB. Observed in an instrumentation pilot: spans briefly landed in the global DB despite `TJ_CONFIG` being set — a data-isolation hazard. Fix in `load_config()` rather than `ensure_initialised()`: when no explicit `path` is passed, honor `TJ_CONFIG` before the search-path discovery order. This is the single discovery function the SDK and CLI share, so the fix also covers the other bare `load_config()` SDK callers (`nemoclaw`, `openai_agents_sdk`, `llamaindex`) — the same bug class — and keeps CLI/SDK consistent without re-reading the env in two places. An explicit `path` argument still wins (CLI `--config` beats `TJ_CONFIG`, preserved), and a `TJ_CONFIG` pointing at a missing file fails loudly instead of silently falling back to the global config — matching the CLI and closing the exact silent-fallback hazard. Bare `find_config_file()` CLI callers (onboard, doctor, budget, mcp) are deliberately untouched to keep the change narrow. Tests (`tests/integration/test_sdk_config_discovery.py`): `load_config()` honors `TJ_CONFIG`, an explicit path beats it, and a missing `TJ_CONFIG` file raises; plus an end-to-end SUBPROCESS test (fresh OTel global provider + bootstrap singleton — the real "a process sets TJ_CONFIG" scenario) that runs `@watch()` with `TJ_CONFIG` set and asserts spans land in the intended DB and not the decoy global one. Verified the suite fails without the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
anilmurty
commented
Jun 22, 2026
anilmurty
left a comment
Contributor
Author
There was a problem hiding this comment.
Verdict: ✅ APPROVE — ready to merge
Reviewed the diff independently. CI all 4 green (827 tests), MERGEABLE, tight scope (core/config.py + one integration test, 146/0).
Per-criterion (#196)
ensure_initialised()honorsTJ_CONFIG— fixed at the root:load_config(path=None)now readsTJ_CONFIGbefore search-path discovery. This also fixes the other bare-load_config()SDK callers (nemoclaw / openai_agents_sdk / llamaindex) — same bug class, one place. Correct choice over patching just the one call site. ✅- Spans land in the intended DB — the subprocess end-to-end test (
@watch()in a fresh process, clean OTel global + bootstrap singleton, decoy global DB, asserts zero spans reach it) is the right design for this, and the agent verified it fails without the fix (3/4) and passes with it. ✅
Verified the safety properties
- Explicit path wins — env is read only when
path is None. No double-read; CLI (which passes the resolved path) is unaffected. EmptyTJ_CONFIG→or None→ discovery, handled. - Missing
TJ_CONFIGfile now raises (viafind_config_file's existing override check) instead of silently falling back to the global DB — this is the #196 hazard, and raising matches the CLI's--configbehavior. Correct, intentional behavior change. - No CLI regression — green CLI integration tests confirm; bare
find_config_file()CLI callers (onboard/doctor/budget/mcp) correctly left untouched (narrowest blast radius; onboard's global-config behavior is intentional).
One note for release notes (not a blocker)
The "missing TJ_CONFIG → raise" is a (correct, minor) behavior change worth a one-line mention when this ships — anyone with a stale TJ_CONFIG pointing at a deleted file will now get a clear error instead of silent global fallback.
Ready to merge. Independent of #198 (different files), so it can land anytime.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Python SDK bootstrap ignored
TJ_CONFIG, so a process pointing it at a project/custom config could silently initialize against the global config and write spans into the global DuckDB — a data-isolation hazard observed in an instrumentation pilot.Summary
load_config()now honorsTJ_CONFIG(then the documented search-path order:tokenjam.toml→.tj/config.toml→~/.config/tj/config.toml) when no explicitpathis passed.ensure_initialised()) and, by living in the shared discovery function, the other bareload_config()SDK callers (nemoclaw,openai_agents_sdk,llamaindex) — same bug class.litellm.pyuntouched.Root cause
ensure_initialised()(reached by@watch()/ anypatch_*()) calledload_config()with no path. The CLI resolvesTJ_CONFIGvia Click'senvvar="TJ_CONFIG"and passes the path intoload_config(path); the SDK path passed nothing, sofind_config_file(None)fell straight through to the search paths and picked the global config.Fix — in
load_config(), notensure_initialised()I chose
load_config()(the single discovery function the SDK and CLI share) over patchingensure_initialised():load_config()bare (nemoclaw/openai_agents_sdk/llamaindex), which had the identical leak.pathargument still wins over the env var (so CLI--configbeatsTJ_CONFIG, via Click precedence →load_config(path)→ env-read skipped).TJ_CONFIGpointing at a missing file now raisesFileNotFoundErrorinstead of silently falling back to the global config — matching the CLI and closing the exact silent-fallback hazard.find_config_file()CLI callers (onboard/doctor/budget/mcp) are deliberately untouched to keep the blast radius narrow.Tests / Verification (
tests/integration/test_sdk_config_discovery.py)load_config()honorsTJ_CONFIG; an explicit path beats it; a missingTJ_CONFIGfile raises.@watch()withTJ_CONFIGset, run in a fresh process so the OTel global TracerProvider + bootstrap singleton start clean — the real "a process sets TJ_CONFIG" scenario, withPYTHONPATHpinned to this checkout): asserts spans land in theTJ_CONFIG-pointed DB and zero spans reach a decoy global DB.pytest tests/unit/ tests/synthetic/ tests/agents/ tests/integration/→ 827 passed.ruff check tokenjam/+mypy tokenjam/core/config.pyclean. CLI config tests (test_cli.py,test_config.py) unaffected.What's NOT in this PR
litellm.pyuntouched (parallel agent owns it).find_config_file()CLI callers — left as-is (the SDK fix doesn't need them, and changing them would alter onboard/doctor/budget/mcp discovery).find_config_file()itself.Closes #196
🤖 Generated with Claude Code