Skip to content

propose: uniform YAML path expansion for embedding.model + README YAML reference#87

Merged
HumanBean17 merged 1 commit into
masterfrom
propose/yaml-path-expansion
May 12, 2026
Merged

propose: uniform YAML path expansion for embedding.model + README YAML reference#87
HumanBean17 merged 1 commit into
masterfrom
propose/yaml-path-expansion

Conversation

@HumanBean17
Copy link
Copy Markdown
Owner

What

Two coupled changes:

  1. propose/YAML-PATH-EXPANSION-PROPOSE.md — a surgical propose to apply os.path.expanduser + os.path.expandvars to embedding.model values from .java-codebase-rag.yml, but only when the value is path-shaped (starts with /, ./, ../, ~, or contains $). Hub ids (org/name) are passed through unchanged. 12 locked decisions, 15-case use-case re-walk, 1 PR of implementation work.

  2. README §2 gains a new "Project YAML reference" subsection: a single annotated .java-codebase-rag.yml example covering every key the project consumes (index_dir, embedding.{model,device}, microservice_roots, cross_service_resolution, role_overrides, route_overrides, http_client_overrides, async_producer_overrides), a path-expansion compatibility table, and a pointer to java-codebase-rag meta for diagnosing which source supplied each value.

Why

While answering "how do I specify a path to a local embedding model via YAML?", the only honest answer turned out to be: you can't use ~/... because YAML resolution doesn't expand it. CLI and env (SBERT_MODEL) both apply full expansion; index_dir from YAML applies expanduser (but not expandvars); embedding.model from YAML applies neither. That's an inconsistency between resolution paths, not a design call.

Separately, the README had YAML examples scattered across §2 (env-var precedence), §7.1 (role/route overrides), §7.2 (cross_service_resolution), and §7.4 (http_client/async_producer overrides) — with no single block showing all keys together. The new §2 sub-section is that single block.

Scope

Out of scope (deliberately):

  • Other YAML keys — none are filesystem paths. role_overrides, route_overrides, http_client_overrides, async_producer_overrides contain URL paths, Kafka topic names, FQNs, role names, framework names — all namespace strings, not paths.
  • microservice_roots — entries are directory names relative to source_root, not arbitrary paths. Expansion would let one project's YAML reach outside its tree.
  • embedding.device — devices (cpu, cuda, mps) aren't paths.
  • index_dir adding expandvars — already does expanduser; the $VAR gap is smaller and tracked as a separate follow-up (noted in §5 of the propose).

Status

Status: draft — not lock yet. Open for review on:

  • Heuristic boundary (especially UC8: plain models/x without leading ./ is treated as a hub id, not a path)
  • Decision PR-A1: Route schema + literal extractor (B2a) #5 (stderr hint on unresolved $VAR, vs. raising)
  • README §2.1 layout and depth (annotated single block vs. per-key tables)

After review, status moves to under review, then locked, then the doc is referenced from a follow-up implementation PR.

@HumanBean17
Copy link
Copy Markdown
Owner Author

Review (docs + propose)

Direction: Strong approve on scope and framing — YAML/CLI/env should agree for embedding.model, conservative path-shape heuristic + UC8 (./ disambiguation) is the right trade-off, and the README single-block reference fills a real gap.

Before merge / lock:

  1. README vs implementation — The new §2 text describes embedding.model tilde/$VAR expansion as current behaviour. If this PR lands before the code change, the README will be wrong for a while. Options: ship README with the impl PR, add a one-line "planned / requires …" callout, or use future tense until the behaviour exists.

  2. "Silently fails on ~/" — Worth tightening: the cocoindex subprocess imports java_index_v1_common, which applies expanduser/expandvars to SBERT_MODEL at import, so indexing may already work for YAML ~/... even while resolve_operator_config().embedding_model stays literal. The sharper claim is inconsistency (meta / payloads vs child process, and entrypoints like MCP search that read SBERT_MODEL without a second expand). That keeps the motivate-the-fix argument accurate under scrutiny.

  3. Pseudocode vs table — §3.1 mentions os.sep + _looks_like_hub_id but §3.2 does not; either extend the table or drop that branch from the sketch until the helper is specified.

  4. Unresolved $VAR hint — If you regex-scan for leftovers after expandvars, remember ${VAR} forms may not match $\\w+.

  5. CLI parity — One sentence on whether --embedding-model gets the same path-shaped expansion as YAML would close a small ambiguity for implementers.

Overall: good draft; address (1)–(2) before treating README as contract on master.

…L reference

- propose/YAML-PATH-EXPANSION-PROPOSE.md: surgical fix to apply
  os.path.expanduser + os.path.expandvars to .java-codebase-rag.yml
  embedding.model values that look like filesystem paths. Hub ids
  (org/name shape) are passed through unchanged.

- README §2 ('Environment variables') gains a new 'Project YAML
  reference' subsection: a single annotated .java-codebase-rag.yml
  example covering every supported key (index_dir, embedding.{model,
  device}, microservice_roots, cross_service_resolution, role_overrides,
  route_overrides, http_client_overrides, async_producer_overrides),
  a path-expansion compatibility table, and tips for diagnosing what
  the resolver picked up via 'java-codebase-rag meta'.

Scope is intentionally minimal: only embedding.model gets the expansion
fix. Other YAML string fields (URL paths, topic names, FQNs, role
names, framework names) are not filesystem paths and are explicitly
out of scope. index_dir is documented as expanding only '~' (not
$VAR) -- a smaller existing gap noted as a separate follow-up.
@HumanBean17 HumanBean17 force-pushed the propose/yaml-path-expansion branch from 15a0e74 to 77cd606 Compare May 12, 2026 14:45
@HumanBean17
Copy link
Copy Markdown
Owner Author

Amendments force-pushed (15a0e7477cd606)

All five points addressed. Status bumped draft → under review. Doc grew 204 → 252 lines.

Per-point response

1. README vs implementation. Two changes:

  • README YAML-block comment now opens with PLANNED (PR-YAML-EXPAND-1, see propose/YAML-PATH-EXPANSION-PROPOSE.md): and uses future tense (will be expanded). The "until the impl PR lands" sentence names the exact current behaviour (indexer subprocess expands, meta + MCP search show literal) and gives the workaround (absolute paths).
  • §2 path-expansion table row now reads yes (PLANNED, PR-YAML-EXPAND-1) and links to the propose.
  • Locked as decision feat: B6 cross-service matcher + MCP surface (PR-D3) #15: README §2 + code ship in the same PR. When the impl PR merges, the PLANNED markers come out of the README in that same commit, so master never carries a README that promises behaviour the code doesn't deliver.

2. "Silently fails on ~/" was technically wrong. Rewrote the framing across TL;DR and §1. The accurate claim is inconsistency, not blanket failure:

  • index_common.py:8-9 and java_index_v1_common.py:12 apply expanduser/expandvars to os.environ.get("SBERT_MODEL", ...) at module-import time.
  • So the cocoindex indexer subprocess that inherits a YAML ~/... (via apply_to_os_environ() setting SBERT_MODEL) re-imports and re-expands. Indexing works.
  • But resolve_operator_config().embedding_model itself stays literal. meta output (cli.py:266), MCP search (mcp_v2.py:382, server.py:177 — both os.environ.get("SBERT_MODEL", SBERT_MODEL) without re-expand), and JSON payloads all show ~/models/x literally. That's the bug.
  • §1 Frame is now "one canonical value visible to every consumer," not "make YAML match env."

This also forced a scope shift: the fix has to be at the resolution layer (post-_pick_str), not the YAML branch, otherwise CLI and env still flow literal through _pick_str to ResolvedOperatorConfig. Decision #3 locks this.

3. Pseudocode vs table. Dropped the os.sep branch and the _looks_like_hub_id helper from §3.1. The leading-prefix check + $ presence is sufficient, and the §3.2 table is now the single source of truth. Decision #5 explicitly documents the drop.

4. ${VAR} form not matched by \$\w+. Widened the regex to \$(\w+|\{[^}]+\}). Lifted to a module-level constant _UNRESOLVED_VAR_RE in the pseudocode and called out in principle #6 + decision #6.

5. CLI parity. This was the most useful point. Locking the helper at the resolution layer (decision #3) means CLI-parity comes for free: the helper runs on the value returned by _pick_str regardless of which path won. Two implications captured:

  • Decision Add per-PR Cursor task prompts for Tier 1 completion #4: shell-expanded forms (--embedding-model ~/x) and quoted forms (--embedding-model "~/x") both produce the same canonical value; the helper is idempotent on already-expanded input.
  • New UC10b in the use-case table: shell-quoted CLI ~ (which bypasses shell expansion) gets canonicalised exactly like YAML.
  • New test in §6 and Appendix A: test_embedding_model_cli_quoted_tilde_expanded.

Consistency pass

Doc grew 204 → 252 lines.

@HumanBean17 HumanBean17 marked this pull request as ready for review May 12, 2026 15:49
@HumanBean17 HumanBean17 merged commit d1027a3 into master May 12, 2026
@HumanBean17 HumanBean17 deleted the propose/yaml-path-expansion branch May 23, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant