Cloud SLayer mode end-to-end (mirror local) + consolidate the OTF artifact lifecycle (DEV-1468)#2
Conversation
…lifecycle (DEV-1468) Make all three SLayer combos run in the cloud runner by mirroring the local path, and consolidate the on-the-fly artifact lifecycle to presence-gated reuse + explicit force-wipe (remove fingerprint *gating*; keep the fingerprint as provenance). "Local == cloud by construction": upload an artifact → it's present → it's reused, with no cloud-specific OTF code. OTF lifecycle (cache.py / reference_build.py): - Cache is a single authoritative dir per DB (drop the <fp> level); reuse is gated on the _cache_fp.txt completeness marker. ensure_db_cache gains force; a markerless/old-layout target is rmtree'd then atomic-renamed (migration). - ensure_db_reference checks the marker BEFORE ensure_db_cache (load-bearing for cloud combo 3, which downloads only the reference). _load_reference_entry is self-contained (kb_rows from disk; raise on missing/corrupt, accept []); reuse emits a provenance WARNING. - --otf-rebuild (renamed from --otf-rebuild-reference, kept as a hidden alias) force-wipes BOTH layers for both on-the-fly frameworks via _maybe_force_wipe_otf. paths.py: slayer_models_root / slayer_otf_cache_root (new) / slayer_models_otf_root gain BIRD_*_ROOT env overrides; both agents drop their duplicated _otf_cache_root. Cloud (gcs/cli/driver/ray_app/prereqs/image + Dockerfile.cloud + cluster.yaml.j2): - gcs: upload_dir_prefix + download_prefix (concurrent_download_prefix is now a thin wrapper) + slayer_artifact_name. - cli: accept slayer; --slayer-setup / --slayer-storage-root; validate the tuple via run._validate_slayer_setup. - driver: per-selected-DB upload (pre-encoded from the worktree, OTF from the main-checkout paths), fail-fast presence check before build/cluster, OPENAI key delivery for slayer, manifest + resubmit plumbing, no re-upload on resubmit. - ray_app: delete the ephemeral-server stub; download_slayer_setup once per worker (atomic, root-level marker); thread slayer_setup through run_one_task. - prereqs: require OPENAI_API_KEY in slayer mode. - image/Dockerfile: stop baking slayer_models (uploaded at submit); drop it + the slayer-ingest version from data_hash and the dirty-input set. - cluster template: export the three BIRD_*_ROOT env vars. Full non-integration suite + integration tier green; merged origin/main (DEV-1466). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughRemoves baked SLayer artifacts from images and implements run-time per-run GCS upload/download of SLayer setups, adds env-overridable slayer path helpers, refactors OTF cache/reference reuse to marker-based layouts, wires CLI/driver slayer submit options and validation, and updates Ray actor initialization and tests. ChangesSLayer Setup Delivery: Docker-less Runtime Models with Marker-Based OTF Cache
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/bird_interact_agents/cloud/image.py (1)
218-226:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRe-include
slayer_models/in dirty-worktree checksThe dirty-path filter no longer checks
slayer_models/, so cloud submits can bypassDirtyWorktreeErroreven with uncommitted SLayer changes.Suggested fix
input_prefixes = ( "src/", + "slayer_models/", "audited_gold/", "uv.lock", "pyproject.toml", "Dockerfile.cloud", )As per coding guidelines, "Ensure worktree is clean (only
src/,slayer_models/,audited_gold/,uv.lock,pyproject.toml,Dockerfile.cloudare checked) before submitting cloud jobs to avoidDirtyWorktreeError".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bird_interact_agents/cloud/image.py` around lines 218 - 226, The dirty-path filter tuple input_prefixes (in src/bird_interact_agents/cloud/image.py) was changed and no longer includes "slayer_models/", allowing uncommitted SLayer changes to slip past DirtyWorktreeError; restore "slayer_models/" into the input_prefixes tuple (alongside "src/", "audited_gold/", "uv.lock", "pyproject.toml", "Dockerfile.cloud") so the worktree cleanliness check includes that directory again.src/bird_interact_agents/cloud/gcs.py (1)
190-205:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce exact-prefix matching and block path traversal in
download_prefix.
list_blobs(prefix=prefix)+rel = blob.name[len(prefix):]can incorrectly include sibling objects when callers pass a prefix without trailing/(e.g.,.../dbalso matches.../db2). Also,relis not sanitized for.., so a crafted blob key can write outsidedest.Suggested fix
-from pathlib import Path +from pathlib import Path, PurePosixPath @@ def download_prefix( @@ - def _one(blob: Any) -> None: - # Strip the prefix AND any leading slash, so a prefix passed without a - # trailing slash (e.g. ".../db") doesn't yield an absolute "/x.yaml" - # that would escape `dest`. - rel = blob.name[len(prefix):].lstrip("/") + normalized_prefix = prefix.rstrip("/") + list_prefix = f"{normalized_prefix}/" + dest_root = dest.resolve() + + def _one(blob: Any) -> None: + rel = blob.name[len(list_prefix):] if not rel: return - target = dest / rel + rel_path = PurePosixPath(rel) + if any(part == ".." for part in rel_path.parts): + raise ValueError(f"unsafe blob name outside prefix: {blob.name}") + target = (dest_root / Path(*rel_path.parts)).resolve() + target.relative_to(dest_root) target.parent.mkdir(parents=True, exist_ok=True) data = blob.download_as_bytes() target.write_bytes(data) - blobs = list(bucket.list_blobs(prefix=prefix)) + blobs = list(bucket.list_blobs(prefix=list_prefix))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bird_interact_agents/cloud/gcs.py` around lines 190 - 205, The download_prefix logic must enforce exact-directory matches and block path traversal: when iterating bucket.list_blobs(prefix=prefix) only accept blobs whose blob.name equals prefix or startswith prefix + "/" (so a prefix "x/db" doesn't match "x/db2"), compute rel as the slice after the validated separator, then sanitize rel by rejecting or normalizing any path components that contain ".." or absolute paths (e.g., use pathlib.Path(rel).parts and ensure no ".." and no leading "/" before joining to dest). Update the _one handler and the blob-filtering loop to apply these checks (references: prefix, blob.name, rel, dest, _one, bucket.list_blobs) and return or raise on invalid/suspicious blob names.src/bird_interact_agents/cloud/ray_app.py (1)
313-324:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnsure per-task temp dirs are cleaned even when GCS writes fail.
Right now cleanup runs only after
write_row/write_logsucceed. If either call raises,cloud_log_*leaks. Wrap GCS writes in atry/finallywith unconditionalrmtree.Proposed fix
- _gcs.write_row(run_id, iid, attempt, row, client=gcs_client) - try: - log_bytes = log_tmp.read_bytes() if log_tmp.exists() else b"" - except OSError: - log_bytes = b"" - if log_bytes: - _gcs.write_log(run_id, iid, attempt, log_bytes, client=gcs_client) - # Now safe to drop the log tmp dir. - shutil.rmtree(log_dir, ignore_errors=True) + try: + _gcs.write_row(run_id, iid, attempt, row, client=gcs_client) + try: + log_bytes = log_tmp.read_bytes() if log_tmp.exists() else b"" + except OSError: + log_bytes = b"" + if log_bytes: + _gcs.write_log(run_id, iid, attempt, log_bytes, client=gcs_client) + finally: + # Always drop the per-task tmp dir, even on write failures. + shutil.rmtree(log_dir, ignore_errors=True) return iid🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bird_interact_agents/cloud/ray_app.py` around lines 313 - 324, The temp log directory (log_dir / log_tmp) is only removed after successful _gcs.write_row/_gcs.write_log, causing leaks on GCS failures; wrap the GCS write logic in a try/finally: perform _gcs.write_row(...) and the conditional read/_gcs.write_log(...) inside the try, and in the finally unconditionally call shutil.rmtree(log_dir, ignore_errors=True) so the per-task temp dirs are always cleaned even if write_row or write_log raises; reference the existing symbols _gcs.write_row, _gcs.write_log, log_tmp, log_dir, gcs_client, run_id, iid, attempt, and row when locating the change.
🧹 Nitpick comments (2)
tests/cloud/test_driver.py (2)
673-673: ⚡ Quick winTighten fail-fast assertions to avoid false positives.
Using
pytest.raises(Exception)here can pass on unrelated failures. Please assert a specific exception type (or at least a stable message fragment) to keep these tests pinned to the intended fail-fast contract.Also applies to: 692-692, 710-710, 727-727
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/cloud/test_driver.py` at line 673, Replace the broad pytest.raises(Exception) usages in tests/cloud/test_driver.py (the raises call at the shown locations) with a tighter assertion: either specify the concrete exception class that the submit-time failure should raise (e.g., SubmitError or the actual exception class thrown by the submit code) or keep pytest.raises(Exception) but add a stable match="..." argument that asserts a distinctive substring of the expected error message; update all occurrences (the raises at lines indicated: 673, 692, 710, 727) so each uses the specific exception type or pytest.raises(..., match="expected fragment") to avoid false positives.
840-845: ⚡ Quick winAssert the
--slayer-storage-rootvalue on resubmit, not just presence.This test currently validates flag presence but not the propagated value, leaving a gap in manifest→job-args verification.
Suggested assertion addition
job_args = mocks["cluster"].submit_job.call_args.kwargs["args"] assert "--slayer-setup" in job_args assert job_args[job_args.index("--slayer-setup") + 1] == "on-the-fly" assert "--slayer-storage-root" in job_args + assert job_args[job_args.index("--slayer-storage-root") + 1] == "/data/slayer_models" # The crux: resubmit must NOT re-upload the setup. mocks["gcs"].upload_dir_prefix.assert_not_called()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/cloud/test_driver.py` around lines 840 - 845, Assert the actual value passed for "--slayer-storage-root" on resubmit instead of only checking presence: find its index in job_args (job_args = mocks["cluster"].submit_job.call_args.kwargs["args"]) and assert job_args[job_args.index("--slayer-storage-root") + 1] == expected_storage_root, where expected_storage_root is the storage-root value set earlier in the test/manifest (or the variable used to construct the job args); keep the existing check that mocks["gcs"].upload_dir_prefix.assert_not_called() to ensure no re-upload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bird_interact_agents/cloud/driver.py`:
- Around line 82-96: The function read_api_keys_from_local_env builds a set of
required keys via prereqs._required_api_keys and then silently returns only
those present in os.environ; instead, compute the set of missing keys (needed
minus keys in os.environ), and if any are missing raise an explicit error (e.g.,
RuntimeError or ValueError) listing the missing key names so callers fail fast;
keep the existing logic for adding "OPENAI_API_KEY" when query_mode == "slayer",
but after that validate presence and only then return the dict of env values for
the required keys.
In `@src/bird_interact_agents/slayer_otf/cache.py`:
- Around line 291-305: When os.rename(tmp_dir, target) fails but
marker.is_file() indicates a peer won the race, don't return the local fp and
kb_rows; instead read the on-disk metadata from the committed target and return
that. In the except OSError branch (around tmp_dir/target/marker handling) after
confirming marker.is_file(), load the fingerprint and kb_rows from the existing
target directory and return CacheEntry(cache_dir=target, fingerprint=<disk_fp>,
kb_rows=<disk_kb_rows>) rather than returning the local fp/kb_rows; keep the
existing cleanup of tmp_dir (shutil.rmtree(tmp_dir, ignore_errors=True)).
In `@src/bird_interact_agents/slayer_otf/reference_build.py`:
- Around line 476-484: The code in ensure_db_reference currently holds
_get_lock(db) while calling ensure_db_cache (which itself acquires the same
lock), causing a self-deadlock; fix by not calling ensure_db_cache while inside
async with _get_lock(db): instead, under the lock only check marker.is_file()
and return _reuse_reference if set, otherwise exit the lock, call
ensure_db_cache (outside the lock), then re-acquire _get_lock(db) and re-check
marker.is_file() before proceeding with the build; update ensure_db_reference to
follow this release-call-reacquire pattern so ensure_db_cache never attempts to
obtain the same lock recursively.
---
Outside diff comments:
In `@src/bird_interact_agents/cloud/gcs.py`:
- Around line 190-205: The download_prefix logic must enforce exact-directory
matches and block path traversal: when iterating
bucket.list_blobs(prefix=prefix) only accept blobs whose blob.name equals prefix
or startswith prefix + "/" (so a prefix "x/db" doesn't match "x/db2"), compute
rel as the slice after the validated separator, then sanitize rel by rejecting
or normalizing any path components that contain ".." or absolute paths (e.g.,
use pathlib.Path(rel).parts and ensure no ".." and no leading "/" before joining
to dest). Update the _one handler and the blob-filtering loop to apply these
checks (references: prefix, blob.name, rel, dest, _one, bucket.list_blobs) and
return or raise on invalid/suspicious blob names.
In `@src/bird_interact_agents/cloud/image.py`:
- Around line 218-226: The dirty-path filter tuple input_prefixes (in
src/bird_interact_agents/cloud/image.py) was changed and no longer includes
"slayer_models/", allowing uncommitted SLayer changes to slip past
DirtyWorktreeError; restore "slayer_models/" into the input_prefixes tuple
(alongside "src/", "audited_gold/", "uv.lock", "pyproject.toml",
"Dockerfile.cloud") so the worktree cleanliness check includes that directory
again.
In `@src/bird_interact_agents/cloud/ray_app.py`:
- Around line 313-324: The temp log directory (log_dir / log_tmp) is only
removed after successful _gcs.write_row/_gcs.write_log, causing leaks on GCS
failures; wrap the GCS write logic in a try/finally: perform _gcs.write_row(...)
and the conditional read/_gcs.write_log(...) inside the try, and in the finally
unconditionally call shutil.rmtree(log_dir, ignore_errors=True) so the per-task
temp dirs are always cleaned even if write_row or write_log raises; reference
the existing symbols _gcs.write_row, _gcs.write_log, log_tmp, log_dir,
gcs_client, run_id, iid, attempt, and row when locating the change.
---
Nitpick comments:
In `@tests/cloud/test_driver.py`:
- Line 673: Replace the broad pytest.raises(Exception) usages in
tests/cloud/test_driver.py (the raises call at the shown locations) with a
tighter assertion: either specify the concrete exception class that the
submit-time failure should raise (e.g., SubmitError or the actual exception
class thrown by the submit code) or keep pytest.raises(Exception) but add a
stable match="..." argument that asserts a distinctive substring of the expected
error message; update all occurrences (the raises at lines indicated: 673, 692,
710, 727) so each uses the specific exception type or pytest.raises(...,
match="expected fragment") to avoid false positives.
- Around line 840-845: Assert the actual value passed for
"--slayer-storage-root" on resubmit instead of only checking presence: find its
index in job_args (job_args =
mocks["cluster"].submit_job.call_args.kwargs["args"]) and assert
job_args[job_args.index("--slayer-storage-root") + 1] == expected_storage_root,
where expected_storage_root is the storage-root value set earlier in the
test/manifest (or the variable used to construct the job args); keep the
existing check that mocks["gcs"].upload_dir_prefix.assert_not_called() to ensure
no re-upload.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0a34dea4-460e-4b62-ae6b-f8695c34f1a1
📒 Files selected for processing (27)
Dockerfile.cloudsrc/bird_interact_agents/agents/pydantic_ai_otf_encode/agent.pysrc/bird_interact_agents/agents/pydantic_ai_recursive/agent.pysrc/bird_interact_agents/cloud/cli.pysrc/bird_interact_agents/cloud/cluster.yaml.j2src/bird_interact_agents/cloud/driver.pysrc/bird_interact_agents/cloud/gcs.pysrc/bird_interact_agents/cloud/image.pysrc/bird_interact_agents/cloud/prereqs.pysrc/bird_interact_agents/cloud/ray_app.pysrc/bird_interact_agents/paths.pysrc/bird_interact_agents/run.pysrc/bird_interact_agents/slayer_otf/cache.pysrc/bird_interact_agents/slayer_otf/reference_build.pytests/cloud/conftest.pytests/cloud/test_cli.pytests/cloud/test_cluster.pytests/cloud/test_driver.pytests/cloud/test_gcs.pytests/cloud/test_image.pytests/cloud/test_prereqs.pytests/cloud/test_ray_app.pytests/test_otf_rebuild_wiring.pytests/test_paths.pytests/test_slayer_otf_cache.pytests/test_slayer_otf_reference_build.pytests/test_slayer_setup_flag.py
…ss) + GCE run-id cap (DEV-1468) Process-reviews (CodeRabbit + Codex) on PR #2, plus a GC-smoke-discovered bug: - CRITICAL self-deadlock (CodeRabbit): ensure_db_reference held _get_lock(db) then called ensure_db_cache, which re-acquires the SAME non-reentrant lock — hanging on any fresh/--otf-rebuild reference build. Moved ensure_db_cache OUTSIDE the lock (kept the top reuse check so cloud combo-3 still never calls it). Added a regression test that wraps the fake cache to take the real lock (would hang under the old code; the unit tests masked it by mocking the cache). - cache.py rename-race (CodeRabbit): the OSError race-success branch now returns _load_cache_entry(target) — the winner's on-disk fp/kb_rows — so CacheEntry matches cache_dir instead of our local values. - ray_app.download_slayer_setup (Codex): a missing/empty GCS prefix is no longer cached as .download_complete — raise instead, so a failed/partial upload surfaces and a retry can still run. - driver.read_api_keys_from_local_env (CodeRabbit): fail fast (PrereqError) on a missing required key instead of silently dropping it — resubmit skips prereq checks, so this otherwise surfaced as opaque per-actor auth failures. - driver.mint_run_id: cap the framework slug so Ray's GCE node name ray-<run_id>-worker-<uuid> stays under the 63-char instance-name limit. The long pydantic_ai_otf_encode / pydantic_ai_recursive slugs (19 chars) tripped Ray's assertion at `ray up` once DEV-1468 made them cloud-submittable in slayer mode (a real GC-smoke failure). + regression test. Codex CX[A] (root-level .download_complete not run-scoped) accepted as-is: fresh-VM-per-run + per-combo-fixed roots mean it never persists across runs. Full non-integration suite (1030) + integration tier (19) green; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/bird_interact_agents/cloud/driver.py (1)
20-22: ⚡ Quick winBind
_required_api_keysoutside the monkeypatcheddriver.prereqsmodule.Only
PrereqErroris pinned to the real module right now.read_api_keys_from_local_env()still resolves_required_api_keysthroughdriver.prereqs, so the submit/resubmit tests that replacedriver.prereqswith a mock stop exercising provider-key selection and can miss silently dropped env vars.♻️ Suggested change
-from bird_interact_agents.cloud.prereqs import PrereqError +from bird_interact_agents.cloud.prereqs import PrereqError, _required_api_keys @@ - needed.update(prereqs._required_api_keys(model)) + needed.update(_required_api_keys(model))Also applies to: 97-119
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bird_interact_agents/cloud/driver.py` around lines 20 - 22, Bind the required keys list from the real prereqs module instead of resolving it through driver.prereqs at call time: import or read _required_api_keys from bird_interact_agents.cloud.prereqs into the top-level of driver.py (alongside the already pinned PrereqError) and update read_api_keys_from_local_env to reference that top-level _required_api_keys rather than driver.prereqs._required_api_keys so tests that monkeypatch driver.prereqs still exercise provider-key selection and env-var validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/bird_interact_agents/cloud/driver.py`:
- Around line 20-22: Bind the required keys list from the real prereqs module
instead of resolving it through driver.prereqs at call time: import or read
_required_api_keys from bird_interact_agents.cloud.prereqs into the top-level of
driver.py (alongside the already pinned PrereqError) and update
read_api_keys_from_local_env to reference that top-level _required_api_keys
rather than driver.prereqs._required_api_keys so tests that monkeypatch
driver.prereqs still exercise provider-key selection and env-var validation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5ba38dba-8ef7-4022-9301-1a2d5d8c1b9a
📒 Files selected for processing (8)
src/bird_interact_agents/cloud/driver.pysrc/bird_interact_agents/cloud/ray_app.pysrc/bird_interact_agents/slayer_otf/cache.pysrc/bird_interact_agents/slayer_otf/reference_build.pytests/cloud/test_driver.pytests/cloud/test_ray_app.pytests/test_slayer_otf_cache.pytests/test_slayer_otf_reference_build.py
…not Ray's 55)
The previous cap was computed from Ray's internal `name_label <= 55`
assertion, but Ray composes the actual GCE name as
`ray-<run_id>-worker-<uuid8>-compute` which adds 17 more chars — and
GCP's compute-instance regex caps the FULL name at 63
(`[a-z]([-a-z0-9]{0,61}[a-z0-9])?`). Ray's 55 passed for a 50-char label
while GCP rejected the resulting 66-char name with HTTP 400, blocking
`ray up`. Compute the slug cap from GCP's real 63, and test the FULL
worst-case name (worker + compute suffix).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/bird_interact_agents/cloud/driver.py (1)
170-188:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast on unmapped
instance_ids before deriving DB uploads.
_dbs_for_instances()silently drops requested IDs that are missing frommini_interact.jsonl. In slayer mode that means_check_slayer_setup_present()and_upload_slayer_setup()both run against an incomplete DB set, so a submit can proceed without uploading setup for some requested instances and only fail later in-cluster.💡 Suggested fix
def _dbs_for_instances(instance_ids) -> list[str]: @@ - wanted = set(instance_ids) + wanted = set(instance_ids) + matched: set[str] = set() dbs: set[str] = set() with paths.mini_interact_data_file().open() as f: for line in f: @@ - td = _json.loads(line) - if td.get("instance_id") in wanted: + td = _json.loads(line) + iid = td.get("instance_id") + if iid in wanted: + matched.add(iid) db = td.get("selected_database") if db: dbs.add(db) + missing = sorted(wanted - matched) + if missing: + raise ValueError( + f"cloud slayer: instance_ids not found in mini_interact.jsonl: {missing}" + ) return sorted(dbs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bird_interact_agents/cloud/driver.py` around lines 170 - 188, _dbs_for_instances currently drops requested instance_ids that aren't present in mini_interact.jsonl; modify _dbs_for_instances to track found instance_ids while reading paths.mini_interact_data_file(), compute missing = wanted - found_ids after the loop, and if any missing raise a clear exception (e.g., ValueError) listing the missing instance IDs and referring to the mini_interact.jsonl source; keep returning the sorted, deduplicated db list when no IDs are missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/bird_interact_agents/cloud/driver.py`:
- Around line 170-188: _dbs_for_instances currently drops requested instance_ids
that aren't present in mini_interact.jsonl; modify _dbs_for_instances to track
found instance_ids while reading paths.mini_interact_data_file(), compute
missing = wanted - found_ids after the loop, and if any missing raise a clear
exception (e.g., ValueError) listing the missing instance IDs and referring to
the mini_interact.jsonl source; keep returning the sorted, deduplicated db list
when no IDs are missing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d012e4fb-83f8-466c-a46d-9c93867273a2
📒 Files selected for processing (2)
src/bird_interact_agents/cloud/driver.pytests/cloud/test_driver.py
- ensure_db_reference now forwards `force` to ensure_db_cache (Codex): a programmatic ensure_db_reference(force=True) caller would otherwise reuse a stale phase-1-3 cache and encode it into the "fresh" reference. The CLI --otf-rebuild was already safe (it purges both layers separately via _maybe_force_wipe_otf), but the direct-API contract now matches. Regression test asserts ensure_db_cache is called with force=True. - driver imports `_required_api_keys` by name (alongside PrereqError) so the submit/resubmit tests that monkeypatch `driver.prereqs` still exercise the real provider→key mapping (CodeRabbit). Otherwise the mock returns an empty iterable and key-selection silently no-ops in tests. CodeRabbit's stale broad-pytest.raises nitpick is from its 2026-05-24 review of the pre-fix code (already addressed in 40d4fec — all 4 fail-fast tests use pytest.raises(FileNotFoundError)). Full suite 1031 + integration 19 green; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/bird_interact_agents/cloud/driver.py (1)
180-192:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't silently ignore
instance_ids that never resolve to a DB.If the dataset lookup misses any requested
instance_id, this helper returns a partial DB list and the new fail-fast SLayer presence/upload path proceeds as if everything is valid. That turns a typo or stale manifest entry into a late in-cluster missing-artifact failure after the image build / cluster bring-up has already started.Suggested fix
def _dbs_for_instances(instance_ids) -> list[str]: @@ - wanted = set(instance_ids) + wanted = set(instance_ids) + found: set[str] = set() dbs: set[str] = set() with paths.mini_interact_data_file().open() as f: for line in f: @@ td = _json.loads(line) if td.get("instance_id") in wanted: + found.add(td["instance_id"]) db = td.get("selected_database") if db: dbs.add(db) + missing = sorted(wanted - found) + if missing: + raise ValueError( + f"instance_ids not found in {paths.mini_interact_data_file()}: {missing}" + ) return sorted(dbs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bird_interact_agents/cloud/driver.py` around lines 180 - 192, The helper that reads paths.mini_interact_data_file() currently returns a partial list when some requested instance_ids are not found; modify the code to detect missing ids by tracking found_ids (e.g., build a set while iterating using the existing wanted = set(instance_ids) and when td.get("instance_id") matches add to found_ids), then compare wanted - found_ids after the loop and raise an explicit exception (ValueError or a custom error) listing the missing instance_ids so the caller (e.g., the SLayer presence/upload path) fails fast instead of proceeding with a partial db list; ensure you still return sorted(dbs) only when no ids are missing.src/bird_interact_agents/slayer_otf/reference_build.py (1)
472-497:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSerialize the fast-path reuse check with
force=Truerebuilds.The new lock-free reuse path lets a non-force caller return
reference_dirwhile a concurrentforce=Truecaller is still on the rebuild path. Once that force caller reaches_commit_reference(..., force=True), it canrmtree(target)underneath the reuser, so in-process callers can see missing or partially replaced reference files.Suggested fix
- # Reuse path FIRST — never builds/touches the cache (load-bearing for cloud - # combo 3, which downloads only the reference, never the cache). - if not force and marker.is_file(): - return _reuse_reference(target, db) + # Serialize reuse with force rebuilds, but release the lock before touching + # the cache because ensure_db_cache() takes the same per-db lock. + async with _get_lock(db): + if not force and marker.is_file(): + return _reuse_reference(target, db) # Materialise (or reuse) the phases-1-3 cache OUTSIDE the reference lock. @@ async with _get_lock(db): # Double-check under the lock — a peer may have built the reference # while we waited (or while ensure_db_cache ran).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bird_interact_agents/slayer_otf/reference_build.py` around lines 472 - 497, The early lock-free fast-path (the "if not force and marker.is_file(): return _reuse_reference(...)" check) must be removed so reuse is serialized with concurrent force=True rebuilds; delete that pre-ensure_db_cache return and let ensure_db_cache run and the later async with _get_lock(db) double-check (which calls _reuse_reference) decide reuse, so use marker, _reuse_reference, ensure_db_cache and _get_lock(db) as the referenced symbols to locate and remove the initial early-return branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/bird_interact_agents/cloud/driver.py`:
- Around line 180-192: The helper that reads paths.mini_interact_data_file()
currently returns a partial list when some requested instance_ids are not found;
modify the code to detect missing ids by tracking found_ids (e.g., build a set
while iterating using the existing wanted = set(instance_ids) and when
td.get("instance_id") matches add to found_ids), then compare wanted - found_ids
after the loop and raise an explicit exception (ValueError or a custom error)
listing the missing instance_ids so the caller (e.g., the SLayer presence/upload
path) fails fast instead of proceeding with a partial db list; ensure you still
return sorted(dbs) only when no ids are missing.
In `@src/bird_interact_agents/slayer_otf/reference_build.py`:
- Around line 472-497: The early lock-free fast-path (the "if not force and
marker.is_file(): return _reuse_reference(...)" check) must be removed so reuse
is serialized with concurrent force=True rebuilds; delete that
pre-ensure_db_cache return and let ensure_db_cache run and the later async with
_get_lock(db) double-check (which calls _reuse_reference) decide reuse, so use
marker, _reuse_reference, ensure_db_cache and _get_lock(db) as the referenced
symbols to locate and remove the initial early-return branch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 235f56bb-cf8b-421c-a1ea-6132556b446d
📒 Files selected for processing (3)
src/bird_interact_agents/cloud/driver.pysrc/bird_interact_agents/slayer_otf/reference_build.pytests/test_slayer_otf_reference_build.py
Closes DEV-1468. Extends/supersedes DEV-1467 with the design agreed in a /spec session, plus a consolidation of the OTF artifact lifecycle. Two Codex review rounds (plan + tests) folded in.
What & why
Make all three SLayer combos run in the cloud runner (
bird-interact-cloud) by mirroring the local path, and consolidate the over-engineered OTF artifact lifecycle to presence-gated reuse + explicit force-wipe. The unifying lever is removing fingerprint gating (the fingerprint stays as provenance). Then local == cloud by construction: upload an artifact → it's present → it's reused, with no cloud-specific OTF code.The three combos, all working:
slayer_models/<db>/.slayer_otf_cache/<db>/.slayer_models_otf/<db>/.OTF lifecycle consolidation (
slayer_otf/)<fp>path level); reuse gated on the_cache_fp.txtmarker.ensure_db_cachegainsforce; a markerless/old-layout target isrmtree'd then atomic-renamed (handles migration from the old<db>/<fp>/layout with no crash).ensure_db_reference, beforeensure_db_cache— load-bearing for cloud combo 3, which downloads only the reference (never the cache)._load_reference_entryis self-contained (loadskb_rowsfrom disk; raises on missing/corrupt, accepts[]); reuse emits a one-line provenance WARNING.--otf-rebuild(renamed from--otf-rebuild-reference, kept as a hidden alias) force-wipes both layers for both on-the-fly frameworks viarun._maybe_force_wipe_otf.paths + agents
slayer_models_root/ newslayer_otf_cache_root/slayer_models_otf_rooteach gain aBIRD_*_ROOTenv override (harmless locally; the cloud actor uses them to point at/data/...). Both agents drop their duplicated_otf_cache_root.Cloud (
cloud/*+ Dockerfile.cloud + cluster.yaml.j2)upload_dir_prefix+download_prefix(concurrent_download_prefixis now a thin wrapper) +slayer_artifact_name.--slayer-setup/--slayer-storage-root; validate the(framework, query_mode, mode, slayer_setup)tuple viarun._validate_slayer_setup.paths.*), a fail-fast presence check before build/push/cluster,OPENAI_API_KEYdelivery for slayer, manifest + resubmit plumbing, and no re-upload on resubmit.download_slayer_setuponce per worker process (atomic, root-level.download_completemarker); threadslayer_setupthrough torun_one_task.OPENAI_API_KEYin slayer mode.slayer_models/(uploaded at submit); drop it + the slayer-ingest version fromdata_hashand the dirty-input set. mini-interact + audited_gold stay baked.BIRD_*_ROOTenv vars.Decisions (confirmed in the /spec session)
data_hash.Testing
-m integration): 19 passed.origin/main(DEV-1466) — fast-forward + clean stash-pop, no hand-resolved conflicts.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests