Skip to content

fix(modal): surface repo image build failure details#652

Open
hreiten wants to merge 1 commit into
ColeMurray:mainfrom
watchdog-no:fix/repo-image-build-error-details
Open

fix(modal): surface repo image build failure details#652
hreiten wants to merge 1 commit into
ColeMurray:mainfrom
watchdog-no:fix/repo-image-build-error-details

Conversation

@hreiten
Copy link
Copy Markdown
Contributor

@hreiten hreiten commented May 18, 2026

Summary

  • Preserve structured setup/supervisor failure details when a repo image build exits before image_build.complete
  • Redact the clone token and repo environment variable values before storing build failure messages
  • Fix supervisor.fatal logging so it does not collide with the reserved LogRecord.message field

Why

Repo image builds currently fall back to Build sandbox exited without completing (exit_code=...) even when the build sandbox already emitted a more useful structured failure such as setup.failed or supervisor.error. That makes failed image builds hard to diagnose from the settings UI.

There is also a separate fatal-error logging path that passes message= through logging.extra, which raises KeyError: "Attempt to overwrite 'message' in LogRecord" before the fatal error can be reported.

Validation

  • cd packages/modal-infra && ./.venv/bin/ruff check . && ./.venv/bin/ruff format --check . && ./.venv/bin/pytest tests/ -q
  • cd packages/sandbox-runtime && ./.venv/bin/ruff check . && ./.venv/bin/ruff format --check . && ./.venv/bin/pytest tests/ -q
  • npm run lint
  • npm run typecheck

npm test currently fails in unrelated control-plane model tests for GPT 5.5 default reasoning expectations; this PR does not touch TypeScript model metadata or tests.

Summary by CodeRabbit

  • Bug Fixes
    • Improved image build failure reporting: captures concise error messages from build logs, trims long messages, and redacts sensitive values; sandbox exits now report extracted errors instead of generic messages.
    • Fixed supervisor fatal-error reporting to use the correct structured field for the error message.
  • Tests
    • Updated and added tests to validate failure reporting, truncation, and redaction behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78a41982-c4ff-47ab-92cf-00b2a1809423

📥 Commits

Reviewing files that changed from the base of the PR and between d3a4ce4 and be9088d.

📒 Files selected for processing (4)
  • packages/modal-infra/src/scheduler/image_builder.py
  • packages/modal-infra/tests/test_image_builder_v2.py
  • packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py
  • packages/sandbox-runtime/tests/test_supervisor_monitor.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/sandbox-runtime/tests/test_supervisor_monitor.py
  • packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py
  • packages/modal-infra/tests/test_image_builder_v2.py
  • packages/modal-infra/src/scheduler/image_builder.py

📝 Walkthrough

Walkthrough

Streams structured build logs, extracts a concise redacted/truncated failure message when present, returns (sha, complete, error) from the log streamer, and raises BuildError with the extracted message if the sandbox exits without completion; aligns supervisor fatal logging field and updates tests.

Changes

Image Build Error Reporting Enhancement

Layer / File(s) Summary
Failure message extraction utilities
packages/modal-infra/src/scheduler/image_builder.py
Adds Iterable import, constants BUILD_FAILURE_MESSAGE_MAX_CHARS, BUILD_FAILURE_REDACT_MIN_CHARS, _BUILD_LOG_EVENTS, _SETUP_FAILURE_EVENTS, and helpers _format_build_failure_event and _trim_build_failure_message to extract and redact concise failure messages from structured JSON log events.
Stream build logs refactor
packages/modal-infra/src/scheduler/image_builder.py
Rewrites _stream_build_logs(sandbox, redact_values) to parse structured JSON stdout, capture head_sha from git.sync_complete, treat image_build.complete as completion, select setup/supervisor payloads for concise error text, and return (head_sha, build_complete, error_message).
Build pipeline integration
packages/modal-infra/src/scheduler/image_builder.py
build_repo_image now collects redact_values (clone token and injected env values), passes them into _stream_build_logs, and raises BuildError with the extracted build_error if the sandbox exits without completion.
Test coverage for stream logs contract
packages/modal-infra/tests/test_image_builder_v2.py
Tests updated to await the new 3-tuple and assert error is None on success; _build_handle expanded to accept stdout_lines and returncode; new test test_reports_build_log_failure_when_stream_never_completes verifies redaction/truncation and failure callback payload.
Fatal error reporting field alignment
packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py, packages/sandbox-runtime/tests/test_supervisor_monitor.py
SandboxSupervisor._report_fatal_error now emits the fatal message under error_message (not message); new test TestFatalErrorReporting ensures reporting does not collide with reserved fields.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • ColeMurray/background-agents#636: Modifies build_repo_image and sandbox termination in image build failure handling; closely related to this PR's error extraction and reporting changes.
  • ColeMurray/background-agents#503: Related changes to how error_message is extracted/redacted from build-log events and surfaced to UI.

Suggested reviewers

  • open-inspect

Poem

🐇 I stitched the logs with careful art,

hid the secrets, trimmed the part,
masked the tokens, kept the clue,
now builds can tell us what they do —
hop in, reviewers, take a cart!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: surfacing detailed error information from failed repo image builds instead of using generic messages.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hreiten hreiten changed the title [codex] Surface repo image build failure details fix(modal): surface repo image build failure details May 18, 2026
@hreiten hreiten marked this pull request as ready for review May 18, 2026 13:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/modal-infra/src/scheduler/image_builder.py`:
- Around line 53-63: The failure events list is missing "supervisor.fatal", so
update the constants: add "supervisor.fatal" to the _BUILD_LOG_EVENTS tuple and
include it in the _BUILD_FAILURE_EVENTS set (which is _SETUP_FAILURE_EVENTS |
{...}) so supervisor.fatal is treated as a build/fatal error; this will allow
_stream_build_logs and the error-extraction logic that inspects supervisor
events to pick up the structured fatal messages instead of falling back to
generic failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 018208cd-7e99-4263-9a04-6fc34f4838bf

📥 Commits

Reviewing files that changed from the base of the PR and between a8a14c4 and 3f5574b.

📒 Files selected for processing (4)
  • packages/modal-infra/src/scheduler/image_builder.py
  • packages/modal-infra/tests/test_image_builder_v2.py
  • packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py
  • packages/sandbox-runtime/tests/test_supervisor_monitor.py

Comment thread packages/modal-infra/src/scheduler/image_builder.py
@hreiten hreiten force-pushed the fix/repo-image-build-error-details branch from 3f5574b to d3a4ce4 Compare May 18, 2026 17:39
@hreiten hreiten force-pushed the fix/repo-image-build-error-details branch from d3a4ce4 to be9088d Compare May 18, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant