Skip to content

Search: skip oversize and binary files (closes #27)#51

Merged
willwashburn merged 2 commits into
mainfrom
claude/fix-issue-27-nzfxv
May 25, 2026
Merged

Search: skip oversize and binary files (closes #27)#51
willwashburn merged 2 commits into
mainfrom
claude/fix-issue-27-nzfxv

Conversation

@willwashburn
Copy link
Copy Markdown
Member

Closes #27.

Summary

Search now has the same blast-radius guards ripgrep ships with by default:

  • Binary skip. SearcherBuilder::binary_detection(BinaryDetection::quit(b'\x00')) — stop scanning a file on the first NUL byte, matching ripgrep's default.
  • Per-file size cap. New maxFileBytes arg on relaywash__Search (default ~10MB, set 0 to disable). Files over the limit are detected via std::fs::metadata and skipped before they're handed to the searcher, so a 500MB log isn't streamed line-by-line just to be discarded.
  • skipped array in the response. Each skipped file is reported with { path, reason: "size" | "binary", bytes? } so the agent sees what was dropped instead of silently missing hits.
  • Profile knob. tools.search.maxFileBytes added to SearchDefaults — per-repo profiles can tune the cap without touching the schema literal (cache-safe).

Files

  • crates/wash/src/search.rsBinaryDetection::quit, size pre-check, new SearchOutput { hits, skipped } and SkippedFile types, binary_data Sink callback to flag binary files.
  • crates/wash/src/tools/search.rs — new maxFileBytes schema field, 0 semantics, response includes skipped.
  • crates/wash/src/profile.rsSearchDefaults::max_file_bytes (optional).

Test plan

  • cargo test --lib — 91 pass, including 4 new tests (skips_binary_file_on_nul, skips_oversize_file, no_size_limit_when_none, plus the existing two updated for the new return shape).
  • cargo test --releasetools_list_is_byte_stable_across_profiles still passes (schema literal unchanged across profile values; cache invariant holds).
  • cargo build --release clean.

Out of scope

The issue suggests a bonus total-response-bytes cap. Deferring — maxResults × the size cap already bounds output to a predictable order of magnitude, and a separate byte budget interacts with ranking/truncation in a way I didn't want to bolt on without a follow-up think.


Generated by Claude Code

Add a per-file size cap (default ~10MB, configurable via the `maxFileBytes`
arg) and binary-file detection (NUL-quit, matching ripgrep) to the search
pipeline. Skipped files are surfaced in a `skipped` array so the agent
isn't surprised by missing hits.

- `SearcherBuilder::binary_detection(BinaryDetection::quit(b'\x00'))`.
- Size pre-check on `std::fs::metadata` before handing the path to the
  searcher — avoids streaming a 500MB log just to discard it.
- `SearchOutput { hits, skipped }` replaces the bare `Vec<SearchHit>`
  return so callers see both halves of the result.
- New `maxFileBytes` field on the tool schema (literal — cache-safe) and
  a matching `maxFileBytes` knob on the per-repo profile.
- Tests for binary-skip, size-skip, and the disable-cap path.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5f69181d-443f-4a5e-88ca-d315e05ceba8

📥 Commits

Reviewing files that changed from the base of the PR and between 8546ff8 and 2b06797.

📒 Files selected for processing (1)
  • crates/wash/src/tools/search.rs

📝 Walkthrough

Walkthrough

Adds an optional per-file size cap and NUL-byte binary detection to search. Search now returns a SearchOutput containing hits and skipped files; profile and tool schemas expose/forward the max_file_bytes option and the tool reports skipped metadata in its JSON response.

Changes

Search file-size and binary-file caps

Layer / File(s) Summary
Profile configuration for max file size
crates/wash/src/profile.rs
SearchDefaults gains an optional max_file_bytes field serialized as maxFileBytes, omitted when None.
Search API types and output contract
crates/wash/src/search.rs
Adds DEFAULT_MAX_FILE_BYTES, SkippedFile, and SearchOutput; extends SearchOpts with max_file_bytes; run() now returns SearchOutput instead of Vec<SearchHit).
Search execution with binary detection and size checks
crates/wash/src/search.rs
run() configures ripgrep-style binary detection (quit on NUL), performs pre-scan metadata size checks, tracks binary state in HitSink, and populates skipped for binary/oversize files while returning hits for scanned files.
Search tests for binary and size skipping
crates/wash/src/search.rs
Tests updated to consume SearchOutput; new tests cover binary-file skipping (reason: "binary"), oversize skipping (reason: "size" with byte count), and disabling size limits (max_file_bytes: None).
Tool schema and output integration
crates/wash/src/tools/search.rs
Tool input schema adds maxFileBytes (0 → disable); run() derives max_file_bytes from args/profile/default, forwards it to search::run, and includes skipped, skippedTotal, and skippedTruncated in the response (skipped entries capped to maxResults).

Sequence Diagram(s)

sequenceDiagram
  participant Tool as Search Tool (relaywash__Search)
  participant Search as crate::search::run
  participant Searcher as grep_searcher
  participant FS as FileSystem
  Tool->>Search: call run(SearchOpts { max_file_bytes, ... })
  Search->>FS: stat(file) [check size vs max_file_bytes]
  alt file exceeds size
    FS-->>Search: metadata (size)
    Search-->>Tool: record SkippedFile(reason: "size", bytes)
  else file within size
    Search->>Searcher: scan file (binary_detection = quit on NUL)
    Searcher->>Search: binary_data or matches
    alt binary detected
      Search-->>Tool: record SkippedFile(reason: "binary")
    else matches found
      Search-->>Tool: return SearchHit snippets
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I thumped and sniffed through bytes and nul,
Now large logs stop where they pull,
Skipped are noted, hits still stay,
A small cap lights the searching way,
Hop, hop—safer scans from dusk to dull.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: adding skip logic for oversize and binary files in search functionality, and references the closed issue.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering binary detection, per-file size caps, skipped reporting, and profile configuration.
Linked Issues check ✅ Passed The PR successfully implements all core requirements from issue #27: binary detection via BinaryDetection::quit, configurable per-file maxFileBytes with defaults, skipped file reporting with reasons, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #27 requirements. The PR explicitly defers the bonus total-response-bytes cap to future work, remaining focused on binary skipping and per-file size limits.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-issue-27-nzfxv

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8546ff8210

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/wash/src/tools/search.rs Outdated
let max_file_bytes: Option<u64> = match args.get("maxFileBytes").and_then(|v| v.as_u64()) {
Some(0) => None,
Some(n) => Some(n),
None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat profile maxFileBytes=0 as cap-disabled

When maxFileBytes is omitted in the request, this fallback wraps the profile value directly in Some(...), so a profile value of 0 becomes Some(0) instead of disabling the limit. In search::run, that makes meta.len() > 0 true for nearly every non-empty file, so searches silently return no hits unless the caller overrides the arg each time. This conflicts with the documented 0-means-disabled behavior and makes profile-based tuning brittle.

Useful? React with 👍 / 👎.

Comment thread crates/wash/src/tools/search.rs Outdated
let value = json!({
"results": results,
"truncated": truncated,
"skipped": output.skipped,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound skipped entries to avoid oversized Search responses

The handler truncates results via maxResults but always returns the full output.skipped list, which can grow to thousands of entries on repos with many binary or oversize files. That reintroduces a large response surface (payload/token size and latency) even when hit output is capped, so the new guardrails can still produce oversized tool responses in common monorepo layouts.

Useful? React with 👍 / 👎.

Two fixes from Codex review on #51:

- Profile `maxFileBytes: 0` was being wrapped in `Some(0)` and turned
  into a "skip everything > 0 bytes" guard, silently nuking all hits.
  Treat 0 from either source as cap-disabled.
- Truncate `skipped` to `maxResults` and surface `skippedTotal` /
  `skippedTruncated` so a monorepo with thousands of vendored bundles
  can't reintroduce the oversized-response problem the cap exists to
  prevent.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

// profile. Omitted everywhere → static default (~10MB).
let max_file_bytes: Option<u64> = {
let raw = args
.get("maxFileBytes")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Profile maxFileBytes: 0 silently skips all non-empty files instead of disabling the cap

When the agent omits maxFileBytes, the code falls through to Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)) at crates/wash/src/tools/search.rs:68. If the profile has max_file_bytes: Some(0), this produces Some(0). In the search loop (crates/wash/src/search.rs:87), the guard meta.len() > limit with limit=0 is true for every non-empty file, so all files are skipped and zero search hits are returned. The tool API documents 0 as "disables the cap" (Some(0) => None at line 66), but that same "0 means disable" semantics isn't applied to the profile value, creating an inconsistency where a profile setting of 0 breaks search entirely rather than removing the size limit.

Suggested change
.get("maxFileBytes")
None => match prof.max_file_bytes {
Some(0) | None => Some(DEFAULT_MAX_FILE_BYTES),
Some(n) => Some(n),
},
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2b06797 — I chose the symmetric interpretation: 0 means "disabled" regardless of whether it comes from the request arg or the profile. The suggested patch makes profile 0 fall back to DEFAULT_MAX_FILE_BYTES, but that conflicts with the explicit-arg semantics where 0 means "no cap". Treating 0 as disabled everywhere keeps the contract uniform and lets a profile turn the cap off for repos that need it.


Generated by Claude Code

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/wash/src/tools/search.rs">

<violation number="1" location="crates/wash/src/tools/search.rs:68">
P2: Profile value `maxFileBytes: 0` is not treated as “disable”, causing nearly all non-empty files to be skipped.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

Comment thread crates/wash/src/tools/search.rs Outdated
let max_file_bytes: Option<u64> = match args.get("maxFileBytes").and_then(|v| v.as_u64()) {
Some(0) => None,
Some(n) => Some(n),
None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)),
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Profile value maxFileBytes: 0 is not treated as “disable”, causing nearly all non-empty files to be skipped.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/wash/src/tools/search.rs, line 68:

<comment>Profile value `maxFileBytes: 0` is not treated as “disable”, causing nearly all non-empty files to be skipped.</comment>

<file context>
@@ -60,6 +61,12 @@ fn run(args: &Value) -> Result<ToolResult> {
+    let max_file_bytes: Option<u64> = match args.get("maxFileBytes").and_then(|v| v.as_u64()) {
+        Some(0) => None,
+        Some(n) => Some(n),
+        None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)),
+    };
     let rank = args
</file context>
Suggested change
None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)),
None => match prof.max_file_bytes {
Some(0) => None,
Some(n) => Some(n),
None => Some(DEFAULT_MAX_FILE_BYTES),
},
Fix with Cubic

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the Devin and Codex threads above — already fixed in 2b06797. I went with the symmetric "0 = disabled" interpretation rather than the suggested "profile 0 → DEFAULT" so the contract is uniform across arg and profile (and a profile can actually turn the cap off when needed).


Generated by Claude Code

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've saved this as a new learning to improve future reviews.

@willwashburn willwashburn merged commit ef52707 into main May 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search has no file-size or binary-file cap

2 participants