Search: skip oversize and binary files (closes #27)#51
Conversation
Add a per-file size cap (default ~10MB, configurable via the `maxFileBytes`
arg) and binary-file detection (NUL-quit, matching ripgrep) to the search
pipeline. Skipped files are surfaced in a `skipped` array so the agent
isn't surprised by missing hits.
- `SearcherBuilder::binary_detection(BinaryDetection::quit(b'\x00'))`.
- Size pre-check on `std::fs::metadata` before handing the path to the
searcher — avoids streaming a 500MB log just to discard it.
- `SearchOutput { hits, skipped }` replaces the bare `Vec<SearchHit>`
return so callers see both halves of the result.
- New `maxFileBytes` field on the tool schema (literal — cache-safe) and
a matching `maxFileBytes` knob on the per-repo profile.
- Tests for binary-skip, size-skip, and the disable-cap path.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds an optional per-file size cap and NUL-byte binary detection to search. Search now returns a SearchOutput containing hits and skipped files; profile and tool schemas expose/forward the max_file_bytes option and the tool reports skipped metadata in its JSON response. ChangesSearch file-size and binary-file caps
Sequence Diagram(s)sequenceDiagram
participant Tool as Search Tool (relaywash__Search)
participant Search as crate::search::run
participant Searcher as grep_searcher
participant FS as FileSystem
Tool->>Search: call run(SearchOpts { max_file_bytes, ... })
Search->>FS: stat(file) [check size vs max_file_bytes]
alt file exceeds size
FS-->>Search: metadata (size)
Search-->>Tool: record SkippedFile(reason: "size", bytes)
else file within size
Search->>Searcher: scan file (binary_detection = quit on NUL)
Searcher->>Search: binary_data or matches
alt binary detected
Search-->>Tool: record SkippedFile(reason: "binary")
else matches found
Search-->>Tool: return SearchHit snippets
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8546ff8210
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let max_file_bytes: Option<u64> = match args.get("maxFileBytes").and_then(|v| v.as_u64()) { | ||
| Some(0) => None, | ||
| Some(n) => Some(n), | ||
| None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)), |
There was a problem hiding this comment.
Treat profile maxFileBytes=0 as cap-disabled
When maxFileBytes is omitted in the request, this fallback wraps the profile value directly in Some(...), so a profile value of 0 becomes Some(0) instead of disabling the limit. In search::run, that makes meta.len() > 0 true for nearly every non-empty file, so searches silently return no hits unless the caller overrides the arg each time. This conflicts with the documented 0-means-disabled behavior and makes profile-based tuning brittle.
Useful? React with 👍 / 👎.
| let value = json!({ | ||
| "results": results, | ||
| "truncated": truncated, | ||
| "skipped": output.skipped, |
There was a problem hiding this comment.
Bound skipped entries to avoid oversized Search responses
The handler truncates results via maxResults but always returns the full output.skipped list, which can grow to thousands of entries on repos with many binary or oversize files. That reintroduces a large response surface (payload/token size and latency) even when hit output is capped, so the new guardrails can still produce oversized tool responses in common monorepo layouts.
Useful? React with 👍 / 👎.
Two fixes from Codex review on #51: - Profile `maxFileBytes: 0` was being wrapped in `Some(0)` and turned into a "skip everything > 0 bytes" guard, silently nuking all hits. Treat 0 from either source as cap-disabled. - Truncate `skipped` to `maxResults` and surface `skippedTotal` / `skippedTruncated` so a monorepo with thousands of vendored bundles can't reintroduce the oversized-response problem the cap exists to prevent.
| // profile. Omitted everywhere → static default (~10MB). | ||
| let max_file_bytes: Option<u64> = { | ||
| let raw = args | ||
| .get("maxFileBytes") |
There was a problem hiding this comment.
🟡 Profile maxFileBytes: 0 silently skips all non-empty files instead of disabling the cap
When the agent omits maxFileBytes, the code falls through to Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)) at crates/wash/src/tools/search.rs:68. If the profile has max_file_bytes: Some(0), this produces Some(0). In the search loop (crates/wash/src/search.rs:87), the guard meta.len() > limit with limit=0 is true for every non-empty file, so all files are skipped and zero search hits are returned. The tool API documents 0 as "disables the cap" (Some(0) => None at line 66), but that same "0 means disable" semantics isn't applied to the profile value, creating an inconsistency where a profile setting of 0 breaks search entirely rather than removing the size limit.
| .get("maxFileBytes") | |
| None => match prof.max_file_bytes { | |
| Some(0) | None => Some(DEFAULT_MAX_FILE_BYTES), | |
| Some(n) => Some(n), | |
| }, |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Addressed in 2b06797 — I chose the symmetric interpretation: 0 means "disabled" regardless of whether it comes from the request arg or the profile. The suggested patch makes profile 0 fall back to DEFAULT_MAX_FILE_BYTES, but that conflicts with the explicit-arg semantics where 0 means "no cap". Treating 0 as disabled everywhere keeps the contract uniform and lets a profile turn the cap off for repos that need it.
Generated by Claude Code
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/wash/src/tools/search.rs">
<violation number="1" location="crates/wash/src/tools/search.rs:68">
P2: Profile value `maxFileBytes: 0` is not treated as “disable”, causing nearly all non-empty files to be skipped.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
| let max_file_bytes: Option<u64> = match args.get("maxFileBytes").and_then(|v| v.as_u64()) { | ||
| Some(0) => None, | ||
| Some(n) => Some(n), | ||
| None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)), |
There was a problem hiding this comment.
P2: Profile value maxFileBytes: 0 is not treated as “disable”, causing nearly all non-empty files to be skipped.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/wash/src/tools/search.rs, line 68:
<comment>Profile value `maxFileBytes: 0` is not treated as “disable”, causing nearly all non-empty files to be skipped.</comment>
<file context>
@@ -60,6 +61,12 @@ fn run(args: &Value) -> Result<ToolResult> {
+ let max_file_bytes: Option<u64> = match args.get("maxFileBytes").and_then(|v| v.as_u64()) {
+ Some(0) => None,
+ Some(n) => Some(n),
+ None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)),
+ };
let rank = args
</file context>
| None => Some(prof.max_file_bytes.unwrap_or(DEFAULT_MAX_FILE_BYTES)), | |
| None => match prof.max_file_bytes { | |
| Some(0) => None, | |
| Some(n) => Some(n), | |
| None => Some(DEFAULT_MAX_FILE_BYTES), | |
| }, |
There was a problem hiding this comment.
Same issue as the Devin and Codex threads above — already fixed in 2b06797. I went with the symmetric "0 = disabled" interpretation rather than the suggested "profile 0 → DEFAULT" so the contract is uniform across arg and profile (and a profile can actually turn the cap off when needed).
Generated by Claude Code
There was a problem hiding this comment.
Thanks for the feedback! I've saved this as a new learning to improve future reviews.
Closes #27.
Summary
Search now has the same blast-radius guards ripgrep ships with by default:
SearcherBuilder::binary_detection(BinaryDetection::quit(b'\x00'))— stop scanning a file on the first NUL byte, matching ripgrep's default.maxFileBytesarg onrelaywash__Search(default ~10MB, set0to disable). Files over the limit are detected viastd::fs::metadataand skipped before they're handed to the searcher, so a 500MB log isn't streamed line-by-line just to be discarded.skippedarray in the response. Each skipped file is reported with{ path, reason: "size" | "binary", bytes? }so the agent sees what was dropped instead of silently missing hits.tools.search.maxFileBytesadded toSearchDefaults— per-repo profiles can tune the cap without touching the schema literal (cache-safe).Files
crates/wash/src/search.rs—BinaryDetection::quit, size pre-check, newSearchOutput { hits, skipped }andSkippedFiletypes,binary_dataSink callback to flag binary files.crates/wash/src/tools/search.rs— newmaxFileBytesschema field,0semantics, response includesskipped.crates/wash/src/profile.rs—SearchDefaults::max_file_bytes(optional).Test plan
cargo test --lib— 91 pass, including 4 new tests (skips_binary_file_on_nul,skips_oversize_file,no_size_limit_when_none, plus the existing two updated for the new return shape).cargo test --release—tools_list_is_byte_stable_across_profilesstill passes (schema literal unchanged across profile values; cache invariant holds).cargo build --releaseclean.Out of scope
The issue suggests a bonus total-response-bytes cap. Deferring —
maxResults× the size cap already bounds output to a predictable order of magnitude, and a separate byte budget interacts with ranking/truncation in a way I didn't want to bolt on without a follow-up think.Generated by Claude Code