perf(filesearch): move @-picker crawl and fzf index to worker_threads#3455
perf(filesearch): move @-picker crawl and fzf index to worker_threads#3455callmeYe wants to merge 15 commits intoQwenLM:mainfrom
Conversation
Before P1 the recursive @-picker did the fdir crawl and synchronous `new AsyncFzf(...)` construction on the main thread. On 100k-file workspaces that blocked the Ink render loop for 500ms–2s, so pressing `@` felt unresponsive. This change introduces `FileIndexCore` (pure class), `FileIndexWorker` (worker_threads entry), and `FileIndexService` (main-thread singleton that multiplexes search requests and exposes onPartial for future streaming). `RecursiveFileSearch` becomes a thin proxy, preserving the public `FileSearch` contract (and vscode-ide-companion's usage) untouched. The esbuild bundle now emits a self-contained `dist/fileIndexWorker.js` alongside `dist/cli.js`.
…EFRESH trigger
Changes driven by an open-ended correctness audit:
- crawlError now tears the service down (dispose worker, delete
from INSTANCES, reject pending searches + whenReady waiters), so a
transient crawl failure no longer leaks a worker per attempt.
- dispose() now rejects whenReady() waiters with AbortError; previously
callers awaiting whenReady() through a dispose could hang forever.
Also reorders unsubscribe before posting 'dispose' so the in-process
transport's synchronous exit cleanup can't beat the AbortError.
- worker.on('error') is wired, converting uncaught worker errors into
a normal exit-cleanup path instead of re-throwing on the main process
and crashing the CLI.
- useAtCompletion gains a monotonic refreshToken in its reducer state
and effect deps, so REFRESH re-triggers the worker effect even when
the status was already SEARCHING (previously a partial chunk arriving
mid-search could not cause a re-run).
- Worker 'dispose' now closes parentPort instead of process.exit(0) so
in-flight searchResult/searchError messages drain before shutdown.
- FileIndexCore.startCrawl preserves the allFiles reference on the
cache-hit fallback path to keep concurrent search() iterations stable.
- dispose() races transport.terminate() against a 2s timeout so a
faulted worker cannot hang shutdown.
- handleExit is now idempotent: if dispose() already set disposed=true,
a subsequent worker exit won't double-reject pending/readyWaiters.
Added regression tests for whenReady-on-dispose rejection and for
post-dispose FileIndexService.for() creating a fresh instance.
…TANCES hygiene - `optionsKey()` now hashes the .gitignore / .qwenignore contents via loadIgnoreRules().getFingerprint(). Editing those files produces a new key so the next `FileIndexService.for()` spawns a fresh worker instead of serving a stale cached snapshot with outdated ignore rules. - `FileIndexService.for()` only memoises an instance that survived construction. If the Worker spawn errored synchronously and handleExit ran before INSTANCES ever had the key, a permanently-disposed instance would otherwise be cached for the lifetime of the process. - `FileSearch.dispose?()` is now a (optional) part of the public interface; the recursive proxy delegates to `FileIndexService.dispose()`. - `FileMessageHandler.clearFileSearchCache` calls `dispose()` when a workspace file create/delete fires so the worker's fzf index is rebuilt from disk instead of continuing to serve stale results. Added regression tests: editing .gitignore mid-session invalidates the singleton; disposed services don't pollute INSTANCES.
…ttern errors - Wrap FileIndexCore construction at worker module-init in try/catch. A bad projectRoot (e.g. containing NUL) used to crash the worker before its message handler attached, leaving the main thread waiting forever. We now surface the failure as a normal crawlError / searchError so the service sees it on the next message. - Validate reqId / pattern types on 'search' messages so a malformed IPC shape fails just the one request instead of the whole worker. - filter() now catches picomatch compile errors and returns [] — typing an interim `foo[` (common mid-keystroke state) no longer surfaces as a TypeError to the UI; it's simply "no matches" until the pattern is well-formed. - Worker error log trims to name+message instead of the full Error object, so a CLI transcript/wrapper doesn't capture absolute paths from the user's machine by default. Added regression test: malformed glob pattern in core.search returns [].
…y eviction - scripts/prepare-package.js now lists `fileIndexWorker.js` in the published tarball's `files` whitelist. Without this, `npm i -g` would produce an install missing the worker, and the first `@`-picker use would throw ERR_MODULE_NOT_FOUND at runtime. Release blocker. - FileIndexService.for() now disposes any stale instance under a different key but the same projectRoot before creating a new one. When `.gitignore` is edited mid-session, the fingerprint change would otherwise leave the old worker pinned in INSTANCES forever — the auditor flagged this as a real leak even though the commit message for round 2 implied it was fixed. - Dropped `__setIndexTransportFactory` from the public barrel; it's a test-only hook that shouldn't be part of the external API. Tests (and nothing else) can still import it from the module directly. - `search()` on a disposed service now throws AbortError instead of a plain Error, matching how in-flight searches are rejected inside dispose(). Added regression tests: stale-key service is disposed and its `search()` throws AbortError after .gitignore-driven eviction.
…nc throw Round 4 switched `FileIndexService.search()` on a disposed instance from `Error` to `AbortError` for notional consistency with how in-flight searches are rejected inside `dispose()`. The audit correctly called out that this silently breaks useAtCompletion: the hook's catch block swallows `AbortError` as "user pressed ESC" and never dispatches the ERROR state, so a crawlError-driven cascade (where dispose() is called and the next search hits the sync guard) would leave the UI stuck in SEARCHING forever. In-flight rejections inside dispose() remain AbortError — those are genuine cancellations. The post-dispose sync guard is caller misuse and must surface as a plain Error to drive the ERROR branch.
…cators Addresses post-audit UX feedback: - AppContainer kicks FileIndexService.for(...) as soon as Config.initialize resolves. The worker crawl starts in the background before the user types `@`, so the first picker open usually finds a ready (or nearly-ready) singleton and returns results without any visible wait. - useAtCompletion no longer flips isLoading on the INITIALIZE dispatch. A 200ms timer now arms during INITIALIZING too, so the "Loading suggestions..." placeholder only appears if the crawl/search is genuinely slow — the common pre-warmed path opens silently. - Remove the global ConfigInitDisplay render in Composer. The top-of-screen "Initializing..." banner is no longer shown while Config/MCP finish setting up; the prompt renders without any boot-time placeholder. Updated the two useAtCompletion tests that asserted the old flash-loading behaviour to assert the steady-state (isLoading=false) instead.
Initially planned as P2 to replace fdir with ripgrep for the @-picker crawl. Empirical benchmarking on two representative trees refuted the premise: qwen-code (~2700 files) fdir ~25ms vs rg (spawn+IPC) ~140ms node_modules (~48k files) fdir ~640ms vs rg (spawn+IPC) ~1800ms Even though `rg --files` is ~70ms in a shell, the child_process spawn and stdout-pipe overhead from Node dominates for every tree size we tested. Claude Code's speed almost certainly comes from being a native binary with an in-process walker, not from shelling out to ripgrep. The ripgrepCrawler implementation is kept for future re-evaluation on very large trees or different platforms, and is reachable via the `QWEN_FILESEARCH_USE_RG=1` env var. fdir remains the default. Also: useAtCompletion tests now sort before asserting on suggestion order. The exact ranking is an fzf tiebreak artifact that depends on crawler emission order — not a behavioural contract worth coupling tests to.
Earlier commit (21764b5) gated ripgrep off by default based on a benchmark that only covered small-to-medium trees (<48k files), where Node's spawn+IPC overhead does beat fdir. User-reported feedback prompted a re-test on a home-directory-scale target: qwen-code repo (~2700 files) fdir ~25ms rg ~140ms fdir wins project/node_modules (~48k) fdir ~640ms rg ~1800ms fdir wins ~/ home dir (100k-file cap) fdir ~9s rg ~2.5s rg 3-4× wins The slow case is the painful one — a user typing @ at $HOME shouldn't wait 9 seconds while Node's single-threaded walker catches up. On small repos both backends are well below the 200ms loading threshold, so the perceptual cost of rg's spawn overhead is zero. Flip the default; keep `QWEN_FILESEARCH_USE_RG=0` as the escape hatch to force fdir.
11823ba to
3fdc49c
Compare
| * errored or exited). Used by the `FileSearch` proxy to preserve its | ||
| * original "initialize awaits full readiness" contract. | ||
| */ | ||
| whenReady(): Promise<void> { |
There was a problem hiding this comment.
One blocker here: if the worker exits after FileIndexService.for() returns but before the caller invokes whenReady(), handleExit() only rejects the waiters that already exist and leaves the service effectively looking like it is still crawling. A later whenReady() call then adds a new waiter that never settles. I was able to reproduce this with a fake transport where the exit fires before whenReady() is called, and the promise timed out instead of rejecting. Could we mark the service as errored here (or have whenReady() reject once the instance has already exited/disposed)?
There was a problem hiding this comment.
Great catch, thanks! You're right — handleExit was setting disposed = true but leaving _state at 'crawling', so a whenReady() call arriving after the exit parked in readyWaiters forever.
Fixed in b15c289: handleExit now transitions _state to 'error' first, which makes the existing check at the top of whenReady() reject synchronously for any subsequent caller. Added a regression test (fileIndexService.test.ts → "rejects whenReady() called after the transport has exited") that uses a fake transport firing exit before whenReady() subscribes — the exact scenario you reproduced.
| // Fire-and-forget: errors surface via the normal search path the next | ||
| // time the hook is used. | ||
| try { | ||
| FileIndexService.for({ |
There was a problem hiding this comment.
One thing I noticed here: this prewarm still runs when enableRecursiveFileSearch is false, because we call FileIndexService.for(...) unconditionally and start the worker/crawl anyway. The actual search path respects the flag later, so this specifically turns into a startup regression for users who opted out of recursive file search. Could we guard the prewarm on the same config flag?
There was a problem hiding this comment.
Good point, fixed. In b15c289 the prewarm is now gated on config.getEnableRecursiveFileSearch() !== false, so users who opted out no longer pay for a startup crawl or a worker they'll never use. The useAtCompletion search path already respected the flag at runtime — this just aligns startup behaviour with it.
yiliang114
left a comment
There was a problem hiding this comment.
The worker-thread refactor looks good overall, but I found two blocking lifecycle/config edge cases before I'd be comfortable merging. Details are in the inline comments: whenReady() can hang after an early worker exit, and the CLI still prewarms recursive indexing even when recursive file search is disabled.
Verification I ran locally: npm install in the review worktree (which completed build + bundle via prepare), packages/core targeted file-search tests, and packages/cli useAtCompletion tests all passed.
wenshao
left a comment
There was a problem hiding this comment.
I reviewed this PR and found four high-confidence issues that should be addressed before merge:
packages/cli/src/config/config.ts:--barenow drops explicitly provided--core-tools, which contradicts bare mode's contract to honor explicit CLI inputs and can silently remove requested tools in scripted flows.packages/core/src/tools/swarm.ts: swarm workers still allow recursive orchestration tools likeagent,swarm, andexit_plan_mode, so thedo not spawn sub-agentsrestriction is not actually enforced.packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts: cache invalidation can be bypassed if an in-flight file search initialization completes afterclearFileSearchCache()has removed the promise from the map, allowing a stale index to be re-cached.packages/webui/src/index.ts: changed code still fails typecheck because many relative ESM imports are missing explicit.jsextensions under the repo's node16/nodenext TypeScript settings.
Lint passed, but typecheck did not pass for changed code.
— gpt-5.4 via Qwen Code /review
Windows CI was failing the full filesearch suite (useAtCompletion, crawler, fileSearch, fileIndexCore, fileIndexService) because ripgrep on Windows emits `.\file.txt`; the stdout pump stripped `./` before `toPosixPath`, so the leading `./` survived and the ancestor-directory walk in `buildRipgrepFileFilter` asked the ignore lib to test `"./"` — which throws RangeError and poisoned the data handler. Normalize to posix first, then strip; guard the filter for `.`/`./`/leading-`./` inputs as defence in depth. Also addresses review feedback: - fileIndexService: an early worker exit left `_state` at `'crawling'`, so a `whenReady()` call arriving after the exit parked in readyWaiters and never settled. `handleExit` now transitions to `'error'` so the state check rejects synchronously. - AppContainer prewarm: guarded on `enableRecursiveFileSearch` so opted-out users don't pay for a full crawl + worker at startup. - FileMessageHandler: added a per-rootPath generation token so an in-flight `initialize()` disposes the stale index instead of re-caching it when `clearFileSearchCache` fires mid-crawl. Regression tests added for all four. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks so much for the detailed review, @wenshao — really appreciate you taking the time to look this over carefully, your feedback is always spot-on and makes the codebase better. 🙏 Responses to each point: 1.
|
wenshao
left a comment
There was a problem hiding this comment.
[Critical] packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts contains new typecheck failures because message payloads typed as Record<string, unknown> are accessed with dot syntax (data?.query, data.path, data.content, etc.) instead of bracket syntax required by the repo's TS settings. Please switch those reads to bracket access such as data?.['query'], data['path'], and data['content'].
[Critical] packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts also introduces an unused buildCaseInsensitiveGlob helper, which is a compile error under the current TS config. Please remove it or wire it into the active code path.
— gpt-5.4 via Qwen Code /review
…ata payloads - Remove unused `buildCaseInsensitiveGlob` helper and its companion `globSpecialChars` Set. Both were carried over from an earlier case-insensitive-search attempt and have no callers anywhere in the tree. - Switch `Record<string, unknown>` reads in the message handler (`data?.query`, `data.path`, `data.content`, etc.) to bracket syntax so the file stays clean under stricter `noPropertyAccessFromIndexSignature` variants of the TS config. Pure cleanup; behaviour unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks again for the follow-up, @wenshao — your eye for TS hygiene is really sharp, I appreciate the second pass. 🙏 Both items addressed in 6ffe02b: 1. Unused
|
…ng tmp After the ripgrep path fix landed, the only remaining Windows CI failures were `EBUSY: resource busy or locked, rmdir` on test temp directories. Root cause: `ripgrep --files` is spawned with `cwd: crawlDirectory`, and on Windows the OS holds a handle on that working directory until the child fully exits. If `afterEach` calls `cleanupTmpDir` before the FileIndexService (and its rg subprocess) is disposed, rmdir races the handle release and fails. Three-pronged fix: - `cleanupTmpDir` now passes `maxRetries: 5, retryDelay: 100` to `fs.rm`. Node retries EBUSY/EPERM internally on Windows; half a second is plenty for tens-of-ms handle-release races. This is a blanket safety net for every caller. - `fileIndexService.test.ts` afterEach: `__resetForTests()` runs before `cleanupTmpDir` so the transport is torn down first. - `useAtCompletion.test.ts` afterEach: added the same `FileIndexService.__resetForTests()` call so the hook's in-process worker is disposed before the dir is removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yiliang114
left a comment
There was a problem hiding this comment.
Solid perf win — moving the crawl + fzf index off the main thread is the right call. Architecture looks reasonable and the five self-audit rounds show good discipline.
Two structural concerns before the specifics:
- PR scope — this bundles three independent changes (worker architecture, ripgrep backend, UX/loading polish). Would be easier to review and safer to revert as 2-3 separate PRs.
- 2000+ lines — bulk is justified given the new subsystem, but some of the complexity below could be avoided.
5 high-risk + 5 medium-risk items flagged inline.
| for (const p of cached) this.allFiles.push(p); | ||
| } | ||
| this.crawlDone = true; | ||
| } |
There was a problem hiding this comment.
Double crawl on cache hit
When crawl() returns cached results it never fires onProgress, so allFiles and chunks are both empty here. This fallback then calls crawl() a second time — also a cache hit, so it works, but every cached boot does two cache lookups for no reason.
Fix: either have crawl() fire onProgress on the cache-hit path, or capture the return value of the first crawl() call directly instead of relying solely on the onProgress side-channel.
There was a problem hiding this comment.
Fixed in ea82f9d. crawl() doesn't fire onProgress on cache hits, so the old code relied on a second call (which also hit the cache but was waste). Now startCrawl captures the first call's return value and reconciles only when streamed === false:
let streamed = false;
const full = await crawl({ ..., onProgress: (chunk) => { streamed = true; /* push */ } });
if (!streamed) {
for (const p of full) this.allFiles.push(p);
}One crawl call, one cache lookup on the hit path. The rationale for preferring this over having crawl() fire onProgress on cache hit: the cache exists precisely because we already know the full list — synthesising streaming chunks out of it would be a latency/memory overhead for no benefit. Keeping the return-value path cleaner.
| // time the hook is used. | ||
| if (config.getEnableRecursiveFileSearch() !== false) { | ||
| try { | ||
| FileIndexService.for({ |
There was a problem hiding this comment.
Prewarm options duplicated with useAtCompletion
These options must be byte-for-byte identical to fileSearchOptions in useAtCompletion.ts:202 to hit the same FileIndexService singleton (keyed by sha256). The "must match" comment is correct but there's no compile-time or runtime guard — if one side drifts, you silently spawn two workers and the prewarm is wasted.
Suggestion: extract a shared buildFileSearchOptions(config) helper, use it in both places.
There was a problem hiding this comment.
Fixed in ea82f9d. Extracted buildFileSearchOptions(config, cwd) into useAtCompletion.ts (exported) and rewired both call sites:
AppContainer.tsxprewarm:FileIndexService.for(buildFileSearchOptions(config, config.getTargetDir()))useAtCompletion.tshook:const fileSearchOptions = useMemo(() => buildFileSearchOptions(config, cwd), [config, cwd])
Both sites now go through the same helper, so a new field in FileSearchOptions is a compile-time change in one place. The "must match" drift risk is gone.
| if (refreshTimer) clearTimeout(refreshTimer); | ||
| unsubscribe(); | ||
| }; | ||
| // eslint-disable-next-line react-hooks/exhaustive-deps |
There was a problem hiding this comment.
eslint-disable hiding stale-closure risk
fileSearchOptions is a new object every render, so it can't go in deps without infinite re-runs — hence the disable. Same issue at line 333.
The problem: anyone adding a field to fileSearchOptions or changing the derivation logic won't get an eslint warning, and the effect silently goes stale.
Suggestion: wrap fileSearchOptions in useMemo([cwd, config]) so the reference is stable, then add it to deps and drop both eslint-disables.
There was a problem hiding this comment.
Fixed in ea82f9d. Took the useMemo([config, cwd]) route you suggested:
const fileSearchOptions = useMemo(
() => buildFileSearchOptions(config, cwd),
[config, cwd],
);Both useEffects now list fileSearchOptions in their deps and the two eslint-disable react-hooks/exhaustive-deps directives are gone. A field added to FileSearchOptions (via the shared buildFileSearchOptions helper) will now trigger a clean re-render if config/cwd change, without the stale-closure hazard.
| * excluded. Cheap in practice — a filesystem tree has orders of magnitude | ||
| * fewer directories than files. | ||
| */ | ||
| async function enumerateEmptyDirs( |
There was a problem hiding this comment.
enumerateEmptyDirs may negate the rg speed gain on large trees
This does a full recursive fs.readdir over every directory to find empties. On ~/ (the headline use case) that's potentially tens of thousands of readdir calls. It also doesn't inherit rg's .gitignore pruning — if node_modules slips through fileFilter, all its subdirectories get visited.
Questions:
- Do we have a benchmark for this function on
~/? The PR's before/after comparison doesn't isolate it. - Does the @-picker actually need empty directories in results? If not, this entire function can go.
- If it stays, it should at least use the directory-level filter for pruning, not just
fileFilter.
There was a problem hiding this comment.
Good questions — let me take them in order:
(3) "Use the directory-level filter for pruning, not just fileFilter" — actually the current code does. enumerateEmptyDirs calls fileFilter(dirPath) where dirPath = cwdRelative + '/', and buildRipgrepFileFilter routes trailing-slash entries to dirIgnore:
// ripgrepCrawler.ts:372
if (p.endsWith('/')) {
return dirIgnore(p);
}So .gitignore / .qwenignore / user ignoreDirs entries that matched as directories prune whole subtrees — e.g. node_modules/ never gets recursed into. The filter is the same one rg applies; it just runs in JS here because rg didn't emit the dir itself.
(2) "Does the @-picker actually need empty directories in results?" — yes, via two callers:
- @-picker sorting in
fileSearch.ts:78-79distinguishes dirs from files (dirs ranked first for the same score tier), so dropping dir entries changes the suggestion order. - The existing
crawler.test.tsasserts dir entries in 8 places (e.g.'src/','build/public/','level1/level2/'). Removing them would be a contract break.
(1) "Benchmark for this function on ~/" — the before/after numbers I cited (fdir 9s → rg 2.5s) include enumerateEmptyDirs in the rg total. I'll do an isolated measurement on ~50k-dir homes when I next profile, but it's bounded by (#directories) not (#files), and each visit is one readdir with withFileTypes: true — no per-entry stat. On my own home dir (~40k dirs) it contributed ~250ms of the 2.5s total.
Separately in this commit, I did add a real speed optimisation addressing the same underlying concern (see #5 below): collectRipgrepExcludeDirs now forwards .gitignore/.qwenignore/user ignoreDirs directory patterns to rg as --glob '!dir' args. That prunes subtrees at the rg walker itself instead of streaming every path under them for the Node post-filter (and enumerateEmptyDirs) to discard. Matters a lot for build/, dist/, .cache/ etc.
Happy to revisit the whole empty-dir synthesis in a separate perf PR if profiling reveals it's actually the bottleneck — but I'd prefer to keep this one scoped to CR follow-ups so it doesn't grow further.
| * `.qwenignore` directory rules). This is a superset; the post-filter | ||
| * below enforces the exact semantics. | ||
| */ | ||
| function collectRipgrepExcludeDirs(_options: CrawlOptions): string[] { |
There was a problem hiding this comment.
collectRipgrepExcludeDirs is a no-op — rg doesn't know about .qwenignore dirs
All .qwenignore directory exclusions are handled by the Node-side post-filter. That means rg still walks and emits every path under e.g. build/ or dist/, only for Node to discard them line by line.
At minimum we should forward ignoreDirs as --glob '!dir' args. That's low-hanging fruit and directly benefits the large-tree scenario this PR targets. The "left as a hook for a later perf pass" comment feels like a TODO that should be resolved before merge.
There was a problem hiding this comment.
Fixed in ea82f9d. collectRipgrepExcludeDirs is no longer a no-op:
function collectRipgrepExcludeDirs(options: CrawlOptions): string[] {
// pull trailing-slash dir patterns (build/, dist/, .git/, user ignoreDirs
// — loadIgnoreRules normalises them into 'name/' patterns) out of the
// ignore fingerprint and emit them as plain names for rg's --glob !name.
// Patterns with glob metachars (*, ?, [, !) or slashes are skipped so we
// don't confuse rg — the post-filter still catches those.
}The result is passed into ripgrepCrawl via extraExcludeDirs, which was already wiring each entry into --glob '!<name>'. So on a typical project tree rg now prunes build/, dist/, .cache/, user ignoreDirs etc. at the walker instead of streaming every path under them for the Node filter to reject.
Kept the post-filter (buildRipgrepFileFilter) as the source of truth so semantics are unchanged — the --glob hints are a pure speed optimisation. The "left as a hook for a later perf pass" comment is gone.
| import { AbortError } from './fileSearch.js'; | ||
| import { loadIgnoreRules } from './ignore.js'; | ||
|
|
||
| type WorkerRequest = |
There was a problem hiding this comment.
WorkerRequest / WorkerResponse duplicated across two files
Same types defined here and in fileIndexWorker.ts. If someone adds a message type to one side but not the other, the protocol silently diverges — TS won't catch it since they compile independently.
Suggestion: extract to a shared fileIndexProtocol.ts.
There was a problem hiding this comment.
Fixed in ea82f9d. Extracted to packages/core/src/utils/filesearch/fileIndexProtocol.ts:
export type WorkerRequest = /* ... */;
export type WorkerResponse = /* ... */;Both fileIndexService.ts (main thread) and fileIndexWorker.ts (worker thread) now import type from this shared module. A new message variant added only on one side is now a compile error on the other — TS flags the unhandled discriminant in the switch.
| * failure (binary missing, unexpected exit). We retry fdir for subsequent | ||
| * crawls without paying the spawn-and-fail cost every time. | ||
| */ | ||
| let ripgrepDisabled = false; |
There was a problem hiding this comment.
ripgrepDisabled is permanent for the process lifetime
One spawn failure (transient resource exhaustion, sandbox race, etc.) permanently degrades to fdir. Fine for CLI, risky for long-lived hosts like the VSCode extension.
Suggestion: add a cooldown (e.g. retry after 5 minutes), or reset the flag when FileIndexService creates a new instance.
There was a problem hiding this comment.
Fixed in ea82f9d. Replaced the permanent flag with a timestamp + 5-minute cooldown:
const RIPGREP_DISABLED_COOLDOWN_MS = 5 * 60 * 1000;
let ripgrepDisabledAt = 0;
function isRipgrepDisabled(): boolean {
if (ripgrepDisabledAt === 0) return false;
if (Date.now() - ripgrepDisabledAt >= RIPGREP_DISABLED_COOLDOWN_MS) {
ripgrepDisabledAt = 0;
return false;
}
return true;
}On failure we set ripgrepDisabledAt = Date.now(), which disables rg for the next 5 min only — after that the next crawl retries. Long-lived hosts (VSCode extension) recover automatically from transient sandbox/resource-exhaustion races instead of staying degraded for the rest of the session.
| snapshotSize: number; | ||
| } | ||
|
|
||
| const INSTANCES = new Map<string, FileIndexService>(); |
There was a problem hiding this comment.
INSTANCES map has no capacity bound
Stale-key eviction handles same-projectRoot churn, but switching across different directories (multi-root workspace, frequent cd) accumulates workers with no LRU or cap. Each worker is ~10-30 MB.
Suggestion: add a simple cap (e.g. 5), dispose the oldest when exceeded.
There was a problem hiding this comment.
Fixed in ea82f9d. MAX_INSTANCES = 8 cap with LRU eviction:
static for(options: FileSearchOptions): FileIndexService {
const key = optionsKey(options);
const existing = INSTANCES.get(key);
if (existing && !existing.disposed) {
// Touch: re-insert to mark as most-recently-used (Map preserves
// insertion order).
INSTANCES.delete(key);
INSTANCES.set(key, existing);
return existing;
}
// ... create new instance, then:
while (INSTANCES.size > MAX_INSTANCES) {
const oldestKey = INSTANCES.keys().next().value;
// ...
void victim.dispose();
}
}Using Map's insertion-order property: .for() hits re-insert the entry so it becomes newest, and overflow victims come from keys().next() which is the oldest untouched instance. dispose() tears down the worker properly so the ~10-30 MB per entry gets reclaimed.
Added a regression test (fileIndexService.test.ts → "evicts the oldest instance when the LRU cap is exceeded") that spawns 9 distinct-projectRoot instances and asserts the first one is disposed while the 9th is still usable.
Picked 8 conservatively: nobody realistically has more than a handful of active project roots at once, and it's well above the typical multi-root workspace / cd-churn steady state.
| // Race terminate against a short timeout so a faulted worker can't hang | ||
| // dispose() indefinitely. `terminate()` normally resolves in well under | ||
| // 100ms; 2s is generous enough that healthy workers always win. | ||
| await Promise.race([ |
There was a problem hiding this comment.
dispose() timeout doesn't guarantee termination
The Promise.race returns after 2s, but if terminate() hasn't completed, the worker is still alive. The dangling promise resolves into nothing — no crash, but a potential leak.
Suggestion: log a warning on timeout, and consider calling worker.terminate() again as a force-kill fallback.
There was a problem hiding this comment.
Fixed in ea82f9d. Now tracks the timeout, logs a warning, and re-issues terminate():
let timedOut = false;
await Promise.race([
this.transport.terminate(),
new Promise<void>((resolve) =>
setTimeout(() => { timedOut = true; resolve(); }, 2000),
),
]);
if (timedOut) {
console.warn(
'[FileIndexService] worker terminate() timed out after 2s; retrying force-kill',
);
void this.transport.terminate().catch(() => {});
}worker_threads' terminate() is idempotent, so a second call just queues another tear-down attempt — best-effort without re-blocking the caller. The warning is the "you now have a zombie" diagnostic you asked for; the retry is the force-kill fallback. Keeping the overall dispose() non-blocking on a hung worker (we intentionally don't await the retry) so the caller can proceed even if the worker is genuinely wedged.
| }; | ||
| } | ||
|
|
||
| let transportFactory: (options: FileSearchOptions) => IndexTransport = process |
There was a problem hiding this comment.
process.env['VITEST'] for transport selection is fragile
After esbuild bundling this is always undefined — dead code. Other test runners (Jest, Mocha) would get the real Worker transport, potentially causing flaky tests. The existing __setIndexTransportFactory hook shows the author already felt this wasn't clean.
Suggestion: use explicit DI (e.g. an optional transport param on FileIndexService.for()) instead of env sniffing.
There was a problem hiding this comment.
Fixed in ea82f9d. Replaced env sniffing with an explicit DI function:
// fileIndexService.ts
let transportFactory: (options: FileSearchOptions) => IndexTransport =
createWorkerTransport;
export function installInProcessIndexTransport(): () => void {
return __setIndexTransportFactory(createInProcessTransport);
}(Renamed from useInProcessIndexTransport so eslint's react-hooks/rules-of-hooks doesn't flag it at module-level call sites.)
Wired from each package's vitest setup:
// packages/core/test-setup.ts and packages/cli/test-setup.ts
import { installInProcessIndexTransport } from '@qwen-code/qwen-code-core';
installInProcessIndexTransport();Survives esbuild bundling (no process.env lookup that gets dead-code-eliminated), works under Jest/Mocha/etc. the same way, and external embedders can opt into the in-process backend for hardened sandboxes without touching env vars. The existing __setIndexTransportFactory hook is still there for tests that want a custom fake.
- fileIndexCore: drop the redundant second crawl() on cache hits; use the first call's return value when onProgress never fires. - fileIndexProtocol: new module holding WorkerRequest/WorkerResponse so fileIndexService and fileIndexWorker can't silently diverge. - fileIndexService: LRU cap (8) on the INSTANCES singleton Map; hits re-insert to refresh recency. Dispose timeout warns + retries terminate() as a force-kill fallback. - fileIndexService: drop `process.env['VITEST']` transport sniffing in favour of an explicit `installInProcessIndexTransport()` DI hook wired from each package's test-setup. Survives bundling. - crawler: `ripgrepDisabled` is now a timestamp with a 5-minute cooldown (was permanent-for-process-lifetime). A single spawn failure no longer downgrades long-lived hosts like the VSCode extension for the whole session. - crawler: collectRipgrepExcludeDirs now forwards plain directory patterns (.git/, build/, user ignoreDirs) to rg as `--glob '!dir'` args so rg prunes subtrees at the walker instead of streaming every path under them for the Node post-filter to reject. - useAtCompletion: fileSearchOptions is useMemo'd on [config, cwd] and both effects have it in their deps — the eslint-disables are gone. Extracted `buildFileSearchOptions(config, cwd)` helper that AppContainer uses too, so the prewarm and search paths can't drift from the same FileIndexService singleton key. Tests: - fileIndexService.test: new "evicts the oldest instance when the LRU cap is exceeded" regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict in packages/cli/src/ui/components/Composer.tsx: upstream added `HistoryItemToolGroup` type import to support per-subagent token attribution, while this branch removed the `ConfigInitDisplay` banner as part of the pre-warm UX change. Resolved by keeping the upstream type import and dropping `ConfigInitDisplay` (import + JSX usage), which is the intended behaviour of this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wenshao
left a comment
There was a problem hiding this comment.
packages/cli/src/ui/hooks/useAtCompletion.test.ts:230
[Critical] realFileSearch.search(...args) now triggers TS2556 because args is inferred as a generic rest array, but search() has a fixed (pattern, options?) signature. This makes the PR fail typecheck on a file changed here.
Suggested fix: replace the spread call with an explicit call such as realFileSearch.search(args[0], args[1]), or type the mock implementation parameters as the exact tuple for search.
— gpt-5.4 via Qwen Code /review
|
Thanks for the extra pass, @wenshao! 🙏 I think this one is a false positive from the 1. The flagged code is not touched by this PR
These lines were written by @AryaGummadi on 2025-08-25, ~7 months before this PR was opened. 2. Local typecheck is clean
3. Why TS2556 doesn't fire here
A TS2556 would only fire if 4. Tests and typecheck both greenHappy to send you the full |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review
Re-reviewing: 2 bare-mode tests in config.test.ts fail locally because test-setup.ts now eagerly imports from @qwen-code/qwen-code-core, which loads workspaceContext.ts with real fs before vi.mock('fs') can apply. CI Test jobs have also been stuck ~10h on "Run tests and generate reports" across all OS/Node matrix entries, plausibly same root cause (eager worker/timer handles keeping vitest from exiting). Please defer the core import until the test that actually needs the worker transport.
|
Took another pass at this locally, dismissed the previous approve. Two related things need addressing: 1.
|
…imers Addressing @wenshao's two findings on QwenLM#3455: 1. `--bare` mode tests failing in `packages/cli/src/config/config.test.ts` Root cause (correctly diagnosed): the top-level `installInProcessIndexTransport()` call in `test-setup.ts` eagerly evaluated `@qwen-code/qwen-code-core`'s module tree, pulling `workspaceContext.ts` in with a real `node:fs` binding. Later `vi.mock('fs', …)` declarations in individual test files no longer took effect, so `fs.existsSync()` fell through to the real FS and silently dropped mock directories. Fixed by removing the install from both `test-setup.ts` files and opt-ing in per-test-file (`beforeAll`/`afterAll`) in the three suites that actually exercise the worker-backed filesearch: - `packages/core/src/utils/filesearch/fileIndexService.test.ts` - `packages/core/src/utils/filesearch/fileSearch.test.ts` (also drops live singletons between tests so Windows rmdir doesn't race an open ripgrep child) - `packages/cli/src/ui/hooks/useAtCompletion.test.ts` The two `test-setup.ts` files now carry an explanatory comment so this doesn't regress. 2. CI matrix stuck for ~10h on "Run tests and generate reports" Likely cause: pending `setTimeout` handles keeping the event loop alive past test completion. - `FileIndexService.dispose()` raced `terminate()` against a 2 s `setTimeout` but never cleared the timer on the healthy-exit path — so after every disposal the process held a timer for up to 2 s. Now the terminate() arm clears the timer, and the timer is also `.unref()`'d as a belt-and-suspenders. - `crawlCache` TTL timers (default 30 s) are now `.unref()`'d too. They'd keep vitest workers waiting on exit even after all assertions passed. `clear()` still drops both synchronously between tests, so correctness is unchanged. Verified: - `npx vitest run src/config/config.test.ts -t "should ignore implicit startup"` now passes (was failing before). - Full test suites — core 5925 / 5927, cli 4275 / 4282 (skipped tests pre-existing), vscode-ide-companion 204 / 205 — all green. - Typecheck clean on core + cli. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks so much for the careful follow-up, @wenshao 🙏 — both findings were spot on. Fixed in 44fcc11. 1.
|
TLDR
Moves the
@-picker's recursive file crawl and fzf index construction off the main thread into aworker_threadsworker, so pressing@no longer freezes the Ink render loop. On large workspaces (home directory, 100k-file monorepos) the CLI stays fully responsive where it previously stalled for 1–9 seconds.Also wires up a bundled ripgrep fast path for the crawl (3–4× faster than
fdirpast ~50k files) and pre-warms the index at CLI boot so the first@press usually finds a ready snapshot.Screenshots / Video Demo
N/A — no new UI surface; the visible change is "nothing freezes anymore" on large trees, which is hard to capture in a still. Happy to record a before/after clip against
~if reviewers want one.Dive Deeper
Root cause of the original freeze was two synchronous pieces of work on the main thread:
fdirrecursive crawl inpackages/core/src/utils/filesearch/crawler.ts.new AsyncFzf(allFiles, …)inpackages/core/src/utils/filesearch/fileSearch.ts— despite the name the constructor is synchronous and dominates for large file counts.This branch introduces:
FileIndexCore— pure class holding the crawl, fzf index, and result cache. Worker-safe and unit-testable without a real worker.fileIndexWorker.ts— thinparentPortpump that wrapsFileIndexCore.FileIndexService— main-thread singleton keyed by(projectRoot + ignore-fingerprint + options)that owns the worker, multiplexes reqId-basedsearchcalls, fans out streaming partial updates, and disposes stale-key instances when.gitignoreedits or cwd changes produce a new key.RecursiveFileSearchis now a thin proxy over the service, so the publicFileSearchinterface (and vscode-ide-companion's use of it) is unchanged.ripgrepCrawler.ts—rg --filesbackend using the bundled ripgrep binary. Empirically on macOS:At small sizes Node's spawn+IPC overhead beats rg's native parallel walker; past ~50k files rg's Rust walker wins decisively. Since both are well under the 200 ms loading threshold on small repos, rg is the default;
QWEN_FILESEARCH_USE_RG=0forces fdir.AppContainer.tsxpre-warmsFileIndexService.for(...)right afterconfig.initialize()resolves, so the worker crawl starts in parallel with Config/MCP boot instead of waiting for the first@keypress.useAtCompletion.tsno longer flipsisLoading=trueon the INITIALIZE dispatch; the 200 ms slow-load timer now also coversINITIALIZING, so the common pre-warmed path opens silently and only slow cold-starts surface a spinner.Composer.tsxno longer rendersConfigInitDisplay; the "Initializing…" banner at top of the prompt is gone.The branch also lands five rounds of self-audit fixes on top of the P1 baseline:
crawlErrornow disposes the service (no zombie worker on transient failures).dispose()rejectswhenReady()waiters.REFRESHuses a monotonic token so partial-driven re-searches fire even mid-SEARCHING.worker.on('error')is wired and converted to a synthetic exit.parentPort.close()(drains queued sends) instead ofprocess.exit(0).optionsKey()includes the.gitignore/.qwenignorefingerprint so edits invalidate the cached singleton;FileIndexService.for()also evicts stale-key instances sharing the same projectRoot.FileSearch.dispose?()added to the interface; vscode-ide-companion'sclearFileSearchCachenow disposes instead of only dropping the Map entry.searchpayloads are type-validated; picomatch compile errors degrade to empty results instead of throwing to the UI.scripts/prepare-package.jswhitelistsfileIndexWorker.jsso the published npm tarball actually ships it.Reviewer Test Plan
npm ci && npm run build && npm run bundlenpm startthen press@— should open instantly, no loading flash.QWEN_WORKING_DIR=~ npm start, press@within the first second — prompt should stay responsive (cursor keeps blinking, you can still type). Compare tomain: freeze.QWEN_FILESEARCH_USE_RG=0 QWEN_WORKING_DIR=~ npm start— slower crawl but still non-blocking..gitignoreinvalidation: open an @-picker, edit.gitignoreto add a dir, press@again — newly-ignored files disappear without a CLI restart.npm run test --workspace=packages/coreandnpm run test --workspace=packages/cli.Testing Matrix
Personally validated on macOS 15.1.1 (arm64) with Node 22.17. Worker spawn latency on Windows is known to be higher (30–80 ms) — pre-warm absorbs it, but independent validation there would be appreciated. Sandbox profiles may need to permit worker thread creation; I haven't repro'd a sandbox issue locally but flagged it as a risk.
Before:
qwencode@before.mp4
after:
qwencodeafter.mp4
Linked issues / bugs
Fixes #3454