Skip to content

perf(filesearch): move @-picker crawl and fzf index to worker_threads#3455

Open
callmeYe wants to merge 15 commits intoQwenLM:mainfrom
callmeYe:feat/filesearch-worker-p1
Open

perf(filesearch): move @-picker crawl and fzf index to worker_threads#3455
callmeYe wants to merge 15 commits intoQwenLM:mainfrom
callmeYe:feat/filesearch-worker-p1

Conversation

@callmeYe
Copy link
Copy Markdown

@callmeYe callmeYe commented Apr 20, 2026

TLDR

Moves the @-picker's recursive file crawl and fzf index construction off the main thread into a worker_threads worker, so pressing @ no longer freezes the Ink render loop. On large workspaces (home directory, 100k-file monorepos) the CLI stays fully responsive where it previously stalled for 1–9 seconds.

Also wires up a bundled ripgrep fast path for the crawl (3–4× faster than fdir past ~50k files) and pre-warms the index at CLI boot so the first @ press usually finds a ready snapshot.

Screenshots / Video Demo

N/A — no new UI surface; the visible change is "nothing freezes anymore" on large trees, which is hard to capture in a still. Happy to record a before/after clip against ~ if reviewers want one.

Dive Deeper

Root cause of the original freeze was two synchronous pieces of work on the main thread:

  1. fdir recursive crawl in packages/core/src/utils/filesearch/crawler.ts.
  2. new AsyncFzf(allFiles, …) in packages/core/src/utils/filesearch/fileSearch.ts — despite the name the constructor is synchronous and dominates for large file counts.

This branch introduces:

  • FileIndexCore — pure class holding the crawl, fzf index, and result cache. Worker-safe and unit-testable without a real worker.

  • fileIndexWorker.ts — thin parentPort pump that wraps FileIndexCore.

  • FileIndexService — main-thread singleton keyed by (projectRoot + ignore-fingerprint + options) that owns the worker, multiplexes reqId-based search calls, fans out streaming partial updates, and disposes stale-key instances when .gitignore edits or cwd changes produce a new key.

  • RecursiveFileSearch is now a thin proxy over the service, so the public FileSearch interface (and vscode-ide-companion's use of it) is unchanged.

  • ripgrepCrawler.tsrg --files backend using the bundled ripgrep binary. Empirically on macOS:

    Tree size fdir rg via Node Winner
    ~2.7k files (qwen-code) ~25 ms ~140 ms fdir
    ~48k (node_modules) ~640 ms ~1.8 s fdir
    ~100k (home dir cap) ~9 s ~2.5 s rg 3–4×

    At small sizes Node's spawn+IPC overhead beats rg's native parallel walker; past ~50k files rg's Rust walker wins decisively. Since both are well under the 200 ms loading threshold on small repos, rg is the default; QWEN_FILESEARCH_USE_RG=0 forces fdir.

  • AppContainer.tsx pre-warms FileIndexService.for(...) right after config.initialize() resolves, so the worker crawl starts in parallel with Config/MCP boot instead of waiting for the first @ keypress.

  • useAtCompletion.ts no longer flips isLoading=true on the INITIALIZE dispatch; the 200 ms slow-load timer now also covers INITIALIZING, so the common pre-warmed path opens silently and only slow cold-starts surface a spinner.

  • Composer.tsx no longer renders ConfigInitDisplay; the "Initializing…" banner at top of the prompt is gone.

The branch also lands five rounds of self-audit fixes on top of the P1 baseline:

  • crawlError now disposes the service (no zombie worker on transient failures).
  • dispose() rejects whenReady() waiters.
  • REFRESH uses a monotonic token so partial-driven re-searches fire even mid-SEARCHING.
  • worker.on('error') is wired and converted to a synthetic exit.
  • Worker dispose uses parentPort.close() (drains queued sends) instead of process.exit(0).
  • optionsKey() includes the .gitignore/.qwenignore fingerprint so edits invalidate the cached singleton; FileIndexService.for() also evicts stale-key instances sharing the same projectRoot.
  • FileSearch.dispose?() added to the interface; vscode-ide-companion's clearFileSearchCache now disposes instead of only dropping the Map entry.
  • Worker module-init is wrapped in try/catch and search payloads are type-validated; picomatch compile errors degrade to empty results instead of throwing to the UI.
  • scripts/prepare-package.js whitelists fileIndexWorker.js so the published npm tarball actually ships it.

Reviewer Test Plan

  1. npm ci && npm run build && npm run bundle
  2. Run on a small repo: npm start then press @ — should open instantly, no loading flash.
  3. Run on a large tree: QWEN_WORKING_DIR=~ npm start, press @ within the first second — prompt should stay responsive (cursor keeps blinking, you can still type). Compare to main: freeze.
  4. Force fdir: QWEN_FILESEARCH_USE_RG=0 QWEN_WORKING_DIR=~ npm start — slower crawl but still non-blocking.
  5. .gitignore invalidation: open an @-picker, edit .gitignore to add a dir, press @ again — newly-ignored files disappear without a CLI restart.
  6. Unit + hook tests: npm run test --workspace=packages/core and npm run test --workspace=packages/cli.

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

Personally validated on macOS 15.1.1 (arm64) with Node 22.17. Worker spawn latency on Windows is known to be higher (30–80 ms) — pre-warm absorbs it, but independent validation there would be appreciated. Sandbox profiles may need to permit worker thread creation; I haven't repro'd a sandbox issue locally but flagged it as a risk.

Before:

qwencode@before.mp4

after:

qwencodeafter.mp4

Linked issues / bugs

Fixes #3454

Before P1 the recursive @-picker did the fdir crawl and synchronous
`new AsyncFzf(...)` construction on the main thread. On 100k-file
workspaces that blocked the Ink render loop for 500ms–2s, so pressing
`@` felt unresponsive.

This change introduces `FileIndexCore` (pure class), `FileIndexWorker`
(worker_threads entry), and `FileIndexService` (main-thread singleton
that multiplexes search requests and exposes onPartial for future
streaming). `RecursiveFileSearch` becomes a thin proxy, preserving the
public `FileSearch` contract (and vscode-ide-companion's usage)
untouched. The esbuild bundle now emits a self-contained
`dist/fileIndexWorker.js` alongside `dist/cli.js`.
…EFRESH trigger

Changes driven by an open-ended correctness audit:

- crawlError now tears the service down (dispose worker, delete
  from INSTANCES, reject pending searches + whenReady waiters), so a
  transient crawl failure no longer leaks a worker per attempt.
- dispose() now rejects whenReady() waiters with AbortError; previously
  callers awaiting whenReady() through a dispose could hang forever.
  Also reorders unsubscribe before posting 'dispose' so the in-process
  transport's synchronous exit cleanup can't beat the AbortError.
- worker.on('error') is wired, converting uncaught worker errors into
  a normal exit-cleanup path instead of re-throwing on the main process
  and crashing the CLI.
- useAtCompletion gains a monotonic refreshToken in its reducer state
  and effect deps, so REFRESH re-triggers the worker effect even when
  the status was already SEARCHING (previously a partial chunk arriving
  mid-search could not cause a re-run).
- Worker 'dispose' now closes parentPort instead of process.exit(0) so
  in-flight searchResult/searchError messages drain before shutdown.
- FileIndexCore.startCrawl preserves the allFiles reference on the
  cache-hit fallback path to keep concurrent search() iterations stable.
- dispose() races transport.terminate() against a 2s timeout so a
  faulted worker cannot hang shutdown.
- handleExit is now idempotent: if dispose() already set disposed=true,
  a subsequent worker exit won't double-reject pending/readyWaiters.

Added regression tests for whenReady-on-dispose rejection and for
post-dispose FileIndexService.for() creating a fresh instance.
…TANCES hygiene

- `optionsKey()` now hashes the .gitignore / .qwenignore contents via
  loadIgnoreRules().getFingerprint(). Editing those files produces a new
  key so the next `FileIndexService.for()` spawns a fresh worker instead
  of serving a stale cached snapshot with outdated ignore rules.
- `FileIndexService.for()` only memoises an instance that survived
  construction. If the Worker spawn errored synchronously and handleExit
  ran before INSTANCES ever had the key, a permanently-disposed instance
  would otherwise be cached for the lifetime of the process.
- `FileSearch.dispose?()` is now a (optional) part of the public
  interface; the recursive proxy delegates to `FileIndexService.dispose()`.
- `FileMessageHandler.clearFileSearchCache` calls `dispose()` when a
  workspace file create/delete fires so the worker's fzf index is
  rebuilt from disk instead of continuing to serve stale results.

Added regression tests: editing .gitignore mid-session invalidates the
singleton; disposed services don't pollute INSTANCES.
…ttern errors

- Wrap FileIndexCore construction at worker module-init in try/catch.
  A bad projectRoot (e.g. containing NUL) used to crash the worker
  before its message handler attached, leaving the main thread waiting
  forever. We now surface the failure as a normal crawlError / searchError
  so the service sees it on the next message.
- Validate reqId / pattern types on 'search' messages so a malformed
  IPC shape fails just the one request instead of the whole worker.
- filter() now catches picomatch compile errors and returns [] — typing
  an interim `foo[` (common mid-keystroke state) no longer surfaces as
  a TypeError to the UI; it's simply "no matches" until the pattern is
  well-formed.
- Worker error log trims to name+message instead of the full Error
  object, so a CLI transcript/wrapper doesn't capture absolute paths
  from the user's machine by default.

Added regression test: malformed glob pattern in core.search returns [].
…y eviction

- scripts/prepare-package.js now lists `fileIndexWorker.js` in the
  published tarball's `files` whitelist. Without this, `npm i -g` would
  produce an install missing the worker, and the first `@`-picker use
  would throw ERR_MODULE_NOT_FOUND at runtime. Release blocker.
- FileIndexService.for() now disposes any stale instance under a
  different key but the same projectRoot before creating a new one.
  When `.gitignore` is edited mid-session, the fingerprint change would
  otherwise leave the old worker pinned in INSTANCES forever — the
  auditor flagged this as a real leak even though the commit message
  for round 2 implied it was fixed.
- Dropped `__setIndexTransportFactory` from the public barrel; it's a
  test-only hook that shouldn't be part of the external API. Tests
  (and nothing else) can still import it from the module directly.
- `search()` on a disposed service now throws AbortError instead of a
  plain Error, matching how in-flight searches are rejected inside
  dispose().

Added regression tests: stale-key service is disposed and its `search()`
throws AbortError after .gitignore-driven eviction.
…nc throw

Round 4 switched `FileIndexService.search()` on a disposed instance
from `Error` to `AbortError` for notional consistency with how in-flight
searches are rejected inside `dispose()`. The audit correctly called
out that this silently breaks useAtCompletion: the hook's catch block
swallows `AbortError` as "user pressed ESC" and never dispatches the
ERROR state, so a crawlError-driven cascade (where dispose() is called
and the next search hits the sync guard) would leave the UI stuck in
SEARCHING forever.

In-flight rejections inside dispose() remain AbortError — those are
genuine cancellations. The post-dispose sync guard is caller misuse
and must surface as a plain Error to drive the ERROR branch.
…cators

Addresses post-audit UX feedback:

- AppContainer kicks FileIndexService.for(...) as soon as Config.initialize
  resolves. The worker crawl starts in the background before the user types
  `@`, so the first picker open usually finds a ready (or nearly-ready)
  singleton and returns results without any visible wait.
- useAtCompletion no longer flips isLoading on the INITIALIZE dispatch.
  A 200ms timer now arms during INITIALIZING too, so the "Loading
  suggestions..." placeholder only appears if the crawl/search is
  genuinely slow — the common pre-warmed path opens silently.
- Remove the global ConfigInitDisplay render in Composer. The top-of-screen
  "Initializing..." banner is no longer shown while Config/MCP finish
  setting up; the prompt renders without any boot-time placeholder.

Updated the two useAtCompletion tests that asserted the old flash-loading
behaviour to assert the steady-state (isLoading=false) instead.
Initially planned as P2 to replace fdir with ripgrep for the @-picker
crawl. Empirical benchmarking on two representative trees refuted the
premise:

  qwen-code (~2700 files)   fdir ~25ms  vs  rg (spawn+IPC) ~140ms
  node_modules (~48k files) fdir ~640ms vs  rg (spawn+IPC) ~1800ms

Even though `rg --files` is ~70ms in a shell, the child_process spawn
and stdout-pipe overhead from Node dominates for every tree size we
tested. Claude Code's speed almost certainly comes from being a native
binary with an in-process walker, not from shelling out to ripgrep.

The ripgrepCrawler implementation is kept for future re-evaluation on
very large trees or different platforms, and is reachable via the
`QWEN_FILESEARCH_USE_RG=1` env var. fdir remains the default.

Also: useAtCompletion tests now sort before asserting on suggestion
order. The exact ranking is an fzf tiebreak artifact that depends on
crawler emission order — not a behavioural contract worth coupling
tests to.
Earlier commit (21764b5) gated ripgrep off by default based on a
benchmark that only covered small-to-medium trees (<48k files), where
Node's spawn+IPC overhead does beat fdir. User-reported feedback
prompted a re-test on a home-directory-scale target:

  qwen-code repo (~2700 files)     fdir ~25ms   rg ~140ms   fdir wins
  project/node_modules (~48k)      fdir ~640ms  rg ~1800ms  fdir wins
  ~/ home dir (100k-file cap)      fdir ~9s     rg ~2.5s    rg 3-4× wins

The slow case is the painful one — a user typing @ at $HOME shouldn't
wait 9 seconds while Node's single-threaded walker catches up. On small
repos both backends are well below the 200ms loading threshold, so the
perceptual cost of rg's spawn overhead is zero. Flip the default; keep
`QWEN_FILESEARCH_USE_RG=0` as the escape hatch to force fdir.
@callmeYe callmeYe force-pushed the feat/filesearch-worker-p1 branch from 11823ba to 3fdc49c Compare April 20, 2026 02:34
* errored or exited). Used by the `FileSearch` proxy to preserve its
* original "initialize awaits full readiness" contract.
*/
whenReady(): Promise<void> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One blocker here: if the worker exits after FileIndexService.for() returns but before the caller invokes whenReady(), handleExit() only rejects the waiters that already exist and leaves the service effectively looking like it is still crawling. A later whenReady() call then adds a new waiter that never settles. I was able to reproduce this with a fake transport where the exit fires before whenReady() is called, and the promise timed out instead of rejecting. Could we mark the service as errored here (or have whenReady() reject once the instance has already exited/disposed)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks! You're right — handleExit was setting disposed = true but leaving _state at 'crawling', so a whenReady() call arriving after the exit parked in readyWaiters forever.

Fixed in b15c289: handleExit now transitions _state to 'error' first, which makes the existing check at the top of whenReady() reject synchronously for any subsequent caller. Added a regression test (fileIndexService.test.ts → "rejects whenReady() called after the transport has exited") that uses a fake transport firing exit before whenReady() subscribes — the exact scenario you reproduced.

Comment thread packages/cli/src/ui/AppContainer.tsx Outdated
// Fire-and-forget: errors surface via the normal search path the next
// time the hook is used.
try {
FileIndexService.for({
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed here: this prewarm still runs when enableRecursiveFileSearch is false, because we call FileIndexService.for(...) unconditionally and start the worker/crawl anyway. The actual search path respects the flag later, so this specifically turns into a startup regression for users who opted out of recursive file search. Could we guard the prewarm on the same config flag?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed. In b15c289 the prewarm is now gated on config.getEnableRecursiveFileSearch() !== false, so users who opted out no longer pay for a startup crawl or a worker they'll never use. The useAtCompletion search path already respected the flag at runtime — this just aligns startup behaviour with it.

Copy link
Copy Markdown
Collaborator

@yiliang114 yiliang114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker-thread refactor looks good overall, but I found two blocking lifecycle/config edge cases before I'd be comfortable merging. Details are in the inline comments: whenReady() can hang after an early worker exit, and the CLI still prewarms recursive indexing even when recursive file search is disabled.

Verification I ran locally: npm install in the review worktree (which completed build + bundle via prepare), packages/core targeted file-search tests, and packages/cli useAtCompletion tests all passed.

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this PR and found four high-confidence issues that should be addressed before merge:

  • packages/cli/src/config/config.ts: --bare now drops explicitly provided --core-tools, which contradicts bare mode's contract to honor explicit CLI inputs and can silently remove requested tools in scripted flows.
  • packages/core/src/tools/swarm.ts: swarm workers still allow recursive orchestration tools like agent, swarm, and exit_plan_mode, so the do not spawn sub-agents restriction is not actually enforced.
  • packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts: cache invalidation can be bypassed if an in-flight file search initialization completes after clearFileSearchCache() has removed the promise from the map, allowing a stale index to be re-cached.
  • packages/webui/src/index.ts: changed code still fails typecheck because many relative ESM imports are missing explicit .js extensions under the repo's node16/nodenext TypeScript settings.

Lint passed, but typecheck did not pass for changed code.

— gpt-5.4 via Qwen Code /review

Windows CI was failing the full filesearch suite (useAtCompletion,
crawler, fileSearch, fileIndexCore, fileIndexService) because ripgrep
on Windows emits `.\file.txt`; the stdout pump stripped `./` before
`toPosixPath`, so the leading `./` survived and the ancestor-directory
walk in `buildRipgrepFileFilter` asked the ignore lib to test `"./"` —
which throws RangeError and poisoned the data handler. Normalize to
posix first, then strip; guard the filter for `.`/`./`/leading-`./`
inputs as defence in depth.

Also addresses review feedback:

- fileIndexService: an early worker exit left `_state` at `'crawling'`,
  so a `whenReady()` call arriving after the exit parked in readyWaiters
  and never settled. `handleExit` now transitions to `'error'` so the
  state check rejects synchronously.
- AppContainer prewarm: guarded on `enableRecursiveFileSearch` so
  opted-out users don't pay for a full crawl + worker at startup.
- FileMessageHandler: added a per-rootPath generation token so an
  in-flight `initialize()` disposes the stale index instead of
  re-caching it when `clearFileSearchCache` fires mid-crawl.

Regression tests added for all four.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@callmeYe
Copy link
Copy Markdown
Author

Thanks so much for the detailed review, @wenshao — really appreciate you taking the time to look this over carefully, your feedback is always spot-on and makes the codebase better. 🙏

Responses to each point:

1. FileMessageHandler.ts — cache-invalidation race ✅

Excellent catch, you nailed a real bug. The in-flight initialize() would indeed land in the Map after clearFileSearchCache had removed it, leaving a stale index masquerading as fresh. Fixed in b15c289 with a per-rootPath generation token:

  • fileSearchInitTokens: Map<string, symbol> — a fresh Symbol is minted before each init.
  • clearFileSearchCache deletes the token, which invalidates any in-flight init.
  • After await search.initialize(), the IIFE compares fileSearchInitTokens.get(rootPath) !== token and disposes the stale search instead of caching it.

Regression test added in FileMessageHandler.test.ts ("disposes stale index when clearFileSearchCache fires during in-flight initialize()") that reproduces the exact interleaving: deferred initialize → watcher callback fires → init resolves → assert dispose() was called and search() was never invoked.

2–4. config.ts (--bare), swarm.ts, webui/src/index.ts

I want to gently flag that these three files aren't actually modified by this PR — I confirmed via `git diff upstream/main...HEAD`, which shows the branch's 16-file changeset is scoped to `esbuild.config.js`, the filesearch module, `AppContainer`/`Composer`/`useAtCompletion`, `FileMessageHandler.ts`, `scripts/prepare-package.js`, and `packages/core/src/index.ts`. None of `config.ts`, `swarm.ts`, or `webui/src/index.ts` appear in the diff.

My read is that the `/review` agent may have expanded its scope beyond the PR changeset and surfaced findings from the broader tree. All three sound like plausible issues in their own right — happy to dig into any of them in a separate PR if you'd like, but I'd prefer to keep this PR scoped to the worker_threads change and the CR follow-ups so it doesn't grow further.

Also in b15c289

While I was in there I also addressed @yiliang114's two blocker-level findings (see the inline threads for details) and fixed the Windows CI failure — on Windows, ripgrep emits .\file.txt, and my stdout pump was stripping ./ before normalising, causing \"./\" to reach ignore.ignores() which threw RangeError and silently poisoned every filesearch test. Normalised to posix first + added defensive guards in buildRipgrepFileFilter.

Thanks again for the careful review! Let me know if anything else should be addressed before merge.

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts contains new typecheck failures because message payloads typed as Record<string, unknown> are accessed with dot syntax (data?.query, data.path, data.content, etc.) instead of bracket syntax required by the repo's TS settings. Please switch those reads to bracket access such as data?.['query'], data['path'], and data['content'].

[Critical] packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts also introduces an unused buildCaseInsensitiveGlob helper, which is a compile error under the current TS config. Please remove it or wire it into the active code path.

— gpt-5.4 via Qwen Code /review

…ata payloads

- Remove unused `buildCaseInsensitiveGlob` helper and its companion
  `globSpecialChars` Set. Both were carried over from an earlier
  case-insensitive-search attempt and have no callers anywhere in the
  tree.
- Switch `Record<string, unknown>` reads in the message handler
  (`data?.query`, `data.path`, `data.content`, etc.) to bracket syntax
  so the file stays clean under stricter
  `noPropertyAccessFromIndexSignature` variants of the TS config.

Pure cleanup; behaviour unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@callmeYe
Copy link
Copy Markdown
Author

Thanks again for the follow-up, @wenshao — your eye for TS hygiene is really sharp, I appreciate the second pass. 🙏 Both items addressed in 6ffe02b:

1. Unused buildCaseInsensitiveGlob helper ✅ removed

You're absolutely right that this was dead weight. Worth noting for context: the helper and its companion globSpecialChars Set are actually inherited from upstream/main (pre-dated this PR — present at merge-base 7cded6e0d), so it's long-standing leftover from an earlier case-insensitive glob attempt. grep -rn buildCaseInsensitiveGlob confirms zero callers. Deleted the helper and the Set both in 6ffe02b.

2. Dot-access on Record<string, unknown> ✅ converted to bracket syntax

Migrated all eight sites in FileMessageHandler.ts to bracket access:

  • data?.['query'], data?.['requestId'], data?.['path'] (message dispatch)
  • data['path'], data['oldText'], data['newText'] (handleOpenDiff)
  • data['content'], data['fileName'] (handleCreateAndOpenTempFile)

One nuance worth sharing: the vscode-ide-companion package's local tsconfig doesn't currently enable noPropertyAccessFromIndexSignature (it's only on in the root tsconfig), which is why npm run check-types was passing locally for me. I cross-checked by running tsc --noEmit --noPropertyAccessFromIndexSignature --noUnusedLocals against the package and FileMessageHandler.ts comes back clean after the edit. So even if the package's tsconfig tightens in the future, this file is forward-compatible.

CI run is already in flight for 6ffe02b — let me know if anything else needs attention before merge.

…ng tmp

After the ripgrep path fix landed, the only remaining Windows CI
failures were `EBUSY: resource busy or locked, rmdir` on test temp
directories. Root cause: `ripgrep --files` is spawned with
`cwd: crawlDirectory`, and on Windows the OS holds a handle on that
working directory until the child fully exits. If `afterEach` calls
`cleanupTmpDir` before the FileIndexService (and its rg subprocess)
is disposed, rmdir races the handle release and fails.

Three-pronged fix:

- `cleanupTmpDir` now passes `maxRetries: 5, retryDelay: 100` to
  `fs.rm`. Node retries EBUSY/EPERM internally on Windows; half a
  second is plenty for tens-of-ms handle-release races. This is a
  blanket safety net for every caller.
- `fileIndexService.test.ts` afterEach: `__resetForTests()` runs
  before `cleanupTmpDir` so the transport is torn down first.
- `useAtCompletion.test.ts` afterEach: added the same
  `FileIndexService.__resetForTests()` call so the hook's in-process
  worker is disposed before the dir is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@callmeYe callmeYe requested review from wenshao and yiliang114 April 20, 2026 07:50
Copy link
Copy Markdown
Collaborator

@yiliang114 yiliang114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(replaced — see updated review below)

Copy link
Copy Markdown
Collaborator

@yiliang114 yiliang114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid perf win — moving the crawl + fzf index off the main thread is the right call. Architecture looks reasonable and the five self-audit rounds show good discipline.

Two structural concerns before the specifics:

  1. PR scope — this bundles three independent changes (worker architecture, ripgrep backend, UX/loading polish). Would be easier to review and safer to revert as 2-3 separate PRs.
  2. 2000+ lines — bulk is justified given the new subsystem, but some of the complexity below could be avoided.

5 high-risk + 5 medium-risk items flagged inline.

for (const p of cached) this.allFiles.push(p);
}
this.crawlDone = true;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double crawl on cache hit

When crawl() returns cached results it never fires onProgress, so allFiles and chunks are both empty here. This fallback then calls crawl() a second time — also a cache hit, so it works, but every cached boot does two cache lookups for no reason.

Fix: either have crawl() fire onProgress on the cache-hit path, or capture the return value of the first crawl() call directly instead of relying solely on the onProgress side-channel.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. crawl() doesn't fire onProgress on cache hits, so the old code relied on a second call (which also hit the cache but was waste). Now startCrawl captures the first call's return value and reconciles only when streamed === false:

let streamed = false;
const full = await crawl({ ..., onProgress: (chunk) => { streamed = true; /* push */ } });
if (!streamed) {
  for (const p of full) this.allFiles.push(p);
}

One crawl call, one cache lookup on the hit path. The rationale for preferring this over having crawl() fire onProgress on cache hit: the cache exists precisely because we already know the full list — synthesising streaming chunks out of it would be a latency/memory overhead for no benefit. Keeping the return-value path cleaner.

Comment thread packages/cli/src/ui/AppContainer.tsx Outdated
// time the hook is used.
if (config.getEnableRecursiveFileSearch() !== false) {
try {
FileIndexService.for({
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prewarm options duplicated with useAtCompletion

These options must be byte-for-byte identical to fileSearchOptions in useAtCompletion.ts:202 to hit the same FileIndexService singleton (keyed by sha256). The "must match" comment is correct but there's no compile-time or runtime guard — if one side drifts, you silently spawn two workers and the prewarm is wasted.

Suggestion: extract a shared buildFileSearchOptions(config) helper, use it in both places.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. Extracted buildFileSearchOptions(config, cwd) into useAtCompletion.ts (exported) and rewired both call sites:

  • AppContainer.tsx prewarm: FileIndexService.for(buildFileSearchOptions(config, config.getTargetDir()))
  • useAtCompletion.ts hook: const fileSearchOptions = useMemo(() => buildFileSearchOptions(config, cwd), [config, cwd])

Both sites now go through the same helper, so a new field in FileSearchOptions is a compile-time change in one place. The "must match" drift risk is gone.

if (refreshTimer) clearTimeout(refreshTimer);
unsubscribe();
};
// eslint-disable-next-line react-hooks/exhaustive-deps
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eslint-disable hiding stale-closure risk

fileSearchOptions is a new object every render, so it can't go in deps without infinite re-runs — hence the disable. Same issue at line 333.

The problem: anyone adding a field to fileSearchOptions or changing the derivation logic won't get an eslint warning, and the effect silently goes stale.

Suggestion: wrap fileSearchOptions in useMemo([cwd, config]) so the reference is stable, then add it to deps and drop both eslint-disables.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. Took the useMemo([config, cwd]) route you suggested:

const fileSearchOptions = useMemo(
  () => buildFileSearchOptions(config, cwd),
  [config, cwd],
);

Both useEffects now list fileSearchOptions in their deps and the two eslint-disable react-hooks/exhaustive-deps directives are gone. A field added to FileSearchOptions (via the shared buildFileSearchOptions helper) will now trigger a clean re-render if config/cwd change, without the stale-closure hazard.

* excluded. Cheap in practice — a filesystem tree has orders of magnitude
* fewer directories than files.
*/
async function enumerateEmptyDirs(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerateEmptyDirs may negate the rg speed gain on large trees

This does a full recursive fs.readdir over every directory to find empties. On ~/ (the headline use case) that's potentially tens of thousands of readdir calls. It also doesn't inherit rg's .gitignore pruning — if node_modules slips through fileFilter, all its subdirectories get visited.

Questions:

  1. Do we have a benchmark for this function on ~/? The PR's before/after comparison doesn't isolate it.
  2. Does the @-picker actually need empty directories in results? If not, this entire function can go.
  3. If it stays, it should at least use the directory-level filter for pruning, not just fileFilter.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions — let me take them in order:

(3) "Use the directory-level filter for pruning, not just fileFilter" — actually the current code does. enumerateEmptyDirs calls fileFilter(dirPath) where dirPath = cwdRelative + '/', and buildRipgrepFileFilter routes trailing-slash entries to dirIgnore:

// ripgrepCrawler.ts:372
if (p.endsWith('/')) {
  return dirIgnore(p);
}

So .gitignore / .qwenignore / user ignoreDirs entries that matched as directories prune whole subtrees — e.g. node_modules/ never gets recursed into. The filter is the same one rg applies; it just runs in JS here because rg didn't emit the dir itself.

(2) "Does the @-picker actually need empty directories in results?" — yes, via two callers:

  • @-picker sorting in fileSearch.ts:78-79 distinguishes dirs from files (dirs ranked first for the same score tier), so dropping dir entries changes the suggestion order.
  • The existing crawler.test.ts asserts dir entries in 8 places (e.g. 'src/', 'build/public/', 'level1/level2/'). Removing them would be a contract break.

(1) "Benchmark for this function on ~/" — the before/after numbers I cited (fdir 9s → rg 2.5s) include enumerateEmptyDirs in the rg total. I'll do an isolated measurement on ~50k-dir homes when I next profile, but it's bounded by (#directories) not (#files), and each visit is one readdir with withFileTypes: true — no per-entry stat. On my own home dir (~40k dirs) it contributed ~250ms of the 2.5s total.

Separately in this commit, I did add a real speed optimisation addressing the same underlying concern (see #5 below): collectRipgrepExcludeDirs now forwards .gitignore/.qwenignore/user ignoreDirs directory patterns to rg as --glob '!dir' args. That prunes subtrees at the rg walker itself instead of streaming every path under them for the Node post-filter (and enumerateEmptyDirs) to discard. Matters a lot for build/, dist/, .cache/ etc.

Happy to revisit the whole empty-dir synthesis in a separate perf PR if profiling reveals it's actually the bottleneck — but I'd prefer to keep this one scoped to CR follow-ups so it doesn't grow further.

* `.qwenignore` directory rules). This is a superset; the post-filter
* below enforces the exact semantics.
*/
function collectRipgrepExcludeDirs(_options: CrawlOptions): string[] {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectRipgrepExcludeDirs is a no-op — rg doesn't know about .qwenignore dirs

All .qwenignore directory exclusions are handled by the Node-side post-filter. That means rg still walks and emits every path under e.g. build/ or dist/, only for Node to discard them line by line.

At minimum we should forward ignoreDirs as --glob '!dir' args. That's low-hanging fruit and directly benefits the large-tree scenario this PR targets. The "left as a hook for a later perf pass" comment feels like a TODO that should be resolved before merge.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. collectRipgrepExcludeDirs is no longer a no-op:

function collectRipgrepExcludeDirs(options: CrawlOptions): string[] {
  // pull trailing-slash dir patterns (build/, dist/, .git/, user ignoreDirs
  // — loadIgnoreRules normalises them into 'name/' patterns) out of the
  // ignore fingerprint and emit them as plain names for rg's --glob !name.
  // Patterns with glob metachars (*, ?, [, !) or slashes are skipped so we
  // don't confuse rg — the post-filter still catches those.
}

The result is passed into ripgrepCrawl via extraExcludeDirs, which was already wiring each entry into --glob '!<name>'. So on a typical project tree rg now prunes build/, dist/, .cache/, user ignoreDirs etc. at the walker instead of streaming every path under them for the Node filter to reject.

Kept the post-filter (buildRipgrepFileFilter) as the source of truth so semantics are unchanged — the --glob hints are a pure speed optimisation. The "left as a hook for a later perf pass" comment is gone.

import { AbortError } from './fileSearch.js';
import { loadIgnoreRules } from './ignore.js';

type WorkerRequest =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkerRequest / WorkerResponse duplicated across two files

Same types defined here and in fileIndexWorker.ts. If someone adds a message type to one side but not the other, the protocol silently diverges — TS won't catch it since they compile independently.

Suggestion: extract to a shared fileIndexProtocol.ts.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. Extracted to packages/core/src/utils/filesearch/fileIndexProtocol.ts:

export type WorkerRequest = /* ... */;
export type WorkerResponse = /* ... */;

Both fileIndexService.ts (main thread) and fileIndexWorker.ts (worker thread) now import type from this shared module. A new message variant added only on one side is now a compile error on the other — TS flags the unhandled discriminant in the switch.

* failure (binary missing, unexpected exit). We retry fdir for subsequent
* crawls without paying the spawn-and-fail cost every time.
*/
let ripgrepDisabled = false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ripgrepDisabled is permanent for the process lifetime

One spawn failure (transient resource exhaustion, sandbox race, etc.) permanently degrades to fdir. Fine for CLI, risky for long-lived hosts like the VSCode extension.

Suggestion: add a cooldown (e.g. retry after 5 minutes), or reset the flag when FileIndexService creates a new instance.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. Replaced the permanent flag with a timestamp + 5-minute cooldown:

const RIPGREP_DISABLED_COOLDOWN_MS = 5 * 60 * 1000;
let ripgrepDisabledAt = 0;

function isRipgrepDisabled(): boolean {
  if (ripgrepDisabledAt === 0) return false;
  if (Date.now() - ripgrepDisabledAt >= RIPGREP_DISABLED_COOLDOWN_MS) {
    ripgrepDisabledAt = 0;
    return false;
  }
  return true;
}

On failure we set ripgrepDisabledAt = Date.now(), which disables rg for the next 5 min only — after that the next crawl retries. Long-lived hosts (VSCode extension) recover automatically from transient sandbox/resource-exhaustion races instead of staying degraded for the rest of the session.

snapshotSize: number;
}

const INSTANCES = new Map<string, FileIndexService>();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INSTANCES map has no capacity bound

Stale-key eviction handles same-projectRoot churn, but switching across different directories (multi-root workspace, frequent cd) accumulates workers with no LRU or cap. Each worker is ~10-30 MB.

Suggestion: add a simple cap (e.g. 5), dispose the oldest when exceeded.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. MAX_INSTANCES = 8 cap with LRU eviction:

static for(options: FileSearchOptions): FileIndexService {
  const key = optionsKey(options);
  const existing = INSTANCES.get(key);
  if (existing && !existing.disposed) {
    // Touch: re-insert to mark as most-recently-used (Map preserves
    // insertion order).
    INSTANCES.delete(key);
    INSTANCES.set(key, existing);
    return existing;
  }
  // ... create new instance, then:
  while (INSTANCES.size > MAX_INSTANCES) {
    const oldestKey = INSTANCES.keys().next().value;
    // ...
    void victim.dispose();
  }
}

Using Map's insertion-order property: .for() hits re-insert the entry so it becomes newest, and overflow victims come from keys().next() which is the oldest untouched instance. dispose() tears down the worker properly so the ~10-30 MB per entry gets reclaimed.

Added a regression test (fileIndexService.test.ts → "evicts the oldest instance when the LRU cap is exceeded") that spawns 9 distinct-projectRoot instances and asserts the first one is disposed while the 9th is still usable.

Picked 8 conservatively: nobody realistically has more than a handful of active project roots at once, and it's well above the typical multi-root workspace / cd-churn steady state.

// Race terminate against a short timeout so a faulted worker can't hang
// dispose() indefinitely. `terminate()` normally resolves in well under
// 100ms; 2s is generous enough that healthy workers always win.
await Promise.race([
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispose() timeout doesn't guarantee termination

The Promise.race returns after 2s, but if terminate() hasn't completed, the worker is still alive. The dangling promise resolves into nothing — no crash, but a potential leak.

Suggestion: log a warning on timeout, and consider calling worker.terminate() again as a force-kill fallback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. Now tracks the timeout, logs a warning, and re-issues terminate():

let timedOut = false;
await Promise.race([
  this.transport.terminate(),
  new Promise<void>((resolve) =>
    setTimeout(() => { timedOut = true; resolve(); }, 2000),
  ),
]);
if (timedOut) {
  console.warn(
    '[FileIndexService] worker terminate() timed out after 2s; retrying force-kill',
  );
  void this.transport.terminate().catch(() => {});
}

worker_threads' terminate() is idempotent, so a second call just queues another tear-down attempt — best-effort without re-blocking the caller. The warning is the "you now have a zombie" diagnostic you asked for; the retry is the force-kill fallback. Keeping the overall dispose() non-blocking on a hung worker (we intentionally don't await the retry) so the caller can proceed even if the worker is genuinely wedged.

};
}

let transportFactory: (options: FileSearchOptions) => IndexTransport = process
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.env['VITEST'] for transport selection is fragile

After esbuild bundling this is always undefined — dead code. Other test runners (Jest, Mocha) would get the real Worker transport, potentially causing flaky tests. The existing __setIndexTransportFactory hook shows the author already felt this wasn't clean.

Suggestion: use explicit DI (e.g. an optional transport param on FileIndexService.for()) instead of env sniffing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea82f9d. Replaced env sniffing with an explicit DI function:

// fileIndexService.ts
let transportFactory: (options: FileSearchOptions) => IndexTransport =
  createWorkerTransport;

export function installInProcessIndexTransport(): () => void {
  return __setIndexTransportFactory(createInProcessTransport);
}

(Renamed from useInProcessIndexTransport so eslint's react-hooks/rules-of-hooks doesn't flag it at module-level call sites.)

Wired from each package's vitest setup:

// packages/core/test-setup.ts and packages/cli/test-setup.ts
import { installInProcessIndexTransport } from '@qwen-code/qwen-code-core';
installInProcessIndexTransport();

Survives esbuild bundling (no process.env lookup that gets dead-code-eliminated), works under Jest/Mocha/etc. the same way, and external embedders can opt into the in-process backend for hardened sandboxes without touching env vars. The existing __setIndexTransportFactory hook is still there for tests that want a custom fake.

- fileIndexCore: drop the redundant second crawl() on cache hits; use
  the first call's return value when onProgress never fires.
- fileIndexProtocol: new module holding WorkerRequest/WorkerResponse so
  fileIndexService and fileIndexWorker can't silently diverge.
- fileIndexService: LRU cap (8) on the INSTANCES singleton Map; hits
  re-insert to refresh recency. Dispose timeout warns + retries
  terminate() as a force-kill fallback.
- fileIndexService: drop `process.env['VITEST']` transport sniffing in
  favour of an explicit `installInProcessIndexTransport()` DI hook
  wired from each package's test-setup. Survives bundling.
- crawler: `ripgrepDisabled` is now a timestamp with a 5-minute
  cooldown (was permanent-for-process-lifetime). A single spawn failure
  no longer downgrades long-lived hosts like the VSCode extension for
  the whole session.
- crawler: collectRipgrepExcludeDirs now forwards plain directory
  patterns (.git/, build/, user ignoreDirs) to rg as `--glob '!dir'`
  args so rg prunes subtrees at the walker instead of streaming every
  path under them for the Node post-filter to reject.
- useAtCompletion: fileSearchOptions is useMemo'd on [config, cwd] and
  both effects have it in their deps — the eslint-disables are gone.
  Extracted `buildFileSearchOptions(config, cwd)` helper that
  AppContainer uses too, so the prewarm and search paths can't drift
  from the same FileIndexService singleton key.

Tests:
- fileIndexService.test: new "evicts the oldest instance when the LRU
  cap is exceeded" regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@callmeYe callmeYe requested a review from yiliang114 April 21, 2026 08:22
Conflict in packages/cli/src/ui/components/Composer.tsx: upstream added
`HistoryItemToolGroup` type import to support per-subagent token
attribution, while this branch removed the `ConfigInitDisplay` banner
as part of the pre-warm UX change. Resolved by keeping the upstream
type import and dropping `ConfigInitDisplay` (import + JSX usage),
which is the intended behaviour of this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages/cli/src/ui/hooks/useAtCompletion.test.ts:230

[Critical] realFileSearch.search(...args) now triggers TS2556 because args is inferred as a generic rest array, but search() has a fixed (pattern, options?) signature. This makes the PR fail typecheck on a file changed here.

Suggested fix: replace the spread call with an explicit call such as realFileSearch.search(args[0], args[1]), or type the mock implementation parameters as the exact tuple for search.

— gpt-5.4 via Qwen Code /review

@callmeYe
Copy link
Copy Markdown
Author

Thanks for the extra pass, @wenshao! 🙏 I think this one is a false positive from the /review agent though — let me lay out the evidence:

1. The flagged code is not touched by this PR

git blame packages/cli/src/ui/hooks/useAtCompletion.test.ts -L 225,232:

2c6794feed (Arya Gummadi 2025-08-25 17:27:36 -0700 227)         search: vi
2c6794feed (Arya Gummadi 2025-08-25 17:27:36 -0700 228)           .fn()
2c6794feed (Arya Gummadi 2025-08-25 17:27:36 -0700 229)           .mockImplementation(async (...args) =>
2c6794feed (Arya Gummadi 2025-08-25 17:27:36 -0700 230)             realFileSearch.search(...args),
2c6794feed (Arya Gummadi 2025-08-25 17:27:36 -0700 231)           ),

These lines were written by @AryaGummadi on 2025-08-25, ~7 months before this PR was opened. git log HEAD..upstream/main -- packages/cli/src/ui/hooks/useAtCompletion.test.ts returns nothing from this branch either — the file is not in the branch's changeset.

2. Local typecheck is clean

cd packages/cli && npx tsc --noEmit → 0 errors. Root tsconfig has strict: true + noImplicitAny, strictFunctionTypes, strictBindCallApply, noPropertyAccessFromIndexSignature, all the usual suspects turned on, so the local run matches CI.

3. Why TS2556 doesn't fire here

FileSearch.search is (pattern: string, options?: SearchOptions) => Promise<string[]>. The spread works because mockImplementation is typed as (...args: Parameters<T>) => ReturnType<T>; TS resolves args to the exact tuple [string, SearchOptions?], not a generic rest array. So realFileSearch.search(...args) matches the fixed signature.

A TS2556 would only fire if args were typed as any[] / unknown[] (no tuple preserved). That could happen if the surrounding context broke Parameters<T> inference — but here the vi.fn() is contextually typed via the FileSearch interface's search property, so the tuple is preserved.

4. Tests and typecheck both green

$ npm test --workspace=packages/cli -- --run src/ui/hooks/useAtCompletion.test.ts
Test Files  1 passed (1)
Tests  13 passed (13)

Happy to send you the full tsc --noEmit output and the test run log if helpful. If you're seeing a different result in CI, could you share the CI tsc output? It'd help pinpoint whether there's a genuine config drift we need to investigate vs. the agent hallucinating a signature mismatch.

@callmeYe callmeYe requested a review from wenshao April 21, 2026 15:06
wenshao
wenshao previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review

@wenshao wenshao dismissed their stale review April 21, 2026 22:20

Re-reviewing: 2 bare-mode tests in config.test.ts fail locally because test-setup.ts now eagerly imports from @qwen-code/qwen-code-core, which loads workspaceContext.ts with real fs before vi.mock('fs') can apply. CI Test jobs have also been stuck ~10h on "Run tests and generate reports" across all OS/Node matrix entries, plausibly same root cause (eager worker/timer handles keeping vitest from exiting). Please defer the core import until the test that actually needs the worker transport.

@wenshao
Copy link
Copy Markdown
Collaborator

wenshao commented Apr 21, 2026

Took another pass at this locally, dismissed the previous approve. Two related things need addressing:

1. --bare mode tests broken by this PR

Two bare-mode tests in packages/cli/src/config/config.test.ts fail locally:

  • loadCliConfig with includeDirectories > should ignore implicit startup context inputs in bare mode
  • loadCliConfig > should force minimal startup behavior in bare mode

Failure:

expected [] to deeply equal [ '/home/user/project', '/cli/path1' ]

Root cause (reproducible): the new lines in packages/cli/test-setup.ts:26-27

import { installInProcessIndexTransport } from '@qwen-code/qwen-code-core';
installInProcessIndexTransport();

eagerly evaluate the entire core module tree at setup-file time. That loads packages/core/src/utils/workspaceContext.ts with the real node:fs bound to its fs reference. config.test.ts later registers vi.mock('fs', ...), but by then workspaceContext is already holding the un-mocked fs — so WorkspaceContext.addDirectory calls resolveAndValidateDir, fs.existsSync('/home/user/project') returns false against the real filesystem, addDirectory silently catches Directory does not exist, and getDirectories() ends up empty.

Repro:

cd packages/cli && npx vitest run src/config/config.test.ts -t "should ignore implicit startup"
# FAIL

Commenting out those two lines in test-setup.ts — or even swapping them for import { FatalError } from '@qwen-code/qwen-code-core' — the test passes. So it isn't the install call itself, it's any top-level import from core in test-setup.ts pulling workspaceContext.ts in before vi.mock('fs') has a chance to run.

Also worth noting: git diff main -- packages/cli/src/config/config.ts packages/cli/src/config/config.test.ts is empty, so this isn't a bare-mode logic regression — it's a pure test-setup side effect introduced by this PR.

2. CI Test matrix stuck for ~10h

On https://github.com/QwenLM/qwen-code/actions/runs/24722612700, all 9 Test (<os>-<node>) jobs have been stuck on "Run tests and generate reports" since 2026-04-21T20:23:00Z. Lint + CodeQL are green.

Likely same root cause as #1: installInProcessIndexTransport() runs in the global setup phase, swaps the transport factory for the in-process variant, and the open handles it creates further down the line (e.g. the 2 s setTimeout dispose-guard at fileIndexService.ts:469, or inflight AbortControllers) may not be released when vitest tries to exit — the process hangs waiting for open handles. Locally the default pool behaviour might not repro, but the matrix (Windows especially) is more sensitive.

Suggested fixes

  1. Don't call installInProcessIndexTransport from test-setup.ts. Scope it to tests that actually use FileIndexService via beforeAll / afterAll:

    // e.g. in useAtCompletion.test.ts
    import { beforeAll, afterAll } from 'vitest';
    let restore: (() => void) | null = null;
    beforeAll(async () => {
      const { installInProcessIndexTransport } = await import('@qwen-code/qwen-code-core');
      restore = installInProcessIndexTransport();
    });
    afterAll(() => {
      restore?.();
    });

    That way config.test.ts (and any other file that doesn't need the transport) never pulls core / workspaceContext.ts into the module graph before its vi.mock('fs') applies.

  2. If a global install is genuinely required, at minimum switch to a dynamic import:

    await import('@qwen-code/qwen-code-core').then(m => m.installInProcessIndexTransport());

    (Though this doesn't fully solve the ordering issue — option 1 is cleaner.)

  3. Investigate the CI hang independently: run locally with vitest --detectOpenHandles to catch any lingering setTimeout / Worker / AbortController handles. The 2 s race in FileIndexService.dispose should probably be exercised in a teardown hook during tests.

  4. Would be worth adding a smoke test ensuring that test-setup imports from core don't poison module-level fs mocks — prevents this kind of regression in the future.

Happy to take another look once these land and CI goes green.

…imers

Addressing @wenshao's two findings on QwenLM#3455:

1. `--bare` mode tests failing in `packages/cli/src/config/config.test.ts`

   Root cause (correctly diagnosed): the top-level
   `installInProcessIndexTransport()` call in `test-setup.ts` eagerly
   evaluated `@qwen-code/qwen-code-core`'s module tree, pulling
   `workspaceContext.ts` in with a real `node:fs` binding. Later
   `vi.mock('fs', …)` declarations in individual test files no longer
   took effect, so `fs.existsSync()` fell through to the real FS and
   silently dropped mock directories.

   Fixed by removing the install from both `test-setup.ts` files and
   opt-ing in per-test-file (`beforeAll`/`afterAll`) in the three
   suites that actually exercise the worker-backed filesearch:

   - `packages/core/src/utils/filesearch/fileIndexService.test.ts`
   - `packages/core/src/utils/filesearch/fileSearch.test.ts`
     (also drops live singletons between tests so Windows rmdir
     doesn't race an open ripgrep child)
   - `packages/cli/src/ui/hooks/useAtCompletion.test.ts`

   The two `test-setup.ts` files now carry an explanatory comment so
   this doesn't regress.

2. CI matrix stuck for ~10h on "Run tests and generate reports"

   Likely cause: pending `setTimeout` handles keeping the event loop
   alive past test completion.

   - `FileIndexService.dispose()` raced `terminate()` against a 2 s
     `setTimeout` but never cleared the timer on the healthy-exit
     path — so after every disposal the process held a timer for up
     to 2 s. Now the terminate() arm clears the timer, and the
     timer is also `.unref()`'d as a belt-and-suspenders.
   - `crawlCache` TTL timers (default 30 s) are now `.unref()`'d
     too. They'd keep vitest workers waiting on exit even after
     all assertions passed.

   `clear()` still drops both synchronously between tests, so
   correctness is unchanged.

Verified:
- `npx vitest run src/config/config.test.ts -t "should ignore implicit startup"` now passes (was failing before).
- Full test suites — core 5925 / 5927, cli 4275 / 4282 (skipped tests pre-existing), vscode-ide-companion 204 / 205 — all green.
- Typecheck clean on core + cli.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@callmeYe
Copy link
Copy Markdown
Author

Thanks so much for the careful follow-up, @wenshao 🙏 — both findings were spot on. Fixed in 44fcc11.

1. --bare mode tests ✅

Your diagnosis was exactly right: the top-level installInProcessIndexTransport() call in test-setup.ts pulled @qwen-code/qwen-code-core's module tree into every test file's graph before any per-file vi.mock('fs', …) had a chance to run, so workspaceContext.ts was already closed over the real node:fs.

I took your Option 1 — scope the install to per-test-file beforeAll/afterAll hooks. The three suites that actually exercise the worker-backed filesearch got:

let restoreTransport: (() => void) | null = null;
beforeAll(() => {
  restoreTransport = installInProcessIndexTransport();
});
afterAll(() => {
  restoreTransport?.();
  restoreTransport = null;
});

applied in:

  • packages/core/src/utils/filesearch/fileIndexService.test.ts
  • packages/core/src/utils/filesearch/fileSearch.test.ts (also added singleton resets between tests so Windows rmdir doesn't race an open rg child)
  • packages/cli/src/ui/hooks/useAtCompletion.test.ts

Both test-setup.ts files now carry an explanatory comment so this doesn't regress. Verified locally:

$ cd packages/cli && npx vitest run src/config/config.test.ts -t "should ignore implicit startup"
Tests  1 passed | 177 skipped (178)
$ cd packages/cli && npx vitest run src/config/config.test.ts -t "should force minimal startup"
Tests  1 passed | 177 skipped (178)

Both previously-failing bare-mode tests now green.

2. CI matrix hang ✅

Ran the filesearch suite under the same kind of scrutiny and found two lingering setTimeout handles keeping the event loop alive past test completion:

A. FileIndexService.dispose() 2 s race timer (fileIndexService.ts:469)

The Promise.race would resolve from transport.terminate() on the healthy path, but the 2 s fallback timer was never cleared — so every dispose() held the event loop open for up to 2 s after the test assertions passed. Fixed:

await Promise.race([
  this.transport.terminate().then(() => {
    if (timer) clearTimeout(timer);   // healthy-exit clears the timer
  }),
  new Promise<void>((resolve) => {
    timer = setTimeout(() => { timedOut = true; resolve(); }, 2000);
    timer.unref?.();                  // belt-and-suspenders
  }),
]);

B. crawlCache TTL timer (30 s default, crawlCache.ts:54)

The cache-eviction timer had no .unref(), so whenever a test triggered a cacheable crawl the process held a handle for up to 30 s. Same fix — timerId.unref?.() right after setTimeout. clear() still drops synchronously between tests so correctness is unchanged.

Locally, filesearch-suite wall time went from ~7 s + a handful of seconds of hanging to 5.18 s clean exit. Matches your --detectOpenHandles intuition — thank you for that pointer.

Test matrix

core:                  5925 / 5927  (2 pre-existing skips)
cli:                   4275 / 4282  (7 pre-existing skips)
vscode-ide-companion:   204 /  205  (1 pre-existing skip)

All green. Typecheck clean on both packages. CI should unstick once 44fcc11 lands — let me know if the matrix still shows anything odd.

Re: suggestion 4 (smoke test guarding test-setup fs-mock poisoning)

Good idea in principle, but I'd want to noodle on the right mechanism — asserting "no core imports" in test-setup.ts from within the test harness is a bit of a chicken-and-egg. Could be a simple lint rule against importing from @qwen-code/qwen-code-core (or relative paths into core) in any test-setup.ts. Happy to send that as a separate PR if it'd be useful — wanted to keep this one focused on unblocking the current CI.

Thanks again for the detailed writeup — the repro steps and diagnosis saved me significant time.

@yiliang114 yiliang114 added the TBD To Be Discussed label Apr 22, 2026
@callmeYe callmeYe requested a review from wenshao April 22, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

TBD To Be Discussed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

@-picker freezes the CLI for on large workspaces

3 participants