Skip to content

feat: persistent hed-lsp client (closes #144)#146

Merged
neuromechanist merged 9 commits into
mainfrom
feature/issue-144-persistent-lsp
May 21, 2026
Merged

feat: persistent hed-lsp client (closes #144)#146
neuromechanist merged 9 commits into
mainfrom
feature/issue-144-persistent-lsp

Conversation

@neuromechanist
Copy link
Copy Markdown
Member

Summary

Eliminates the dominant per-request latency on the prod backend: the multi-second "Initializing annotation workflow..." gap before annotation starts. The cause was _semantic_preprocess_node calling subprocess.run(["hed-suggest", ...]) once per extracted keyword in a sequential loop. Each spawn cold-booted the LSP server, loaded ~600 MB embeddings, answered one query, and exited (~6 s warm, 91 s very-cold). With 10 keywords that stacked to 60+ s.

This PR uses the hed-lsp Node server as designed: one persistent JSON-RPC connection held for the lifetime of the FastAPI process, batched hed/suggest queries answered in ~40 ms.

Closes #144.

Companion change

hed-lsp branch feat/hed-suggest-request adds the hed/suggest JSON-RPC handler. The Dockerfile pins to that branch via ARG HED_LSP_REF; will switch back to a tag-based pin once that PR merges.

What's in this PR

  • src/lsp/ — async LSP/JSON-RPC client over stdio and Unix sockets, plus hedit-lspd supervisor daemon.
  • FastAPI lifespan spawns one node server.js --stdio child on startup and shuts it down on teardown.
  • HedAnnotationWorkflow._semantic_preprocess_node replaces the per-keyword subprocess.run loop with one batched await client.suggest(*keywords).
  • ValidationAgent uses the same persistent client for invalid-tag replacement suggestions.
  • hedit lsp start|stop|status for the opt-in standalone CLI daemon. hedit annotate auto-detects the socket; HEDIT_LSP=0 disables.
  • src/validation/hed_lsp.py reduced to a thin re-export shim; the synchronous subprocess.run path is deleted (no backwards-compat shims per project policy).
  • Real-LSP integration tests in tests/lsp/ (7 tests, no mocks): client stdio, batched suggest, concurrent demux, idempotent shutdown, missing-server-js error, full daemon lifecycle.
  • Old mock-heavy tests/test_hed_lsp.py + tests/test_keyword_extraction.py deleted.
  • Version bump 0.7.9.dev4 -> 0.7.10a0.

Verification

Local against the hed-lsp feat/hed-suggest-request build:

  • LSP initialize handshake: 163 ms
  • First batched hed/suggest (5 queries, includes schema load): 37 ms
  • Second batched call (3 queries, schema cached): 84 ms

Existing test suite: 450 passed, 1 skipped, 43 deselected (-m "not integration").
LSP-specific suite: 7 passed in 1.27 s (real LSP, no mocks).

Out of scope

  • Auto-spawn-with-idle-timeout daemon for CLI (explicitly rejected during planning — too much hidden state on user machines).
  • Bundling Node.js into the wheel (docs require Node >= 20 instead).

Test plan

  • uv run pytest -m "not integration" -- 450 passed.
  • uv run pytest tests/lsp/ against a local hed-lsp build -- 7 passed.
  • uv run ruff check clean on all touched files.
  • Smoke-tested LSP initialize + batched suggest against the live server.
  • After merge + deploy on hedtools: verify docker exec hedit pgrep -fa "server.js --stdio" shows the persistent child, and a fresh /annotate request shows preprocess wall time <500 ms.

src/lsp/ adds a persistent client for the hed-lsp Node server.
HedLspClient owns either a spawned 'node server.js --stdio' child
(spawn_stdio) or a Unix-socket connection to a running daemon
(connect_unix). Both transports speak the same LSP wire protocol
(Content-Length framed JSON-RPC); the client demultiplexes responses
by id so concurrent suggest() calls work.

src/lsp/daemon.py is the 'hedit-lspd' supervisor entry point. It
owns one Node child and proxies LSP messages bidirectionally to
multiple Unix-socket peers, so a single warm server can serve many
'hedit annotate' CLI invocations. Per-peer 'shutdown' and 'exit'
are intercepted (a peer disconnecting must not kill the shared
child).

Real-LSP integration tests cover stdio spawn, batched suggest,
concurrent demuxing, empty-query failure, idempotent shutdown,
missing-server-js error, and the full daemon start -> connect ->
suggest -> SIGTERM lifecycle. No mocks; tests skip cleanly when
node or hed-lsp isn't installed.
HedAnnotationWorkflow now takes an injected lsp_client and uses
one batched suggest(*keywords) call from _semantic_preprocess_node
instead of spawning hed-suggest per keyword in a sequential loop.
Same change in ValidationAgent for tag-replacement suggestions.

The FastAPI lifespan spawns the persistent client at startup
(HED_LSP_SERVER_JS env or /app/hed-lsp/server/out/server.js by
default) and shuts it down on teardown. A module-level sentinel
keeps the 8 per-request workflow construction sites unchanged.
HED_LSP_DISABLE=1 falls back to keyword-only preprocessing.

For the standalone CLI:
- 'hedit lsp start|stop|status' manage an opt-in 'hedit-lspd'
  daemon under the per-user runtime dir.
- 'hedit annotate' picks up the daemon's socket if present;
  otherwise runs without LSP (HEDIT_LSP=0 to force-skip).
- No auto-spawn, no idle-timeout magic; users opt in explicitly.

src/validation/hed_lsp.py is now a thin re-export shim over
src.lsp plus find_lsp_server_js() / is_hed_lsp_available()
helpers. The synchronous subprocess.run path is deleted; no
backwards-compat shims.
Dockerfile pins the hed-lsp clone to the feat/hed-suggest-request
branch via the new HED_LSP_REF build arg so the runtime image picks
up the hed/suggest request handler. Adds HED_LSP_SERVER_JS to the
container ENV so the FastAPI lifespan can find the entrypoint
without configuration.

.env.example documents the new env vars: HED_LSP_SERVER_JS,
HED_LSP_DISABLE, HEDIT_LSP_RUNTIME_DIR, HEDIT_LSP. Drops the stale
docs for the now-removed HED_LSP_USE_SEMANTIC and HED_LSP_MAX_RESULTS
knobs.

Patch bump with alpha prerelease per the develop -> main release
workflow.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 21, 2026

Deploying hedit with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3250f94
Status: ✅  Deploy successful!
Preview URL: https://c3bcda98.hedit.pages.dev
Branch Preview URL: https://feature-issue-144-persistent.hedit.pages.dev

View logs

Resolves trivial conflict on version files. Keeping the feature
branch's 0.7.10a0 (patch+alpha bump) over main's 0.7.9a2 since
this PR is the next alpha promotion off develop.
Code-review and silent-failure audit raised a few real defects and
logging-quality issues. Fixes, paired with the test that catches
the most important one:

- src/lsp/client.py: when the read loop exits (clean EOF, protocol
  error, or unexpected crash) the client now sets _shutdown_started
  so subsequent suggest() calls fail fast instead of hanging on a
  future that will never be resolved. Added a regression test
  (test_suggest_after_shutdown_fails_fast) that exercises this.
- src/lsp/client.py: _request_unshutdownable now pops the pending
  future on write failure, matching _request's contract, so a
  broken pipe during shutdown doesn't strand a future for
  _fail_pending to clean up by side effect.
- src/lsp/client.py: shutdown() now cancels the reader task before
  closing the writer, so the reader doesn't observe a half-closed
  stream and log a spurious 'connection closed' before shutdown
  completes. Writer close failures are logged at debug level
  instead of being silently swallowed.
- src/lsp/client.py: _dispatch logs unknown-id responses and
  late-arriving responses at warning level, and unhandled server
  notifications at debug, so protocol-level anomalies aren't
  invisible.
- src/lsp/daemon.py: _NodeChild error sentinel now echoes the
  pending request id so HedLspClient._dispatch can resolve the
  caller's future instead of silently dropping the message.
- src/lsp/daemon.py: _NodeChild._read_loop catches unexpected
  exceptions with logger.exception so a reader crash isn't an
  asyncio.Task ghost. Server task crashes in serve_until_signal
  now log at error with traceback. Peer writer close errors get
  a debug log.
- src/api/main.py: LSP spawn / shutdown failures now go through
  logging.getLogger('hedit.lsp') with exc_info, not print(), so
  log aggregation in prod can detect a silently-degraded server.
  The exception clause around spawn_stdio narrows to (RuntimeError,
  OSError, TimeoutError) so programming errors propagate.
- src/cli/local_executor.py: connect-to-daemon failure is logged
  at debug with the socket path and the error; previously it was
  a bare except: pass.
- src/agents/workflow.py and validation_agent.py: the outer except
  around lsp_client.suggest() now passes exc_info=True so an
  unexpected exception (one that suggest() didn't already convert
  to a failure result) leaves a usable traceback.
- tests/lsp/test_daemon_lifecycle.py: replace deprecated
  asyncio.get_event_loop().time() with time.monotonic().
Comment thread src/cli/commands/lsp.py
for path in (pid_file_path(rt_dir), socket_file_path(rt_dir), meta_file_path(rt_dir)):
try:
path.unlink()
except FileNotFoundError:
Comment thread src/lsp/client.py
task.cancel()
try:
await task
except (asyncio.CancelledError, Exception):
Comment thread src/lsp/daemon.py
def remove_broadcast_queue(self, queue: asyncio.Queue[dict]) -> None:
try:
self._broadcast.remove(queue)
except ValueError:
Comment thread src/lsp/daemon.py
if self._process.returncode is None:
try:
self._stdin.close()
except Exception:
Comment thread src/lsp/daemon.py
task.cancel()
try:
await task
except (asyncio.CancelledError, Exception):
Comment thread src/lsp/daemon.py
for task in list(self._connections):
try:
await task
except (asyncio.CancelledError, Exception):
Comment thread src/lsp/daemon.py
for path in (self._socket_path, self._pid_path, self._meta_path):
try:
path.unlink()
except FileNotFoundError:
Comment thread src/lsp/daemon.py
forward_task.cancel()
try:
await forward_task
except (asyncio.CancelledError, Exception):
Comment thread src/lsp/daemon.py

try:
asyncio.run(_run())
except KeyboardInterrupt:
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

hed-lsp migrated from npm to pnpm workspaces (hed-standard/hed-lsp#28),
so 'npm install' at the repo root no longer hydrates server/node_modules
and the subsequent tsc -b fails with TS2307 'Cannot find module
vscode-languageserver' etc.

Switch to pnpm via corepack, pinned to the same version the upstream
CI uses (10.33.4), and drop the now-redundant 'npm link' for the
hed-suggest CLI binary — the persistent LSP path spawns server.js
directly via HED_LSP_SERVER_JS, so the CLI shim doesn't ship in the
image anymore.
The CI build uses deploy/Dockerfile (see docker-build.yml + security-scan.yml),
not the root Dockerfile, so the previous npm->pnpm migration commit
missed the file CI actually runs against.

Mirror the same change: pin hed-lsp to feat/hed-suggest-request,
switch the build to pnpm@10.33.4 via corepack, drop the obsolete
'npm link' of hed-suggest, and export HED_LSP_SERVER_JS so the
FastAPI lifespan finds the server entrypoint without configuration.
The src/lsp/ module is the bulk of this PR but was gated behind
pytest.mark.integration, so the unit-test CI job skipped it and
codecov/patch dropped below threshold.

The hed_lsp_server_js conftest fixture already skips cleanly when
node or hed-lsp aren't available, so the integration marker isn't
actually needed for these tests — they're 'integration' with a Node
subprocess, not with a paid LLM API. Drop the marker so the tests
run in the default lane.

CI then needs hed-lsp available: setup-node@v6 + a build step that
clones the pinned hed-lsp branch, runs pnpm install + compile, and
exports HED_LSP_SERVER_JS so the conftest finds the entrypoint.
This makes src/lsp/* execute under coverage in unit-tests, lifting
patch coverage above the 50% codecov threshold without weakening
any policy.
hed-lsp PR #31 merged as 36447afb5e48bfecc45af0464b652614ba571414 on
main (v0.4.0). Pin both deploy/Dockerfile and .github/workflows/test.yml
to that commit SHA so the build is reproducible and survives any
branch cleanup upstream.

tests/lsp/test_daemon_internals.py: in-process exercise of _NodeChild
and _Daemon. The existing end-to-end test launches the daemon in a
subprocess, which means coverage.py never sees the daemon code. These
five new tests instantiate _NodeChild and _Daemon directly within the
test process so coverage on src/lsp/daemon.py jumps from ~14% (just
the entry-point main()) to most of the proxy and lifecycle logic.

Still no mocks: tests use the real Node hed-lsp child and a real Unix
socket; the hed_lsp_server_js fixture skips when neither is available.
@neuromechanist neuromechanist merged commit 08a6d2b into main May 21, 2026
23 checks passed
neuromechanist added a commit that referenced this pull request May 21, 2026
Per CLAUDE.md 'Develop Branch Sync Rule': after each alpha release on
main (0.7.10a2 here), develop bumps the patch and resets to .dev0 so
the two branches share a clean version lineage and dev builds publish
to TestPyPI under the next patch series.

Fast-forwarded merge from main (no divergence: develop had nothing
ahead). All #146 (persistent hed-lsp) and #151 (#148+#150 latency)
work is now on develop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use persistent hed-lsp instead of per-keyword CLI spawn

2 participants