perf: fast keyword extraction + disable reasoning on lightweight LLMs by neuromechanist · Pull Request #151 · Annotation-Garden/HEDit

neuromechanist · 2026-05-21T17:53:38Z

Closes #148, #150. Sub-issues of #147.

Summary

After #146 landed the persistent hed-lsp client, the per-request "Initializing annotation workflow..." gap dropped from 20-60 s to ~10-12 s. Direct measurement showed the LSP call itself is 0.5 s; the remaining time is one LLM call inside _extract_keywords, which in prod ran on the evaluation model (qwen3.6-35b-a3b) with extended reasoning enabled by default.

Two coupled fixes:

Use fast LLM with reasoning disabled for keyword extraction (and eval/feedback/assess) #148 — HedAnnotationWorkflow takes an optional keyword_llm; defaults to the annotation LLM (claude-haiku-4.5 in prod) which is well-suited to a "list 5 keywords" task. create_openrouter_workflow and the standalone CLI build a dedicated keyword_llm with the annotation model, max_tokens=200, and reasoning disabled.
Disable reasoning on non-annotation workflow LLMs (keyword, eval, feedback, assess) #150 — create_openrouter_llm gains a disable_reasoning flag. When True, sets model_kwargs["reasoning"] = {"enabled": False} — OpenRouter's portable cross-provider flag that turns off extended thinking on Anthropic, Qwen, and OpenAI in one shot. Passed for evaluation_llm, assessment_llm, feedback_llm, keyword_llm. Annotation LLM keeps reasoning enabled.

Measurement (prod container, real OpenRouter calls)

Setup	Wall time	Output
claude-haiku-4.5 (default, reasoning on)	7–9 s	thinking blocks
claude-haiku-4.5 with `reasoning.enabled=false`, `max_tokens=200`	~1 s	clean comma-separated text
qwen3.6-35b-a3b (reasoning on)	1.5 s	thinking blocks
qwen3.6-35b-a3b with `reasoning.enabled=false`	0.5 s	clean text

End-to-end expected effect: pre-annotate window goes from ~10-12 s to ~1-2 s. Evaluation / feedback / assessment calls 2x+ faster.

Test plan

uv run pytest -m "not integration" -- 465 passed, 1 skipped.
uv run pytest tests/test_openrouter_llm.py -- 18 passed including the two new flag-passthrough tests.
uv run pytest tests/lsp/ tests/test_validation_agent.py -- 39 passed (real LSP, no mocks).
Local empirical test against OpenRouter confirms the timing numbers above.
After merge + deploy: time a real /annotate request and confirm the "Initializing annotation workflow..." window dropped to ~1-2 s.

Out of scope

Run semantic_preprocess in parallel with first annotate call #149 (run semantic_preprocess in parallel with annotate) is the follow-up that takes the window to ~0 s perceived latency. Separate PR after this lands.

Closes #148, #150. Sub-issues of #147. After the persistent hed-lsp client landed (#146), the per-request 'Initializing annotation workflow...' window shrank from 20-60 s to ~10-12 s. The remaining cost is the LLM call inside _semantic_preprocess_node._extract_keywords, which in prod was routed to the evaluation model (qwen3.6-35b-a3b) with extended reasoning enabled by default. Two coupled changes: 1. (#148) HedAnnotationWorkflow now takes an optional keyword_llm parameter. Default falls back to the annotation LLM (claude-haiku in prod), which is fast and well-suited to a 5-keyword extraction task. create_openrouter_workflow and the standalone CLI's local_executor build a dedicated keyword_llm with the annotation model, max_tokens=200, and reasoning disabled. 2. (#150) create_openrouter_llm gains a disable_reasoning flag. When True it sets model_kwargs['reasoning'] = {'enabled': False} -- the OpenRouter portable flag that turns off extended thinking across Anthropic, Qwen, and OpenAI providers in one shot. The flag is passed when building evaluation_llm, assessment_llm, feedback_llm, and keyword_llm; the annotation LLM keeps reasoning on since that model is doing the real HED tag synthesis where thinking helps first-attempt quality. Measured in the prod container against real OpenRouter calls: - claude-haiku-4.5 (reasoning on, default): 7-9 s for keyword extraction, response contains thinking blocks. - claude-haiku-4.5 with reasoning.enabled=false, max_tokens=200: ~1 s, clean comma-separated text. - qwen3.6-35b-a3b (reasoning on): 1.5 s, thinking blocks. - qwen3.6-35b-a3b with reasoning.enabled=false: 0.5 s, clean text. Bumps the API to 0.7.10a2.

cloudflare-workers-and-pages · 2026-05-21T17:53:39Z

Deploying hedit with Cloudflare Pages

Latest commit:	`498c0f2`
Status:	⚡️ Build in progress...

View logs

codecov · 2026-05-21T17:55:36Z

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/cli/local_executor.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Per CLAUDE.md 'Develop Branch Sync Rule': after each alpha release on main (0.7.10a2 here), develop bumps the patch and resets to .dev0 so the two branches share a clean version lineage and dev builds publish to TestPyPI under the next patch series. Fast-forwarded merge from main (no divergence: develop had nothing ahead). All #146 (persistent hed-lsp) and #151 (#148+#150 latency) work is now on develop.

neuromechanist merged commit fe49ece into main May 21, 2026
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fast keyword extraction + disable reasoning on lightweight LLMs#151

perf: fast keyword extraction + disable reasoning on lightweight LLMs#151
neuromechanist merged 1 commit into
mainfrom
feature/issue-148-fast-keyword-extraction

neuromechanist commented May 21, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 21, 2026

Uh oh!

codecov Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neuromechanist commented May 21, 2026

Summary

Measurement (prod container, real OpenRouter calls)

Test plan

Out of scope

Uh oh!

cloudflare-workers-and-pages Bot commented May 21, 2026

Deploying hedit with Cloudflare Pages

Uh oh!

codecov Bot commented May 21, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant