perf: fast keyword extraction + disable reasoning on lightweight LLMs#151
Merged
Merged
Conversation
Closes #148, #150. Sub-issues of #147. After the persistent hed-lsp client landed (#146), the per-request 'Initializing annotation workflow...' window shrank from 20-60 s to ~10-12 s. The remaining cost is the LLM call inside _semantic_preprocess_node._extract_keywords, which in prod was routed to the evaluation model (qwen3.6-35b-a3b) with extended reasoning enabled by default. Two coupled changes: 1. (#148) HedAnnotationWorkflow now takes an optional keyword_llm parameter. Default falls back to the annotation LLM (claude-haiku in prod), which is fast and well-suited to a 5-keyword extraction task. create_openrouter_workflow and the standalone CLI's local_executor build a dedicated keyword_llm with the annotation model, max_tokens=200, and reasoning disabled. 2. (#150) create_openrouter_llm gains a disable_reasoning flag. When True it sets model_kwargs['reasoning'] = {'enabled': False} -- the OpenRouter portable flag that turns off extended thinking across Anthropic, Qwen, and OpenAI providers in one shot. The flag is passed when building evaluation_llm, assessment_llm, feedback_llm, and keyword_llm; the annotation LLM keeps reasoning on since that model is doing the real HED tag synthesis where thinking helps first-attempt quality. Measured in the prod container against real OpenRouter calls: - claude-haiku-4.5 (reasoning on, default): 7-9 s for keyword extraction, response contains thinking blocks. - claude-haiku-4.5 with reasoning.enabled=false, max_tokens=200: ~1 s, clean comma-separated text. - qwen3.6-35b-a3b (reasoning on): 1.5 s, thinking blocks. - qwen3.6-35b-a3b with reasoning.enabled=false: 0.5 s, clean text. Bumps the API to 0.7.10a2.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
neuromechanist
added a commit
that referenced
this pull request
May 21, 2026
Per CLAUDE.md 'Develop Branch Sync Rule': after each alpha release on main (0.7.10a2 here), develop bumps the patch and resets to .dev0 so the two branches share a clean version lineage and dev builds publish to TestPyPI under the next patch series. Fast-forwarded merge from main (no divergence: develop had nothing ahead). All #146 (persistent hed-lsp) and #151 (#148+#150 latency) work is now on develop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #148, #150. Sub-issues of #147.
Summary
After #146 landed the persistent hed-lsp client, the per-request "Initializing annotation workflow..." gap dropped from 20-60 s to ~10-12 s. Direct measurement showed the LSP call itself is 0.5 s; the remaining time is one LLM call inside
_extract_keywords, which in prod ran on the evaluation model (qwen3.6-35b-a3b) with extended reasoning enabled by default.Two coupled fixes:
Use fast LLM with reasoning disabled for keyword extraction (and eval/feedback/assess) #148 —
HedAnnotationWorkflowtakes an optionalkeyword_llm; defaults to the annotation LLM (claude-haiku-4.5 in prod) which is well-suited to a "list 5 keywords" task.create_openrouter_workflowand the standalone CLI build a dedicated keyword_llm with the annotation model,max_tokens=200, and reasoning disabled.Disable reasoning on non-annotation workflow LLMs (keyword, eval, feedback, assess) #150 —
create_openrouter_llmgains adisable_reasoningflag. WhenTrue, setsmodel_kwargs["reasoning"] = {"enabled": False}— OpenRouter's portable cross-provider flag that turns off extended thinking on Anthropic, Qwen, and OpenAI in one shot. Passed for evaluation_llm, assessment_llm, feedback_llm, keyword_llm. Annotation LLM keeps reasoning enabled.Measurement (prod container, real OpenRouter calls)
reasoning.enabled=false,max_tokens=200reasoning.enabled=falseEnd-to-end expected effect: pre-annotate window goes from ~10-12 s to ~1-2 s. Evaluation / feedback / assessment calls 2x+ faster.
Test plan
uv run pytest -m "not integration"-- 465 passed, 1 skipped.uv run pytest tests/test_openrouter_llm.py-- 18 passed including the two new flag-passthrough tests.uv run pytest tests/lsp/ tests/test_validation_agent.py-- 39 passed (real LSP, no mocks).Out of scope