Skip to content

perf: fast keyword extraction + disable reasoning on lightweight LLMs#151

Merged
neuromechanist merged 1 commit into
mainfrom
feature/issue-148-fast-keyword-extraction
May 21, 2026
Merged

perf: fast keyword extraction + disable reasoning on lightweight LLMs#151
neuromechanist merged 1 commit into
mainfrom
feature/issue-148-fast-keyword-extraction

Conversation

@neuromechanist
Copy link
Copy Markdown
Member

Closes #148, #150. Sub-issues of #147.

Summary

After #146 landed the persistent hed-lsp client, the per-request "Initializing annotation workflow..." gap dropped from 20-60 s to ~10-12 s. Direct measurement showed the LSP call itself is 0.5 s; the remaining time is one LLM call inside _extract_keywords, which in prod ran on the evaluation model (qwen3.6-35b-a3b) with extended reasoning enabled by default.

Two coupled fixes:

  1. Use fast LLM with reasoning disabled for keyword extraction (and eval/feedback/assess) #148HedAnnotationWorkflow takes an optional keyword_llm; defaults to the annotation LLM (claude-haiku-4.5 in prod) which is well-suited to a "list 5 keywords" task. create_openrouter_workflow and the standalone CLI build a dedicated keyword_llm with the annotation model, max_tokens=200, and reasoning disabled.

  2. Disable reasoning on non-annotation workflow LLMs (keyword, eval, feedback, assess) #150create_openrouter_llm gains a disable_reasoning flag. When True, sets model_kwargs["reasoning"] = {"enabled": False} — OpenRouter's portable cross-provider flag that turns off extended thinking on Anthropic, Qwen, and OpenAI in one shot. Passed for evaluation_llm, assessment_llm, feedback_llm, keyword_llm. Annotation LLM keeps reasoning enabled.

Measurement (prod container, real OpenRouter calls)

Setup Wall time Output
claude-haiku-4.5 (default, reasoning on) 7–9 s thinking blocks
claude-haiku-4.5 with reasoning.enabled=false, max_tokens=200 ~1 s clean comma-separated text
qwen3.6-35b-a3b (reasoning on) 1.5 s thinking blocks
qwen3.6-35b-a3b with reasoning.enabled=false 0.5 s clean text

End-to-end expected effect: pre-annotate window goes from ~10-12 s to ~1-2 s. Evaluation / feedback / assessment calls 2x+ faster.

Test plan

  • uv run pytest -m "not integration" -- 465 passed, 1 skipped.
  • uv run pytest tests/test_openrouter_llm.py -- 18 passed including the two new flag-passthrough tests.
  • uv run pytest tests/lsp/ tests/test_validation_agent.py -- 39 passed (real LSP, no mocks).
  • Local empirical test against OpenRouter confirms the timing numbers above.
  • After merge + deploy: time a real /annotate request and confirm the "Initializing annotation workflow..." window dropped to ~1-2 s.

Out of scope

Closes #148, #150. Sub-issues of #147.

After the persistent hed-lsp client landed (#146), the per-request
'Initializing annotation workflow...' window shrank from 20-60 s to
~10-12 s. The remaining cost is the LLM call inside
_semantic_preprocess_node._extract_keywords, which in prod was routed
to the evaluation model (qwen3.6-35b-a3b) with extended reasoning
enabled by default.

Two coupled changes:

1. (#148) HedAnnotationWorkflow now takes an optional keyword_llm
   parameter. Default falls back to the annotation LLM (claude-haiku
   in prod), which is fast and well-suited to a 5-keyword extraction
   task. create_openrouter_workflow and the standalone CLI's
   local_executor build a dedicated keyword_llm with the annotation
   model, max_tokens=200, and reasoning disabled.

2. (#150) create_openrouter_llm gains a disable_reasoning flag. When
   True it sets model_kwargs['reasoning'] = {'enabled': False} -- the
   OpenRouter portable flag that turns off extended thinking across
   Anthropic, Qwen, and OpenAI providers in one shot. The flag is
   passed when building evaluation_llm, assessment_llm, feedback_llm,
   and keyword_llm; the annotation LLM keeps reasoning on since that
   model is doing the real HED tag synthesis where thinking helps
   first-attempt quality.

Measured in the prod container against real OpenRouter calls:
- claude-haiku-4.5 (reasoning on, default): 7-9 s for keyword
  extraction, response contains thinking blocks.
- claude-haiku-4.5 with reasoning.enabled=false, max_tokens=200:
  ~1 s, clean comma-separated text.
- qwen3.6-35b-a3b (reasoning on): 1.5 s, thinking blocks.
- qwen3.6-35b-a3b with reasoning.enabled=false: 0.5 s, clean text.

Bumps the API to 0.7.10a2.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying hedit with  Cloudflare Pages  Cloudflare Pages

Latest commit: 498c0f2
Status:⚡️  Build in progress...

View logs

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/cli/local_executor.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@neuromechanist neuromechanist merged commit fe49ece into main May 21, 2026
22 of 23 checks passed
neuromechanist added a commit that referenced this pull request May 21, 2026
Per CLAUDE.md 'Develop Branch Sync Rule': after each alpha release on
main (0.7.10a2 here), develop bumps the patch and resets to .dev0 so
the two branches share a clean version lineage and dev builds publish
to TestPyPI under the next patch series.

Fast-forwarded merge from main (no divergence: develop had nothing
ahead). All #146 (persistent hed-lsp) and #151 (#148+#150 latency)
work is now on develop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use fast LLM with reasoning disabled for keyword extraction (and eval/feedback/assess)

1 participant