Add provider-agnostic Model QoS profiles (#6831)#6832
Add provider-agnostic Model QoS profiles (#6831)#6832
Conversation
Introduce configurable feature-to-model mapping so each LLM feature can be independently assigned to a model tier (nano/mini/medium/high). Override per-feature via env vars like LLM_TIER_CONV_ACTION_ITEMS=medium. Defaults downgrade high-volume structured extraction tasks from gpt-5.1 to gpt-4.1-mini while preserving rollback capability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded llm_medium_experiment (gpt-5.1) with get_llm() calls for action items, structure, events, app results, and daily summaries. Each feature now uses its configured tier, defaulting to mini for action items and app results (high-volume structured extraction). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded llm_mini with get_llm('knowledge_graph') so the
model can be independently configured via LLM_TIER_KNOWLEDGE_GRAPH env var.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded llm_mini with get_llm() for memory extraction, text content extraction, conflict resolution, and categorization. Each can be independently configured via LLM_TIER_MEMORIES etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 unit tests covering tier defaults, env var overrides, model mapping, instance caching, tier info debugging, and rollback scenarios. Added to test.sh for CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR introduces a QoS tier system in
Confidence Score: 4/5Safe to merge after fixing the tautological rollback test assertion, which currently provides no coverage of the critical rollback path. One P1 finding: the rollback test always passes regardless of the actual model returned, leaving the primary rollback guarantee untested. Two P2 findings (unused imports, missing prompt_cache_key) are minor cleanups. backend/tests/unit/test_llm_qos_tiers.py — rollback test assertion needs fixing Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["get_llm(feature)"] --> B["_resolve_tier(feature)"]
B --> C{"LLM_TIER_{FEATURE} env var set and valid?"}
C -- yes --> D["Use env tier"]
C -- no --> E["_FEATURE_TIER_DEFAULTS.get(feature, TIER_MINI)"]
D --> F["_TIER_MODELS[tier] → model name"]
E --> F
F --> G{"model in _llm_cache?"}
G -- yes --> H["Return cached ChatOpenAI"]
G -- no --> I{"model == 'gpt-5.1'?"}
I -- yes --> J["ChatOpenAI(model, prompt_cache_retention=24h)"]
I -- no --> K["ChatOpenAI(model)"]
J --> L["Store in _llm_cache"]
K --> L
L --> H
Reviews (1): Last reviewed commit: "Add tests for QoS tier system (#6831)" | Re-trigger Greptile |
| from pydantic import BaseModel, Field | ||
|
|
||
| from .clients import llm_mini | ||
| from .clients import get_llm, llm_mini |
There was a problem hiding this comment.
llm_mini is imported but no longer referenced in this file — both call sites were converted to get_llm('knowledge_graph'). The same applies to memories.py (line 13), where llm_mini is also imported but unused after the migration.
| from .clients import get_llm, llm_mini | |
| from .clients import get_llm |
| monkeypatch.setenv('LLM_TIER_CONV_ACTION_ITEMS', 'medium') | ||
| assert _resolve_tier('conv_action_items') == TIER_MEDIUM | ||
| llm = get_llm('conv_action_items') | ||
| assert 'gpt-5.1' in str(llm.model_name) or hasattr(llm, 'invoke') |
There was a problem hiding this comment.
Tautological assertion always passes
The or hasattr(llm, 'invoke') clause is always True for any ChatOpenAI instance, so this assertion can never fail regardless of which model was actually returned. The test never validates the rollback worked.
| assert 'gpt-5.1' in str(llm.model_name) or hasattr(llm, 'invoke') | |
| assert llm.model_name == 'gpt-5.1' |
| def _get_or_create_llm(model_name: str) -> ChatOpenAI: | ||
| """Get or create a ChatOpenAI instance for the given model name.""" | ||
| if model_name not in _llm_cache: | ||
| kwargs = {'model': model_name, 'callbacks': [_usage_callback]} | ||
| if model_name == 'gpt-5.1': | ||
| kwargs['extra_body'] = {"prompt_cache_retention": "24h"} | ||
| _llm_cache[model_name] = ChatOpenAI(**kwargs) | ||
| return _llm_cache[model_name] |
There was a problem hiding this comment.
prompt_cache_key dropped for gpt-5.1 features
llm_medium_experiment callers previously passed a per-call prompt_cache_key (e.g. "omi-transcript-structure", "omi-daily-summary") to steer requests toward the same backend cache shard. The new _get_or_create_llm only sets prompt_cache_retention: 24h but omits prompt_cache_key, so prompt-prefix cache hit rates for conv_structure, conv_events, and daily_summary may regress. Consider threading the per-feature key through _get_or_create_llm or passing it at call sites for features that had one previously.
- Fix get_transcript_structure event extraction to use conv_structure tier instead of incorrect conv_events key - Restore prompt_cache_key on all medium-tier callsites for OpenAI cache routing (action items, structure, apps, daily summary) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove 18 unwired feature entries from tier defaults to avoid confusion. Only the 8 features actually called via get_llm() remain in the map. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace test_models_unchanged_for_llm_calls (checked for llm_medium_experiment) with test_llm_calls_use_qos_tier_system that verifies get_llm() feature keys and prompt_cache_key retention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename LLM_TIER_ env prefix to OMI_QOS_ (Omi-level QoS, not LLM-level) - Add cache_key param to get_llm() that only applies prompt_cache_key when the resolved model supports it (gpt-5.1) - Safely ignored when tier is swapped to nano/mini via env var override, preventing model-specific params from breaking unsupported models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace .bind(prompt_cache_key=...) and invoke kwargs with get_llm's cache_key parameter. This ensures prompt_cache_key is only sent to models that support it when tiers are swapped via OMI_QOS_ env vars. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename from test_llm_qos_tiers to test_omi_qos_tiers - Update env var references from LLM_TIER_ to OMI_QOS_ - Add TestCacheKeySafety: verifies cache_key is applied for medium tier, safely ignored for mini/nano, and safely ignored after tier downgrade Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ream, llm_agent (#6831) Simplify the model inventory before Omi QoS sits on top. These 5 globals had zero production callers: - llm_large (o1-preview) — unused - llm_large_stream (o1-preview streaming) — unused - llm_high_stream (o4-mini streaming) — unused - llm_agent (gpt-5.1 with cache key) — only test mocks - llm_agent_stream (gpt-5.1 streaming with cache key) — only test mocks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace llm_agent cache retention checks with QoS tier medium (gpt-5.1) cache retention verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_prompt_cache_optimization: check Omi QoS cache_key support instead of removed llm_agent globals - test_prompt_cache_integration: verify gpt-5.1 prompt_cache_retention via extra_body instead of llm_agent model_kwargs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add get_llm stub to clients mock and patch get_llm instead of llm_medium_experiment in extract_action_items test paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Required fixes from review iterations 1-3, all addressed:
All 40 Omi QoS + 24 prompt caching + 39 action item + 13 usage context tests passing. by AI for @beastoin |
Fix reference to undefined non_cache_clients -> non_gpt51_clients. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…structor capture (#6831)
Replace OpenAI-only tier system with profile-based architecture covering all 4 providers (OpenAI, Anthropic, OpenRouter, Perplexity). Each profile maps every feature to a specific model — different features can use different model tiers within the same profile. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
47 tests covering: profile structure, get_model resolution, per-feature env overrides, pinned features, OpenRouter client construction, streaming, cache key safety, provider classification, and rollback scenarios. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…6831) - Medium profile: conv_action_items and conv_apps corrected to gpt-5.1 (matches prod) - OpenRouter cache key includes temperature to prevent cross-feature cache poisoning - get_llm() raises ValueError for Anthropic/Perplexity features (use get_model() instead) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace legacy llm_persona_mini_stream/llm_persona_medium_stream/llm_medium_experiment
with get_llm('persona_chat')/get_llm('persona_chat_premium')/get_llm('persona_clone').
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…6831) Replace legacy llm_gemini_flash with get_llm('wrapped_analysis') across 9 analysis functions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ble model routing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Required fixes from review (all addressed):
All 84 tests passing (51 QoS + 15 callsite + 18 prompt cache integration). by AI for @beastoin |
… overrides Prevents MODEL_QOS_CONV_X=claude-haiku-3.5 from silently creating a ChatOpenAI client with a non-OpenAI model, and similar mismatches for OpenRouter features. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ired files - Subprocess tests for MODEL_QOS=premium and invalid profile fallback - Callsite assertions for chat, persona, goals, notifications, app_generator, graph, perplexity, chat_sessions, apps, app_integrations, external_integrations, proactive_notification, generate_2025, onboarding - Legacy invocation guard across all 14 wired files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ns for all 17 files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Test results:
Ready for merge review. by AI for @beastoin |
L1 — Backend + Pusher live testBackend (port 10160, branch Endpoint tests: No QoS-related errors in startup or request logs. Pusher (port 10161) No QoS-related errors. by AI for @beastoin |
L2 — Service + App integrated live testSetup: Backend (port 10160) + Pusher (port 10161) running QoS branch Backend + Pusher: Both services booted with QoS App screens verified (all loaded without errors against QoS-enabled backend):
Result: App boots and renders all major screens without errors against QoS-enabled backend+pusher. No crashes, no rendering issues, no QoS-related error logs. by AI for @beastoin |
…65% cost savings Apply geni's model tuning: gpt-5.4 (3 user-facing), gpt-5.4-mini (9 processing), gpt-4.1-nano (19 classification), claude-sonnet-4-6 (1 agent), sonar-pro (1 search). Replace fixed provider sets with _classify_provider() — provider now follows the model name, not the feature. This enables persona_chat/wrapped_analysis to be OpenRouter in premium but OpenAI in max.
…tion 80 tests: new _classify_provider tests, profile-specific provider assertions, max 5-variant check, updated cache key models, dynamic safety guard tests.
Addresses CP8 tester gaps: caplog-based override warning assertions, OpenAI vs OpenRouter client routing verification, temperature config test.
Change test_openrouter_temperature_applied to use monkeypatch env override so get_llm actually routes through OpenRouter path, proving _OPENROUTER_TEMPERATURES config is applied end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Manager feedback: geni's optimization was too aggressive. Premium now matches current production max models (gpt-5.1, gpt-5.2, gpt-4.1, gpt-4.1-mini, o4-mini, OpenRouter, claude-sonnet-4-6, sonar-pro). Max temporarily identical — pending geni re-tune. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align test assertions with premium=production and max=production model assignments. Both profiles now use OpenRouter, o4-mini, gpt-5.1, gpt-5.2, gpt-4.1, gpt-4.1-mini. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Premium: geni's 5-model optimized set (gpt-5.4, gpt-5.4-mini, gpt-4.1-nano, claude-sonnet-4-6, sonar-pro) — no OpenRouter. Max: quality upgrade from production — all gpt-5.1/5.2 → gpt-5.4, all gpt-4.1-mini/o4-mini/gpt-4.1 → gpt-5.4-mini, OpenRouter eliminated. 4 model variants, 3 providers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests reflect: premium uses geni's 5-model cost-saving set (no OpenRouter), max uses latest-gen quality upgrade (gpt-5.4/gpt-5.4-mini, no OpenRouter). Cache key tests use monkeypatch for non-cacheable model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Max: only change gpt-5.1/5.2 → gpt-5.4 (latest flagship). Everything else unchanged from production (gpt-4.1-mini, gpt-4.1, o4-mini, OpenRouter, claude-sonnet-4-6, sonar-pro). 9 model IDs, 4 providers. Premium: cost-optimized ~65-70% cheaper. gpt-5.4→gpt-5.4-mini, gpt-4.1-mini/gpt-4.1/o4-mini→gpt-4.1-nano. persona_chat stays OpenRouter, persona_chat_premium uses gpt-5.4-mini (get_llm() rejects Anthropic). 6 model IDs, 4 providers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Max: 9 model IDs with OpenRouter (production + gpt-5.4 upgrade). Premium: cost-optimized with gpt-5.4-mini, gpt-4.1-nano, mixed OpenRouter. Tests reflect both profiles accurately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests every provider path (OpenAI, Anthropic, Perplexity, OpenRouter) with actual API calls. Found 2 deprecated OpenRouter models: google/gemini-flash-1.5-8b (404) and anthropic/claude-3.5-sonnet (404). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
L1 integration test — real LLM API calls for all 34 QoS features:
Streaming (OpenAI + OpenRouter): PASS 32/34 features verified with real API responses. 2 OpenRouter models are deprecated — need replacement. Test: by AI for @beastoin |
- Default profile: max → premium (cost-effective, 80% of max quality) - Fallback on invalid MODEL_QOS: max → premium - Replace deprecated persona_chat (google/gemini-flash-1.5-8b → gpt-4.1-nano) - Replace deprecated persona_chat_premium (anthropic/claude-3.5-sonnet → gpt-5.4-mini) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Default profile assertion: max → premium - Dead OpenRouter model assertions → direct OpenAI API models - Provider routing tests: persona_chat is now OpenAI, not OpenRouter - Invalid profile fallback: max → premium - 85 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Restructured for premium default: gpt-5.4-mini (11 features), gpt-4.1-nano (20 features) - All 34 features verified with real LLM API calls - Streaming, cache key, and profile routing tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Test results:
L1 Integration — Real LLM API calls (premium profile)
Total: 126 tests (85 unit + 41 integration), all passing. Next step: CP8 tester + CP9 live tests. by AI for @beastoin |
CP9 Live Test EvidenceChanged paths (28 files, model routing only):
L1 (CP9A): Build and run changed component standalone
L2 (CP9B): Service + app integration
L1 synthesis: All changed paths P1-P4 proven. 34 model routing paths verified end-to-end with real provider APIs. Dead OpenRouter models (google/gemini-flash-1.5-8b, anthropic/claude-3.5-sonnet) confirmed replaced with working direct API models (gpt-4.1-nano, gpt-5.4-mini). No untested paths. L2 synthesis: Model routing is transparent to the app layer — the only integration surface is model name → provider API → response, fully covered by L1. No API contract changes, no protocol changes, no UI changes. Workflow status: CP0-CP9B complete. Ready for merge approval. by AI for @beastoin |
Closes #6831
Adds a 2-profile Model QoS system that maps all 34 LLM features to models across 4 providers (OpenAI, Anthropic, OpenRouter, Perplexity).
Profiles:
gpt-5.4-mini(11 features) +gpt-4.1-nano(20 features) +claude-sonnet-4-6+google/gemini-3-flash-preview+sonar-pro. 5 distinct model IDs.gpt-5.4(9 features) +gpt-4.1-mini(15 features) +gpt-4.1+o4-mini+gpt-4.1-nano+gpt-5.4-mini+claude-sonnet-4-6+google/gemini-3-flash-preview+sonar-pro. 9 distinct model IDs.Key design:
get_model(feature)resolves model from active profile with env override support (MODEL_QOS_FEATURE_NAME=model)get_llm(feature)returns correct LangChain client (OpenAI or OpenRouter) with caching_classify_provider(model)determines provider from model name — provider follows model, not featuregpt-5.4andgpt-5.4-miniviacache_keyparameterfair_use) bypass profile entirelyget_llm()rejects Anthropic/Perplexity features (must use dedicated clients)google/gemini-flash-1.5-8b,anthropic/claude-3.5-sonnet) replaced with direct OpenAI APITests: 85 unit tests + 41 L1 integration tests (real LLM API calls for all 34 features). 34 features across 17 wired files, 30+ callsites verified.
by AI for @beastoin