Backend LLM costs span multiple providers (OpenAI, Anthropic, Google/Gemini, Perplexity) across 60+ callsites. There is no unified way to control model selection per-feature — models are hardcoded throughout the codebase, making cost optimization and A/B testing impossible without code changes.
Current Behavior
- 60+ LLM callsites across 4 providers with hardcoded model instances
- OpenAI: 15+ features using
llm_mini, llm_medium, llm_medium_experiment directly
- Anthropic: chat agent hardcoded to
claude-sonnet-4-6
- OpenRouter: persona chat and wrapped analysis hardcoded to specific Gemini/Claude models
- Perplexity: web search hardcoded to
sonar-pro
- No mechanism to downgrade/upgrade models per-feature without code changes
- No way to switch cost profiles (e.g., "run everything on cheapest acceptable models")
Expected Behavior
A provider-agnostic QoS profile system where each profile (mini/medium/high) maps every feature to a specific model — potentially different model tiers within the same profile, since some features need more quality than others even in a cost-optimized profile.
Solution
QoS Profiles — each profile is a complete feature→model mapping across all providers:
MODEL_QOS_MINI:
conv_action_items: gpt-4.1-nano (cheapest, structured extraction)
conv_structure: gpt-4.1-mini (needs more quality)
chat_agent: claude-haiku-3.5 (cost-optimized chat)
persona_chat: gemini-flash-1.5-8b
...
MODEL_QOS_MEDIUM:
conv_action_items: gpt-4.1-mini
conv_structure: gpt-5.1
chat_agent: claude-sonnet-4-6
persona_chat: claude-3.5-sonnet
...
MODEL_QOS_HIGH:
conv_action_items: gpt-5.1
conv_structure: o4-mini
chat_agent: claude-sonnet-4-6
persona_chat: gemini-3-flash-preview
...
Global switch: MODEL_QOS=mini selects entire profile
Per-feature override: MODEL_QOS_CONV_STRUCTURE=gpt-5.1 overrides one feature
21 features across 4 providers:
- OpenAI (16): conv_action_items, conv_structure, conv_apps, daily_summary, memories, memory_conflict, memory_category, knowledge_graph, chat_responses, chat_extraction, session_titles, goals, notifications, followup, smart_glasses, onboarding
- Anthropic (1): chat_agent
- OpenRouter (3): persona_chat, persona_clone, wrapped_analysis
- Perplexity (1): web_search
Pinned features: fair_use classifier pinned to specific model regardless of profile (accuracy-critical).
Affected Areas
| Area |
Files |
Callsites |
| QoS core |
utils/llm/clients.py |
Profile definitions, get_model(), client factories |
| Conversation processing |
utils/llm/conversation_processing.py |
5 callsites |
| Memories |
utils/llm/memories.py |
4 callsites |
| Knowledge graph |
utils/llm/knowledge_graph.py |
2 callsites |
| Chat |
utils/llm/chat.py |
10+ callsites |
| Persona |
utils/llm/persona.py |
5 callsites |
| Goals |
utils/llm/goals.py |
3 callsites |
| Notifications |
utils/llm/notifications.py |
2 callsites |
| Agentic chat |
utils/retrieval/agentic.py |
1 callsite (Anthropic) |
| Wrapped |
utils/wrapped/generate_2025.py |
9 callsites (Gemini) |
| Other |
Various routers/utils |
10+ callsites |
Impact
Unified cost control across all LLM providers. One env var (MODEL_QOS=mini) to switch the entire backend to cost-optimized models. Per-feature overrides for A/B testing. No user-facing changes.
by AI for @beastoin
Backend LLM costs span multiple providers (OpenAI, Anthropic, Google/Gemini, Perplexity) across 60+ callsites. There is no unified way to control model selection per-feature — models are hardcoded throughout the codebase, making cost optimization and A/B testing impossible without code changes.
Current Behavior
llm_mini,llm_medium,llm_medium_experimentdirectlyclaude-sonnet-4-6sonar-proExpected Behavior
A provider-agnostic QoS profile system where each profile (mini/medium/high) maps every feature to a specific model — potentially different model tiers within the same profile, since some features need more quality than others even in a cost-optimized profile.
Solution
QoS Profiles — each profile is a complete feature→model mapping across all providers:
Global switch:
MODEL_QOS=miniselects entire profilePer-feature override:
MODEL_QOS_CONV_STRUCTURE=gpt-5.1overrides one feature21 features across 4 providers:
Pinned features: fair_use classifier pinned to specific model regardless of profile (accuracy-critical).
Affected Areas
utils/llm/clients.pyget_model(), client factoriesutils/llm/conversation_processing.pyutils/llm/memories.pyutils/llm/knowledge_graph.pyutils/llm/chat.pyutils/llm/persona.pyutils/llm/goals.pyutils/llm/notifications.pyutils/retrieval/agentic.pyutils/wrapped/generate_2025.pyImpact
Unified cost control across all LLM providers. One env var (
MODEL_QOS=mini) to switch the entire backend to cost-optimized models. Per-feature overrides for A/B testing. No user-facing changes.by AI for @beastoin