LLMs suffer from stylistic inertia in long roleplay sessions. Once a tone, pacing, or prose style is established over several turns, the model tends to perpetuate it regardless of narrative shifts. A lighthearted conversation that turns tragic will often retain the cadence and vocabulary of the earlier tone because the weight of prior context anchors the model's generation.
Static system prompts cannot solve this. The system prompt is written once and does not adapt to evolving scenes.
An agentic middleware layer sits between the user and the model. It intercepts each user message, runs a short analytical pass to "read the room," then dynamically assembles prompt directives that shape the model's writing before the actual roleplay generation happens.
The user never sees the agentic layer. The writer model doesn't know it's being directed. The result is a roleplay session that naturally adapts its style, tone, and pacing as the narrative evolves.
The system uses a three-pass architecture for each user message:
- Director Pass - Tool-calling phase where the LLM selects moods, plot direction, and potentially rewrites user prompts
- Writer Pass - Story generation phase where the LLM writes the actual roleplay response
- Editor Pass - A ReAct loop - Self-audit for slop and length optimization phase. This is surgical, errors will be programmatically detected, the model only needs to write replacement for targeted sentences
For optimal KV cache reuse, the following will remain consistent across passes:
- The system prompt (character card, instructions, etc.) is identical across all passes
- Built once and reused forever
- Includes character description, scenario, example dialogue, and additional instructions
- The conversation history (previous messages) is identical across all passes
- Maintains exact same message content and ordering
- The same tool definitions must be sent in each LLM call for kv cache reuse
- Tool schemas affect the model's internal representation
- Inconsistent tool schemas break KV cache alignment
- Clear direction for Writer: Grounding the story + actively steering the writing style = better output
- Customizability: Customizable prompt injection that's automatically used by Director model
- Anti-slop: Get rid of overused words, phrases, and patterns often seen in LLM outputs
- Length Guard: Actively or passively protect from length degradation as context grows
- Speed: Multiple passes will obviously have a longer time to final response
- Cost: Neligible cost increase, which comes naturally with multiple passes, somewhat alleviated by KV cache reuse strategy
- A model with solid tool/function calling capabilities (recommended: Gemma 4)
- OpenAI-compatible LLM inference backend API that supports prompt-caching
