Problem
Currently, each agent makes an individual LLM call during AgentRuntime.process_tick(). While agents are processed concurrently via ThreadPoolExecutor, their LLM requests are not batched together, which underutilizes vLLM's continuous batching capabilities.
Current Flow
# python/agent_runtime/runtime.py:87
for agent_id, agent in self.agents.items():
task = asyncio.create_task(self._agent_decide(agent))
# Each agent makes individual LLM call
Proposed Solution
Implement batch LLM generation in AgentRuntime.process_tick():
-
Collect all agent contexts first:
contexts = {}
for agent_id, agent in self.agents.items():
contexts[agent_id] = agent._build_context()
-
Send all prompts to vLLM together:
if isinstance(self.backend, VLLMBackend):
results = await self.backend.generate_batch(prompts)
else:
# Fallback to concurrent individual calls
results = await self._concurrent_llm_calls(contexts)
-
Parse results into actions:
actions = {}
for agent_id, result in results.items():
actions[agent_id] = agent._parse_action(result)
Expected Impact
- 50-70% faster LLM inference with 4+ agents
- Better utilization of vLLM's continuous batching (PagedAttention)
- With 4 agents: ~1.5x time of single agent instead of 4x
Files to Modify
python/agent_runtime/runtime.py - Implement batch processing in process_tick()
python/backends/vllm_backend.py - Add generate_batch() method
python/backends/base.py - Add abstract generate_batch() interface
python/backends/llama_cpp_backend.py - Implement fallback (no native batching)
References
Priority
HIGH - This is the primary bottleneck in multi-agent scenarios
Problem
Currently, each agent makes an individual LLM call during
AgentRuntime.process_tick(). While agents are processed concurrently via ThreadPoolExecutor, their LLM requests are not batched together, which underutilizes vLLM's continuous batching capabilities.Current Flow
Proposed Solution
Implement batch LLM generation in
AgentRuntime.process_tick():Collect all agent contexts first:
Send all prompts to vLLM together:
Parse results into actions:
Expected Impact
Files to Modify
python/agent_runtime/runtime.py- Implement batch processing inprocess_tick()python/backends/vllm_backend.py- Addgenerate_batch()methodpython/backends/base.py- Add abstractgenerate_batch()interfacepython/backends/llama_cpp_backend.py- Implement fallback (no native batching)References
python/agent_runtime/runtime.py:66-100python/backends/base.py:68-86Priority
HIGH - This is the primary bottleneck in multi-agent scenarios