EchoRL is a system framework that bridges reaction and planning in real-time reinforcement learning through experience-grounded infrastructure. It introduces three key innovations for bandwidth-efficient LLM-based reinforcement learning:
- Latent Planning Optimization - structured rollout with continuation-based reasoning
- Asynchronous Execution Engine - KV-cache sharing, bandwidth-aware scheduling, and token-level dispatch
- Prioritized Replay Buffer - stratified hot/cold buffers for improved RL training efficiency
- Latent Planning: Trajectory-conditioned policy with KL regularization
- Bandwidth-Efficient Execution: KV-cache sharing with effective bandwidth b_eff(s_{1:t}) and η_bw tracking
- Async Execution: 78% KV reuse rate with bandwidth-aware priority scheduling
- Prioritized Replay: Hot/cold buffer stratification with surprise-weighted sampling
- Comprehensive Evaluation: Benchmarks across ALFWorld, WebShop, CRUXEval, ARC, and MiniGrid
- Multi-Backbone Support: GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-4, Qwen, DeepSeek-R1
- Performance Monitoring: Real-time metrics, system monitoring, and statistical analysis
- Python 3.9+
- PyTorch 2.0+
- CUDA 11.8+ (for GPU acceleration)
# Clone the repository
git clone https://github.com/your-org/Echo-RL.git
cd Echo-RL
# Create virtual environment
conda create -n echo_rl python=3.10 -y
conda activate echo_rl
# Install dependencies
pip install -r requirements.txt
# Install EchoRL in development mode
pip install -e .
# Build C++ performance kernels (optional but recommended)
pip install pybind11
pip install -e ".[dev]" # or: python setup.py build_ext --inplaceFor specific tasks and backbones, install additional dependencies:
# LLM API clients
pip install openai anthropic google-generativeai mistralai
# Local model support
pip install transformers accelerate bitsandbytes
# Environment-specific
pip install alfworld selenium # For ALFWorld and WebShop tasksTrain EchoRL on ALFWorld task with GPT-4o backbone:
python examples/train_echo_rl.py \
--task alfworld \
--backbone gpt-4o \
--timesteps 100000 \
--num-actors 128 \
--batch-size 256Run full benchmark comparing EchoRL against baselines:
python examples/benchmark_echo_rl.py \
--tasks alfworld webshop cruxeval \
--backbones gpt-4o claude-3.5-sonnet \
--baselines react tot ppo-rlhf \
--num-seeds 10 \
--num-episodes 100import asyncio
from echo_rl import EchoRLTrainer, TrainingConfig
async def main():
# Create training configuration
config = TrainingConfig(
env_name="alfworld",
total_timesteps=100000,
num_actors=128,
device="cuda"
)
# Initialize trainer
trainer = EchoRLTrainer(config)
# Run training
metrics = await trainer.train()
print(f"Success rate: {metrics.evaluation_results['success_rate']:.3f}")
print(f"Avg reward: {metrics.evaluation_results['avg_reward']:.3f}")
asyncio.run(main())EchoRL coordinates three modules through one shared latent plan τ̄:
Latent Plan τ_t = F_φ(s_{t-k:t})
│
├──► Soft-prefix policy π_θ(a_t | s_t, τ_t)
├──► Bandwidth-aware scheduling: priority = r / (b_eff + q + ε)
└──► Planning-aware replay: score = ||τ_t - τ̄||² + α|r_t|
EchoRL optimizes the bandwidth efficiency metric:
η_bw(π) = E[Σ r_t] / (E[Σ b_eff(s_{1:t})] + E_B[w|ℓ_PG|])
where effective rollout bandwidth accounts for KV prefix reuse:
b_eff(s_{1:t}, t') = b(s_{1:t}) - b(s_{1:t'}) # t' = reused prefix length
b(s_{1:t}) = scale · t(t+1)/2 # quadratic attention cost
Performance-critical paths are implemented in C++ (echo_rl/kernels/) with Python fallbacks:
| Kernel | Paper reference |
|---|---|
EMAPlanTracker |
Shared EMA plan τ̄ for replay scoring |
plan_surprise |
||τ_t - τ̄||² + α|r_t| |
prefix_match |
KV prefix reuse: KV(s₁:t) = KV_frozen ∪ KV_rolling |
priority_sample |
Softmax replay sampling + importance weights |
attention_bandwidth_cost |
Rollout bandwidth b(s₁:t) |
effective_bandwidth_cost |
KV-aware effective bandwidth b_eff(s₁:t) |
bandwidth_aware_priorities |
Scheduling priority r / (b + q + ε) |
bandwidth_efficiency |
η_bw learning return per bandwidth unit |
Build kernels:
pip install pybind11
python setup.py build_ext --inplace
python -c "from echo_rl.kernels import kernels_available; print(kernels_available())"EchoRL consists of three core components:
from echo_rl.core.bandwidth import (
BandwidthConfig,
BandwidthEfficiencyTracker,
BandwidthAwareScheduler,
)
from echo_rl.kernels import effective_bandwidth_cost, bandwidth_efficiency
# Effective bandwidth with KV prefix reuse
b_eff = effective_bandwidth_cost(seq_len=128, reuse_len=96, scale=1.0)
# Bandwidth-aware rollout scheduling
scheduler = BandwidthAwareScheduler(BandwidthConfig(bandwidth_weight=1.0))
priority = scheduler.compute_priority(reward=1.0, seq_len=128, queue_time=0.5, reuse_len=96)
# Track η_bw during training
tracker = BandwidthEfficiencyTracker()
tracker.record_rollout_step(reward=0.5, seq_len=64, reuse_len=48)
tracker.record_learner_update(weighted_pg_loss=0.02)
metrics = tracker.snapshot()
print(f"η_bw = {metrics.eta_bw:.4f}, saved = {metrics.total_bandwidth_saved:.2f}")from echo_rl.core.latent_planning import LatentPlanningOptimizer, TrajectoryEncoder
# Trajectory encoder: τ_t = F_φ(s_{t-k:t})
encoder = TrajectoryEncoder(state_dim=512, config=PlanningConfig())
# Policy conditioning: π_θ(a_t | s_t, τ_t)
policy = PolicyNetwork(state_dim=512, action_dim=20, latent_dim=512)
# KL regularization: L_KL = D_KL[p_φ(τ_t | s_{1:t}) || p_φ(τ_{t-1} | s_{1:t-1})]
optimizer = LatentPlanningOptimizer(state_dim=512, action_dim=20, config=PlanningConfig())from echo_rl.core.async_execution import AsyncExecutionEngine, KVCacheManager
# KV-cache sharing: KV(s1:t) = KV_frozen(s1:t') ∪ KV_rolling(s_{t'+1:t})
cache_manager = KVCacheManager(config=ExecutionConfig())
# Priority scheduling: priority(i) = r_i / (q_i + ε)
execution_engine = AsyncExecutionEngine(
config=ExecutionConfig(),
model=policy_network,
device="cuda"
)
# Submit async rollout
request_id = await execution_engine.submit_rollout(
state_sequence=state_window,
priority=1.0
)from echo_rl.core.prioritized_replay import PrioritizedReplayBuffer, HotColdBuffer
# Hot/cold stratification
replay_buffer = PrioritizedReplayBuffer(config=ReplayConfig())
# Surprise-weighted sampling: score(t) = ||τ_t - E[τ]||² + α * r_t
experiences, weights = replay_buffer.sample_batch(
batch_size=256,
temperature=1.0
)EchoRL achieves significant improvements across all evaluated tasks:
| Task | Method | Success@1 (%) | ETPS | Cost/Success |
|---|---|---|---|---|
| ALFWorld | ReAct | 58.3 | 1,234 | $0.041 |
| EchoRL | 73.1 | 2,721 | $0.027 | |
| WebShop | ReAct | 58.3 | 1,234 | $0.041 |
| EchoRL | 73.1 | 2,721 | $0.027 | |
| CRUXEval | ReAct | 58.3 | 1,234 | $0.041 |
| EchoRL | 73.1 | 2,721 | $0.027 |
- 30-55% fewer environment steps through trajectory-conditioned actions
- 1.5-2.3× ETPS increase via KV-cache sharing and token-level dispatch
- 22-41% cost reduction through prioritized replay system
- 78% KV reuse rate with prefix caching strategy
Text-world control tasks requiring object manipulation and navigation.
from echo_rl.environments.alfworld import ALFWorldEnvironment, ALFWorldConfig
config = ALFWorldConfig(task_type="pick_and_place", max_objects=10)
env = ALFWorldEnvironment(config)Web-based shopping agent tasks with product search and purchase completion.
from echo_rl.environments.webshop import WebShopEnvironment, WebShopConfig
config = WebShopConfig(website_type="electronics", budget_limit=1000.0)
env = WebShopEnvironment(config)Code repair and debugging tasks requiring bug identification and fixing.
from echo_rl.environments.cruxeval import CRUXEvalEnvironment, CRUXEvalConfig
config = CRUXEvalConfig(language="python", max_code_length=1000)
env = CRUXEvalEnvironment(config)Abstract reasoning tasks with grid-based puzzles requiring pattern recognition.
from echo_rl.environments.arc import ARCEnvironment, ARCConfig
config = ARCConfig(grid_size=10, task_type="pattern_completion")
env = ARCEnvironment(config)Grid-world planning tasks with navigation, object manipulation, and goal completion.
from echo_rl.environments.minigrid import MiniGridEnvironment, MiniGridConfig
config = MiniGridConfig(grid_size=8, task_type="key_door")
env = MiniGridEnvironment(config)from echo_rl.utils.monitoring import PerformanceMonitor, MetricsCollector
# Real-time performance tracking
monitor = PerformanceMonitor()
monitor.start_monitoring()
# Comprehensive metrics collection
collector = MetricsCollector()
collector.collect_metrics(performance_metrics)from echo_rl.evaluation.benchmark import EchoRLBenchmark, BenchmarkConfig
config = BenchmarkConfig(
tasks=["alfworld", "webshop", "cruxeval"],
backbones=["gpt-4o", "claude-3.5-sonnet"],
baselines=["react", "tot", "ppo-rlhf"],
num_seeds=10
)
benchmark = EchoRLBenchmark(config)
results = await benchmark.run_benchmark()from echo_rl.training.trainer import TrainingConfig
config = TrainingConfig(
env_name="alfworld",
total_timesteps=1000000,
learning_starts=10000,
train_frequency=4,
evaluation_frequency=10000,
save_frequency=50000,
num_actors=128,
num_learners=2,
batch_size=256,
device="cuda"
)from echo_rl.core import PlanningConfig, ExecutionConfig, ReplayConfig, PPOConfig
# Latent planning
planning_config = PlanningConfig(
embedding_dim=512,
state_window_size=8,
kl_weight=0.1,
learning_rate=3e-4
)
# Async execution
execution_config = ExecutionConfig(
max_concurrent_rollouts=128,
max_cache_size=10000,
timeout=30.0
)
# Prioritized replay
replay_config = ReplayConfig(
hot_buffer_size=1000000,
cold_buffer_size=10000000,
age_threshold=1000,
temperature=1.0
)
# PPO learner
ppo_config = PPOConfig(
learning_rate=3e-4,
clip_epsilon=0.2,
value_loss_coef=0.5,
entropy_coef=0.01,
kl_coef=0.1,
gae_lambda=0.95,
gamma=0.99
)train_echo_rl.py- Basic training scriptbenchmark_echo_rl.py- Comprehensive benchmarking
latent_planning_demo.py- Trajectory encoding demoasync_execution_demo.py- KV-cache sharing demoprioritized_replay_demo.py- Hot/cold buffer demobandwidth_efficient_demo.py- Bandwidth efficiency demo
Run the test suite:
# Run all tests
pytest tests/
# Run specific test categories
pytest tests/test_core/ # Core components
pytest tests/test_environments/ # Environment interfaces
pytest tests/test_training/ # Training infrastructure
pytest tests/test_evaluation/ # Evaluation and benchmarkingTo reproduce the results from the EchoRL paper:
# Full benchmark across all tasks and backbones
python examples/benchmark_echo_rl.py \
--tasks alfworld webshop cruxeval arc minigrid \
--backbones gpt-4o claude-3.5-sonnet gemini-1.5-pro llama-4 qwen-7b deepseek-r1 \
--baselines react tot ppo-rlhf rlaif impala \
--num-seeds 10 \
--num-episodes 100Create custom benchmark configurations:
from echo_rl.evaluation.benchmark import BenchmarkConfig
config = BenchmarkConfig(
tasks=["custom_task"],
backbones=["custom_backbone"],
baselines=["custom_baseline"],
num_seeds=5,
num_episodes=50,
echo_rl_configs={
"total_timesteps": 50000,
"num_actors": 64
}
)This project is licensed under the MIT License - see the LICENSE file for details.