A Vision-Language-Action agent demo that plays Pokemon Red using LiquidAI's LFM2-VL-450M model, trained with TRL's GRPO algorithm and optimized via Unsloth.
- 🎮 Vision-Language Agent: Uses LFM2-VL-450M to analyze game screens and decide actions
- 🚀 Memory Efficient: 4-bit quantization and LoRA adapters via Unsloth
- 🏋️ GRPO Training: Group Relative Policy Optimization with custom game rewards
- 📊 Interactive Demo: Watch the agent play with real-time statistics
- GPU: NVIDIA GPU with 8GB+ VRAM (16GB+ recommended for training)
- Environment Server: The Pokemon Red OpenEnv server must be running
cd Pokemon_Red_OpenEnv/Agent_Demo
# Create virtual environment and install dependencies
uv sync
# Activate environment (optional, uv run handles this)
source .venv/bin/activateIn a separate terminal:
cd Pokemon_Red_OpenEnv/pokemonred_env
uv sync
uv run python -m server.appThe server will start at http://localhost:8000.
# Run with base model (no training required)
uv run python demo.py --max-steps 100
# Run in headless mode for faster testing
uv run python demo.py --headless --max-steps 50
# Use a trained checkpoint
uv run python demo.py --checkpoint outputs/final# Start training
uv run python train.py --max-steps 500
# Use custom config
uv run python train.py --config configs/train_config.yaml
# Resume from checkpoint
uv run python train.py --resume outputs/checkpoint-200Agent_Demo/
├── agent/
│ ├── __init__.py
│ └── vla_agent.py # VLA agent with LFM2-VL-450M
├── env/
│ ├── __init__.py
│ └── env_wrapper.py # Environment wrapper for training
├── training/
│ ├── __init__.py
│ └── trainer.py # GRPO trainer configuration
├── configs/
│ ├── train_config.yaml # Training hyperparameters
│ └── demo_config.yaml # Demo settings
├── demo.py # Interactive demo script
├── train.py # Training entry point
├── pyproject.toml # Dependencies
└── README.md
| Parameter | Default | Description |
|---|---|---|
model_id |
LiquidAI/LFM2-VL-450M |
HuggingFace model ID |
lora_rank |
16 |
LoRA rank for fine-tuning |
batch_size |
1 |
Per-device batch size |
gradient_accumulation |
8 |
Effective batch = 1 × 8 = 8 |
learning_rate |
5e-6 |
Learning rate |
num_generations |
4 |
GRPO completions per prompt |
max_steps |
1000 |
Total training steps |
| Parameter | Default | Description |
|---|---|---|
checkpoint |
null |
Path to trained LoRA checkpoint |
max_steps |
1000 |
Steps to run demo |
temperature |
0.1 |
Sampling temperature (lower = deterministic) |
delay |
0.1 |
Seconds between steps |
- Game Screen → LFM2-VL-450M processes the 144×160 pixel Game Boy screen
- Context Prompt → Agent receives HP, position, and battle status
- Action Prediction → Model outputs one of 7 actions: Down, Left, Right, Up, A, B, Start
- Environment Step → Action is executed, reward is returned
The agent is trained using Group Relative Policy Optimization:
- Prompt Generation: Collect game states by playing randomly
- Multiple Completions: Generate 4 action predictions per state
- Reward Evaluation: Execute each action and get game reward
- Policy Update: Optimize model to favor higher-reward actions
| Reward | Weight | Description |
|---|---|---|
| Game Reward | Primary | From environment (exploration, badges, levels) |
| Action Validity | 0.1 | Bonus for valid action format |
| Brevity | 0.05 | Bonus for concise responses |
| Mode | VRAM | Notes |
|---|---|---|
| Demo (4-bit) | ~4GB | Base model inference |
| Demo (bf16) | ~2GB | Uses LFM2-VL-450M's small size |
| Training (4-bit + LoRA) | ~8-12GB | With gradient checkpointing |
Make sure the Pokemon Red environment server is running:
cd ../pokemonred_env && uv run python -m server.app- Reduce
batch_sizeto 1 - Reduce
num_generationsto 2 - Ensure
load_in_4bit: truein config
- Enable GSPO:
use_gspo: true - Use 8-bit optimizer:
optim: adamw_8bit - Enable gradient checkpointing (enabled by default)
This demo is part of the Pokemon Red OpenEnv project for The OpenEnv Challenge hackathon.