Guangyu Zhao* | Kewei Lian* | [Haoxuan Ru*] | Borong Zhang* | Haowei Lin | Zhancun Mu | [Haobo Fu] | [Qiang Fu] | Shaofei Cai | Zihao Wang | Yitao Liang
Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals. We formulate post-training adaptation as a latent control problem, where the goal embedding serves as a continuous control variable to modulate the behavior of a frozen policy.
This repository provides the infrastructure to:
- Run the GROOT foundation policy in Minecraft.
- Generate trajectory data (videos, actions, and rewards) for preference learning.
- Train and optimize the latent goal (
$z$ ) using a DPO-like loss based on preference pairs. - Test the optimized latent goal in the environment.
- Python 3.9+
- Java 8 (for Minecraft environment
MCP-Reborn) - Conda (recommended for environment management)
Ensure you have the required dependencies installed. You can use the provided pyproject.toml or install the necessary packages manually (e.g., torch, gymnasium, av, opencv-python, hydra-core, omegaconf, einops, rich, torchmetrics).
# Example using conda
conda create -n pgt python=3.10
conda activate pgt
pip install -r requirements.txt # Or install based on pyproject.tomlNote: The Minecraft environment (MCP-Reborn) requires specific Java setup and Xvfb/vglrun for headless rendering. Please ensure launchClient.sh is configured correctly for your system.
The PGT pipeline consists of three main steps: Data Generation, Training, and Testing.
To generate rollouts for a specific task using the original GROOT policy (this serves as both testing the original policy and generating data for PGT), use the run.py script.
This script runs multiple episodes in parallel across available GPUs.
python run.py \
--env_name diverses/collect_wood \
--ref_video reference_videos/collect_wood.mp4 \
--num_episodes 50 \
--max_steps 128 \
--output_dir outputsArguments:
--env_name: The environment configuration path (e.g.,diverses/collect_wood). Configurations are located injarvis/global_configs/envs/.--ref_video: The path to the reference video for the task, used to extract the initial latent goal.--num_episodes: Total number of episodes to generate.--max_steps: Maximum number of steps per episode.--output_dir: Directory to save the generated trajectories.
Outputs:
The generated episodes will be saved in the outputs/<task>/ directory:
.mp4video files for each episode..pkltrajectory files containing detailed observations and actions.rewards.jsoncontaining the cumulative reward for each episode.
Once the data is generated, run the PGT training script to optimize the latent goal based on the preference pairs formed from the generated trajectories.
python train_pgt.py \
--task collect_wood \
--output_dir outputs/collect_wood \
--ref_video reference_videos/collect_wood.mp4 \
--beta 0.1 \
--lr 1e-2 \
--epochs 50How it works:
- Loads the generated
.pkltrajectories andrewards.json. - Forms preference pairs
$(\tau_w, \tau_l)$ where$R(\tau_w) > R(\tau_l)$ . - Splits the pairs into training and evaluation sets.
- Optimizes the latent goal
$z$ using Adam optimizer to maximize the likelihood of preferred trajectories and minimize the likelihood of non-preferred ones (DPO loss). - Selects the best
$z$ based on the lowest evaluation loss. - Saves the optimized latent goal to
outputs/<task>/optimized_z.pkl.
Finally, you can test the policy using the newly optimized latent goal by passing the --given_latent argument to run.py:
python run.py \
--env_name diverses/collect_wood \
--ref_video reference_videos/collect_wood.mp4 \
--num_episodes 20 \
--max_steps 128 \
--given_latent outputs/collect_wood/optimized_z.pkl \
--output_dir outputs_testCompare the average rewards in outputs_test/<task>/rewards.json with the original generation step to evaluate the performance improvement.
run.py: Main script for running environment rollouts and data collection.train_pgt.py: Script for training and optimizing the latent goal using PGT.checkpoints/: Directory containing pretrained policy weights (e.g., GROOT).reference_videos/: Directory containing reference videos for various tasks.jarvis/: Core library containing environment wrappers, model architectures, and configurations.stark_tech/: Minecraft environment interface (MinecraftWrapper) and backend (MCP-Reborn).arm/: Model architectures (ConditionedAgent,GrootPolicy, etc.).global_configs/: Environment and task YAML configurations.assets/: Minecraft static assets, recipes, and spawn configurations.
@misc{zhao2026pgt,
title={Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies},
author={Guangyu Zhao and Kewei Lian and Haoxuan Ru and Borong Zhang and Haowei Lin and Zhancun Mu and Haobo Fu and Qiang Fu and Shaofei Cai and Zihao Wang and Yitao Liang},
year={2026},
eprint={2412.02125},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.02125}
}