Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Overview

Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals. We formulate post-training adaptation as a latent control problem, where the goal embedding serves as a continuous control variable to modulate the behavior of a frozen policy.

This repository provides the infrastructure to:

Run the GROOT foundation policy in Minecraft.
Generate trajectory data (videos, actions, and rewards) for preference learning.
Train and optimize the latent goal ($z$) using a DPO-like loss based on preference pairs.
Test the optimized latent goal in the environment.

Installation

Prerequisites

Python 3.9+
Java 8 (for Minecraft environment MCP-Reborn)
Conda (recommended for environment management)

Setup Environment

Ensure you have the required dependencies installed. You can use the provided pyproject.toml or install the necessary packages manually (e.g., torch, gymnasium, av, opencv-python, hydra-core, omegaconf, einops, rich, torchmetrics).

# Example using conda
conda create -n pgt python=3.10
conda activate pgt
pip install -r requirements.txt # Or install based on pyproject.toml

Note: The Minecraft environment (MCP-Reborn) requires specific Java setup and Xvfb/vglrun for headless rendering. Please ensure launchClient.sh is configured correctly for your system.

Usage Pipeline

The PGT pipeline consists of three main steps: Data Generation, Training, and Testing.

1. Data Generation (Rollout)

To generate rollouts for a specific task using the original GROOT policy (this serves as both testing the original policy and generating data for PGT), use the run.py script.

This script runs multiple episodes in parallel across available GPUs.

python run.py \
    --env_name diverses/collect_wood \
    --ref_video reference_videos/collect_wood.mp4 \
    --num_episodes 50 \
    --max_steps 128 \
    --output_dir outputs

Arguments:

--env_name: The environment configuration path (e.g., diverses/collect_wood). Configurations are located in jarvis/global_configs/envs/.
--ref_video: The path to the reference video for the task, used to extract the initial latent goal.
--num_episodes: Total number of episodes to generate.
--max_steps: Maximum number of steps per episode.
--output_dir: Directory to save the generated trajectories.

Outputs: The generated episodes will be saved in the outputs/<task>/ directory:

.mp4 video files for each episode.
.pkl trajectory files containing detailed observations and actions.
rewards.json containing the cumulative reward for each episode.

2. PGT Training

Once the data is generated, run the PGT training script to optimize the latent goal based on the preference pairs formed from the generated trajectories.

python train_pgt.py \
    --task collect_wood \
    --output_dir outputs/collect_wood \
    --ref_video reference_videos/collect_wood.mp4 \
    --beta 0.1 \
    --lr 1e-2 \
    --epochs 50

How it works:

Loads the generated .pkl trajectories and rewards.json.
Forms preference pairs $(\tau_w, \tau_l)$ where $R(\tau_w) > R(\tau_l)$.
Splits the pairs into training and evaluation sets.
Optimizes the latent goal $z$ using Adam optimizer to maximize the likelihood of preferred trajectories and minimize the likelihood of non-preferred ones (DPO loss).
Selects the best $z$ based on the lowest evaluation loss.
Saves the optimized latent goal to outputs/<task>/optimized_z.pkl.

3. Testing with Optimized Latent Goal

Finally, you can test the policy using the newly optimized latent goal by passing the --given_latent argument to run.py:

python run.py \
    --env_name diverses/collect_wood \
    --ref_video reference_videos/collect_wood.mp4 \
    --num_episodes 20 \
    --max_steps 128 \
    --given_latent outputs/collect_wood/optimized_z.pkl \
    --output_dir outputs_test

Compare the average rewards in outputs_test/<task>/rewards.json with the original generation step to evaluate the performance improvement.

Repository Structure

run.py: Main script for running environment rollouts and data collection.
train_pgt.py: Script for training and optimizing the latent goal using PGT.
checkpoints/: Directory containing pretrained policy weights (e.g., GROOT).
reference_videos/: Directory containing reference videos for various tasks.
jarvis/: Core library containing environment wrappers, model architectures, and configurations.
- stark_tech/: Minecraft environment interface (MinecraftWrapper) and backend (MCP-Reborn).
- arm/: Model architectures (ConditionedAgent, GrootPolicy, etc.).
- global_configs/: Environment and task YAML configurations.
- assets/: Minecraft static assets, recipes, and spawn configurations.

Citing PGT

@misc{zhao2026pgt,
      title={Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies},
      author={Guangyu Zhao and Kewei Lian and Haoxuan Ru and Borong Zhang and Haowei Lin and Zhancun Mu and Haobo Fu and Qiang Fu and Shaofei Cai and Zihao Wang and Yitao Liang},
      year={2026},
      eprint={2412.02125},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2412.02125}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
jarvis		jarvis
reference_videos		reference_videos
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
train_pgt.py		train_pgt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Overview

Installation

Prerequisites

Setup Environment

Usage Pipeline

1. Data Generation (Rollout)

2. PGT Training

3. Testing with Optimized Latent Goal

Repository Structure

Citing PGT

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Overview

Installation

Prerequisites

Setup Environment

Usage Pipeline

1. Data Generation (Rollout)

2. PGT Training

3. Testing with Optimized Latent Goal

Repository Structure

Citing PGT

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages