Turning frozen visual foundation embeddings into a compact, task-centric latent space for reward-free offline planning and control.
Minghao Fu · Fan Feng · Nicklas Hansen · Biwei Huang | UC San Diego
TC-WM treats a pretrained visual backbone (e.g. DINOv2) as a semantic scaffold, not the final state space. A linear projection compresses its embedding into a compact latent; a designated subspace is aligned with proprioception via InfoNCE; a ViT predicts latent dynamics; a linear decoder reconstructs the embedding to prevent collapse. The task-centric block is identifiable up to a simple transformation.
Empirically, TC-WM enables zero-shot test-time planning across nine offline visual-control tasks — Maze, Wall, Push-T, Lift, Can, Square, Reacher, Cheetah, Hopper — beating DINO-WM on every LDP task and matching strong model-based baselines.
conda env create -f environment.yml
conda activate tcwm
bash install_mujoco.sh # MuJoCo 210 + LD_LIBRARY_PATH setup
pip install d4rl pymunk==6.* shapely scikit-image pygame # env extrasSet the data root once:
export TCWM_DATA_ROOT=/path/to/data # configs read this via ${oc.env:TCWM_DATA_ROOT,./data}# Single task
python train.py --config-name=train_tcwm env=wall
# All TC-WM tasks (one GPU each, picks free GPUs automatically)
bash run_tcwm.shConfigs live in conf/ (Hydra). Key knobs: env=, encoder= (dino / vjepa / dinov3), projected_dim, alignment_dim, training.epochs. Offline run by default (WANDB_MODE=offline).
python plan.py --config-name=plan_wall \
ckpt_base_path=$TCWM_DATA_ROOT/checkpoints/wall_proj256 \
n_evals=50Available planners: CEM (conf/planner/cem.yaml), LDP (conf/planner/ldp.yaml), GD. Rollout videos auto-saved as plan{batch}_{trial}_{success|failure}.mp4 in the Hydra run dir.
python rollout.py --config-name=train_tcwm \
env=wall resume_folder=$TCWM_DATA_ROOT/checkpoints/wall_proj256 \
+traj_idx=0 +output_dir=rollouts/wall_idx0 +fps=8Saves three GIFs per trajectory: {env}_orig.gif, {env}_recon.gif, {env}_pred.gif — original observation, autoencoder reconstruction, open-loop TC-WM prediction.
TC-WM/
├── train.py · plan.py · rollout.py · utils.py · preprocessor.py · custom_resolvers.py
├── models/ # VWorldModel + encoder · projector · predictor · decoder
├── env/ # Gym wrappers for Maze, Wall, Push-T, Robomimic, DMC
├── datasets/ # Trajectory loaders
├── planning/ # CEM, LDP, MPC, evaluator
├── conf/ # Hydra configs (env / encoder / method / planner / ...)
├── metrics/ distributed_fn/ gpu_utils/
└── assets/
Pretrained checkpoints will be released at MinghaoFu/TC-WM-checkpoints on HuggingFace Hub. Each task ships with a single best-seed checkpoint; load with resume_folder=<path>.
@article{fu2026tcwm,
title = {Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations},
author = {Fu, Minghao and Feng, Fan and Hansen, Nicklas and Huang, Biwei},
journal = {arXiv preprint arXiv:2605.25620},
year = {2026}
}m9fu [at] ucsd [dot] edu