CVPR 2026
Xinxuan Lu, Charless Fowlkes, Alexander C. Berg
University of California, Irvine
Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories.
conda env create -f environment.yml
conda activate viewtokenThis pins Python 3.12 and installs pinned pip dependencies from requirements.txt. Key dependencies: PyTorch 2.10 (CUDA 12.8), xtuner, transformers 4.48, flash-attn 2.8.3, mmengine, deepspeed. The flash_attn wheel is pinned to cu12 / torch2.10 / cp312 / linux_x86_64, so reproduction requires a CUDA 12.8-capable host (e.g., NVIDIA Blackwell).
conda env create -f stable_diffusion/environment.yml
conda activate viewtoken_sdThis pins Python 3.12 and installs pinned pip dependencies from stable_diffusion/requirements.txt. Key dependencies: PyTorch 2.10 (CUDA 12.8), diffusers 0.37, accelerate, peft, transformers 5.3.
Download all checkpoints from Google Drive and extract to checkpoints/.
| Model | Backbone | File |
|---|---|---|
| Harmon Base | Harmon 1.5B pretrained | harmon/Harmon-1.5b-ReAlign-plus.pth |
| VAE | KL-16 | harmon/kl16.ckpt |
| Viewpoint Token (Harmon) | Harmon 1.5B | harmon/harmon_viewpoint.pth |
| Viewpoint Token (SD2) | Stable Diffusion 2.1 | sd2/training_state_30000.pth |
| Viewpoint Token (SD3) | Stable Diffusion 3.5 Medium | sd3/training_state_30000.pth |
| Viewpoint Token (SD3, AE/RPY split) † | Stable Diffusion 3.5 Medium | sd3_split_aerpy/training_state_50000.pth |
| Viewpoint Predictor | ResNet-50 | viewpoint_predictor/best_model.pth |
† sd3_split_aerpy/ is post-paper — a follow-up encoder variant with two parallel MLPs (azimuth+elevation, radius+pitch+yaw) trained for 50k steps. Substantially better azimuth and yaw/pitch on our predictor benchmark. Use sd3/ for paper reproduction.
The SD2/SD3 args.pkl files ship alongside their training_state_*.pth and are required by the eval loader. SD2/SD3 base models (Manojb/stable-diffusion-2-1-base, stabilityai/stable-diffusion-3.5-medium) are pulled from HuggingFace on first use.
Note on --radius: The --radius argument uses the raw (unnormalized) camera distance in the same units as the training rendering (Blender world units). This is roughly 3× the normalized radius reported in the paper. For example, the paper's normalized radius of ~1.67 corresponds to --radius 5.0 in the code. The scripts internally apply radius_normalized = (radius - 5.5) / 2.5 before feeding into the viewpoint encoder.
All three scripts accept --azimuth, --elevation, --radius, --pitch, --yaw as one or more space-separated values; the cartesian product of the five lists determines how many images get generated (one PNG per viewpoint tuple, named {slug}_az{az}_el{el}_r{r}_p{p}_y{y}.png under --output_dir). With no viewpoint flags, the defaults sweep azimuth [45, 135, 225, 315] at elevation 30 and radius 5.0 — four images.
python scripts/viewpoint2image.py \
--config configs/examples/viewpoint_training_1_5b.py \
--checkpoint checkpoints/harmon/harmon_viewpoint.pth \
--object_desc "a red sports car on a grass land" \
--azimuth 0 90 180 270 \
--output_dir output/harmon_red_carpython stable_diffusion/viewpoint2image_sd2.py \
--ckpt_base_path checkpoints/sd2/ \
--step 30000 \
--object_desc "a red sports car on a grass land" \
--azimuth 0 90 180 270 \
--output_dir output/sd2_red_car# Paper checkpoint (shared-backbone encoder, step 30000)
python stable_diffusion/viewpoint2image_sd3.py \
--ckpt_base_path checkpoints/sd3/ \
--step 30000 \
--object_desc "a red sports car on a grass land" \
--azimuth 0 90 180 270 \
--output_dir output/sd3_red_car
# Post-paper AE/RPY split checkpoint (better azimuth, supports pitch/yaw too)
python stable_diffusion/viewpoint2image_sd3.py \
--ckpt_base_path checkpoints/sd3_split_aerpy/ \
--step 50000 \
--object_desc "a red sports car on a grass land" \
--azimuth 0 90 180 270 \
--pitch 0.0 \
--yaw 0.0 \
--output_dir output/sd3_split_red_car--pitch and --yaw are optional (default 0.0) and use the same scale as the training metadata's angular_offset / 7.0. The encoder internally multiplies them by 5 — small values (e.g. ±0.2) produce noticeable camera tilt.
All three inference scripts also accept --metadata_json + --image_key to load a viewpoint from a metadata file (e.g., evaluation/metadata/eval_metadata.json) instead of passing --azimuth/--elevation/--radius directly; in that case a single image is produced.
The TexVerse viewpoint dataset (~101 GB compressed) is available from Google Drive (same folder as the checkpoints).
After downloading, extract with:
mkdir -p /your/data/root
tar --use-compress-program=unzstd -xvf viewpoint_dataset.tar.zst -C /your/data/rootThis will create viewpoint_dataset/ under your chosen DATA_ROOT.
Set DATA_ROOT in the config files to point to your data directory.
DATA_ROOT/
└── viewpoint_dataset/
├── all_final_captions.json # Global captions file (used by all subfolders)
├── texverse_vehicles_randomview_best_nanobanana/
│ ├── metadata.json # Per-folder camera/pose metadata
│ ├── <uid>/000.png ... 009.png # Rendered frames per object
│ └── ...
├── texverse_vehicles_randomview_best_nanobanana_2/
├── texverse_animals_randomview_best_nanobanana/
├── texverse_animals_randomview_best_nanobanana_2/
├── texverse_people_randomview_best_nanobanana/
├── texverse_people_randomview_best_nanobanana_2/
├── texverse_furniture_randomview_best_nanobanana/
├── texverse_vehicles_randomview_all/
├── texverse_animals_randomview_all/
├── texverse_people_randomview_all/
└── texverse_furniture_randomview_all/
Each subdirectory contains:
metadata.json— per-frame camera parameters and per-object descriptions (required)<uid>/<frame_index>.png— rendered images, one folder per object UID
The root-level all_final_captions.json provides global caption overrides shared across all subfolders.
Each dataset directory contains a metadata.json keyed by <uid>/<frame>.png:
{
"0290a9bf3100495b87029541363f2b24/000.png": {
"uid": "0290a9bf3100495b87029541363f2b24",
"image_path": "0290a9bf3100495b87029541363f2b24/000.png",
"matrix": [
[r00, r01, r02, tx],
[r10, r11, r12, ty],
[r20, r21, r22, tz]
],
"azimuth_degrees": -9.96,
"elevation_degrees": 8.05,
"q": [qx, qy, qz, qw],
"t": [tx, ty, tz],
"object_type": "container ship",
"background": "a dark, moonlit sea with a subtle glow",
"descriptions": ["with blue containers", "with deck machinery"],
"name": "container ship"
}
}Required fields:
uid— Object UID (folder name)image_path— Relative path to image (<uid>/<frame>.png)matrix— 3×4 camera extrinsic [R | t]name— Object class name (used for prompt generation)
Optional fields:
azimuth_degrees,elevation_degrees— Pre-computed camera anglesq,t— Quaternion and translation (alternative tomatrix)object_type,background,descriptions— Used for caption template generationfilter_result— Set to"reject"to exclude a sample from training
- Edit
DATA_ROOTinconfigs/examples/viewpoint_training_1_5b.pyto point at the directory that containsviewpoint_dataset/(see Dataset above). - Place pretrained checkpoints under
checkpoints/harmon/Harmon-1.5b-ReAlign-plus.pthandcheckpoints/harmon/kl16.ckpt(see Pretrained Checkpoints). - (Optional) wandb: the config enables
WandbVisBackendwithproject=camera-controlandname=viewtoken_1_5b_8gpu. Runwandb loginonce (or setWANDB_API_KEY), or exportWANDB_MODE=disabledto opt out. Editinit_kwargsin the config'svisualizerblock to change project/entity/run name. - Launch training:
# Multi-GPU — the default config is tuned for 8 GPUs
torchrun --nproc_per_node=8 scripts/train.py \
configs/examples/viewpoint_training_1_5b.py \
--launcher pytorch \
--work-dir work_dirs/viewtoken_1_5b
# Single GPU — also set multi_gpu=False in the config
python scripts/train.py configs/examples/viewpoint_training_1_5b.py \
--work-dir work_dirs/viewtoken_1_5bCheckpoints land at <work-dir>/iter_<N>.pth every save_steps iterations (10k by default).
A few config knobs worth knowing:
multi_gpu— setTruefortorchrunlaunches (wires up DDP withfind_unused_parameters=Truefor the multi-task sampler),Falsefor single-GPUpython scripts/train.py.batch_sizeis per-GPU. Effective global batch =batch_size × num_gpus × accumulative_counts. Default6 × 8 × 4 = 192. Scalebatch_sizeinversely with GPU count to hold this constant.gradient_checkpointing = Falseby default. Enabling it requires also settingstatic_graph=Trueinstrategy.model_wrapper, otherwise DDP raises a reentrant-backward error.
cd stable_diffusion
# Train and evaluate (all three stages, default)
DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh
# Custom run name
DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh my_experiment
# Training only
RUN_EVAL=0 RUN_PREDICTOR=0 DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh
# Eval-only on an already-trained checkpoint (skip Phase 1)
RUN_TRAIN=0 STEP=30000 ./train_and_eval_sd2.sh
# Generation only, no predictor scoring
RUN_PREDICTOR=0 DATA_ROOT=/path/to/data ./train_and_eval_sd2.shSD2.1 shares the same three-stage RUN_TRAIN / RUN_EVAL / RUN_PREDICTOR toggles documented below for SD3.5.
cd stable_diffusion
# Train and evaluate (single GPU) — runs all three stages
./train_and_eval_sd3.sh
# Multi-GPU
NUM_GPUS=4 ./train_and_eval_sd3.sh
# Custom config
CONFIG=configs/viewpoint_sd3.yaml ./train_and_eval_sd3.sh
# AE/RPY split encoder variant (best azimuth)
CONFIG=configs/viewpoint_sd3_50k_bs6_split_aerpy.yaml ./train_and_eval_sd3.shThe launcher has three independent stage toggles (all default to ON):
| flag | stage |
|---|---|
RUN_TRAIN=1 |
Phase 1: training |
RUN_EVAL=1 |
Phase 2a: 5550-image generation |
RUN_PREDICTOR=1 |
Phase 2b: ResNet predictor scoring (requires PREDICTOR_CONFIG + PREDICTOR_CKPT) |
# Training only
RUN_EVAL=0 RUN_PREDICTOR=0 ./train_and_eval_sd3.sh
# Eval-only on an already-trained checkpoint (skip Phase 1)
RUN_TRAIN=0 STEP=50000 ./train_and_eval_sd3.sh
# Generation only, no predictor scoring
RUN_PREDICTOR=0 ./train_and_eval_sd3.shTrain a standalone ResNet-based viewpoint predictor:
python scripts/train_viewpoint_predictor.py \
--config configs/examples/viewpoint_predictor_resnet.pyEvaluation metadata is provided in evaluation/metadata/:
| File | Description |
|---|---|
eval_combinations.json |
Standard viewpoint evaluation combinations |
eval_metadata.json |
Per-image metadata with camera parameters |
eval_combinations_challenging.json |
Challenging viewpoints (extreme angles) |
eval_metadata_challenging.json |
Metadata for challenging viewpoints |
Two-step pipeline: generate images conditioned on (prompt, viewpoint) combinations, then score them with a pretrained viewpoint predictor.
# Step 1: generate images (single GPU, ~3–4 h on a Blackwell PRO 6000 for 5550 combinations)
python evaluation/generate_viewpoint_eval_harmon.py \
--config configs/examples/viewpoint_training_1_5b.py \
--checkpoint work_dirs/viewtoken_1_5b/iter_50000.pth \
--vae_checkpoint checkpoints/harmon/kl16.ckpt \
--eval_json evaluation/metadata/eval_combinations.json \
--metadata_json evaluation/metadata/eval_metadata.json \
--output_dir output/harmon_eval \
--batch_size 16
# Step 2: score generated images against ground-truth viewpoints (few minutes)
python evaluation/evaluate_viewpoint_predictor.py \
--config configs/examples/viewpoint_predictor_resnet.py \
--checkpoint checkpoints/viewpoint_predictor/best_model.pth \
--dataset_path output/harmon_eval \
--output_dir output/harmon_predictorFor SD3.5 (swap sd3 → sd2 and the script to generate_viewpoint_eval_sd2.py for SD2.1):
python evaluation/generate_viewpoint_eval_sd3.py \
--model_name viewpoint_sd3_lora \
--step 30000 \
--ckpt_base_path checkpoints/sd3/ \
--base_folder evaluation/results/sd3 \
--eval_combinations_path evaluation/metadata/eval_combinations.json \
--metadata_path evaluation/metadata/eval_metadata.json
python evaluation/evaluate_viewpoint_predictor.py \
--config configs/examples/viewpoint_predictor_resnet.py \
--checkpoint checkpoints/viewpoint_predictor/best_model.pth \
--dataset_path evaluation/results/sd3_results \
--output_dir evaluation/predictor_resultsThe generation script appends _results to --base_folder automatically. For the challenging benchmark, swap both metadata paths to the _challenging.json variants.
Inside --output_dir the predictor writes:
| File | Contents |
|---|---|
metrics.json |
Aggregate azimuth / elevation / radius / rotation / translation error statistics |
grouped_metrics.json |
Same metrics bucketed by viewpoint bin |
per_object_metrics.csv |
Per-class table (also printed to stdout) |
per_sample_errors.csv |
Per-image errors for all samples |
worst_10_percent.json |
Hardest samples |
plots/ |
Error visualizations |
The predictor checkpoint (checkpoints/viewpoint_predictor/best_model.pth) ships in the Google Drive release alongside the other weights. To train your own instead, use scripts/train_viewpoint_predictor.py with configs/examples/viewpoint_predictor_resnet.py (see Training → Viewpoint Predictor).
python evaluation/calculate_clip_similarity_best.py \
--results_folder evaluation/results/sd3_resultsviewtoken_control/
├── configs/examples/ # Harmon training + predictor configs
├── scripts/ # Harmon training and inference entry points
├── src/ # Harmon source: models, datasets, hooks, runners
├── stable_diffusion/ # SD2 and SD3 training, inference, configs
│ ├── train_and_eval_sd{2,3}.sh # Three-stage train + eval launchers
│ ├── viewpoint2image_sd{2,3}.py # Single-prompt sweep inference
│ └── configs/ # SD3 baseline + split_aerpy YAMLs
├── evaluation/ # Eval scripts + per-image metadata
├── checkpoints/ # Downloaded pretrained weights (gitignored)
├── environment.yml # Harmon conda env
├── stable_diffusion/environment.yml # SD2/SD3 conda env
├── requirements.txt
└── LICENSE
@inproceedings{lu2026viewtoken,
title={Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens},
author={Lu, Xinxuan and Fowlkes, Charless and Berg, Alexander C.},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}This project is licensed under CC BY 4.0. See LICENSE for details.
Note: The Harmon backbone code (src/models/harmon.py, src/models/harmon_dev.py, src/models/mar/) is derived from Harmon, which is released under the S-Lab License 1.0 (non-commercial use). Use of these files is subject to the S-Lab License terms.
This project builds upon Harmon.