Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

Project Page | Paper | Checkpoints & Dataset

CVPR 2026

Xinxuan Lu, Charless Fowlkes, Alexander C. Berg

University of California, Irvine

Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories.

Installation

Harmon Backbone

conda env create -f environment.yml
conda activate viewtoken

This pins Python 3.12 and installs pinned pip dependencies from requirements.txt. Key dependencies: PyTorch 2.10 (CUDA 12.8), xtuner, transformers 4.48, flash-attn 2.8.3, mmengine, deepspeed. The flash_attn wheel is pinned to cu12 / torch2.10 / cp312 / linux_x86_64, so reproduction requires a CUDA 12.8-capable host (e.g., NVIDIA Blackwell).

Stable Diffusion Backbone (SD2 / SD3)

conda env create -f stable_diffusion/environment.yml
conda activate viewtoken_sd

This pins Python 3.12 and installs pinned pip dependencies from stable_diffusion/requirements.txt. Key dependencies: PyTorch 2.10 (CUDA 12.8), diffusers 0.37, accelerate, peft, transformers 5.3.

Pretrained Checkpoints

Download all checkpoints from Google Drive and extract to checkpoints/.

Model	Backbone	File
Harmon Base	Harmon 1.5B pretrained	`harmon/Harmon-1.5b-ReAlign-plus.pth`
VAE	KL-16	`harmon/kl16.ckpt`
Viewpoint Token (Harmon)	Harmon 1.5B	`harmon/harmon_viewpoint.pth`
Viewpoint Token (SD2)	Stable Diffusion 2.1	`sd2/training_state_30000.pth`
Viewpoint Token (SD3)	Stable Diffusion 3.5 Medium	`sd3/training_state_30000.pth`
Viewpoint Token (SD3, AE/RPY split) †	Stable Diffusion 3.5 Medium	`sd3_split_aerpy/training_state_50000.pth`
Viewpoint Predictor	ResNet-50	`viewpoint_predictor/best_model.pth`

† sd3_split_aerpy/ is post-paper — a follow-up encoder variant with two parallel MLPs (azimuth+elevation, radius+pitch+yaw) trained for 50k steps. Substantially better azimuth and yaw/pitch on our predictor benchmark. Use sd3/ for paper reproduction.

The SD2/SD3 args.pkl files ship alongside their training_state_*.pth and are required by the eval loader. SD2/SD3 base models (Manojb/stable-diffusion-2-1-base, stabilityai/stable-diffusion-3.5-medium) are pulled from HuggingFace on first use.

Inference

Note on --radius: The --radius argument uses the raw (unnormalized) camera distance in the same units as the training rendering (Blender world units). This is roughly 3× the normalized radius reported in the paper. For example, the paper's normalized radius of ~1.67 corresponds to --radius 5.0 in the code. The scripts internally apply radius_normalized = (radius - 5.5) / 2.5 before feeding into the viewpoint encoder.

All three scripts accept --azimuth, --elevation, --radius, --pitch, --yaw as one or more space-separated values; the cartesian product of the five lists determines how many images get generated (one PNG per viewpoint tuple, named {slug}_az{az}_el{el}_r{r}_p{p}_y{y}.png under --output_dir). With no viewpoint flags, the defaults sweep azimuth [45, 135, 225, 315] at elevation 30 and radius 5.0 — four images.

Harmon

python scripts/viewpoint2image.py \
    --config configs/examples/viewpoint_training_1_5b.py \
    --checkpoint checkpoints/harmon/harmon_viewpoint.pth \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --output_dir output/harmon_red_car

Stable Diffusion 2.1

python stable_diffusion/viewpoint2image_sd2.py \
    --ckpt_base_path checkpoints/sd2/ \
    --step 30000 \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --output_dir output/sd2_red_car

Stable Diffusion 3.5

# Paper checkpoint (shared-backbone encoder, step 30000)
python stable_diffusion/viewpoint2image_sd3.py \
    --ckpt_base_path checkpoints/sd3/ \
    --step 30000 \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --output_dir output/sd3_red_car

# Post-paper AE/RPY split checkpoint (better azimuth, supports pitch/yaw too)
python stable_diffusion/viewpoint2image_sd3.py \
    --ckpt_base_path checkpoints/sd3_split_aerpy/ \
    --step 50000 \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --pitch 0.0 \
    --yaw 0.0 \
    --output_dir output/sd3_split_red_car

--pitch and --yaw are optional (default 0.0) and use the same scale as the training metadata's angular_offset / 7.0. The encoder internally multiplies them by 5 — small values (e.g. ±0.2) produce noticeable camera tilt.

All three inference scripts also accept --metadata_json + --image_key to load a viewpoint from a metadata file (e.g., evaluation/metadata/eval_metadata.json) instead of passing --azimuth/--elevation/--radius directly; in that case a single image is produced.

Dataset

Download

The TexVerse viewpoint dataset (~101 GB compressed) is available from Google Drive (same folder as the checkpoints).

After downloading, extract with:

mkdir -p /your/data/root
tar --use-compress-program=unzstd -xvf viewpoint_dataset.tar.zst -C /your/data/root

This will create viewpoint_dataset/ under your chosen DATA_ROOT.

Expected Structure

Set DATA_ROOT in the config files to point to your data directory.

DATA_ROOT/
└── viewpoint_dataset/
    ├── all_final_captions.json          # Global captions file (used by all subfolders)
    ├── texverse_vehicles_randomview_best_nanobanana/
    │   ├── metadata.json                 # Per-folder camera/pose metadata
    │   ├── <uid>/000.png ... 009.png     # Rendered frames per object
    │   └── ...
    ├── texverse_vehicles_randomview_best_nanobanana_2/
    ├── texverse_animals_randomview_best_nanobanana/
    ├── texverse_animals_randomview_best_nanobanana_2/
    ├── texverse_people_randomview_best_nanobanana/
    ├── texverse_people_randomview_best_nanobanana_2/
    ├── texverse_furniture_randomview_best_nanobanana/
    ├── texverse_vehicles_randomview_all/
    ├── texverse_animals_randomview_all/
    ├── texverse_people_randomview_all/
    └── texverse_furniture_randomview_all/

Each subdirectory contains:

metadata.json — per-frame camera parameters and per-object descriptions (required)
<uid>/<frame_index>.png — rendered images, one folder per object UID

The root-level all_final_captions.json provides global caption overrides shared across all subfolders.

Custom Dataset Format

Each dataset directory contains a metadata.json keyed by <uid>/<frame>.png:

{
  "0290a9bf3100495b87029541363f2b24/000.png": {
    "uid": "0290a9bf3100495b87029541363f2b24",
    "image_path": "0290a9bf3100495b87029541363f2b24/000.png",
    "matrix": [
      [r00, r01, r02, tx],
      [r10, r11, r12, ty],
      [r20, r21, r22, tz]
    ],
    "azimuth_degrees": -9.96,
    "elevation_degrees": 8.05,
    "q": [qx, qy, qz, qw],
    "t": [tx, ty, tz],
    "object_type": "container ship",
    "background": "a dark, moonlit sea with a subtle glow",
    "descriptions": ["with blue containers", "with deck machinery"],
    "name": "container ship"
  }
}

Required fields:

uid — Object UID (folder name)
image_path — Relative path to image (<uid>/<frame>.png)
matrix — 3×4 camera extrinsic [R | t]
name — Object class name (used for prompt generation)

Optional fields:

azimuth_degrees, elevation_degrees — Pre-computed camera angles
q, t — Quaternion and translation (alternative to matrix)
object_type, background, descriptions — Used for caption template generation
filter_result — Set to "reject" to exclude a sample from training

Training

Harmon Backbone

Edit DATA_ROOT in configs/examples/viewpoint_training_1_5b.py to point at the directory that contains viewpoint_dataset/ (see Dataset above).
Place pretrained checkpoints under checkpoints/harmon/Harmon-1.5b-ReAlign-plus.pth and checkpoints/harmon/kl16.ckpt (see Pretrained Checkpoints).
(Optional) wandb: the config enables WandbVisBackend with project=camera-control and name=viewtoken_1_5b_8gpu. Run wandb login once (or set WANDB_API_KEY), or export WANDB_MODE=disabled to opt out. Edit init_kwargs in the config's visualizer block to change project/entity/run name.
Launch training:

# Multi-GPU — the default config is tuned for 8 GPUs
torchrun --nproc_per_node=8 scripts/train.py \
    configs/examples/viewpoint_training_1_5b.py \
    --launcher pytorch \
    --work-dir work_dirs/viewtoken_1_5b

# Single GPU — also set multi_gpu=False in the config
python scripts/train.py configs/examples/viewpoint_training_1_5b.py \
    --work-dir work_dirs/viewtoken_1_5b

Checkpoints land at <work-dir>/iter_<N>.pth every save_steps iterations (10k by default).

A few config knobs worth knowing:

multi_gpu — set True for torchrun launches (wires up DDP with find_unused_parameters=True for the multi-task sampler), False for single-GPU python scripts/train.py.
batch_size is per-GPU. Effective global batch = batch_size × num_gpus × accumulative_counts. Default 6 × 8 × 4 = 192. Scale batch_size inversely with GPU count to hold this constant.
gradient_checkpointing = False by default. Enabling it requires also setting static_graph=True in strategy.model_wrapper, otherwise DDP raises a reentrant-backward error.

Stable Diffusion 2.1

cd stable_diffusion

# Train and evaluate (all three stages, default)
DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh

# Custom run name
DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh my_experiment

# Training only
RUN_EVAL=0 RUN_PREDICTOR=0 DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh

# Eval-only on an already-trained checkpoint (skip Phase 1)
RUN_TRAIN=0 STEP=30000 ./train_and_eval_sd2.sh

# Generation only, no predictor scoring
RUN_PREDICTOR=0 DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh

SD2.1 shares the same three-stage RUN_TRAIN / RUN_EVAL / RUN_PREDICTOR toggles documented below for SD3.5.

Stable Diffusion 3.5

cd stable_diffusion

# Train and evaluate (single GPU) — runs all three stages
./train_and_eval_sd3.sh

# Multi-GPU
NUM_GPUS=4 ./train_and_eval_sd3.sh

# Custom config
CONFIG=configs/viewpoint_sd3.yaml ./train_and_eval_sd3.sh

# AE/RPY split encoder variant (best azimuth)
CONFIG=configs/viewpoint_sd3_50k_bs6_split_aerpy.yaml ./train_and_eval_sd3.sh

The launcher has three independent stage toggles (all default to ON):

flag	stage
`RUN_TRAIN=1`	Phase 1: training
`RUN_EVAL=1`	Phase 2a: 5550-image generation
`RUN_PREDICTOR=1`	Phase 2b: ResNet predictor scoring (requires `PREDICTOR_CONFIG` + `PREDICTOR_CKPT`)

# Training only
RUN_EVAL=0 RUN_PREDICTOR=0 ./train_and_eval_sd3.sh

# Eval-only on an already-trained checkpoint (skip Phase 1)
RUN_TRAIN=0 STEP=50000 ./train_and_eval_sd3.sh

# Generation only, no predictor scoring
RUN_PREDICTOR=0 ./train_and_eval_sd3.sh

Viewpoint Predictor

Train a standalone ResNet-based viewpoint predictor:

python scripts/train_viewpoint_predictor.py \
    --config configs/examples/viewpoint_predictor_resnet.py

Evaluation

Evaluation metadata is provided in evaluation/metadata/:

File	Description
`eval_combinations.json`	Standard viewpoint evaluation combinations
`eval_metadata.json`	Per-image metadata with camera parameters
`eval_combinations_challenging.json`	Challenging viewpoints (extreme angles)
`eval_metadata_challenging.json`	Metadata for challenging viewpoints

Viewpoint Evaluation

Two-step pipeline: generate images conditioned on (prompt, viewpoint) combinations, then score them with a pretrained viewpoint predictor.

Harmon

# Step 1: generate images (single GPU, ~3–4 h on a Blackwell PRO 6000 for 5550 combinations)
python evaluation/generate_viewpoint_eval_harmon.py \
    --config configs/examples/viewpoint_training_1_5b.py \
    --checkpoint work_dirs/viewtoken_1_5b/iter_50000.pth \
    --vae_checkpoint checkpoints/harmon/kl16.ckpt \
    --eval_json evaluation/metadata/eval_combinations.json \
    --metadata_json evaluation/metadata/eval_metadata.json \
    --output_dir output/harmon_eval \
    --batch_size 16

# Step 2: score generated images against ground-truth viewpoints (few minutes)
python evaluation/evaluate_viewpoint_predictor.py \
    --config configs/examples/viewpoint_predictor_resnet.py \
    --checkpoint checkpoints/viewpoint_predictor/best_model.pth \
    --dataset_path output/harmon_eval \
    --output_dir output/harmon_predictor

Stable Diffusion

For SD3.5 (swap sd3 → sd2 and the script to generate_viewpoint_eval_sd2.py for SD2.1):

python evaluation/generate_viewpoint_eval_sd3.py \
    --model_name viewpoint_sd3_lora \
    --step 30000 \
    --ckpt_base_path checkpoints/sd3/ \
    --base_folder evaluation/results/sd3 \
    --eval_combinations_path evaluation/metadata/eval_combinations.json \
    --metadata_path evaluation/metadata/eval_metadata.json

python evaluation/evaluate_viewpoint_predictor.py \
    --config configs/examples/viewpoint_predictor_resnet.py \
    --checkpoint checkpoints/viewpoint_predictor/best_model.pth \
    --dataset_path evaluation/results/sd3_results \
    --output_dir evaluation/predictor_results

The generation script appends _results to --base_folder automatically. For the challenging benchmark, swap both metadata paths to the _challenging.json variants.

Predictor outputs

Inside --output_dir the predictor writes:

File	Contents
`metrics.json`	Aggregate azimuth / elevation / radius / rotation / translation error statistics
`grouped_metrics.json`	Same metrics bucketed by viewpoint bin
`per_object_metrics.csv`	Per-class table (also printed to stdout)
`per_sample_errors.csv`	Per-image errors for all samples
`worst_10_percent.json`	Hardest samples
`plots/`	Error visualizations

The predictor checkpoint (checkpoints/viewpoint_predictor/best_model.pth) ships in the Google Drive release alongside the other weights. To train your own instead, use scripts/train_viewpoint_predictor.py with configs/examples/viewpoint_predictor_resnet.py (see Training → Viewpoint Predictor).

CLIP Similarity

python evaluation/calculate_clip_similarity_best.py \
    --results_folder evaluation/results/sd3_results

Project Structure

viewtoken_control/
├── configs/examples/              # Harmon training + predictor configs
├── scripts/                       # Harmon training and inference entry points
├── src/                           # Harmon source: models, datasets, hooks, runners
├── stable_diffusion/              # SD2 and SD3 training, inference, configs
│   ├── train_and_eval_sd{2,3}.sh  # Three-stage train + eval launchers
│   ├── viewpoint2image_sd{2,3}.py # Single-prompt sweep inference
│   └── configs/                   # SD3 baseline + split_aerpy YAMLs
├── evaluation/                    # Eval scripts + per-image metadata
├── checkpoints/                   # Downloaded pretrained weights (gitignored)
├── environment.yml                # Harmon conda env
├── stable_diffusion/environment.yml  # SD2/SD3 conda env
├── requirements.txt
└── LICENSE

Citation

@inproceedings{lu2026viewtoken,
  title={Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens},
  author={Lu, Xinxuan and Fowlkes, Charless and Berg, Alexander C.},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

License

This project is licensed under CC BY 4.0. See LICENSE for details.

Note: The Harmon backbone code (src/models/harmon.py, src/models/harmon_dev.py, src/models/mar/) is derived from Harmon, which is released under the S-Lab License 1.0 (non-commercial use). Use of these files is subject to the S-Lab License terms.

Acknowledgements

This project builds upon Harmon.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
configs		configs
evaluation		evaluation
scripts		scripts
src		src
stable_diffusion		stable_diffusion
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

Project Page | Paper | Checkpoints & Dataset

Installation

Harmon Backbone

Stable Diffusion Backbone (SD2 / SD3)

Pretrained Checkpoints

Inference

Harmon

Stable Diffusion 2.1

Stable Diffusion 3.5

Dataset

Download

Expected Structure

Custom Dataset Format

Training

Harmon Backbone

Stable Diffusion 2.1

Stable Diffusion 3.5

Viewpoint Predictor

Evaluation

Viewpoint Evaluation

Harmon

Stable Diffusion

Predictor outputs

CLIP Similarity

Project Structure

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages