Skip to content

Randdl/viewtoken_control

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

CVPR 2026

Xinxuan Lu, Charless Fowlkes, Alexander C. Berg

University of California, Irvine

Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories.

Installation

Harmon Backbone

conda env create -f environment.yml
conda activate viewtoken

This pins Python 3.12 and installs pinned pip dependencies from requirements.txt. Key dependencies: PyTorch 2.10 (CUDA 12.8), xtuner, transformers 4.48, flash-attn 2.8.3, mmengine, deepspeed. The flash_attn wheel is pinned to cu12 / torch2.10 / cp312 / linux_x86_64, so reproduction requires a CUDA 12.8-capable host (e.g., NVIDIA Blackwell).

Stable Diffusion Backbone (SD2 / SD3)

conda env create -f stable_diffusion/environment.yml
conda activate viewtoken_sd

This pins Python 3.12 and installs pinned pip dependencies from stable_diffusion/requirements.txt. Key dependencies: PyTorch 2.10 (CUDA 12.8), diffusers 0.37, accelerate, peft, transformers 5.3.

Pretrained Checkpoints

Download all checkpoints from Google Drive and extract to checkpoints/.

Model Backbone File
Harmon Base Harmon 1.5B pretrained harmon/Harmon-1.5b-ReAlign-plus.pth
VAE KL-16 harmon/kl16.ckpt
Viewpoint Token (Harmon) Harmon 1.5B harmon/harmon_viewpoint.pth
Viewpoint Token (SD2) Stable Diffusion 2.1 sd2/training_state_30000.pth
Viewpoint Token (SD3) Stable Diffusion 3.5 Medium sd3/training_state_30000.pth
Viewpoint Token (SD3, AE/RPY split) † Stable Diffusion 3.5 Medium sd3_split_aerpy/training_state_50000.pth
Viewpoint Predictor ResNet-50 viewpoint_predictor/best_model.pth

sd3_split_aerpy/ is post-paper — a follow-up encoder variant with two parallel MLPs (azimuth+elevation, radius+pitch+yaw) trained for 50k steps. Substantially better azimuth and yaw/pitch on our predictor benchmark. Use sd3/ for paper reproduction.

The SD2/SD3 args.pkl files ship alongside their training_state_*.pth and are required by the eval loader. SD2/SD3 base models (Manojb/stable-diffusion-2-1-base, stabilityai/stable-diffusion-3.5-medium) are pulled from HuggingFace on first use.

Inference

Note on --radius: The --radius argument uses the raw (unnormalized) camera distance in the same units as the training rendering (Blender world units). This is roughly 3× the normalized radius reported in the paper. For example, the paper's normalized radius of ~1.67 corresponds to --radius 5.0 in the code. The scripts internally apply radius_normalized = (radius - 5.5) / 2.5 before feeding into the viewpoint encoder.

All three scripts accept --azimuth, --elevation, --radius, --pitch, --yaw as one or more space-separated values; the cartesian product of the five lists determines how many images get generated (one PNG per viewpoint tuple, named {slug}_az{az}_el{el}_r{r}_p{p}_y{y}.png under --output_dir). With no viewpoint flags, the defaults sweep azimuth [45, 135, 225, 315] at elevation 30 and radius 5.0 — four images.

Harmon

python scripts/viewpoint2image.py \
    --config configs/examples/viewpoint_training_1_5b.py \
    --checkpoint checkpoints/harmon/harmon_viewpoint.pth \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --output_dir output/harmon_red_car

Stable Diffusion 2.1

python stable_diffusion/viewpoint2image_sd2.py \
    --ckpt_base_path checkpoints/sd2/ \
    --step 30000 \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --output_dir output/sd2_red_car

Stable Diffusion 3.5

# Paper checkpoint (shared-backbone encoder, step 30000)
python stable_diffusion/viewpoint2image_sd3.py \
    --ckpt_base_path checkpoints/sd3/ \
    --step 30000 \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --output_dir output/sd3_red_car

# Post-paper AE/RPY split checkpoint (better azimuth, supports pitch/yaw too)
python stable_diffusion/viewpoint2image_sd3.py \
    --ckpt_base_path checkpoints/sd3_split_aerpy/ \
    --step 50000 \
    --object_desc "a red sports car on a grass land" \
    --azimuth 0 90 180 270 \
    --pitch 0.0 \
    --yaw 0.0 \
    --output_dir output/sd3_split_red_car

--pitch and --yaw are optional (default 0.0) and use the same scale as the training metadata's angular_offset / 7.0. The encoder internally multiplies them by 5 — small values (e.g. ±0.2) produce noticeable camera tilt.

All three inference scripts also accept --metadata_json + --image_key to load a viewpoint from a metadata file (e.g., evaluation/metadata/eval_metadata.json) instead of passing --azimuth/--elevation/--radius directly; in that case a single image is produced.

Dataset

Download

The TexVerse viewpoint dataset (~101 GB compressed) is available from Google Drive (same folder as the checkpoints).

After downloading, extract with:

mkdir -p /your/data/root
tar --use-compress-program=unzstd -xvf viewpoint_dataset.tar.zst -C /your/data/root

This will create viewpoint_dataset/ under your chosen DATA_ROOT.

Expected Structure

Set DATA_ROOT in the config files to point to your data directory.

DATA_ROOT/
└── viewpoint_dataset/
    ├── all_final_captions.json          # Global captions file (used by all subfolders)
    ├── texverse_vehicles_randomview_best_nanobanana/
    │   ├── metadata.json                 # Per-folder camera/pose metadata
    │   ├── <uid>/000.png ... 009.png     # Rendered frames per object
    │   └── ...
    ├── texverse_vehicles_randomview_best_nanobanana_2/
    ├── texverse_animals_randomview_best_nanobanana/
    ├── texverse_animals_randomview_best_nanobanana_2/
    ├── texverse_people_randomview_best_nanobanana/
    ├── texverse_people_randomview_best_nanobanana_2/
    ├── texverse_furniture_randomview_best_nanobanana/
    ├── texverse_vehicles_randomview_all/
    ├── texverse_animals_randomview_all/
    ├── texverse_people_randomview_all/
    └── texverse_furniture_randomview_all/

Each subdirectory contains:

  • metadata.json — per-frame camera parameters and per-object descriptions (required)
  • <uid>/<frame_index>.png — rendered images, one folder per object UID

The root-level all_final_captions.json provides global caption overrides shared across all subfolders.

Custom Dataset Format

Each dataset directory contains a metadata.json keyed by <uid>/<frame>.png:

{
  "0290a9bf3100495b87029541363f2b24/000.png": {
    "uid": "0290a9bf3100495b87029541363f2b24",
    "image_path": "0290a9bf3100495b87029541363f2b24/000.png",
    "matrix": [
      [r00, r01, r02, tx],
      [r10, r11, r12, ty],
      [r20, r21, r22, tz]
    ],
    "azimuth_degrees": -9.96,
    "elevation_degrees": 8.05,
    "q": [qx, qy, qz, qw],
    "t": [tx, ty, tz],
    "object_type": "container ship",
    "background": "a dark, moonlit sea with a subtle glow",
    "descriptions": ["with blue containers", "with deck machinery"],
    "name": "container ship"
  }
}

Required fields:

  • uid — Object UID (folder name)
  • image_path — Relative path to image (<uid>/<frame>.png)
  • matrix — 3×4 camera extrinsic [R | t]
  • name — Object class name (used for prompt generation)

Optional fields:

  • azimuth_degrees, elevation_degrees — Pre-computed camera angles
  • q, t — Quaternion and translation (alternative to matrix)
  • object_type, background, descriptions — Used for caption template generation
  • filter_result — Set to "reject" to exclude a sample from training

Training

Harmon Backbone

  1. Edit DATA_ROOT in configs/examples/viewpoint_training_1_5b.py to point at the directory that contains viewpoint_dataset/ (see Dataset above).
  2. Place pretrained checkpoints under checkpoints/harmon/Harmon-1.5b-ReAlign-plus.pth and checkpoints/harmon/kl16.ckpt (see Pretrained Checkpoints).
  3. (Optional) wandb: the config enables WandbVisBackend with project=camera-control and name=viewtoken_1_5b_8gpu. Run wandb login once (or set WANDB_API_KEY), or export WANDB_MODE=disabled to opt out. Edit init_kwargs in the config's visualizer block to change project/entity/run name.
  4. Launch training:
# Multi-GPU — the default config is tuned for 8 GPUs
torchrun --nproc_per_node=8 scripts/train.py \
    configs/examples/viewpoint_training_1_5b.py \
    --launcher pytorch \
    --work-dir work_dirs/viewtoken_1_5b

# Single GPU — also set multi_gpu=False in the config
python scripts/train.py configs/examples/viewpoint_training_1_5b.py \
    --work-dir work_dirs/viewtoken_1_5b

Checkpoints land at <work-dir>/iter_<N>.pth every save_steps iterations (10k by default).

A few config knobs worth knowing:

  • multi_gpu — set True for torchrun launches (wires up DDP with find_unused_parameters=True for the multi-task sampler), False for single-GPU python scripts/train.py.
  • batch_size is per-GPU. Effective global batch = batch_size × num_gpus × accumulative_counts. Default 6 × 8 × 4 = 192. Scale batch_size inversely with GPU count to hold this constant.
  • gradient_checkpointing = False by default. Enabling it requires also setting static_graph=True in strategy.model_wrapper, otherwise DDP raises a reentrant-backward error.

Stable Diffusion 2.1

cd stable_diffusion

# Train and evaluate (all three stages, default)
DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh

# Custom run name
DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh my_experiment

# Training only
RUN_EVAL=0 RUN_PREDICTOR=0 DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh

# Eval-only on an already-trained checkpoint (skip Phase 1)
RUN_TRAIN=0 STEP=30000 ./train_and_eval_sd2.sh

# Generation only, no predictor scoring
RUN_PREDICTOR=0 DATA_ROOT=/path/to/data ./train_and_eval_sd2.sh

SD2.1 shares the same three-stage RUN_TRAIN / RUN_EVAL / RUN_PREDICTOR toggles documented below for SD3.5.

Stable Diffusion 3.5

cd stable_diffusion

# Train and evaluate (single GPU) — runs all three stages
./train_and_eval_sd3.sh

# Multi-GPU
NUM_GPUS=4 ./train_and_eval_sd3.sh

# Custom config
CONFIG=configs/viewpoint_sd3.yaml ./train_and_eval_sd3.sh

# AE/RPY split encoder variant (best azimuth)
CONFIG=configs/viewpoint_sd3_50k_bs6_split_aerpy.yaml ./train_and_eval_sd3.sh

The launcher has three independent stage toggles (all default to ON):

flag stage
RUN_TRAIN=1 Phase 1: training
RUN_EVAL=1 Phase 2a: 5550-image generation
RUN_PREDICTOR=1 Phase 2b: ResNet predictor scoring (requires PREDICTOR_CONFIG + PREDICTOR_CKPT)
# Training only
RUN_EVAL=0 RUN_PREDICTOR=0 ./train_and_eval_sd3.sh

# Eval-only on an already-trained checkpoint (skip Phase 1)
RUN_TRAIN=0 STEP=50000 ./train_and_eval_sd3.sh

# Generation only, no predictor scoring
RUN_PREDICTOR=0 ./train_and_eval_sd3.sh

Viewpoint Predictor

Train a standalone ResNet-based viewpoint predictor:

python scripts/train_viewpoint_predictor.py \
    --config configs/examples/viewpoint_predictor_resnet.py

Evaluation

Evaluation metadata is provided in evaluation/metadata/:

File Description
eval_combinations.json Standard viewpoint evaluation combinations
eval_metadata.json Per-image metadata with camera parameters
eval_combinations_challenging.json Challenging viewpoints (extreme angles)
eval_metadata_challenging.json Metadata for challenging viewpoints

Viewpoint Evaluation

Two-step pipeline: generate images conditioned on (prompt, viewpoint) combinations, then score them with a pretrained viewpoint predictor.

Harmon

# Step 1: generate images (single GPU, ~3–4 h on a Blackwell PRO 6000 for 5550 combinations)
python evaluation/generate_viewpoint_eval_harmon.py \
    --config configs/examples/viewpoint_training_1_5b.py \
    --checkpoint work_dirs/viewtoken_1_5b/iter_50000.pth \
    --vae_checkpoint checkpoints/harmon/kl16.ckpt \
    --eval_json evaluation/metadata/eval_combinations.json \
    --metadata_json evaluation/metadata/eval_metadata.json \
    --output_dir output/harmon_eval \
    --batch_size 16

# Step 2: score generated images against ground-truth viewpoints (few minutes)
python evaluation/evaluate_viewpoint_predictor.py \
    --config configs/examples/viewpoint_predictor_resnet.py \
    --checkpoint checkpoints/viewpoint_predictor/best_model.pth \
    --dataset_path output/harmon_eval \
    --output_dir output/harmon_predictor

Stable Diffusion

For SD3.5 (swap sd3sd2 and the script to generate_viewpoint_eval_sd2.py for SD2.1):

python evaluation/generate_viewpoint_eval_sd3.py \
    --model_name viewpoint_sd3_lora \
    --step 30000 \
    --ckpt_base_path checkpoints/sd3/ \
    --base_folder evaluation/results/sd3 \
    --eval_combinations_path evaluation/metadata/eval_combinations.json \
    --metadata_path evaluation/metadata/eval_metadata.json

python evaluation/evaluate_viewpoint_predictor.py \
    --config configs/examples/viewpoint_predictor_resnet.py \
    --checkpoint checkpoints/viewpoint_predictor/best_model.pth \
    --dataset_path evaluation/results/sd3_results \
    --output_dir evaluation/predictor_results

The generation script appends _results to --base_folder automatically. For the challenging benchmark, swap both metadata paths to the _challenging.json variants.

Predictor outputs

Inside --output_dir the predictor writes:

File Contents
metrics.json Aggregate azimuth / elevation / radius / rotation / translation error statistics
grouped_metrics.json Same metrics bucketed by viewpoint bin
per_object_metrics.csv Per-class table (also printed to stdout)
per_sample_errors.csv Per-image errors for all samples
worst_10_percent.json Hardest samples
plots/ Error visualizations

The predictor checkpoint (checkpoints/viewpoint_predictor/best_model.pth) ships in the Google Drive release alongside the other weights. To train your own instead, use scripts/train_viewpoint_predictor.py with configs/examples/viewpoint_predictor_resnet.py (see Training → Viewpoint Predictor).

CLIP Similarity

python evaluation/calculate_clip_similarity_best.py \
    --results_folder evaluation/results/sd3_results

Project Structure

viewtoken_control/
├── configs/examples/              # Harmon training + predictor configs
├── scripts/                       # Harmon training and inference entry points
├── src/                           # Harmon source: models, datasets, hooks, runners
├── stable_diffusion/              # SD2 and SD3 training, inference, configs
│   ├── train_and_eval_sd{2,3}.sh  # Three-stage train + eval launchers
│   ├── viewpoint2image_sd{2,3}.py # Single-prompt sweep inference
│   └── configs/                   # SD3 baseline + split_aerpy YAMLs
├── evaluation/                    # Eval scripts + per-image metadata
├── checkpoints/                   # Downloaded pretrained weights (gitignored)
├── environment.yml                # Harmon conda env
├── stable_diffusion/environment.yml  # SD2/SD3 conda env
├── requirements.txt
└── LICENSE

Citation

@inproceedings{lu2026viewtoken,
  title={Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens},
  author={Lu, Xinxuan and Fowlkes, Charless and Berg, Alexander C.},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

License

This project is licensed under CC BY 4.0. See LICENSE for details.

Note: The Harmon backbone code (src/models/harmon.py, src/models/harmon_dev.py, src/models/mar/) is derived from Harmon, which is released under the S-Lab License 1.0 (non-commercial use). Use of these files is subject to the S-Lab License terms.

Acknowledgements

This project builds upon Harmon.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors