Saurabh Pathak · Elahe Arani · Mykola Pechenizkiy · Bahram Zonooz
Eindhoven University of Technology
Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real‑world settings. Prior attempts to inject physics rely on conditioning: frame‑level signals are domain‑specific and short‑horizon, while global text prompts are coarse and noisy, missing fine‑grained dynamics. We present PhysVid, a physics‑aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics‑grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk‑aware cross‑attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by
# conda create -n physvid python=3.12 -y
# conda activate physvid
python -m venv physvid
source venvs/physvid/bin/activate
pip install torch==2.7.1+cu128 --index-url https://download.pytorch.org/whl/cu128
pip install torchvision psutil
pip install flash-attn --no-build-isolation
pip install -r requirements.txt
python -m setup installAlso download the Wan T2V base model from here and save it to weights/wan_models/Wan2.1-T2V-1.3B. If you intend to also evaluate Wan14B T2V, download it as well.
First download the checkpoint: PhysVid Model
python tests/test_inference_single.pyFirst, generate the samples for the VideoPhy/VideoPhy2 dataset. Please modify the generate_config.yaml file as needed.
torchrun --nproc_per_node 8 physvid/evaluation/generate_videophy_samples.py --config_path configs/generate_videophy_samples.yamlOnce the samples are generated, you can evaluate their physical plausibility on the corresponding benchmark using the provided evaluation script. Please modify the eval_config.yaml file as needed.
Before running the command below, make sure to download the VideoPhy and VideoPhy2 models using standard huggingface-cli and save them to weights/videophy and weights/videophy2, respectively.
torchrun --nproc_per_node 8 physvid/evaluation/eval.py --config_path configs/eval_config.yamlThe results from our own evaluation in the paper are shown below.
|
|
We use the WISA80k Dataset (80K videos) for training.
First, download the VideoLLama3-7B from the huggingface repo. To prepare the dataset, follow these steps. Alternatively, you can skip the steps below and download the final LMDB dataset and local annotations from here.
# download and extract video from the WISA dataset
python data_processing/download_extract_hf_dataset.py --local_dir XXX --revision 8fbd4a1d1a83bdd9e1f58187d1974c3fbb3a0d37
# precompute the vae latent
torchrun --nproc_per_node 8 data_processing/create_vae_latent_lmdb.py --config_path configs/create_vae_latent_lmdb.yaml
# run local annotation generation. This will generate the local annotations for each video chunk and save them in a json file.
torchrun --nproc_per_node 8 data_processing/filter_annotate_dataset.py --config_path configs/vlm_annotate_config.yamlPlease modify the wandb account information in train.yaml. Then launch the training using the command below.
torchrun --nnodes 8 --nproc_per_node=8 --rdzv_id=5235 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR physvid/train.py \
--config_path configs/wan_causal_dmd.yaml --no_visualizeIf you find PhysVid useful or relevant to your research, please cite our paper:
@inproceedings{pathak2026physvid,
title = {PhysVid: Physics Aware Local Conditioning for Generative Video Models},
author = {Pathak, Saurabh and Arani, Elahe and Pechenizkiy, Mykola and Zonooz, Bahram},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}- The current implementation is optimized for the WISA80k dataset and may require adjustments to work with other datasets.
- The VLM is hardcoded to use the VideoLLama3-7B model. Adapting to other VLMs may require modifications to the codebase.
- The training process is computationally intensive and may require access to high-performance computing resources.
- The evaluation is currently limited to the VideoPhy and VideoPhy2 benchmarks. Evaluating on additional benchmarks may require further code modifications.
We will document the bugs in this section as we identify and fix them. Please feel free to report any issues you encounter while using the code. We will do our best to address them in a timely manner.
This work is supported by the EU funded SYNERGIES project (Grant Agreement No. 101146542). We also gratefully acknowledge the TUE supercomputing team for providing the SPIKE-1 compute infrastructure to carry out the experiments reported in this paper.
A portion of the code related to data processing, training and flow matching is adapted from CausVid.
Wan Model related code is adapted from Wan Video.
VideoPhy/VideoPhy2 code is adapted from videophy.
We thank the respective authors for open-sourcing their code.





