[WACV 2026] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning
The official implementation of WACV 2026 paper UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@inproceedings{le2026uno,
title={UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning},
author={Le, Huy and Chung, Nhat and Kieu, Tung and Yang, Jingkang and Le, Ngan},
booktitle={WACV},
year={2026},
}
Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.
conda create -n uno python=3.9
conda activate uno
pip install -r requirements.txtWe use two datasets Action Genome and PVSG to train/evaluate our method.
- For Action Genome dataset please process the downloaded dataset with the Toolkit and put the processed annotation files with COCO style into
annotationsfolder. The directories of the dataset should look like:
|-- action-genome
|-- annotations # gt annotations
|-- ag_train_coco_style.json
|-- ag_test_coco_style.json
|-- ...
|-- frames # sampled frames
|-- videos # original videos
- For PVSG dataset please follow this repo to download and pre-process the dataset PVSG. The directories of the dataset should look like:
|-- pvsg
|-- pvsg.json # gt annotations
|-- ego4d/epic_kitchen/vidor # video sources
|-- masks # sampled masks
|-- frames # sampled frames
|-- videos # original videos
You can follow the scripts below to train UNO:
Notably, manually tuning LR may be needed to obtain the best performance.
- For SGDET task
bash scripts/train_sgdet.sh
Please download the checkpoints used in the paper and put it into exps/dsgg folder.
You can use the scripts below to evaluate the performance of OED.
- For SGDET task
bash scripts/eval_sgdet.sh
The code is still under development and will be updated more soon!
We thanks all of the authors from the following code for the excellent code they have released. Our framework is built upon these following repos: