[WACV 2026] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

The official implementation of WACV 2026 paper UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@inproceedings{le2026uno,
  title={UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning},
  author={Le, Huy and Chung, Nhat and Kieu, Tung and Yang, Jingkang and Le, Ngan},
  booktitle={WACV},
  year={2026},
}

📕 Overview

Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.

Setup code environment

conda create -n uno python=3.9
conda activate uno
pip install -r requirements.txt

Dataset

Data preperation

We use two datasets Action Genome and PVSG to train/evaluate our method.

For Action Genome dataset please process the downloaded dataset with the Toolkit and put the processed annotation files with COCO style into annotations folder. The directories of the dataset should look like:

|-- action-genome
    |-- annotations   # gt annotations
        |-- ag_train_coco_style.json
        |-- ag_test_coco_style.json
        |-- ...
    |-- frames        # sampled frames
    |-- videos        # original videos

For PVSG dataset please follow this repo to download and pre-process the dataset PVSG. The directories of the dataset should look like:

|-- pvsg
    |-- pvsg.json   # gt annotations
    |-- ego4d/epic_kitchen/vidor   # video sources
        |-- masks        # sampled masks
        |-- frames        # sampled frames
        |-- videos        # original videos

DSGG

Training

You can follow the scripts below to train UNO:

Notably, manually tuning LR may be needed to obtain the best performance.

For SGDET task

bash scripts/train_sgdet.sh

Evaluation

Please download the checkpoints used in the paper and put it into exps/dsgg folder. You can use the scripts below to evaluate the performance of OED.

For SGDET task

bash scripts/eval_sgdet.sh

The code is still under development and will be updated more soon!

Acknowledgement

We thanks all of the authors from the following code for the excellent code they have released. Our framework is built upon these following repos:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
models		models
scripts		scripts
util		util
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
engine.py		engine.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[WACV 2026] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

📌 Citation

📕 Overview

Setup code environment

Dataset

Data preperation

DSGG

Training

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

Fsoft-AIC/UNO

Folders and files

Latest commit

History

Repository files navigation

[WACV 2026] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

📌 Citation

📕 Overview

Setup code environment

Dataset

Data preperation

DSGG

Training

Evaluation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages