by Zhengxuan Wei*, Jiajin Tang*, Sibei Yang†
*Equal contribution; †Corresponding Author
Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing "kicking" vs. "throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization.
git clone https://github.com/SooLab/AMR.git
cd AMR
We use video features (CLIP and SlowFast) and text features (CLIP) as inputs. For CLIP, we utilize the features extracted by R2-Tuning (from the last four layers), but we retain only the [CLS] token per frame to ensure efficiency. You can download our prepared feature files from qvhighlights_features and unzip them to your data root directory.
For Anaconda setup, refer to the official Moment-DETR GitHub.
Our data augmentation is performed directly at the feature level, so there is no need to re-extract video and text features. Therefore, we provide an offline data augmentation script utils/augment_data.py. Before running the script, please ensure that you update the data_root variable to your data root directory path. Run the following command to generate augmented training data:
python utils/augment_data.pyUpdate feat_root in amr/scripts/train_stage1.sh to the path where you saved the features, then run:
bash amr/scripts/train_stage1.sh Update feat_root in amr/scripts/train_stage2.sh to the path where you saved the features. Also, set the resume parameter to point to the checkpoint saved from Stage 1 (e.g., test_amr/{direc}/model_e0039.ckpt). Then run:
bash amr/scripts/train_stage2.sh After training, you can generate hl_val_submission.jsonl and hl_test_submission.jsonl for validation and test sets by running:
bash amr/scripts/inference.sh results/{direc}/model_best.ckpt 'val'
bash amr/scripts/inference.sh results/{direc}/model_best.ckpt 'test'
Replace {direc} with the path to your saved checkpoint. For more details on submission, see standalone_eval/README.md.
If you find this repository useful, please cite our work:
@inproceedings{wei2025augmenting,
title={Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning},
author={Wei, Zhengxuan and Tang, Jiajin and Yang, Sibei},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={3401--3412},
year={2025}
}
The annotation files and parts of the implementation are borrowed from Moment-DETR and TR-DETR. Consequently, our code is also released under the MIT License.
