Skip to content

Soldelli/Awesome-Temporal-Language-Grounding-in-Videos

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 

Repository files navigation

Awesome-Temporal-Sentence-Grounding-in-VideosAwesome

A curated list of Temporal Sentence Grounding in Videos papers and benchmarks.
The task is also usually referred to as:

  • Single Video Moment Retrieval (SVMR)
  • Temporal Activity Localization via Language Query (TALL)
  • Natural Language Grounding in Videos.

Task definition:

a) Given an untrimmed video and a language query, the video grounding task aims to localize a temporal moment (ts,te) in the video that matches the query.

b-d) Represent a high-level overview of common multi-modality interaction schemes investigated in the literature.


00 - Table of Contents


01 - Datasets


Videos Statistics

Dataset

Features
(Download)

Number of Videos

Avg.
Duration

Total
Duration

Train

Val

Test

Minutes

Hours

TACoS

C3D

75

27

25

4.78

10.1

Charades-STA

VGG16
I3D (LGI)
I3D (DRN)

5336

0

1334

0.50

57.1

DiDeMo

VGG16

8511

1094

1037

0.50

88.7

ActivityNet Captions

C3D

10009

4917 (val1)
4885(val2)

5044

1.96

487.6

MAD

CLIP

488

50

112

110.77

1207.3


Sentences Statistics

Dataset

Features
(Download)

Number of Queries

Avg.
Tokens

Total
Tokens

Train

Val

Test

(Millions)

TACoS

10146

4589

4083

10.5

0.2

Charades-STA

12404

0

3720

7.2

0.1

DiDeMo

33005

4180

4021

8.0

0.3

ActivityNet Captions

37421

17505 (val1)
17031(val2)

?

14.8

1.0

MAD

CLIP

280183

32064

72044

12.7

5.0


Language Statistics - (Unique tokens)

Dataset

Adjectives

Nouns

Verbs

Vocabulary

TACoS

0.2 K

0.9 K

0.6 K

2.3 K

Charades-STA

0.1 K

0.6 K

0.4 K

1.3 K

DiDeMo

0.6 K

4.1 K

1.9 K

7.5 K

ActivityNet Captions

1.1 K

7.4 K

3.7 K

15.4 K

MAD

5.3 K

35.5 K

13.1 K

61.4 K



02 - Benchmark Results

  • Evaluation metric: Recall@k for IoU=m (link).

  • NOTE: For Activitynet-Captions, val1 / val2 or a combination of the two splits is used for evaluation. The most common choice is to use val1 as a validation set and val2 as a testing set. This is necessary as the official test set is withheld for competitions purposes.

Methods can be classified in:

  • FS: Fully supervised
  • WS: Weakly supervised
  • RL: Reinforcement Learning

Format

* `Model` [ID](link) | `Features` |  R@k IoU=m |...| R@k IoU=m | Method |

Hit the paper ID to fast-forward to the paper details (link to pdf, venue, year, author and link to GitHub repo).


ActivityNet Captions (val 1)

Models                    Features     R@1
IoU0.3
R@1
IoU0.5
R@1
IoU0.7
R@5
IoU0.3
R@5
IoU0.5
R@5
IoU0.7
Method
ACRN [12] C3D 31.29 16.17 - - - - FS
A2C [19] C3D - 36.90 - - - - RL
DEBUG [27] C3D 55.91 39.72 - - - - FS
ExCL [28] I3D 63.00 43.60 23.60 - - - FS
TSP-PRL [37] C3D 56.08 38.76 - - - - RL
GDP [40] C3D 56.17 39.27 - - - - FS
DRN [41] C3D - 42.49 22.25 - 71.85 45.96 FS
VSLNet [48] I3D 63.16 43.22 26.16 - - - FS

ActivityNet Captions (val 2)

Models Features R@1
IoU0.3
R@1
IoU0.5
R@1
IoU0.7
R@5
IoU0.3
R@5
IoU0.5
R@5
IoU0.7
Method
CTRL [6] C3D 47.43 29.01 - 75.32 59.17 - FS
TGN [10] C3D
VGG16
Inception-V4
43.81
42.24
45.51
27.93
23.90
28.47
11.86
-
-
54.56
51.82
57.32
44.20
40.17
43.33
24.84
-
-
FS
QSPN [17] C3D 52.12 33.26 - 77.72 62.39 - FS
WSDEC-W [26] 62.7 42.00 23.3 - - - WS
WSLLN [26] 75.4 42.80 22.7 - - - WS
CMIN [29] C3D 64.41 44.62 24.48 82.39 69.66 52.96 FS
2D-TAN (pool) [38] C3D 59.45 44.51 26.54 85.53 77.13 61.96 FS
2D-TAN (conv) [38] C3D 58.75 44.05 27.38 85.65 76.65 62.26 FS
SCN [39] C3D 47.23 29.22 - 71.45 55.69 - WS
DRN [41] C3D - 45.45 24.36 - 77.97 50.30 FS
HVTG [45] OBJ 57.60 40.15 18.27 - - - FS
PMI [46] C3D 59.69 38.28 17.83 - - - FS
DPIN [54] C3D 62.40 47.27 28.31 87.52 77.45 60.03 FS
FIAN [55] C3D 64.10 47.90 29.81 87.59 77.64 59.66 FS
CSMGAN [56] C3D 68.52 49.11 29.15 87.68 77.43 59.63 FS
SMRN [58] C3D - 42.97 26.79 - 76.46 60.51 FS
VLG-Net [67] C3D - 46.32 29.82 - 77.15 63.33 FS

ActivityNet Captions (val 1 + val2)

Models                    Features     R@1
IoU0.3
R@1
IoU0.5
R@1
IoU0.7
R@5
IoU0.3
R@5
IoU0.5
R@5
IoU0.7
Method
QSPN [17] C3D 45.30 27.70 13.60 75.70 59.20 38.30 FS
ABLR [20] C3D 55.67 36.79 - - - - RL
SCDM [25] C3D 54.80 36.75 19.86 77.29 64.99 41.53 FS
CBP [36] C3D 54.30 35.76 17.80 77.63 65.89 46.20 FS
LGI [43] C3D 58.52 41.51 23.07 - - - FS
TripNet [47] C3D 48.42 32.19 13.93 - - - RL
TMLGA [49] I3D 51.28 33.04 19.26 - - - FS

TACoS (test)

Models                    Features R@1
IoU0.1
R@1
IoU0.3
R@1
IoU0.5
R@1
IoU0.7
R@5
IoU0.1
R@5
IoU0.3
R@5
IoU0.5
R@5
IoU0.7
Method
CTRL [6] C3D 24.32 18.32 13.30 - 48.73 36.69 25.42 - FS
TGN [10] C3D 41.87 21.77 18.90 11.88 53.40 39.06 31.02 15.26 FS
ACRN [12] C3D 24.22 19.52 14.62 - 47.42 34.97 24.88 - FS
MCF [13] C3D 25.84 18.64 12.53 - 52.96 37.13 24.73 - FS
ROLE [14] C3D 20.37 15.38 9.94 - 45.45 31.17 20.13 - FS
VAL [15] C3D 25.74 19.76 14.74 - 51.87 38.55 26.52 - FS
QSPN [17] C3D 25.31 20.15 15.23 - 53.21 36.72 25.30 - FS
ABLR [20] C3D 34.70 19.50 9.40 - - - - - FS
SAP [21] VGG16 31.15 - 18.24 - 53.51 - 28.11 - FS
SMRL [24] VGG16 26.51 20.25 15.95 - 50.01 38.47 27.84 - RL
SCDM [25] C3D - 26.11 21.17 - - 40.16 32.18 - FS
DEBUG [27] C3D 41.15 23.45 11.72 - - - - - FS
ExCL [28] I3D - 45.50 28.00 13.80 - - - - FS
CMIN [29] C3D
I3D
36.88
41.73
27.33
32.35
19.57
22.54
-
-
64.93
69.15
43.35
50.75
28.53
32.11
-
-
FS
SLTA [31] C3D +
FRCNN
23.13 17.07 11.92 - 46.52 32.90 20.86 - FS
ACL-K [32] C3D 31.64 24.17 20.01 - 57.85 42.15 30.66 - FS
CBP [36] C3D - 27.31 24.79 19.10 - 43.64 37.40 25.59 FS
2D-TAN (Pool) [38] C3D 47.59 37.29 25.32 - 70.31 57.81 45.04 - FS
2D-TAN convl) [38] C3D 46.44 35.22 25.19 - 74.43 56.94 44.21 - FS
GDP [40] C3D 39.68 24.14 13.50 - - - - - FS
DRN [41] C3D - - 23.17 - - - 33.36 - FS
TripNet [47] C3D - 23.95 19.17 9.52 - - - - RL
VSLNet [48] I3D 29.61 24.27 20.03 - - - - - FS
TMLGA [49] I3D - 24.54 21.65 16.46 - - - - FS
DPIN [54] C3D 59.04 46.74 32.92 - 75.78 62.16 50.26 - FS
FIAN [55] C3D 39.55 33.87 28.58 - 56.14 47.76 39.16 - FS
CSMGAN [56] C3D 42.74 33.90 27.09 - 68.97 53.98 41.22 - FS
SMRN [58] C3D 50.44 42.49 32.07 - 77.28 66.63 52.84 - FS
LGN [64] C3D 52.46 41.71 30.57 - 76.86 63.06 50.76 - FS
VLG-Net [67] C3D 57.21 45.46 34.19 - 81.80 70.38 56.56 - FS

DiDeMo (test)

Models                       Features R@1
IoU0.5
R@1
IoU0.7
R@1
IoU1.0
R@5
IoU0.5
R@5
IoU0.7
R@5
IoU1.0
Method
MCN [5] VGG16
Flow
VGG16+Flow
VGG16+Flow+TEF
-
-
-
-
-
-
-
-
13.10
18.35
19.88
28.10
-
-
-
-
-
-
-
-
44.82
56.25
62.39
78.21
FS
TMN [9] VGG16
Flow
VGG16+Flow
-
-
-
-
-
18.71
19.90
22.92
-
-
-
-
-
-
72.97
75.14
76.08
FS
TGN [10] VGG16
Flow
VGG16+Flow
-
-
-
-
-
24.28
27.52
28.23
-
-
-
-
-
-
71.43
76.94
79.26
FS
ACRN [12] VGG16 27.44 16.65 - 69.43 29.45 - FS
ROLE [14] VGG16 29.40 15.68 - 70.72 33.08 - FS
MAN [22] TAN - - 27.02 - - 81.70 FS
TGA [23] VGG16+Flow - - 12.19 - - 39.74 WS
SMRL [24] VGG16+FRCNN - - 31.06 - - 80.45 RL
WSLLN [26] VGG16
Flow
-
-
-
-
19.40
18.40
-
-
-
-
53.10
54.40
WS
SLTA [31] VGG16+FRCNN 30.92 17.16 - 70.18 33.87 - FS
VLANet [44] VGG16 - - 19.32 - - 65.68 WS
RTBPN [51] VGG16
Flow
VGG16+Flow
-
-
-
-
-
20.38
20.52
20.79
-
-
-
-
-
-
55.88
57.72
60.26
WS
VLG-Net [67] VGG16 33.35 25.57 25.57 88.86 71.72 71.65 FS
LoGAN [69] VGG16+Flow - - 39.20 - - 64.04 WS

Charades-STA (test)

Models Features R@1
IoU0.3
R@1
IoU0.5
R@1
IoU0.7
R@5
IoU0.3
R@5
IoU0.5
R@5
IoU0.7
Method
CTRL [6] C3D - 23.63 8.89 - 58.92 29.52 FS
ACRN [12] C3D - 20.26 7.64 - 71.99 27.79 FS
ROLE [14] C3D - 21.74 7.82 - 70.37 30.06 FS
VAL [15] C3D - 23.12 9.16 - 61.26 27.98 FS
ASST [16] C3D - 42.72 24.06 - 71.32 43.98 FS
QSPN [17] C3D 54.70 35.60 15.80 95.80 79.40 45.40 FS
ABLR [20] C3D - 24.36 9.01 - - - FS
SAP [21] VGG16 - 27.42 13.36 - 66.37 38.15 FS
MAN [22] VGG16
I3D
-
-
41.24
46.53
20.54
22.72
-
-
83.21
86.23
51.85
53.72
FS
TGA [23] --- 32.14 19.94 8.84 56.58 65.52 33.51 WS
SMRL [24] VGG16 - 24.36 11.17 - 61.25 32.08 RL
SCDM [25] I3D - 54.44 33.43 - 74.43 58.08 FS
DEBUG [27] C3D - 37.39 17.69 - - - FS
ExCL [28] I3D 65.10 44.10 22.40 - - - FS
SLTA [31] C3D+FRCNN - 22.81 8.25 - 72.39 31.46 FS
ACL [32] C3D - 26.47 11.23 - 61.51 33.23 FS
ACL-K [32] C3D - 30.48 12.20 - 64.84 35.13 FS
CBP [36] C3D - 36.80 18.87 - 70.94 50.19 FS
TSP-PRL [37] C3D - 37.39 17.69 - - - RL
TSP-PRL [37] Two Streams - 45.30 24.73 - - - RL
2D-TAN (pool) [38] VGG16 - 39.70 23.31 - 80.32 51.26 FS
2D-TAN (conv) [38] VGG16 - 39.81 23.25 - 79.33 52.15 FS
SCN [39] C3D 42.96 23.58 9.97 95.56 71.80 38.87 WS
GDP [40] C3D - 39.47 18.49 - - - FS
DRN [41] VGG16
C3D
I3D
-
-
-
42.90
45.40
53.09
23.68
26.40
31.75
-
-
-
87.80
88.01
89.06
54.87
55.38
60.05
FS
LGI [43] I3D - 59.46 35.48 - - - FS
VLANet [44] C3D - 31.83 14.17 - 82.85 33.09 WS
HVTG [45] FRCNN - 47.27 23.30 - - - FS
PMI [46] C3D - 39.73 19.27 - - - FS
TripNet [47] C3D 51.33 38.29 16.07 - - - RL
VSLNet [48] I3D - 54.19 35.22 - - - FS
TMLGA [49] I3D 67.53 52.02 33.74 - - - FS
RTBPN [51] C3D 60.04 32.36 13.24 97.48 71.85 41.18 WS
DPIN [54] VGG16 - 47.98 26.96 - 85.53 55.00 FS
FIAN [55] I3D - 58.55 37.72 - 87.80 63.52 FS
WSTG [61] --- 39.80 27.30 12.90 - - - WS
LGN [64] VGG16 - 48.15 26.67 - 86.80 53.01 FS
LoGAN [69] C3D - 34.68 14.54 - 74.30 39.11 WS



03 - Papers

Markdown format:

* `ID` | `Model Acronym` | `Conference` | [Paper Name](link) | Author 1 et al |  [GitHub](link)

Analysis and Surveys

ID Model Venue Title Authors Code
- -- BMVC 2020 Uncovering Hidden Challenges in Query-Based Video Moment Retrieval Otani et al
- -- AAAI 2022 A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics Yuan et al GitHub
- -- ArXiv 2021 A Survey on Temporal Sentence Grounding in Videos LAN et al
- -- ArXiv 2022 he Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions Zhang et al
- -- Arxiv A Survey on Natural Language Video Localization Liu et al

Early works

ID Model Venue Title Authors Code
1 -- ACL 2013 Grounded Language Learning from Video Described with Sentences Yu et al
2 -- CVPR 2014 Visual Semantic Search: Retrieving Videos via Complex Textual Queries Lin et al
3 -- AAAI 2015 Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework Xu et al
4 -- IJCAI 2016 Unsupervised Alignment of Actions in Video with Text Descriptions Song et al

2017

ID Model Venue Title Authors Code
5 MCN ICCV Localizing Moments in Video with Natural Language Hendricks et al GitHub
6 CTRL ICCV TALL: Temporal Activity Localization via Language Query Gao et al GitHub
7 -- ArXiv Where to Play: Retrieval of Video Segments using Natural-Language Queries Lee et al

2018

ID Model Venue Title Authors Code
8 FIFO ECCV Find and Focus: Retrieve and Localize Video Events with Natural Language Queries Shao et al
9 TMN ECCV Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos Liu et al
10 TGN EMNLP Temporally Grounding Natural Sentence in Video Chen et al GitHub
11 TEMPO EMNLP Localizing Moments in Video with Temporal Language Hendricks et al GitHub
12 ACRN SIGIR Attentive Moment Retrieval in Videos Liu et al GitHub
13 MCF IJCAI Multi-modal Circulant Fusion for Video-to-Language and Backward Wu et al GitHub
14 ROLE ACM MM Cross-modal Moment Localization in Videos Liu et al GitHub
15 VAL PRCM VAL: Visual-attention action localizer Song et al
16 ASST ArXiv Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions Ning et al

2019

ID Model Venue Title Authors Code
17 QSPN AAAI Multilevel Language and Vision Integration for Text-to-Clip Retrieval Xu et al GitHub
18 LNet AAAI Localizing Natural Language in Videos Chen et al
19 A2C AAAI Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos Dongliang et al GitHub
20 ABLR AAAI To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression Yuan et al GitHub
21 SAP AAAI Semantic Proposal for Activity Localization in Videos via Sentence Query Chen et al
22 MAN CVPR MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment Zhang et al GitHub
23 TGA CVPR Weakly Supervised Video Moment Retrieval From Text Queries Mithun et al GitHub
24 SMRL CVPR Language-Driven Temporal Activity Localization_ A Semantic Matching Reinforcement Learning Model Wang et al
25 SCDM NIPS Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos Yuan et al GitHub
26 WSLLN EMNLP WSLLN: Weakly Supervised Natural Language Localization Networks Gao et al
27 DEBUG EMNLP DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization Lu et al
28 ExCL NAACL ExCL: Extractive Clip Localization Using Natural Language Descriptions Ghosh et al
29 CMIN SIGIR Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos Zhang et al GitHub
30 CMIN IEEE Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction Zhang et al GitHub
31 SLTA ICMR Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention Jiang et al GitHub
32 ACL WACV MAC: Mining Activity Concepts for Language-based Temporal Localization Ge et al GitHub
33 WSSTG ACL Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video Chen et al GitHub
34 TCMN ACM Exploiting Temporal Relationships in Video Moment Localization with Natural Language Zhang et al GitHub
35 CAL ArXiv Temporal Localization of Moments in Video Collections with Natural Language Escorcia et al GitHub

2020

ID Model Venue Title Authors Code
36 CBP AAAI Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction Wang et al GitHub
37 TSP-PRL AAAI Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video Wu et al GitHub
38 2DTAN AAAI Learning 2D Temporal Localization Networks for Moment Localization with Natural Language Zhang et al GitHub1, GitHub2
39 SCN AAAI Weakly-Supervised Video Moment Retrieval via Semantic Completion Network Lin et al
40 GDP AAAI Rethinking the Bottom-Up Framework for Query-based Video Localization Chen et al
41 DRN CVPR Dense Regression Network for Video Grounding Zeng et al GitHub
42 STGRN CVPR Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences Zhang et al GitHub
43 LGI CVPR Local-Global Video-Text Interactions for Temporal Grounding Mun et al GitHub
44 VLANet ECCV VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval Ma et al
45 HVTG ECCV Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language Chen et al GitHub
46 PMI ECCV Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos Chen et al
47 TripNet BMVC Tripping through time Efficient Localization of Activities in Videos Hahn et al
48 VSLNet ACL Span-based Localizing Network for Natural Language Video Localization Zhang et al GitHub
49 TMLGA WACV Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention Rodriguez-Opazo et al GitHub
50 -- NIPS Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding Zhang et al
51 RTBPN ACM Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos Zhang et al
52 STRONG ACM STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization Cao et al
53 AVMR ACM Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization Cao et al
54 DPIN ACM Dual Path Interaction Network for Video Moment Localization Wang et al
55 FIAN ACM Fine-grained Iterative Attention Network for Temporal Language Localization in Videos Qu et al
56 CSMGAN ACM Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization Liu et all GitHub
57 -- DAVU Cross-Modality Video Segment Retrieval with Ensemble Learning Yu et al
58 SMRN ISNN Semantic Modulation Based Residual Network for Temporal Language Queries Grounding in Video Chen et al
59 -- Journal Cross-modal video moment retrieval based on visual-textual relationship alignment Chen et al
60 -- ArXiv Video Moment Retrieval via Natural Language Queries Yu et al
61 WSTG ArXiv Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video Chen et al
62 MARN ArXiv Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos Song et al
63 LGN ArXiv Language Guided Networks for Cross-modal Moment Retrieval Liu et al
64 ACRM ArXiv Frame-wise Cross-modal Match for Video Moment Retrieval Tang et al
65 CMA ArXiv A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention Zhang et al
66 -- ArXiv Natural Language Video Localization: A Revisit in Span-based Question Answering Framework Zhang et al

2021

ID Model Venue Title Authors Code
67 VLG-Net ICCVW VLG-Net: Video-Language Graph Matching Network for Video Grounding Soldan et al GitHub
68 LoGAN WACV LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval Tan et al
69 CBLN CVPR Context-aware Biaffine Localizing Network for Temporal Sentence Grounding Liu et al GitHub
70 DeNet CVPR Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding Zhou et al
70 DORi WACV DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video Rodriguez-Opazo et al GitHub
71 PEARL WACV Natural Language Video Moment Localization Through Query-Controlled Temporal Convolution Zhang et al
72 IVG-DCL CVPR Interventional Video Grounding With Dual Contrastive Learning Nan et al GitHub
73 SMIN CVPR Structured Multi-Level Interaction Network for Video Moment Localization via Language Query Wang et al
74 -- CVPR Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos Zhang et al
75 MMRG CVPR Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval Zeng et al
76 CPN CVPR Cascaded Prediction Network via Segment Tree for Temporal Video Grounding Zhao et al
77 CRM CVPR Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation Huang et al
78 FVMR CVPR Fast Video Moment Retrieval Gao et al
79 RMN ACL Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network Liu et al
80 -- ACL Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding Wang et al
81 VCA ACM Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval Wang et al
82 CI-MHA ACM Cross Interaction Network for Natural Language Guided Video Moment Retrieval Yu et al
83 MABAN Journal MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval Yu et al
84 CFSTRI Journal Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding Qi et al
85 -- Journal Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval Teng et al
86 ACRM Journal Frame-wise Cross-modal Matching for Video Moment Retrieval Tang et al
87 DCT-net Journal DCT-net: A deep co-interactive transformer network for video temporal grounding Qi et al
88 SV-VMR Journal Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval Wu et al
89 CAN Journal Context-aware network with foreground recalibration for grounding natural language in video Chen et al
90 -- Journal Multi-scale 2D Representation Learning for weakly-supervised moment retrieval Li et al
91 LCNet Journal Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding Yang et al
92 CLEAR Journal Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization Hu et al
93 VSLNet Journal Natural Language Video Localization: A Revisit in Span-based Question Answering Framework Zhang et al
94 VSRNet Journal VSRNet: End-to-end video segment retrieval with text query Sun et al
95 MS-2D-TAN Journal Multi-Scale 2D Temporal Adjacency Networks for Moment Localization with Natural Language Zhang et al GitHub
96 U-VMR Journal Learning Video Moment Retrieval Without a Single Annotated Video Gao et al
97 CPNet AAAI Proposal-Free Video Grounding with Contextual Pyramid Network Li et al
98 DepNet AAAI Dense Events Grounding in Video Bao et al
99 BPNet AAAI Boundary Proposal Network for Two-Stage Natural Language Video Localization Xiao et al
100 STVGBert ICCV STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding Su et al
101 BSP ICCV Boundary-sensitive Pre-training for Temporal Localization in Videos Xu et al GitHub
102 SSCS ICCV Support-Set Based Cross-Supervision for Video Grounding Ding et al
103 DCM SIGIR Deconfounded Video Moment Retrieval with Causal Intervention Yang et al GitHub
104 -- Arxiv Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair Maeoki et al
105 HDRR Arxiv Hierarchical Deep Residual Reasoning for Temporal Moment Localization Ma et al GitHub
106 RaNet EMNLP Relation-aware Video Reading Comprehension for Temporal Language Grounding Gao et al GitHub
107 GTR Arxiv On Pursuit of Designing Multi-modal Transformer for Video Grounding Cao et al
108 SeqPAN Arxiv Parallel Attention Network with Sequence Matching for Video Grounding Zhang et al
109 S^4TLG Arxiv Self-supervised Learning for Semi-supervised Temporal Language Grounding Luo et al
110 IA-Net EMNLP Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding Liu et al
111 LPNet Arxiv Natural Language Video Localization with Learnable Moment Proposals Xiao et al
112 PLN Arxiv Progressive Localization Networks for Language-based Moment Localization Zheng et al
113 SNEAK Arxiv SNEAK: Synonymous Sentences-Aware Adversarial Attack on Natural Language Video Localization Gou et al
113 MGSL-Net Arxiv Memory-Guided Semantic Learning Network for Temporal Sentence Grounding Liu et al
114 MMFA-CF IWACIII A Multi-modal Fusion Algorithm for Cross-modal Video Moment Retrieval Jia et al

2022

ID Model Venue Title Authors Code
115 MARN Arxiv Exploring Motion and Appearance Information for Temporal Sentence Grounding Liu et al
116 DebiasTLL Arxiv Learning Sample Importance for Cross-Scenario Video Temporal Grounding Bao et al
117 DebiasTLL Journal Video Moment Retrieval With Cross-Modal Neural Architecture Search Yang et al GitHub
118 CDN Journal Cross-modal Dynamic Networks for Video Moment Retrieval with Text Query Yang et al GitHub
119 CDN AAAI Unsupervised Temporal Video Grounding with Deep Semantic Clustering Liu et al
120 APGN ACL Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos Liu et al
121 PLRN AVSS Position-aware Location Regression Network for Temporal Video Grounding Kim et al
122 PRVG Arxiv End-to-End Dense Video Grounding via Parallel Regression Shi et al
123 MQEI Journal Multi-Level Query Interaction for Temporal Language Grounding Tang et al
124 LocFormer Arxiv LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach Rodriguez-Opazo et al
125 STCM-Net Journal STCM-Net: A symmetrical one-stage network for temporal language localization in videos Jia et al
126 TACI Journal Learning to combine the modalities of language and video for temporal moment localization Shin et al
127 -- Arxiv Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding Mo et al
128 MA3SRN Arxiv Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding Liu et al
129 -- AAAI Explore Inter-Contrast Between Videos via Composition for Weakly Supervised Temporal Sentence Grounding Chen et al
130 -- CVPR MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions Soldan et al GitHub
131 -- Arxiv Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning Li et al GitHub



Licenses

CC0

To the extent possible under law, muketong all copyright and related or neighboring rights to this work.

About

A curated list of grounding natural language in video and related area. :-)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages