## Problem Definition 

- Emotion-Specialized Text-to-Video Retrieval Task
  - 사용자가 input으로 추출하고자 하는 비디오 설명 text를 입력한다. inference 단계에서 cosine similarity matrix를 만들기 위해 우리의 emotion-specialized text-to-video retrieval model 가 관련 쿼리사용자의 디렉토리에 있는 비디오와 해당 쿼리 pair 간의 유사성을 계산한다. 
  -  inference 할 때에는, 데이터 속 text 쿼리에 기반한 embedding features와 cosine similarity가 가까운 video vectors를 similarity가 가장 큰 순서로 나열하여 가장 연관된 비디오 상위 n개를 추출하는 task이다. 
  - 우리는 텍스트와 비디오 두 modalities간의 연관성을 잘 배운 text-to-video / video-to-text retrieval model인 CLIP-ViP 모델을 이용하였다. 
  - 여기서 추가적인 우리의 목표는, 감정과 관련된 쿼리에 대해서 더 retrieval task를 잘하는 모델을 만드는 것이었다. 
  - 그 이유는, 사람은 추억 속에 살기 때문에 만약 특정 영상을 찾아야 한다면 감정 관련된 묘사를 하는 쿼리를 입력할 것이라는 예상을 하였기 때문이다.
  - 따라서 text에서 8개의 emotion 추출한 후 embedding을 하여 embedding space에서 각 감정에 대해 text와 video가 잘 clustering되도록 발전시킬 수 있을 것이라 생각하였다. 

![fig1](problem.png)

Here’s how it works: A user asks, 'Could you find a video of me having a great time at my birthday party with my family?’ 
Our pipeline analyzes the text, understanding the emotional cues and content context. It then sifts through a sea of video data, pinpointing those precious moments that align with the query. The output is a collection of video clips ranked by relevance, ready to transport the user back to those joyous times.


## Objectives

사람은 순간의 기억을 감정으로 기억한다. 디지털 시대에서 우리는 비디오를 모으는 것이 아닌 감정을 모은다고 할 수 있다. 특정 기억을 찾고 싶을 때, text는 단순 행동이 아닌 감정이 들어가는 경우가 더 많을 것이다. However, finding that one special moment in an ocean of digital content is challenging. 우리는 해당 작업에 적합한 Emotion-Specialized Text-to-Video retrieval model를 구현하고자 한다. 

## Baseline Model 

![fig2](baseline.png)

Fig. A Baseline model was used for our emotion-specialized text-to-retrieval model 

위의 그림은 우리의 emotion-specialized text-to-retrieval model을 위해 사용한 baseline model (Xue et al., 2023)이다. 해당 모델은 image-text pre-trained models을 adapt하여 video-text pre-training (i.e. post-pretraining)한 모델이다. 기존의 CLIP based image-to-text pre-trained models에 Video Proxy token을 추가하고 text-to-video에 맞는 데이터를 생성하여 학습시킴으로써 an Omnisource Cross-modal learinng 방법을 제안하였다. 아래에 해당 Baseline model의 video proxy, omnisource  cross-modal learning, loss에 대해 간단히 설명하였다. 


< baseline model > 
  - (비디오, 해당 비디오의 caption) pair 을 같은 embedding space에 mapping 한다. 
  - Contrastive learning을 통해 positive pair끼리 가깝도록 학습시켰는데, constrastive learning이란, self-supervised learning에서 positive pair간의 embedding feature vectors는 가까이  negative pairs embedding feature vectors 들끼리는 멀리 배치하도록 하는 기법이다. 
  - 그들은 실험을 통해, large-scale video-text pre-training data에는 "subtitles"과 video간의 domain gap이 크다고 판단하였고 이러한 갭을 줄이기 위해, image-captioning model을 사용하여 각 비디오의 middle frame의 an auxiliary caption 을 생성하였다. 이렇게 생성된 caption을 C라고 한다. 
    - For the input type of the image/frame in ViT, we use linear interpolation to get a middle temporal positional
embedding, then treat the image/frame as a special single-frame video. 
  - 또한 기존 Vision Transformer (ViT)을 이용하여 이미지와 비디오 모두를 process하는 encoder module을 만들기위해, a proxy-guided video attention mechanism 을 고안하였는데, each block에서 attention 계산을 할 때, 별도의 video proxy token이 더해지고 다른 tokens들과도 상호작용한다. while patch tokens only interact with video proxy tokens and patch tokens within the same frame. (frame -> patch token으로 나뉨)


즉 정리하자면 다음과 같이 되며,
  - V: Video
  - S: Subtitles for a Video 
  - F: middle frame for each video
  - C: a caption for the corresponding middle frame 
  
(V, S) pairs와 그에 해당하는 (F, C) pairs를 가지게 된다. 

This method enables joint
training on both videos and images in the same batch, as our proxy-guided attention mechanism
reduces the difference in calculations between video and image.

## Loss function - OMNISOURCE CROSS-MODAL LEARNING 

- info-NCE loss 를 사용하였다. 
- data의 visual source는 video와 frame, text source는 subtitles과 caption이 있으며, source-wise info-NCE loss를 만들어 모델을 학습하였다. 

$\mathcal{L}_{v 2 t}=-\frac{1}{B} \sum_{i=1}^B \log \frac{e^{v_i^{\top} t_i / \tau}}{\sum_{j=1}^B e^{v_i^{\top} t_j / \tau}}, \quad \mathcal{L}_{t 2 v}=-\frac{1}{B} \sum_{i=1}^B \log \frac{e^{t_i^{\top} v_i / \tau}}{\sum_{j=1}^B e^{t_i^{\top} v_j / \tau}}$

여기서부터는 paper에서 발췌한 건데, 기니까 사용안해도 될듯하고,,다만, 사용안하면 관련 reference도 삭제해야한다

Taken from the paper 

"""

- To learn rich video-language alignment from video-subtitle pairs and reduce the language domain gap with downstream data by corresponding auxiliary frame-caption pairs, we study joing Cross-Modal Learning on the omnisource input 
  
- Following most works of learning multimodal alignment on dual encoders (Radford et al., 2021; Xue et al., 2022; Li et al., 2021; Luo et al., 2020; Xu et al.,
2021b; Luo et al., 2021), we use **info-NCE loss** to perform **contrastive learning**.
 
- There are two formats of visual source: video sequences and single frames, and two types of text source: subtitles and captions in our work.
  
- We denote them by V, F, S, and C respectively 
  
- We define a source-wise info-NCE loss by: 

$\mathcal{L}_{v 2 t}=-\frac{1}{B} \sum_{i=1}^B \log \frac{e^{v_i^{\top} t_i / \tau}}{\sum_{j=1}^B e^{v_i^{\top} t_j / \tau}}, \quad \mathcal{L}_{t 2 v}=-\frac{1}{B} \sum_{i=1}^B \log \frac{e^{t_i^{\top} v_i / \tau}}{\sum_{j=1}^B e^{t_i^{\top} v_j / \tau}}$


where $v_i$ and $t_j$ are the normalized embeddings of $i$-th visual feature in $X \in\{V, F\}$ and $j$-th text feature in $Y \in\{S, C\}$ in a batch of size $B . \tau$ is a learnable temperature. The overall alignment loss $\mathcal{L}_{X \leftrightarrow Y}$ is the average of $\mathcal{L}_{v 2 t}$ and $\mathcal{L}_{t 2 v}$. For example, $\mathcal{L}_{V \leftrightarrow S}$ represents info-NCE loss within video-subtitle pairs in a batch, which pulls aligned pairs together in embedding space while pushing apart misaligned pairs.
We study the reasonable variants of OCL: (a) $\mathcal{L}_{V \leftrightarrow S}+\mathcal{L}_{F \leftrightarrow C}$ : Simple combination of two sourcewise losses on video-subtitle and frame-caption pairs; (b) $\mathcal{L}_{V \leftrightarrow S}+\mathcal{L}_{V \leftrightarrow C}$ : As there is also content correlation between videos and its middle-frame captions, we explore to add a loss on videocaption pairs to baseline loss $\mathcal{L}_{V \leftrightarrow S}$; (c) $\mathcal{L}_{V \leftrightarrow S}+\mathcal{L}_{V \leftrightarrow C}+\mathcal{L}_{F \leftrightarrow C}$ : Combination of (a) and (c); (d) $\mathcal{L}_{V \leftrightarrow S, C}+\mathcal{L}_{F \leftrightarrow C}$ : A video corresponds to both a subtitle and auxiliary caption. Compare to (c), the numbers of negative pairs in $\mathcal{L}_{v 2 t}$ can be expanded. The $\mathcal{L}_{v 2 t}$ in $\mathcal{L}_{V \leftrightarrow S, C}$ is rewritten as:
$$
\mathcal{L}_{v 2 t}=-\frac{1}{2 B} \sum_{i=1}^B\left(\log \frac{e^{v_i^{\top} s_i / \tau}}{\sum_{j=1}^B e^{v_i^{\top} s_j / \tau}+e^{v_i^{\top} c_{j \neq i} / \tau}}+\log \frac{e^{v_i^{\top} c_i / \tau}}{\sum_{j=1}^B e^{v_i^{\top} c_j / \tau}+e^{v_i^{\top} s_{j \neq i} / \tau}}\right),
$$
where $s_i \in S$ and $c_i \in C$. The $\mathcal{L}_{t 2 v}$ in $\mathcal{L}_{V \leftrightarrow S, C}$ is equal to (c). We compare all variants with the baseline $\mathcal{L}_{V \leftrightarrow S}$ and report results in Section 5.

where $v_i$ and $t_j$ are the normalized embeddings of $i$-th visual feature in $X \in\{V, F\}$ and $j$-th text feature in $Y \in\{S, C\}$ in a batch of size $B . \tau$ is a learnable temperature. The overall alignment loss $\mathcal{L}_{X \leftrightarrow Y}$ is the average of $\mathcal{L}_{v 2 t}$ and $\mathcal{L}_{t 2 v}$. For example, $\mathcal{L}_{V \leftrightarrow S}$ represents info-NCE loss within video-subtitle pairs in a batch, which pulls aligned pairs together in embedding space while pushing apart misaligned pairs.
We study the reasonable variants of OCL: (a) $\mathcal{L}_{V \leftrightarrow S}+\mathcal{L}_{F \leftrightarrow C}$ : Simple combination of two sourcewise losses on video-subtitle and frame-caption pairs; (b) $\mathcal{L}_{V \leftrightarrow S}+\mathcal{L}_{V \leftrightarrow C}$ : As there is also content correlation between videos and its middle-frame captions, we explore to add a loss on videocaption pairs to baseline loss $\mathcal{L}_{V \leftrightarrow S}$; (c) $\mathcal{L}_{V \leftrightarrow S}+\mathcal{L}_{V \leftrightarrow C}+\mathcal{L}_{F \leftrightarrow C}$ : Combination of (a) and (c); (d) $\mathcal{L}_{V \leftrightarrow S, C}+\mathcal{L}_{F \leftrightarrow C}$ : A video corresponds to both a subtitle and auxiliary caption. Compare to (c), the numbers of negative pairs in $\mathcal{L}_{v 2 t}$ can be expanded. The $\mathcal{L}_{v 2 t}$ in $\mathcal{L}_{V \leftrightarrow S, C}$ is rewritten as:
$$
\mathcal{L}_{v 2 t}=-\frac{1}{2 B} \sum_{i=1}^B\left(\log \frac{e^{v_i^{\top} s_i / \tau}}{\sum_{j=1}^B e^{v_i^{\top} s_j / \tau}+e^{v_i^{\top} c_{j \neq i} / \tau}}+\log \frac{e^{v_i^{\top} c_i / \tau}}{\sum_{j=1}^B e^{v_i^{\top} c_j / \tau}+e^{v_i^{\top} s_{j \neq i} / \tau}}\right),
$$
where $s_i \in S$ and $c_i \in C$. The $\mathcal{L}_{t 2 v}$ in $\mathcal{L}_{V \leftrightarrow S, C}$ is equal to (c). We compare all variants with the baseline $\mathcal{L}_{V \leftrightarrow S}$ and report results in Section 5.


"""

# Reference 




Xue, Hongwei, et al. "Clip-vip: Adapting pre-trained image-text model to video-language representation alignment." arXiv preprint arXiv:2209.06430 (2022).

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.

Xue, Hongwei, et al. "Advancing high-resolution video-language representation with large-scale video transcriptions." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.