# Project Objective 

1. 

# Baseline

<img src="./baseline.png" alt="nn" width="1000" height="800">

# Our Model

<img src="./ourEmbeddingModel_2.png" alt="nn" width="1000" height="800">

## Baseline
    - Task: video-language alignment (Video-to-Text, Text-to-Video)
    - Goal: 
        - Adapting image-text pre-trained models to video-text pre-training (i.e. post-training) 
        - Video Proxy mechanism on the basis of CLIP (CLIP-ViP) 
    - Approach 
        - In-domain auxiliary data generation: 
            - to bridge language domain gaps between images and videos datasets 
            - introduce auxiliary captions into large-scale video-subtitle data to reduce the language domain gap between pre-training and downstream data 
                - pre-training: Image-Text learning 
                - downstream data: Video-Text learning 

        - Video Proxy Mechanism:
            - to enable the Vision Transformer (ViT) model for both image and video encoding 
            - Before feeding into CLIP, we concatenate path tokens with a set of learnable parameters called video proxy tokens
            - The output of the first video proxy token will be regarded as the video's representation 
    - Loss function:
        - info-NCE loss 

## Our Model with emotion embedding 
    - 
        

# Model

In [14]:
import torch.nn as nn
import torch
import numpy as np

In [7]:
data = [{
"video": "video7112.mp4",
"video_id": "video7112",
"caption": "while other friends too try and hitting the basket another is eager to achieve his fourth successful basket in basketball",
"duration": 18.35,
"emotion": 1,
"positive": 4.0,
"negative": 0.0,
"neutral": 0.0,
"joy": 4.0,
"trust": 3.0,
"surprise": 1.0,
"anticipation": 3.0,
"fear": 0.0,
"sadness": 0.0,
"disgust": 0.0,
"anger": 0.0
}]

In [5]:
embed_dim = 512
emotion_embedding = nn.Embedding(8, embed_dim)

In [6]:
emotion_embedding

Embedding(8, 512)

In [34]:
# Change non-zero values to 1, effectively binarizing the input
emotions = torch.tensor(np.array([4.0, 3.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])).unsqueeze(0) 
print(emotions.shape) # (bs, 8)
if emotions is not None:
    emotions = torch.where(emotions > 0, torch.ones_like(emotions), torch.zeros_like(emotions))
    print(f"emotions: {emotions}\n") 

torch.Size([1, 8])
emotions: tensor([[1., 1., 1., 1., 0., 0., 0., 0.]], dtype=torch.float64)



In [35]:
# Retrieve all emotion embeddings
batch_size = 1 
all_emotion_embeds = emotion_embedding.weight.unsqueeze(0).repeat(batch_size, 1, 1)  # [batch_size, 8, embed_dim]

In [36]:
all_emotion_embeds.shape

torch.Size([1, 8, 512])

In [40]:
if emotions is not None:
    emotion_mask = emotions.unsqueeze(-1).type_as(all_emotion_embeds)  # [batch_size, 8, 1]
    print(f"emotion_mask: {emotion_mask}, shape = {emotion_mask.shape}\n")
    selected_emotion_embeds = all_emotion_embeds * emotion_mask  # [batch_size, 8, embed_dim]
    print(f"selected_motion_embeds: {selected_emotion_embeds.shape}\n")
    emotion_embeds = selected_emotion_embeds.sum(1) / (emotion_mask.sum(1) + 1e-8)  # [batch_size, embed_dim]

    print(f"emotion_embeds.shape: {emotion_embeds.shape}")

emotion_mask: tensor([[[1.],
         [1.],
         [1.],
         [1.],
         [0.],
         [0.],
         [0.],
         [0.]]]), shape = torch.Size([1, 8, 1])

selected_motion_embeds: torch.Size([1, 8, 512])

emotion_embeds.shape: torch.Size([1, 512])


In [None]:
seq_length = ? 

emotion_embeds = emotion_embeds.unsqueeze(1).expand(-1, seq_length, -1)  # [batch_size, seq_length, embed_dim]

position_embeddings = self.position_embedding(position_ids)

# embedding with additional emotion_embeds 
embeddings = inputs_embeds + position_embeddings + emotion_embeds

# Future Work 

1. Test Data Collection:
    - Image-captioning model OFA-Caption (Wang et al., 2022b) to generate one caption for the middle frame of each video in HD-VILA-100M
          - with max length of 16 words
2. Model development
    - 현재 우리 모델은 하나의 caption에 대한 nn.Embedding 결과를 단순히 aggregation (addition) 했는데, emotion embedding을 concat해서 fc layer를 통과하는 방식을 접목시킬 수 있을 것 같다.
    - 새로운 emotion encoder를 만들어서, video, text, and emotion features 들의 새로운 cross attention을 시도해볼 수 있을 것 같다. (1) 
    - video 에서 emotion 정보를 추출하여 이또한 video encoder embedding에 추가하여 감정 정보를 반영한다. 
3. Data Processing
   - Lack of training and test data:
         - Data Preprocessing에서 step 1인 Sentiment Analysis에 따라 filtering 되는 video의 수가 많아서, 학습 및 테스트에 활용할 데이터가 부족하다. 따라서 Video 단위의 filtering 전처리를 제외하고 step 2인 each caption에 대한 emotion assignment 작업만 진행하여 기존의 전체 데이터를 활용하도록 한다.


<reference>
(1) X. Zhang, M. Li, S. Lin, H. Xu and G. Xiao, "Transformer-based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2023.3312858.


# TODO 

- 모델 architecture의 변화로 인한 embedding space의 변화를 t-SNE와 같은 visualization approach를 사용하여 시각화한다. 
- 우리 모델의 목적은 처음보는 text에 대해서 가장 그럴듯한 상위 몇 개의 비디오를 추천하는 모델을 만드는 것이므로, 기존 테스트 데이터셋에도 없는 새로운 
    특정 감정에 대한 text (ex) 화난 사람의 비디오) 가 들어갔을 때, 감정이 포함되어있다고 판단되는 비디오의 랭킹 변화를 추적한다.
- pipeline / application
      - 현재 모델은 v2t, t2v는 recall과 같은 metric만 나오기 때문에, 최종적으로는 비디오를 추천하도록 하는 application 까지 구현해야한다. 