# Emotion Specialized Text to Video Retrieval
## Final Report (5 Apr 2024)
Team: 14 

Members: 

__Daye Lee__ 

__Wonseon Lim__ 

__Hyejin Oh__

__Paul Hyunbin Cho__ 

## Project Objective 
We aim to create an image text retrieval model that can better find videos taking emotions into account in their queries. In the age of big data, most people now have countless photos and vidoes in their possession, saved. With the immensely large amount of data that they now have, it is becoming increasingly difficult to efficiently find the videos or photos that someone is looking for. In this regard, the field of video-text retrieval has been receiving more attention and their models' capabilities have improved tremendously over the years. However, we believe that performance of these models can be further improved by focusing on the emotional aspects inherent in both the text and the videos. This is especially more important considering growing demand for personalized and emotionally resonant experiences in digital media. __Thus, we hope to develop a tool for users to easily access emotional content in videos, by modifying video-text retrieval models to incorporate emotion data.__ 

In this final report, we provide a step-by-step walkthrough of how to achieve the same results as we did. The report is structured as follows: 
1. Problem Formulation and Outline
2. Data Collection and Preprocessing
3. Proposed Model
4. Model Performance Evaluation
5. Demonstration
6. Disussion and Possible Future Directions

As the experiments are not run on Jupyter Notebooks, we provide the link to our Github, which contains all of the code used for our experiments and implementations: https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval.git

In [2]:
! pip install nrclex



In [1]:
import os
import json
import pandas as pd
import nltk

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from nltk.sentiment import SentimentIntensityAnalyzer
from nrclex import NRCLex
from IPython.display import Video

nltk.download('vader_lexicon')
nltk.download('punkt')

# 
os.environ['remote_url'] = "https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval.git"
os.environ['repo_name'] = "MIE1517_Project_Emotion-Text-to-Video-Retrieval"
os.environ['pretrained_model'] = "./pretrained_model"

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ones/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/ones/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
! git clone $remote_url
%cd MIE1517_Project_Emotion-Text-to-Video-Retrieval
! pip install -r requirements.txt

## 1. Problem Formulation and Outline
Let's talk about something that deeply resonates with us all: emotions and memories. In this digital era, we don't just collect videos; we collect emotions. However, finding that one special moment in an ocean of digital content is challenging. That's where our Emotion-Specialized Text-to-Video Retrieval Task comes in. 

The Emotion-Specialized Text-to-Video Retrieval Task:
- The user inputs a video description text as an input to extract the desired video. In the inference stage, to create a cosine similarity matrix, our emotion-specialized text-to-video retrieval model calculates the similarity between the relevant query and the videos in the user's directory.
- During inference, the task is to extract the top-n most relevant videos by listing the video vectors that have close cosine similarity to the embedding features based on the text query in the data, in order of the highest similarity.
- We utilized the CLIP-ViP model, a text-to-video / video-to-text retrieval model that has learned the relationship between the two modalities of text and video well.
- Here, our additional goal was to create a model that performs the retrieval task better for emotion-related queries.
- The reason for this is that we anticipated that if a person needs to find a specific video, they would input a query describing emotions, as people live in their memories.
- Therefore, we thought that after extracting 8 emotions from the text and performing embedding, we could develop the model so that the text and video cluster well for each emotion in the embedding space.

![problem](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/problem.png?raw=true)

Here’s how it works: A user inputs a description of the video they are looking for. For example, they could give as input: 'Could you find a video of me having a great time at my birthday party with my family?’ 

Our pipeline analyzes the text, understanding the emotional cues and content context. It then sifts through a sea of video data, pinpointing those precious moments that align most accurately with the query. 

The output? A collection of video clips ranked by relevance, ready to transport the user back to those joyous times.

 It's not just smart tech; it's emotionally intelligent, delivering not just any video, but the right one.


## 2. Data Collection and Preparation
We performed our experiments on the dataset most widely used for Video-Text Retrieval Tasks: MSR-VTT. Most implementations of models for this task usually use the entire dataset(10K YouTube videos with 200K descriptions) but here, we filter the dataset to use only the videos that contain emotions, leaving only 6,000 videos from the entire MSR-VTT dataset for training. Each video is approximately 10-20 seconds long and consists of 20 corresponding captions. We explain the process in detail below. 

### Data Download 
You can simply download the MSR-VTT dataset using this command:
```
bash download_msrvtt.bash $PATH_TO_DOWNLOAD
```
Or you can download the dataset from this link: [MSR-VTT Download](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip)

In [2]:
os.environ['PATH_TO_DOWNLOAD'] = './data'

In [None]:
! bash download_msrvtt.bash $PATH_TO_DOWNLOAD

### Data (Video) Processing
To enhance the speed of the training, we downscale the frames per seconds to 6. This can be done easily by running our processing script using this command: 
```
bash compress_msrvtt.bash
```

Then, the processed videos are saved into the inner directory of the $PATH_TO_DOWNLOAD as {$PATH_TO_DOWNLOAD}/videos_6fps.


We referred to this link for our processing as following: [MSR-VTT Processing](https://github.com/open-mmlab/mmaction2/tree/main/tools/data/msrvtt)


In [None]:
! bash compress_msrvtt.bash $PATH_TO_DOWNLOAD

* Example video

In [15]:
video_file = './data/msrvtt/videos_6fps/video0.mp4'
Video(video_file, embed=True)

### Data Preprocessing
Next, to extract the emotions in our dataset for our main task, Emotion-Specialized Text-to-Video Retrieval, we perform a two-step preprocessing process as follows.

Step (1): We conduct sentiment analysis on each video caption to determine the presence of sentimental information. The sentiment analysis library calculates neutral, positive, and negative scores for each video caption and decides whether it is positive or negative based on the compound score, which is an overall score derived from these individual scores. For example, if the composite score is above 0.6, we classify the video as positive, and if it is below -0.6, we classify it as negative. We abitrarily set -0.6 and 0.6 as the thresholds for classifying the compound scores representing videos with emotion. We filtered out any videos with compound scores in between -0.6 and 0.6 as videos without sufficient positive or negative sentiments, representative of the existence of emotions. This leaves us with about 6K videos in the training set.
$\rightarrow$ [nltk.sentiment.SentimentIntensityAnalyzer](https://www.nltk.org/api/nltk.sentiment.SentimentIntensityAnalyzer.html?highlight=sentimentintensity)

Step (2): After completing the video selection and filtration process in the first step, we determine the emotional information present in each caption. For this purpose, we use NRCLex, which uses a scoring method based on a predefined lexicon dictionary to calculate eight emotions present in a sentence: joy, trust, fear, surprise, sadness, disgust, anger, and anticipation. We also include the calculation of positive, negative and neutral emotions from the previous step. Consequently, the resulting data for each caption contains information on eight emotions along with positive, negative, and neutral sentiments.
$\rightarrow$ [nrclex.NRCLex](https://pypi.org/project/NRCLex/)


This emotion extraction process is applied to the three data splits we use in our experminets: the refined 6K training data from the first step, the entire 1K test data, and a combination of 34 sentimental and 34 non-sentimental data used to analyze our model's results.



#### Load Data's Annotations

In [None]:
mrsvtt_train9k_dir = "./data/msrvtt/annotations/msrvtt_ret_train9k.json"
mrsvtt_test1k_dir = "./data/msrvtt/annotations/msrvtt_ret_test1k.json"

with open(mrsvtt_train9k_dir, 'r') as f:
    mrsvtt_train9k_json = json.load(f)

with open(mrsvtt_test1k_dir, 'r') as f:
    mrsvtt_test1k_json = json.load(f)

# Extract video_id
train9k = pd.DataFrame(mrsvtt_train9k_json)
test1k = pd.DataFrame(mrsvtt_test1k_json)
train9k['video_id'] = train9k['video'].apply(lambda x: x.split('.')[0])
test1k['video_id'] = test1k['video'].apply(lambda x: x.split('.')[0])
print(train9k.shape, test1k.shape)

#  Group captions by each video_id
merge_caption = lambda x: '. '.join(x)
train_video_text = train9k.groupby('video_id').aggregate('caption').apply(merge_caption)
test_video_text = test1k.groupby('video_id').aggregate('caption').apply(merge_caption)
train_video_text = pd.DataFrame(train_video_text).reset_index()
test_video_text = pd.DataFrame(test_video_text).reset_index()
print(train_video_text.shape, test_video_text.shape)

#### Step (1): Extract Sentimental Video using Sentiment Analysis

We used two methods to extract videos with emotional information: one using SentimentIntensityAnalyzer() and the other involving manual extraction of videos with well-defined emotional information. The extract_emotional_data function uses SentimentIntensityAnalyzer(), while the extract_emotion_manually function uses manual extraction. Out of 7,010 videos in the MSRVTT dataset, 6,137 videos in the training set contain emotional information, and out of 1,000 videos in the test set, 34 videos contain emotional information.

#### Functions Implementation

In [None]:
def extract_emotion_manually(df, text=None):
    # assign the default text
    if text is None:
        text = "happy|sad|afraid|fear|surprise|joy|disgust|annoy| anger|angry|" \
                "excite|excited|exciting|scare|scared|scary|fright|frighten|frightened|frightening" \
                "|fearful|fearless|fearfully"
    # extract the data that contains the text
    manual_df = df[df['caption'].str.contains(text)]
    print("Number of manually selecting data:", len(manual_df))
    return manual_df


def extract_emotional_data(df, sent_bound=0.6, manual=False):
    # Initialize SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()
    emotion_df = pd.DataFrame()

    # extract the emotional data using the sentiment analysis
    for idx in tqdm(range(len(df))):
        sentence = df.caption[idx]
        # calculate the sentiment score
        sentiment_score = sia.polarity_scores(sentence)
        # determine the emotional data: compound score > 0.6 (positive) or < -0.6 (negative)
        if sentiment_score['compound'] > sent_bound or sentiment_score['compound'] < -sent_bound:
            emotion_df = pd.concat([emotion_df, df.iloc[idx:idx+1, :]])

    print("Number of emotional data:", len(emotion_df))

    # extract the data that contains predefined emotion words
    if manual:
        manual_df  = extract_emotion_manually(df)
        # extract the data that is not in the emotion_df
        if "sen_id" in manual_df.columns:
            only_manual_df = pd.merge(manual_df, emotion_df, on='sen_id', how='outer', indicator=True).query('_merge=="left_only"')
            only_manual_df = only_manual_df.drop(columns=["caption_y", "video_id_y", "_merge"]).rename(columns={"caption_x": "caption", "video_id_x": "video_id"})
        else:
            only_manual_df = pd.merge(manual_df, emotion_df, on='video_id', how='outer', indicator=True).query('_merge=="left_only"')
            only_manual_df = only_manual_df.drop(columns=["caption_y", "_merge"]).rename(columns={"caption_x": "caption"})
        print("Number of data only in manual data:", len(only_manual_df))
        emotion_df = pd.concat([emotion_df, only_manual_df])
        print("Toal number of emotional data:", len(emotion_df))
        return emotion_df

    return emotion_df

In [None]:
# extract videos that contain emotional
emotion_train_row_df = extract_emotional_data(train_video_text, sent_bound=0.6, manual=True)
emotion_test_row_df = extract_emotional_data(test_video_text, sent_bound=0.6, manual=True)

  0%|          | 0/9000 [00:00<?, ?it/s]

100%|██████████| 9000/9000 [00:17<00:00, 526.53it/s]


Number of emotional data: 5984
Number of manually selecting data: 1019
Number of data only in manual data: 153
Toal number of emotional data: 6137


100%|██████████| 1000/1000 [00:00<00:00, 5879.03it/s]

Number of emotional data: 29
Number of manually selecting data: 7
Number of data only in manual data: 5
Toal number of emotional data: 34





In [None]:
# training set
no_emotion_train_row_df = train_video_text[~train_video_text.video_id.isin(emotion_train_row_df.video_id)]
no_emotion_train_row_df.loc[:, 'emotion'] = 0
emotion_train_row_df.loc[:, 'emotion'] = 1
print('Train shape:', emotion_train_row_df.shape, no_emotion_train_row_df.shape)
train_df = pd.concat([emotion_train_row_df, no_emotion_train_row_df])

# test set
num_emo_test = emotion_test_row_df.shape[0]
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)].sample(num_emo_test)
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])
# total test set
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)]
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Total Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
total_test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])

#### Step (2): Assign Specific Emotion

Using the lexicon provided by the NRC Lexicon Library, we assigned emotion scores to each caption for the eight emotion types defined by Robert Plutchik: joy, trust, fear, surprise, sadness, disgust, anger, and anticipation. This results in a total of 11 columns in the original dataset.

Remember, we apply this process to three datasets in the second step of assigning emotional information: the refined 6K training data from the first step, a combination of 34 sentimental and 34 non-sentimental data for a total of 68 test data, and the entire 1K test data.

In [None]:
# Emotions that NRC Lexicon library has
emostion_list = ["joy", "trust", "fear", "surprise",
                 "sadness", "disgust", "anger", "anticipation"]

#### Functions Implementation

In [None]:
def create_sentiment_columns(df):
    """ Create sentiment columns using NRC lexicon
        : positive, negative, neutral
    """
    df = df.reset_index(drop=True)
    for idx in tqdm(range(df.shape[0])):
        emotion_counts = NRCLex(df.caption.iloc[idx]).raw_emotion_scores
        # positive case
        if 'positive' in emotion_counts.keys():
            df.loc[idx, 'positive'] = emotion_counts['positive']
        # negative case
        if 'negative' in emotion_counts.keys():
            df.loc[idx, 'negative'] = emotion_counts['negative']
        # neutral case
        if 'positive' not in emotion_counts.keys() and 'negative' not in emotion_counts.keys():
            df.loc[idx, 'neutral'] = 1

    df.fillna({'positive': 0, 'negative': 0, 'neutral': 0}, inplace=True)
    return df

def create_emotion_columns(df):
    """ Create emotion columns using NRC lexicon
        : anger, anticipation, disgust, fear, joy, sadness, surprise, trust
    """
    df = df.reset_index(drop=True)
    for idx in tqdm(range(df.shape[0])):
        emotion_counts = NRCLex(df.caption.iloc[idx]).raw_emotion_scores
        # create the emotion columns
        for emo in emostion_list:
            if emo in emotion_counts.keys():
                df.loc[idx, emo] = emotion_counts[emo]

    df.fillna({emo: 0 for emo in emostion_list}, inplace=True)
    return df

In [None]:
# Train
# create the sentiment columns
emotion_train_row_df = create_sentiment_columns(emotion_train_row_df)
# create the emotion columns
emotion_train_row_df = create_emotion_columns(emotion_train_row_df)

# Test
# create the sentiment columns
test_row_df = create_sentiment_columns(test_row_df)
# create the emotion columns
test_row_df = create_emotion_columns(test_row_df)

# Total Test
# create the sentiment columns
total_test_row_df = create_sentiment_columns(total_test_row_df)
# create the emotion columns
total_test_row_df = create_emotion_columns(total_test_row_df)

### Save Data's Annotations

In [None]:
# train
emotion_train_row_df.to_json('./data/msrvtt/msrvtt_train6k.json', orient='records', lines=False)
# test
test_row_df.to_json('./data/msrvtt/msrvtt_test68.json', orient='records', lines=False)
# total test
total_test_row_df.to_json('./data/msrvtt/msrvtt_test1k.json', orient='records', lines=False)

Finally, our training set consists of approximately 6K videos with emotion, and our train, test and experimental test set all have the emotion data from their captions extracted in score form as well. Our data is now ready to be input into our model, which is explained below. 

## 3. Our Proposed Model

### Quick Rundown of Our Baseline, CLIP-ViP

![baseline](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/baseline.png?raw=true)

Fig. A Baseline model was used for our emotion-specialized text-to-retrieval model 

For our baseline, we use the CLIP-ViP model, which is a state-of-the-art model in the domain of Video Text Retrieval. The above figure shows the baseline model (Xue et al., 2023) used for our emotion-specialized text-to-retrieval model. The model adapts image-text pre-trained models and performs video-text pre-training (i.e., post-pretraining). By adding a Video Proxy token to the existing CLIP-based image-to-text pre-trained models and generating data suitable for text-to-video, they proposed an Omnisource Cross-modal learning method. Below, we briefly explain the Video Proxy, Omnisource Cross-modal learning, and loss of the baseline model.

< Baseline Model >

(Video, Caption) pairs are mapped into the same embedding space.
- Through contrastive learning, positive pairs are trained to be close to each other. Contrastive learning is a technique in self-supervised learning where embedding feature vectors of positive pairs are placed close together, while embedding feature vectors of negative pairs are placed far apart.
- Through experiments, they determined that there is a large domain gap between "subtitles" and videos in large-scale video-text pre-training data. To reduce this gap, they used an image-captioning model to generate an auxiliary caption for the middle frame of each video. The generated caption is referred to as C.
- For the input type of the image/frame in ViT, linear interpolation is used to get a middle temporal positional embedding, and the image/frame is treated as a special single-frame video.
- Additionally, to create an encoder module that processes both images and videos using the existing Vision Transformer (ViT), they devised a proxy-guided video attention mechanism. In each block, when calculating attention, a separate video proxy token is added and interacts with other tokens, while patch tokens only interact with video proxy tokens and patch tokens within the same frame. (frame -> divided into patch tokens)

In summary, it becomes as follows:

- V: Video
- S: Subtitles for a Video
- F: Middle frame for each video
- C: A caption for the corresponding middle frame
- (V, S) pairs and their corresponding (F, C) pairs are obtained.

This method enables joint training on both videos and images in the same batch, as their proxy-guided attention mechanism reduces the difference in calculations between video and image.

## Loss function - OMNISOURCE CROSS-MODAL LEARNING 

- They used the info-NCE loss.
- The data's visual sources include video and frame, while the text sources include subtitles and captions. They created source-wise info-NCE loss to train the model.

$\mathcal{L}_{v 2 t}=-\frac{1}{B} \sum_{i=1}^B \log \frac{e^{v_i^{\top} t_i / \tau}}{\sum_{j=1}^B e^{v_i^{\top} t_j / \tau}}, \quad \mathcal{L}_{t 2 v}=-\frac{1}{B} \sum_{i=1}^B \log \frac{e^{t_i^{\top} v_i / \tau}}{\sum_{j=1}^B e^{t_i^{\top} v_j / \tau}}$

## Our Model:
We modify this model to additionally take the emotion data we extracted earlier as input. The outline of our model is as follows:

![overview](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/overview.png?raw=true)

For each of the eight emotions, we initialize an embedding of the same dimensions as the token and positional embeddings. Then for each input sequence and its corresponding emotion scores, we aggregate the embeddings for each emotion by adding them together to create the final emotion embedding. This is then added to each token embedding alongside the positional encodings. This will be shown in the code later on. 

### Pretrained Model Download Link
To use CLIP-ViP, we must first download the pretrained CLIP model. 

CLIP Model for training: [Download Link](https://hdvila.blob.core.windows.net/dataset/pretrain_clipvip_base_32.pt)

The CLIP-ViP-B/32 model under two settings for test or demo:

Baseline Model trained on MSR-VTT 7K: [Download Link](https://docs.google.com/uc?export=download&id=1zAyNr37N5KZAZ1b5tahGtTZzzvShLpDY)

Our Model trained on emotion MSR-VTT 6K: [Download Link](https://docs.google.com/uc?export=download&id=1PuJd5-Qto4CGu4K2adjnUn0EqwoAsM4z)

You need to put these models into the pretrained_model directory.


### CLIP-ViP Modified Code
We now outline the parts of the CLIP-ViP model that we modify to take the emotion data as input. The entire model code is uploaded on the Github Repository, for refer to that for details. The following is not to be run, and is only for showing what parts of the code we modified. 

#### Dataset and Dataloader
In order for the model to utilize the emotion data that we extracted from the text data, we need to modify the Dataset class that is used to call this newly added data. Below is the modified version of the HDVILAVideoRetrievalDataset class. We mainly update the \_\_getitem\_\_ method to additionally read the emotion data as well. 

In [None]:
import torch
from torch.utils.data import Dataset
import random
import os
import json
from torch.utils.data.dataloader import default_collate
from src.utils.logger import LOGGER
from src.utils.basic_utils import flat_list_of_lists
from src.datasets.data_utils import mask_batch_text_tokens, img_collate
from src.datasets.dataloader import init_transform_dict, init_transform_dict_simple
import decord # video loader 
from decord import VideoReader
from decord import cpu, gpu
decord.bridge.set_bridge("torch")
import math
import torch.nn.functional as F
import numpy as np
import cv2
import lmdb
import glob
import src.utils.stop_words as stop_words
from PIL import Image
from src.datasets.sample_frames import SampleFrames

In [None]:
class HDVILAVideoRetrievalDataset(Dataset):
    """
    datalist
    """

    def __init__(self, cfg, vis_dir, anno_path, vis_format='video', mode="train"):
        assert vis_format in ["video", "frame"]
        self.cfg = cfg
        self.vis_dir = vis_dir
        self.anno_path = anno_path
        self.mode = mode
        self.vis_format = vis_format
        self.n_clips = cfg.train_n_clips if mode == "train" else cfg.test_n_clips
        self.num_frm = cfg.train_num_frms if mode == "train" else cfg.test_num_frms
        self.sample_rate = cfg.sample_rate
        if hasattr(cfg, "text_pos_num"):
            self.pos_num = cfg.pos_num
        else:
            self.pos_num = 1
        self.transform = init_transform_dict_simple(video_res=cfg.video_res,
                                             input_res=cfg.input_res)[mode]
        self.frame_sampler = SampleFrames(clip_len=self.num_frm, 
                                          frame_interval=self.sample_rate, 
                                          num_clips=self.n_clips, 
                                          temporal_jitter=True)
        self.init_dataset_process()


    def init_dataset_process(self):
        json_type = os.path.splitext(self.anno_path)[-1]
        assert json_type in ['.json', '.jsonl']

        if json_type == '.jsonl':
            data = []
            with open(self.anno_path) as f:
                for line in f:
                    data.append(json.loads(line))
        else:
            data = json.load(open(self.anno_path))
        self.datalist = data
        if self.cfg.is_demo:
            self.dir_list = os.listdir(self.vis_dir)

    def id2path(self, id):
        clip_name = id
        if self.vis_format == 'video':
            name = os.path.join(self.vis_dir, clip_name.split('/')[-1]+".mp4")
            if "lsmdc" in self.vis_dir:
                name = os.path.join(self.vis_dir, clip_name + ".avi")
        else:
            name = os.path.join(self.vis_dir, clip_name)
        return name

    def __len__(self):
        if self.cfg.is_demo:
            return len(self.dir_list)
        else:
            return len(self.datalist)

    def get_sample_idx(self, total_frame_num):
        """
        sample rate > 0: use SampleFrames, loop default
        sample rate = 0: uniform sampling, temporal jittering
        """
        if self.sample_rate > 0:
            results = {"total_frames": total_frame_num,
                    "start_index": 0}
            results = self.frame_sampler(results)
            return results["frame_inds"]
        elif self.sample_rate == 0:
            if hasattr(self.cfg, "sample_jitter") and self.cfg.sample_jitter and self.mode == "train":
                interval = int(total_frame_num / (self.n_clips*self.num_frm - 1))
                start = np.random.randint(0, interval+1)
                end = np.random.randint(total_frame_num-1-interval, total_frame_num)
                return np.linspace(start, end, self.n_clips*self.num_frm).astype(int)
            else:
                return np.linspace(0, total_frame_num-1, self.n_clips*self.num_frm).astype(int)

    def load_video(self, vis_path):
        vr = VideoReader(vis_path, ctx=cpu(0))
        total_frame_num = len(vr)

        frame_idx = self.get_sample_idx(total_frame_num)
        img_array = vr.get_batch(frame_idx) # (n_clips*num_frm, H, W, 3)

        img_array = img_array.permute(0, 3, 1, 2).float() / 255.
        img_array = self.transform(img_array)

        return img_array

    def load_frames(self, vis_path, total_frame_num):
        frame_idx = self.get_sample_idx(total_frame_num)

        img_array = []
        for i in frame_idx:
            img = Image.open(os.path.join(vis_path, \
                    vis_path.split('/')[-1] + '_{0:03d}.jpg'.format(i))).convert("RGB")
            img_array.append(np.array(img))
        img_array = torch.from_numpy(np.array(img_array))  # (n_clips*num_frm, H, W, 3)

        img_array = img_array.permute(0, 3, 1, 2).float() / 255.
        img_array = self.transform(img_array)

        return img_array

    # This is where we modify the code to include the emotion data
    def __getitem__(self, index):
        if self.cfg.dummy_data:
            return dict(
            video = torch.randn(self.n_clips*self.num_frm, 3, self.cfg.input_res[0], self.cfg.input_res[1]),  # [clips, num_frm, C, H_crop, W_crop]
            texts = ["This is a dummy sentence, which contains nothing meaningful."]
        )

        if self.cfg.is_demo:
            # Get the list of all files and directories 
            # path = self.vis_dir
            video = self.dir_list[index]
            video_id, _ = os.path.splitext(video)
            vis_id = video_id
            texts = [self.cfg.query]  # for testing
            emotions = self.cfg.emotion
            
        else:
            if not ("video_id" in self.datalist[index].keys()):
                video = self.datalist[index]["video"]
                video_id, _ = os.path.splitext(video)
                vis_id = video_id
                texts = self.datalist[index]['caption']
            else:
                vis_id = self.datalist[index]['video_id']
                texts = self.datalist[index]['caption']

            if isinstance(texts, list):
                texts = random.sample(self.datalist[index]['caption'], self.pos_num)
                if 'didemo' in self.anno_path:
                    texts = [' '.join(self.datalist[index]['caption'])]
            else:
                texts = [texts]

            # We get the emotions from the datalist
            emotions = [self.datalist[index][emotion] for emotion in ["joy", "trust", "surprise", "anticipation", "fear", "sadness", "disgust", "anger"]]
            
        vis_path = self.id2path(vis_id)
        video = self.load_video(vis_path) if self.vis_format=='video' else self.load_frames(vis_path, self.datalist[index]['num_frame'])     

        return dict(
            video = video,  # [clips*num_frm, C, H_crop, W_crop]
            texts = texts,
            emotions = emotions,
            vis_id = vis_id
        )

### Collator for creating batches

A custom collator is used to create the batches of:
* sequences of tokens 
* the attention masks corresponding to each sequence 
* videos
* and now the emotions in each sequence. 

In [None]:
class VideoRetrievalCollator(object):
    def __init__(self, tokenizer, max_length=40, is_train=True):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.is_train = is_train

    def collate_batch(self, batch):
        if isinstance(batch[0]["video"], torch.Tensor):
            v_collate = default_collate
        else:
            v_collate = img_collate
        video = v_collate([d["video"] for d in batch])

        text_examples = flat_list_of_lists([d["texts"] for d in batch])
        # Add emotion data
        emotions = torch.LongTensor([(d["emotions"]) for d in batch])
        
        # for vis_id collation
        vid_collate = default_collate
        vis_id = vid_collate([d["vis_id"] for d in batch])
        
        text_str_list = [d for d in text_examples]  # (B, )

        batch_enc = self.tokenizer.batch_encode_plus(
            text_str_list,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )
        text_input_ids = batch_enc.input_ids  # (B, L)
        text_input_mask = batch_enc.attention_mask  # (B, L)

        # Add emotion data to the final returned batch
        collated_batch = dict(
            video=video,   # [B, clips, num_frm, C, H_crop, W_crop]
            text_input_ids=text_input_ids,
            text_input_mask=text_input_mask,
            emotions=emotions,
            vis_id=vis_id
        )

        return collated_batch

### Text Embeddings
The following code shows our main contributions and the implementation of our ideas. The CLIP model is complicated and consists of a hierarchy of many classes. Largely, it consists of two transformers, each for learning the video and text data together. These transformers are further divided into smaller components like encoder and embedding classes. We start with the CLIPTextEmbeddings class, which we modify to incorporate the emotion data in the creation of the text embeddings. 

We do this by initializing an embedding for each emotion, resulting in a total of 8 emotions. Here, the dimensions of the embeddings are the same as the token and positional embeddings. For each sequence, which has a set of corresponding emotions, we call the embeddings for each emotion and then average them to create a single aggregated emotion embedding. This embedding is then added to each token embedding in the sequence alongside the positional embeddings. This allows the model to incorporate the emotion information extracted from each sequence (caption) when learning their representations. 

In [None]:
class CLIPTextEmbeddings(nn.Module):
    def __init__(self, config: CLIPTextConfig):
        super().__init__()
        embed_dim = config.hidden_size

        self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
        self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
        self.emotion_embedding = nn.Embedding(8, embed_dim)

        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        emotions: Optional[torch.LongTensor] = None,
    ) -> torch.Tensor:
        
        seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
        batch_size = input_ids.shape[0] if input_ids is not None else inputs_embeds.shape[0]

        if emotions is not None:
            # Change non-zero values to 1, effectively binarizing the input
            emotions = torch.where(emotions > 0, torch.ones_like(emotions), torch.zeros_like(emotions))
        
        # Retrieve all emotion embeddings
        all_emotion_embeds = self.emotion_embedding.weight.unsqueeze(0).repeat(batch_size, 1, 1)  # [batch_size, 8, embed_dim]

        if emotions is not None:
            emotion_mask = emotions.unsqueeze(-1).type_as(all_emotion_embeds)  # [batch_size, 8, 1]
            selected_emotion_embeds = all_emotion_embeds * emotion_mask  # [batch_size, 8, embed_dim]
            emotion_embeds = selected_emotion_embeds.sum(1) / (emotion_mask.sum(1) + 1e-8)  # [batch_size, embed_dim]
        else:
            emotion_embeds = torch.zeros(batch_size, self.token_embedding.embedding_dim, device=input_ids.device if input_ids is not None else inputs_embeds.device)

        emotion_embeds = emotion_embeds.unsqueeze(1).expand(-1, seq_length, -1)  # [batch_size, seq_length, embed_dim]

        if position_ids is None:
            position_ids = self.position_ids[:, :seq_length]

        if inputs_embeds is None:
            inputs_embeds = self.token_embedding(input_ids)

        position_embeddings = self.position_embedding(position_ids)
        embeddings = inputs_embeds + position_embeddings + emotion_embeds

        return embeddings

### CLIPTextTransformer
Here we update the CLIPTextTransformer class to accept the emotion data 

In [None]:
class CLIPTextTransformer(nn.Module):
    def __init__(self, config: CLIPTextConfig):
        super().__init__()
        self.config = config
        embed_dim = config.hidden_size
        self.embeddings = CLIPTextEmbeddings(config)
        self.encoder = CLIPEncoder(config)
        self.final_layer_norm = nn.LayerNorm(embed_dim)

    @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=CLIPTextConfig)
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        emotions: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        r"""
        Returns:
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is None:
            raise ValueError("You have to specify either input_ids")

        input_shape = input_ids.size()
        input_ids = input_ids.view(-1, input_shape[-1])

        hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids, emotions=emotions)

        bsz, seq_len = input_shape
        # CLIP's text model uses causal mask, prepare it here.
        # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
        if_fp16 = hidden_states.dtype == torch.float16
        causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, fp16=if_fp16).to(hidden_states.device)
        # expand attention_mask
        if attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            attention_mask = _expand_mask(attention_mask, hidden_states.dtype)

        encoder_outputs = self.encoder(
            inputs_embeds=hidden_states,
            attention_mask=attention_mask,
            causal_attention_mask=causal_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        last_hidden_state = encoder_outputs[0]
        last_hidden_state = self.final_layer_norm(last_hidden_state)

        # text_embeds.shape = [batch_size, sequence_length, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        pooled_output = last_hidden_state[torch.arange(last_hidden_state.shape[0]), input_ids.argmax(dim=-1)]

        if not return_dict:
            return (last_hidden_state, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=last_hidden_state,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

    def _build_causal_attention_mask(self, bsz, seq_len, fp16=False):
        # lazily create causal attention mask, with full attention between the vision tokens
        # pytorch uses additive attention mask; fill with -inf
        mask = torch.empty(bsz, seq_len, seq_len)
        mask.fill_(float("-inf"))
        mask.triu_(1)  # zero out the lower diagonal
        mask = mask.unsqueeze(1)  # expand mask
        if fp16:
            mask = mask.half()
        return mask

#### Model Training
To run model training, you have to add the --is_train argument.

In the config file, insert pretrained clip model path to "e2e_weights_path".

For the dataset, insert the text json file path to the "txt" in "training_dataset", "val_datasets", "inference_dataset".

The argument --blob_mount_dir is for set the directory that can save the training log and checkpoint of the model.

Also, you can adjust the learning rate, batch size, epochs, etc in the config file.


For the emotion embeddin model training, add the --is_embed argument to the command line.

Example of command for training:
- Baseline model
```
python run_video_retrieval.py --config ./src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./ --is_train
```
- emotion embedding model
```
python run_video_retrieval.py --config ./src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./ --is_train --is_embed
```

#### Model Test
To execute the code only for test, remove the --is_train argument and replace the "e2e_weights_path" with pretrained clip-vip model. And remove the --is_train argument in the command line.


Example of command for test:
- Baseline model
```
python run_video_retrieval.py --config ./src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./
```
- emotion embedding model
```
python run_video_retrieval.py --config ./src/configs/msrvtt_retrieval/msrvtt_retrieval_vip_base_32.json --blob_mount_dir ./ --is_embed
```


#### Base model's training curves


![base model results](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/base_model_results.png?raw=true)

#### Our model's training curves
  
![our model results](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/our_model_results.png?raw=true)

## 4. Model Performance Evaluation

### Evaluation Metric

**Recall@k**: Recall $(\frac{\text{TP}}{\text{TP} + \text{FN}})$ of Top $ k $ samples
- Among the top  $ k $ samples, the number of samples that are actually similar to the query  $ \text{TP} $ is divided by the total number of query samples $\text{ TP + FN}$.


#### Example of Recall@k

For the quantitative evaluation metric, we used the Recall@k metric, which is commonly used for the recommendation task.
Recall@k is to compute the recall between the Top k samples based on the ranking. Therefore, among the top k samples, the number of samples that actually correspond to the query(or video), so called True Positive, is divided by the total number of query(or video) samples.
For example of Recall@5, you can see in the figure of similarity matrix which is ranked by cosine similarity, that consists of 8 text and video pairs. If you look at the first row of the matrix, as we evaluate top 5 ranking, there is the corresponding video at the rank 3 which means true positive of that row becomes 1. Likewise compute R@5 for all the rows in the matrix, we can get the 6 TP divided by 8 of total samples, results in 75% of R@5 for the example matrix.



![recall@5](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/recall@5.png?raw=true)


Likewise the Recall@1, you can compute the socre with the top 1 ranking between all samples. There is only one True positive sample among 8 samples, so we can get 1 over 8 R@1 score that is 12.5%.


![recall@1](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/recall@1.png?raw=true)



### Quantitative Results (i.e. learning curves, accuracies, etc.)



To first evaluate the effects of training on different size datasplits, we trained the baseline model on the conventional 7k and 9k training sets and validated on the 1k validation set. In addition, we also conducted training on the 6k-emotion dataset we previously created. To evaluate the overall performance of the models, we first evaluate their performance on the entire test set, labeled "Emotion+Neutral" in the tables. On the other hand, __to evaluate the performance of these models on the retrieval of the caption-video pairs with emotions, which we previously identified as 34 videos out of the 1000 videos in the test dataset, we calculate the recall values for only these 34 queries, instead of the total 1000.__ These results are listed in the columns labeled "Emotion" in the tables.

#### MSR-VTT 7k: Baseline Model


| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1               |   49.4000%   |  47.8044%    |  32.3529%  |  41.1765%  |
| Recall@5               |   73.0000%   |  74.9501%    |  61.7647%  |  67.6471%  |
| Recall@10              |   83.4000%   |   84.4311%   |  73.5294%  |  82.3529%  |
| Recall Median          |     2.0         |   2.0        |    3.5     |    3.0     |
| Recall Mean            |       14.5       |   10.3       |   18.1     |   14.6     |

#### MSR-VTT 9k: Baseline Model


| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1               |  49.5000%    | 49.3028%     |  35.2941%  |  35.2941%  |
| Recall@5               |   74.7000%   | 76.6932%     |  61.7647%  |  67.6471%  |
| Recall@10              |  84.8000%    | 85.3586%     |  73.5294%  |  82.3529%  |
| Recall Median          |   2.0        |    2.0       |    2.5     |    3.0     |
| Recall Mean            |   13.4       |    9.5       |   15.9     |   13.1     |


#### MSR-VTT 6k (Emotion): Baseline Model
| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1               |   49.0000%   |  48.4032%    |  29.4118%  |  32.3529%  |
| Recall@5               |    73.2000%  |  75.6487%    |  58.8235%  |  70.5882%  |
| Recall@10              |  84.5000%    |  84.7305%    |  79.4118%  |  79.4118%  |
| Recall Median          |   2.0        |   2.0        |    3.5     |    2.0     |
| Recall Mean            |    13.6      |   9.9        |   16.8     |   14.7     |


#### MSR-VTT 6k (Emotion): Our Model (Emotion Embeddings)
| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1                  |  23.9000%    |    44.8104%  | 5.8824%      | 8.8235%      |
| Recall@5                  |  41.5000%    |  28.1437%    | 14.7059%     | 20.5882%     |
| Recall@10                 |      49.00%  |   50.9980%   | 23.5294%     | 20.5882%     |
| Recall Median             |     11.0     |   9.5        | 83.0         | 66.5         |
| Recall Mean               |      124.4   |      72.6    | 206.6        | 115.5        |


As expected, in regard to the model's performance on emotion-containing queries in the "Emotion" columns, the performance degrades on the 6k(emotion) dataset in comparison to the other two datasets with both emotional and neutral data, as the model sees less data during training. From this, we conclude that training exclusively on data containing only emotion does not translate to improved performance on emotion-containing queries. In addition, when comparing the results for the entire test set with only the emotion-containing queries, we find that for all training set sizes, the recall values for "Emotion" are all lower than for "Emotion+Neutral". We find these as reasons to additionally implement methods to focus on the emotion information within our data, to better find such matches.


The experiment's findings indicate a decline in overall performance when contrasting the 'Baseline' with 'Ours'. Specifically, for 'Text to Video' (T2V), the R@1 metric fell sharply from 29.41% to 5.89%, and for 'Video to Text' (V2T), it decreased from 32.35% to 8.83%. A similar downward trend was observed in the R@5 metric, which dropped from 58.82% to 14.71% for T2V, and from 70.59% to 15.63% for V2T. The R@10 metric also saw a significant reduction, declining from 79.41% to 23.53% for T2V and from 79.41% to 20.59% for V2T.

### Qualitative Results (i.e. interesting findings supported by examples/outputs from your model)

To evaluate the proficiency of our emotion embedding model in learning about emotions, we visualized the embeddings for emotions using t-SNE. If our model has effectively learned emotion representations, we would expect embeddings associated with similar emotions to be mapped closer together in the space. 

Remarkably, the t-SNE visualization revealed that, compared to a baseline model, our model's text features classified as emotional are more distinctly clustered. This suggests that our emotion embedding approach successfully captures the nuances of emotional content of the text.

#### T-SNE Visualization with Joy Emotion
For example, the embeddings of the joy text, in green dots, in our model are slightly more clustered than the baseline model’s. However, the embeddings of the joy videos, in blue dots, are not as closely clustered as the joy text embeddings. 

![joy t-sne](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/tsne_joy.png?raw=true)

#### T-SNE Visualization with Trust Emotion
Likewise, the embeddings of the trust text and trust video show the similar trend.
Since our embedding model only incorporated emotion embeddings for text, we observed a clustering effect for text-related emotions but not for emotions related to videos. Therefore, to enhance the performance of our model, we suggest the inclusion of a module that learns corresponding video embeddings in addition to text embeddings. This enhancement is expected to yield improved performance in tasks involving emotion recognition across both text and video content.


![trust t-sne](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/tsne_trust.png?raw=true)

## 5. Demonstration
In the following, assuming we have trained our model on the MSR-VTT dataset with emotions, we show how we can use the model on unseen videos. 


### Demo Data Download

You can simply download our deomo dataset using this command:
``` 
bash download_demo.bash $PATH_TO_DOWNLOAD
```

In [15]:
! bash download_demo.bash $PATH_TO_DOWNLOAD

Downloading demo_video_6fps.zip.
--2024-04-05 20:33:06--  http://wget/
Resolving wget (wget)... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘wget’
--2024-04-05 20:33:06--  https://drive.google.com/uc?export=download&id=1mOTXPOBRGXgoAycgZ4pLZVCCKglJAyOC
Resolving drive.google.com (drive.google.com)... 172.217.165.14
Connecting to drive.google.com (drive.google.com)|172.217.165.14|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1mOTXPOBRGXgoAycgZ4pLZVCCKglJAyOC&export=download [following]
--2024-04-05 20:33:07--  https://drive.usercontent.google.com/download?id=1mOTXPOBRGXgoAycgZ4pLZVCCKglJAyOC&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.251.41.65
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.251.41.65|:443... connected.
HTTP request sent, awaiting response... 

200 OK
Length: 7625392 (7.3M) [application/octet-stream]
Saving to: ‘demo_video_6fps.zip’


2024-04-05 20:33:11 (18.1 MB/s) - ‘demo_video_6fps.zip’ saved [7625392/7625392]

FINISHED --2024-04-05 20:33:11--
Total wall clock time: 4.2s
Downloaded: 1 files, 7.3M in 0.4s (18.1 MB/s)
Processing videos started.
Processing videos completed.
The preparation of the msrvtt dataset has been successfully completed.


### Demo Code

In [3]:
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath("__file__")))

import time 
import numpy as np
import torch
from torch.utils.data import DataLoader
from transformers import CLIPTokenizerFast

from src.datasets.dataset_video_retrieval_demo import (
    HDVILAVideoRetrievalDataset, VideoRetrievalCollator)

from src.configs.config import shared_configs
from src.utils.misc import set_random_seed
from src.utils.logger import LOGGER
from src.utils.load_save import load_state_dict_with_mismatch
from src.utils.emotion_utils import encode_query

from src.utils.metrics import cal_cossim

  from .autonotebook import tqdm as notebook_tqdm


#### Helper functions

In [4]:
def mk_video_ret_dataloader(dataset_name, vis_format, anno_path, vis_dir, cfg, tokenizer, mode):
    is_train = mode == "train"
    dataset = HDVILAVideoRetrievalDataset(
        cfg=cfg,
        vis_dir=vis_dir,
        anno_path=anno_path,
        vis_format=vis_format,
        mode=mode
    )
    LOGGER.info(f"[{dataset_name}] is_train {is_train} "
                f"dataset size {len(dataset)}, ")

    batch_size = cfg.train_batch_size if is_train else cfg.test_batch_size
    vret_collator = VideoRetrievalCollator(
        tokenizer=tokenizer, max_length=cfg.max_txt_len, is_train=is_train, is_embed=cfg.is_embed)
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=False,
                            num_workers=cfg.n_workers,
                            pin_memory=cfg.pin_mem,
                            collate_fn=vret_collator.collate_batch)
    return dataloader


def setup_model(cfg, device=None):
    LOGGER.info("Setup model...")
    
    if cfg.is_embed:
        from src.modeling.VidCLIP import VidCLIP
        model = VidCLIP(cfg)
    else:
        from src.modeling.VidCLIP import VidCLIP
        model = VidCLIP(cfg)

    if cfg.e2e_weights_path:
        LOGGER.info(f"Loading e2e weights from {cfg.e2e_weights_path}")
        
        load_state_dict_with_mismatch(model, cfg.e2e_weights_path)
    
    if hasattr(cfg, "overload_logit_scale"):
        model.overload_logit_scale(cfg.overload_logit_scale)
    
    model.to(device)

    LOGGER.info("Setup model done!")
    return model

@torch.no_grad()
def infer_demo(model, val_loaders, cfg, device=None):

    model.eval()

    st = time.time()
    
    for loader_name, val_loader in val_loaders.items():
        # print(f"Loop val_loader {loader_name}.")
        valid_len = len(val_loader.dataset)
        text_feats = []
        vis_feats = []
        vis_ids = []
        for val_step, batch in enumerate(val_loader):
            vis_ids.extend(batch.pop('vis_id')) # except vis_id
            batch['video'] = batch['video'].to(device)
            batch['text_input_ids'] = batch['text_input_ids'].to(device)
            batch['text_input_mask'] = batch['text_input_mask'].to(device)
            if cfg.is_embed:
                batch['emotions'] = batch['emotions'].to(device)
            # else:
            #     del batch['emotions']
            
            feats = model(**batch)  # dict
            # print('feats vis_features', feats['vis_features'].shape)
            vis_feat = feats['vis_features']
            text_feat = feats['text_features']

            # print('allgather vis_features', vis_feat.shape)

            text_feats.append(text_feat.cpu().numpy())
            vis_feats.append(vis_feat.cpu().numpy())

        text_feats = np.vstack(text_feats)
        vis_feats = np.vstack(vis_feats)

        text_feats = text_feats[:valid_len]
        vis_feats = vis_feats[:valid_len]
        
        sim_matrix = cal_cossim(text_feats, vis_feats)
        sorted_score = sorted(sim_matrix[0], reverse = True)
        ranks_idx = [sorted_score.index(s) for s in sim_matrix[0]] 
        happy_idx=[]
        angry_idx=[]
        for idx, vis_id in enumerate(vis_ids):
            if 'happy' in vis_id:
                happy_idx.append(idx)
            elif 'angry' in vis_id:
                angry_idx.append(idx)

        happy_score_lst = [sim_matrix[0][idx] for idx in happy_idx]
        angry_score_lst = [sim_matrix[0][idx] for idx in angry_idx]
        happy_score = sum(happy_score_lst)/len(happy_score_lst)
        angry_score = sum(angry_score_lst)/len(angry_score_lst)
        
        ranks = [vis_ids[idx] for idx in ranks_idx]
        print("Video Ranks based on the Cosine Similarity Score: ")
        print(ranks)
        print("\nHappy score mean: %f\n"%happy_score)
        print("Angry score mean: %f\n"%angry_score)
        return ranks

### Baseline Model

#### Model Setup

##### text query: "Happy smiling people" 

In [16]:
sys.argv = ['run_video_retrieval_demo.py', '--config', './src/configs/demo/demo_retrieval_vip_base_32.json']
cfg_base = shared_configs.parse_args()
cfg_base.e2e_weights_path = './pretrained_model/base_model_best_msrvtt7k_b64.pt' # load baseline model
cfg_base.is_embed = False
cfg_base.is_demo = True
cfg_base.query = "Happy smiling people"
cfg_base.emotion = encode_query(cfg_base.query, None)

### Emotion Embedded Model

#### Model Setup

##### Text query: "Happy smiling people"

In [17]:
sys.argv = ['run_video_retrieval_demo.py', '--config', './src/configs/demo/demo_retrieval_vip_base_32.json']
cfg_embed = shared_configs.parse_args()
cfg_base.e2e_weights_path = './pretrained_model/emotion_embed_model_best.pt' # load our model
cfg_embed.is_embed = True
cfg_embed.is_demo = True
cfg_embed.query = "Happy smiling people"
cfg_embed.emotion = encode_query(cfg_embed.query, None)

#### Load Test(Demo) Data

In [18]:
set_random_seed(cfg_base.seed)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# demo data loader
tokenizer = CLIPTokenizerFast.from_pretrained(cfg_base.clip_config)
inference_loaders = {}
for db in cfg_base.inference_datasets:
    inference_loaders[db.name] = mk_video_ret_dataloader(
        dataset_name=db.name, vis_format=db.vis_format,
        anno_path=db.txt, vis_dir=db.vis,
        cfg=cfg_base, tokenizer=tokenizer, mode="test"
    )

04/05/2024 20:33:19 - INFO - __main__ -   [demo-test] is_train False dataset size 21, 


### Evaluate the Base Model and Emotion Embedding Model

In [9]:
# base model
base_model = setup_model(cfg_base, device=device)
base_model.eval()

# emotion embed model
embed_model = setup_model(cfg_embed, device=device)
embed_model.eval()

04/05/2024 20:27:52 - INFO - __main__ -   Setup model...
Some weights of CLIPModel were not initialized from the model checkpoint at openai/clip-vit-base-patch32 and are newly initialized: ['vision_model.embeddings.added_cls', 'vision_model.embeddings.temporal_embedding']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
04/05/2024 20:27:54 - INFO - __main__ -   Loading e2e weights from ./pretrained_model/emotion_embed_model_best.pt
04/05/2024 20:27:56 - INFO - __main__ -   You can ignore the keys with `num_batches_tracked` or from task heads
04/05/2024 20:27:56 - INFO - __main__ -   Keys in loaded but not in model:
04/05/2024 20:27:56 - INFO - __main__ -   In total 1, ['clipmodel.text_model.embeddings.emotion_embedding.weight']
04/05/2024 20:27:56 - INFO - __main__ -   Keys in model but not in loaded:
04/05/2024 20:27:56 - INFO - __main__ -   In total 0, []
04/05/2024 20:27:56 - INFO - __main__ -   Keys in model and loaded, 

VidCLIP(
  (clipmodel): CLIPModel(
    (text_model): CLIPTextTransformer(
      (embeddings): CLIPTextEmbeddings(
        (token_embedding): Embedding(49408, 512)
        (position_embedding): Embedding(77, 512)
      )
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-11): 12 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
   

### Base Model Demo

In [None]:
from IPython.display import Video

base_ranks = infer_demo(base_model, inference_loaders, cfg_base, device)
video_dir = cfg_base.inference_datasets[0].vis

print("Top 3 Selected Videos for Query [%s]" %cfg_base.query)

base_video_list = []
for video in base_ranks[:3]:
    base_video_list.append(os.path.join(video_dir, video+'.mp4'))

for idx, video_path in enumerate(base_video_list): 
    print("\nRank %d: %s"%((idx+1),base_ranks[idx]))
    display(Video(video_path, embed=True))

Video Ranks based on the Cosine Similarity Score: 
['angry8', 'happy1', 'happy4', 'happy5', 'angry1', 'angry6', 'angry7', 'happy2', 'angry2', 'happy6', 'happy10', 'angry5', 'happy8', 'happy11', 'angry10', 'happy9', 'angry3', 'happy3', 'angry9', 'angry4', 'happy7']

Happy score mean: 0.210959

Angry score mean: 0.166257

Top 3 Selected Videos for Query [Happy smiling people]

Rank 1: angry8



Rank 2: happy1



Rank 3: happy4


### Our Model Demo

In [None]:
embed_ranks = infer_demo(embed_model, inference_loaders, cfg_embed, device)
print("Top 3 Selected Videos for Query [%s]" %cfg_embed.query)

embed_video_list = []
for video in embed_ranks[:3]:
    embed_video_list.append(os.path.join(video_dir, video+'.mp4'))

for idx, video_path in enumerate(embed_video_list): 
    print("\nRank %d: %s"%((idx+1),embed_ranks[idx]))
    display(Video(video_path, embed=True))

Video Ranks based on the Cosine Similarity Score: 
['happy8', 'happy4', 'angry2', 'happy5', 'happy7', 'angry1', 'happy3', 'angry7', 'angry10', 'angry6', 'happy9', 'happy6', 'happy1', 'happy11', 'angry9', 'angry5', 'happy10', 'angry8', 'angry3', 'angry4', 'happy2']

Happy score mean: 0.155641

Angry score mean: 0.144195

Top 3 Selected Videos for Query [Happy smiling people]

Rank 1: happy8



Rank 2: happy4



Rank 3: angry2


## 6. Disussion and Possible Future Directions

### What we wish we had known in advance:

- The quality of emotional data that can be extracted. Emotion is quite a subjective concept, and is difficult to define and classify for. The methods we decided to use in the end were lexicon-based methods, which are not the most advanced technique there is. Thus, our method which utilizes this suboptimal data can have trouble leveraging this data for improving performance on the video retrieval task. The incorporation of this kind of data can actually confuse the model instead, in turn causing a drop in performance.
- The difficulty in incorporating the emotion data. We were only able to use a very simple implementation in the form of “emotion embeddings” applied directly to the sequence data in the form of addition. Many sophisticated methods exist, and we leave this to future work. We had a list of methods we wanted to try, but were not able to implement due to time constraints, such as the attention mechanism or emotion-specific positional embeddings. 

![alternatives](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/alternatives_background.png?raw=true)

- The amount of available data and its relation to our task. (The availability of emotional videos) → Our main objective was to create a service for users to more easily access emotional videos in the vast amount of data that they have access to. However, most of the datasets that are readily available for research are collected from a wide range of media like YouTube, or other consumer content, rather than for videos that would be commonly taken by regular people on their phones. In this sense, there is a discrepancy in the type of data we would expect our model to be used on, especially in that the datasets contain videos that do not really contain emotions. This negatively affected our models performance.

### Future directions, and things that can be improved:

- One aspect that we were not able to touch on was the video data. We extracted emotional information from the video/frame captions, but we think it would be possible to extract similar information from the videos themselves. One method we considered was using Facial Expression Recognition(FER) models to extract the expressions of the faces in the videos and convert them into emotions. We think this has potential to improve the performance of our model.

![FER](https://github.com/Hyejin3194/MIE1517_Project_Emotion-Text-to-Video-Retrieval/blob/main/images/FER.png?raw=true)