# Emotion Specialized Text to Video Retrieval
## Project Progress Report (12 Mar 2024)
Team: 14 

Members: 

__Daye Lee__ 

__Wonseon Lim__ 

__Hyejin Oh__

__Paul Hyunbin Cho__ 

## Project Objective 
We aim to create an image text retrieval model that can better find videos taking emotions into account in their queries. In the age of big data, most people now have countless photos and vidoes in their possession, saved. With the immensely large amount of data that they now have, it is becoming increasingly difficult to efficiently find the videos or photos that someone is looking for. In this regard, the field of video-text retrieval has been receiving more attention and their models' capabilities have improved tremendously over the years. However, we believe that performance of these models can be further improved by focusing on the emotional aspects inherent in both the text and the videos. This is especially more important considering growing demand for personalized and emotionally resonant experiences in digital media. __Thus, we hope to develop a tool for users to easily access emotional content in videos, by modifying video-text retrieval models to incorporate emotion data.__ 

The structure of this project report is as follows:
1. Data Collection and Preprocessing
2. Model Development
3. Experiment Results
4. Remaining Project Plan
5. Possible Future Directions

As the experiments are not run on Jupyter Notebooks, we provide the link to our Github, which contains all of the code used for our experiments and implementations: https://github.com/Daye-Lee18/1517_XPretrain

# 1. Data Collection and Preprocessing


- Datasets
    - [MSRVTT Data Description]: The MSR-VTT (1) contains 10K YouTube videos with 200K descriptions, each video of which is approximately 10-20 seconds long and consists of 20 captions. We follow our baseline CLIP-ViP (2) to train models on 9K videos, and report results on the 1K test set. 
    - [Didemo data description]: The DiDeMo (3) consists of 10K Flickr videos annotated with 40K sentences. We follow our baseline CLIP-ViP to evaluate paragraph-to-video retrieval and concatenate all descriptions of a video as one query.
    - [MSRVD data description]: The MSVD (4) dataset includes 1,970 videos with approximately 80,000 captions, where train and test are splitted into 1,576 and 394 videos.
- Why we use these datasets  
    - Our task is to retrieve emotional-specific videos from text, as a downstream task, based on the CLIP-ViP paper. To collect data for our task, we utilized the same data used in the CLIP-ViP paper. Additionally, we can fine-tune on the three datasets mentioned above for downstream task execution with our current computational resources.
- Data Preprocessing
    - First, we collected three publicly available video captions datasets commonly used to evaluate our base task, text-to-video retrieval. Next, to create a dataset for our main task, emotional text-to-video retrieval, we performed a two-step preprocessing process as follows.
    -  Step (1): We performed sentiment analysis on each video caption to determine the presence of sentimental information. The sentiment analysis library calculates neutral, positive, and negative scores for each video caption and decides whether it is positive or negative based on the compound score, which is an overall score derived from these individual scores. For example, if the composite score is above 0.6, we classify the video as positive, and if it is below -0.6, we classify it as negative. Here, 0.6 is the threshold we arbitrarily set to ensure that videos contain robust sentimental information, so it is set higher than 0.5.
    $\rightarrow$ [nltk.sentiment.SentimentIntensityAnalyzer](https://www.nltk.org/api/nltk.sentiment.SentimentIntensityAnalyzer.html?highlight=sentimentintensity)
    -  Step (2): After completing the video selection process in the first step, since each video may have multiple captions, we determine the emotional information present in each caption. For this purpose, we used NRCLex, which uses a scoring method based on a predefined lexicon dictionary to calculate eight emotions present in a sentence. We also included the calculation of positive, negative and neutral emotions. Consequently, the resulting data for each caption contains information on eight emotions along with positive, negative, and neutral sentiments.
    $\rightarrow$ [nrclex.NRCLex](https://pypi.org/project/NRCLex/)

### References

(1) Jun Xu, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In CVPR, pp. 5288–5296, 2016.

(2) Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo. "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment." In ICLR, 2023.

(3) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. "Localizing moments in video with natural language." In ICCV, pp. 5803–5812, 2017.

(4) David Chen and William Dolan. "Collecting highly parallel data for paraphrase evaluation." In ACL: Human Language Technologies, pages 190–200, 2011.

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import nltk

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from nltk.sentiment import SentimentIntensityAnalyzer
from nrclex import NRCLex

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ones/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## 1.1 MSRVTT
### Data Load

In [None]:
# Load the data
mrsvtt_train9k_dir = "./data/msrvtt/annotations/msrvtt_ret_train9k.json"
mrsvtt_test1k_dir = "./data/msrvtt/annotations/msrvtt_ret_test1k.json"

with open(mrsvtt_train9k_dir, 'r') as f:
    mrsvtt_train9k_json = json.load(f)

with open(mrsvtt_test1k_dir, 'r') as f:
    mrsvtt_test1k_json = json.load(f)

# Extract video_id
train9k = pd.DataFrame(mrsvtt_train9k_json)
test1k = pd.DataFrame(mrsvtt_test1k_json)
train9k['video_id'] = train9k['video'].apply(lambda x: x.split('.')[0])
test1k['video_id'] = test1k['video'].apply(lambda x: x.split('.')[0])
print(train9k.shape, test1k.shape)

(180000, 4) (1000, 4)


In [None]:
#  Group captions by each video_id
merge_caption = lambda x: '. '.join(x)
train_video_text = train9k.groupby('video_id').aggregate('caption').apply(merge_caption)
test_video_text = test1k.groupby('video_id').aggregate('caption').apply(merge_caption)
train_video_text = pd.DataFrame(train_video_text).reset_index()
test_video_text = pd.DataFrame(test_video_text).reset_index()
print(train_video_text.shape, test_video_text.shape)

(9000, 2) (1000, 2)


### Step (1): Extract Sentimental Video using Sentiment Analysis

We used two methods to extract videos with emotional information: one using SentimentIntensityAnalyzer() and the other involving manual extraction of videos with well-defined emotional information. The extract_emotional_data function uses SentimentIntensityAnalyzer(), while the extract_emotion_manually function uses manual extraction. Out of 7,010 videos in the MSRVTT dataset, 6,137 videos in the training set contain emotional information, and out of 1,000 videos in the test set, 34 videos contain emotional information.

### Functions Implementation

In [None]:
def extract_emotion_manually(df, text=None):
    # assign the default text
    if text is None:
        text = "happy|sad|afraid|fear|surprise|joy|disgust|annoy| anger|angry|" \
                "excite|excited|exciting|scare|scared|scary|fright|frighten|frightened|frightening" \
                "|fearful|fearless|fearfully"
    # extract the data that contains the text
    manual_df = df[df['caption'].str.contains(text)]
    print("Number of manually selecting data:", len(manual_df))
    return manual_df


def extract_emotional_data(df, sent_bound=0.6, manual=False):
    # Initialize SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()
    emotion_df = pd.DataFrame()

    # extract the emotional data using the sentiment analysis
    for idx in tqdm(range(len(df))):
        sentence = df.caption[idx]
        # calculate the sentiment score
        sentiment_score = sia.polarity_scores(sentence)
        # determine the emotional data: compound score > 0.6 (positive) or < -0.6 (negative)
        if sentiment_score['compound'] > sent_bound or sentiment_score['compound'] < -sent_bound:
            emotion_df = pd.concat([emotion_df, df.iloc[idx:idx+1, :]])

    print("Number of emotional data:", len(emotion_df))

    # extract the data that contains predefined emotion words
    if manual:
        manual_df  = extract_emotion_manually(df)
        # extract the data that is not in the emotion_df
        if "sen_id" in manual_df.columns:
            only_manual_df = pd.merge(manual_df, emotion_df, on='sen_id', how='outer', indicator=True).query('_merge=="left_only"')
            only_manual_df = only_manual_df.drop(columns=["caption_y", "video_id_y", "_merge"]).rename(columns={"caption_x": "caption", "video_id_x": "video_id"})
        else:
            only_manual_df = pd.merge(manual_df, emotion_df, on='video_id', how='outer', indicator=True).query('_merge=="left_only"')
            only_manual_df = only_manual_df.drop(columns=["caption_y", "_merge"]).rename(columns={"caption_x": "caption"})
        print("Number of data only in manual data:", len(only_manual_df))
        emotion_df = pd.concat([emotion_df, only_manual_df])
        print("Toal number of emotional data:", len(emotion_df))
        return emotion_df

    return emotion_df

In [None]:
# extract videos that contain emotional
emotion_train_row_df = extract_emotional_data(train_video_text, sent_bound=0.6, manual=True)
emotion_test_row_df = extract_emotional_data(test_video_text, sent_bound=0.6, manual=True)

  0%|          | 0/9000 [00:00<?, ?it/s]

100%|██████████| 9000/9000 [00:17<00:00, 526.53it/s]


Number of emotional data: 5984
Number of manually selecting data: 1019
Number of data only in manual data: 153
Toal number of emotional data: 6137


100%|██████████| 1000/1000 [00:00<00:00, 5879.03it/s]

Number of emotional data: 29
Number of manually selecting data: 7
Number of data only in manual data: 5
Toal number of emotional data: 34





In [None]:
# training set
no_emotion_train_row_df = train_video_text[~train_video_text.video_id.isin(emotion_train_row_df.video_id)]
no_emotion_train_row_df.loc[:, 'emotion'] = 0
emotion_train_row_df.loc[:, 'emotion'] = 1
print('Train shape:', emotion_train_row_df.shape, no_emotion_train_row_df.shape)
train_df = pd.concat([emotion_train_row_df, no_emotion_train_row_df])

# test set
num_emo_test = emotion_test_row_df.shape[0]
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)].sample(num_emo_test)
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])
# total test set
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)]
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Total Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
total_test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])

Train shape: (122740, 5) (2863, 3)
Test shape: (34, 3) (34, 3)
Total Test shape: (34, 3) (966, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_emotion_train_row_df.loc[:, 'emotion'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  emotion_train_row_df.loc[:, 'emotion'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_emotion_test_row_df['emotion'] = 0


### Check the sentimental compound score

To check the sentimental value of the data, we evaluate the sentimental test data extracted using the compound score.

In [None]:
# sentiment score
sia = SentimentIntensityAnalyzer()

count = 0
for text in total_test_row_df.caption:
    if count > 5:
        break
    print(text)
    print(sia.polarity_scores(text))
    print()
    count += 1

while other friends too try and hitting the basket another is eager to achieve his fourth successful basket in basketball
{'neg': 0.0, 'neu': 0.644, 'pos': 0.356, 'compound': 0.8555}

a young girl in a horror movie is haunted
{'neg': 0.576, 'neu': 0.424, 'pos': 0.0, 'compound': -0.7783}

a female soccer player accepts a reward while being cheered on by the crowd
{'neg': 0.0, 'neu': 0.492, 'pos': 0.508, 'compound': 0.8519}

a woman giving skin care tips
{'neg': 0.0, 'neu': 0.349, 'pos': 0.651, 'compound': 0.6808}

vladmir putin talks on the news about the fight against terrorism
{'neg': 0.444, 'neu': 0.556, 'pos': 0.0, 'compound': -0.802}

jolly good music troop delivering a program and the lady is in good spirit
{'neg': 0.0, 'neu': 0.455, 'pos': 0.545, 'compound': 0.8689}



### Step (2): Assign Specific Emotion

Using the lexicon provided by the NRC Lexicon Library, we assigned emotion scores to each caption, allowing us to determine positive, negative, and neutral sentiments. In addition, we performed scoring for the eight emotion types defined by Robert Plutchik: joy, trust, fear, surprise, sadness, disgust, anger, and anticipation. This results in a total of 11 columns in the original dataset.

We applied this process to three datasets in the second step of assigning emotional information: the refined 6K training data from the first step, a combination of 34 sentimental and 34 non-sentimental data for a total of 68 test data, and the entire 1K test data.

In [None]:
# Emotions that NRC Lexicon library has
emostion_list = [
    "joy",
    "trust",
    "fear",
    "surprise",
    "sadness",
    "disgust",
    "anger",
    "anticipation"
]

### Functions Implementation

In [None]:
def create_sentiment_columns(df):
    """ Create sentiment columns using NRC lexicon
        : positive, negative, neutral
    """
    df = df.reset_index(drop=True)
    for idx in tqdm(range(df.shape[0])):
        emotion_counts = NRCLex(df.caption.iloc[idx]).raw_emotion_scores
        # positive case
        if 'positive' in emotion_counts.keys():
            df.loc[idx, 'positive'] = emotion_counts['positive']
        # negative case
        if 'negative' in emotion_counts.keys():
            df.loc[idx, 'negative'] = emotion_counts['negative']
        # neutral case
        if 'positive' not in emotion_counts.keys() and 'negative' not in emotion_counts.keys():
            df.loc[idx, 'neutral'] = 1

    df.fillna({'positive': 0, 'negative': 0, 'neutral': 0}, inplace=True)
    return df

def create_emotion_columns(df):
    """ Create emotion columns using NRC lexicon
        : anger, anticipation, disgust, fear, joy, sadness, surprise, trust
    """
    df = df.reset_index(drop=True)
    for idx in tqdm(range(df.shape[0])):
        emotion_counts = NRCLex(df.caption.iloc[idx]).raw_emotion_scores
        # create the emotion columns
        for emo in emostion_list:
            if emo in emotion_counts.keys():
                df.loc[idx, emo] = emotion_counts[emo]

    df.fillna({emo: 0 for emo in emostion_list}, inplace=True)
    return df

In [None]:
# Train
# create the sentiment columns
emotion_train_row_df = create_sentiment_columns(emotion_train_row_df)
# create the emotion columns
emotion_train_row_df = create_emotion_columns(emotion_train_row_df)

# Test
# create the sentiment columns
test_row_df = create_sentiment_columns(test_row_df)
# create the emotion columns
test_row_df = create_emotion_columns(test_row_df)

# Total Test
# create the sentiment columns
total_test_row_df = create_sentiment_columns(total_test_row_df)
# create the emotion columns
total_test_row_df = create_emotion_columns(total_test_row_df)


  0%|          | 0/122740 [00:00<?, ?it/s]

100%|██████████| 122740/122740 [01:20<00:00, 1533.17it/s]
100%|██████████| 122740/122740 [01:04<00:00, 1895.97it/s]
100%|██████████| 68/68 [00:00<00:00, 1865.87it/s]
100%|██████████| 68/68 [00:00<00:00, 1592.66it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2152.19it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2311.93it/s]


In [None]:
# train
emotion_train_row_df.to_json('msrvtt_train6k.json', orient='records', lines=False)
# test
test_row_df.to_json('msrvtt_test68.json', orient='records', lines=False)
# total test
total_test_row_df.to_json('msrvtt_test1k.json', orient='records', lines=False)

## 1.2 DiDemo
### Data Load

In [None]:
# Load the data
train_dir = "./data/didemo/annotations/didemo_retrieval_train.json"
val_dir = "./data/didemo/annotations/didemo_retrieval_val.json"
test_dir = "./data/didemo/annotations/didemo_retrieval_test.json"

with open(train_dir, "r") as f:
    train_json = json.load(f)

with open(val_dir, "r") as f:
    val_json = json.load(f)

with open(test_dir, "r") as f:
    test_json = json.load(f)

# train
train_dict = {'video': [], 'caption': []}
for i, data in enumerate(train_json + val_json):
    cap_len = len(data['caption_list'])
    for j in range(cap_len):
        train_dict['video'].append(data['video'])
        train_dict['caption'].append(data['caption_list'][j])

# test
test_dict = {'video': [], 'caption': []}
for i, data in enumerate(test_json):
    cap_len = len(data['caption_list'])
    for j in range(cap_len):
        test_dict['video'].append(data['video'])
        test_dict['caption'].append(data['caption_list'][j])

# Extract video_id
train9k = pd.DataFrame(train_dict)
test1k = pd.DataFrame(test_dict)
train9k['video_id'] = train9k['video'].apply(lambda x: x.split('.')[0])
test1k['video_id'] = test1k['video'].apply(lambda x: x.split('.')[0])
print(train9k.shape, test1k.shape)

(37182, 3) (4017, 3)


In [None]:
#  Group captions by each video_id
merge_caption = lambda x: '. '.join(x)
train_video_text = train9k.groupby('video_id').aggregate('caption').apply(merge_caption)
test_video_text = test1k.groupby('video_id').aggregate('caption').apply(merge_caption)
train_video_text = pd.DataFrame(train_video_text).reset_index()
test_video_text = pd.DataFrame(test_video_text).reset_index()
print(train_video_text.shape, test_video_text.shape)

(9459, 2) (1003, 2)


### Step (1): Extract Sentimental Video using Sentiment Analysis

Out of 9,459 videos in the DiDemo dataset, 668 videos in the training set contain sentimental information, and out of 1,003 videos in the test set, 77 videos contain sentimental information.

In [None]:
# extract videos that contain emotional
emotion_train_row_df = extract_emotional_data(train_video_text, sent_bound=0.6, manual=True)
emotion_test_row_df = extract_emotional_data(test_video_text, sent_bound=0.6, manual=True)

100%|██████████| 9459/9459 [00:03<00:00, 2390.30it/s]


Number of emotional data: 627
Number of manually selecting data: 59
Number of data only in manual data: 41
Toal number of emotional data: 668


100%|██████████| 1003/1003 [00:00<00:00, 2329.30it/s]

Number of emotional data: 75
Number of manually selecting data: 7
Number of data only in manual data: 2
Toal number of emotional data: 77





In [None]:
# training set
no_emotion_train_row_df = train_video_text[~train_video_text.video_id.isin(emotion_train_row_df.video_id)]
no_emotion_train_row_df.loc[:, 'emotion'] = 0
emotion_train_row_df.loc[:, 'emotion'] = 1
print('Train shape:', emotion_train_row_df.shape, no_emotion_train_row_df.shape)
train_df = pd.concat([emotion_train_row_df, no_emotion_train_row_df])

# test set
num_emo_test = emotion_test_row_df.shape[0]
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)].sample(num_emo_test)
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])
# total test set
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)]
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Total Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
total_test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])

Train shape: (668, 3) (8791, 3)
Test shape: (77, 3) (77, 3)
Total Test shape: (77, 3) (926, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_emotion_train_row_df.loc[:, 'emotion'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_emotion_test_row_df['emotion'] = 0


### Check the sentimental compound score

To check the sentimental value of the data, we evaluate the sentimental test data extracted using the compound score.

In [None]:
# sentiment score
sia = SentimentIntensityAnalyzer()

count = 0
for text in total_test_row_df.caption:
    if count > 5:
        break
    print(text)
    print(sia.polarity_scores(text))
    print()
    count += 1

man starts talking in microphone. words apear on bottom first time. the gentleman puts his left arm underneath his right arm.. man talking on microphone unfolds arm and puts hand behind his head. a man places his hand on the back of his head.
{'neg': 0.0, 'neu': 0.865, 'pos': 0.135, 'compound': 0.7506}

silohette of a woman holding a child can be seen in the fog. the screen goes blue and we see no people.. darkest blue. no silhouettes are visible
{'neg': 0.248, 'neu': 0.752, 'pos': 0.0, 'compound': -0.765}

bright flash goes off like someone took a picture
{'neg': 0.0, 'neu': 0.526, 'pos': 0.474, 'compound': 0.6597}

a hand first appears. man playing a touchscreen game.. the circle on the screen is now yellow.. the yellow button turns green and is pressed. the men are filmed.
{'neg': 0.0, 'neu': 0.833, 'pos': 0.167, 'compound': 0.6124}

people watching a blue screen and laughing. we see the lady on the left hold up one finger. women touching screen holds up her left index finger. man w

### Step (2): Assign Specific Emotion

We applied this process to three datasets: 668 training data refined in step 1, 154 test data combining 77 sentimental and 77 non-sentimental data, and the entire 1K test data.

In [None]:
# Train
# create the sentiment columns
emotion_train_row_df = create_sentiment_columns(emotion_train_row_df)
# create the emotion columns
emotion_train_row_df = create_emotion_columns(emotion_train_row_df)

# Test
# create the sentiment columns
test_row_df = create_sentiment_columns(test_row_df)
# create the emotion columns
test_row_df = create_emotion_columns(test_row_df)

# Total Test
# create the sentiment columns
total_test_row_df = create_sentiment_columns(total_test_row_df)
# create the emotion columns
total_test_row_df = create_emotion_columns(total_test_row_df)


  0%|          | 0/668 [00:00<?, ?it/s]

100%|██████████| 668/668 [00:01<00:00, 474.11it/s]
100%|██████████| 668/668 [00:00<00:00, 775.48it/s]
100%|██████████| 154/154 [00:00<00:00, 926.33it/s]
100%|██████████| 154/154 [00:00<00:00, 817.62it/s]
100%|██████████| 1003/1003 [00:01<00:00, 949.50it/s]
100%|██████████| 1003/1003 [00:01<00:00, 865.25it/s]


In [None]:
# train
emotion_train_row_df.to_json('didemo_train668.json', orient='records', lines=False)
# test
test_row_df.to_json('didemo_test154.json', orient='records', lines=False)
# total test
total_test_row_df.to_json('didemo_test1k.json', orient='records', lines=False)

## 1.3 MSVD
### Data Load

In [None]:
def split_each_caption(df):
    data_dict = {}
    data_dict['caption'] = []
    data_dict['video_id'] = []
    for id, cap in df.values:
        cap_len = len(cap.split('.'))
        for i in range(cap_len):
            data_dict['caption'].append(cap.split('.')[i])
            data_dict['video_id'].append(id)
    return pd.DataFrame(data_dict)[['video_id', 'caption']]

In [None]:
# Load the data
msvd_dir = "./data/msvd/annotations/AllVideoDescriptions.txt"

with open(msvd_dir, 'r') as f:
    msvd_text = f.readlines()
msvd_text = msvd_text[7:]
# parse msvd_test to dictionary type
msvd_dict = {}
for line in tqdm(msvd_text):
    line = line.split()
    video_id = line[0]
    description = " ".join(line[1:])
    if video_id in msvd_dict.keys():
        msvd_dict[video_id] += ". " + description
    else:
        msvd_dict[video_id] = description

video_text = pd.DataFrame(msvd_dict.items(), columns=['video_id', 'caption'])
print("The number of total video: ", video_text.video_id.nunique())
train_video_text, test_video_text = train_test_split(video_text, test_size=0.2, random_state=42)
train_video_text, test_video_text = train_video_text.reset_index(drop=True), test_video_text.reset_index(drop=True)
print(train_video_text.shape, test_video_text.shape)

# train & test
train1k = split_each_caption(train_video_text)
test400 = split_each_caption(test_video_text)
print(train9k.shape, test1k.shape)

100%|██████████| 80827/80827 [00:00<00:00, 296552.87it/s]


The number of total video:  1970
(1576, 2) (394, 2)
(37182, 3) (4017, 3)


In [None]:
# extract videos that contain emotional
emotion_train_row_df = extract_emotional_data(train_video_text, sent_bound=0.6, manual=True)
emotion_test_row_df = extract_emotional_data(test_video_text, sent_bound=0.6, manual=True)

  0%|          | 0/1576 [00:00<?, ?it/s]

100%|██████████| 1576/1576 [00:03<00:00, 482.37it/s]


Number of emotional data: 1060
Number of manually selecting data: 186
Number of data only in manual data: 32
Toal number of emotional data: 1092


100%|██████████| 394/394 [00:00<00:00, 717.60it/s]

Number of emotional data: 254
Number of manually selecting data: 37
Number of data only in manual data: 3
Toal number of emotional data: 257





In [None]:
# training set
no_emotion_train_row_df = train_video_text[~train_video_text.video_id.isin(emotion_train_row_df.video_id)]
no_emotion_train_row_df.loc[:, 'emotion'] = 0
emotion_train_row_df.loc[:, 'emotion'] = 1
print('Train shape:', emotion_train_row_df.shape, no_emotion_train_row_df.shape)
train_df = pd.concat([emotion_train_row_df, no_emotion_train_row_df])

# test set
num_emo_test = min(emotion_test_row_df.shape[0], test_video_text.shape[0] - emotion_test_row_df.shape[0])
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)].sample(num_emo_test)
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])
# total test set
no_emotion_test_row_df = test_video_text[~test_video_text.video_id.isin(emotion_test_row_df.video_id)]
no_emotion_test_row_df['emotion'] = 0
emotion_test_row_df['emotion'] = 1
print('Total Test shape:', emotion_test_row_df.shape, no_emotion_test_row_df.shape)
total_test_row_df = pd.concat([emotion_test_row_df, no_emotion_test_row_df])

Train shape: (1092, 3) (484, 3)
Test shape: (257, 3) (137, 3)
Total Test shape: (257, 3) (137, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_emotion_train_row_df.loc[:, 'emotion'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_emotion_test_row_df['emotion'] = 0


### Check the sentimental compound score

To check the sentimental value of the data, we evaluate the sentimental test data extracted using the compound score.

In [None]:
# sentiment score
sia = SentimentIntensityAnalyzer()

count = 0
for text in total_test_row_df.caption:
    if count > 5:
        break
    print(text)
    print(sia.polarity_scores(text))
    print()
    count += 1

a man. a man is playing a guitar. a man is playing guitar on stage. john denver is on stage playing a guitar and singing. a man is playing and singing. a man is singing and playing guitar. john denver is singing and playing a guitar on stage. a man is singing on stage. a man is playing a guitar and singing on a stage. a man is playing a guitar. a man is playing guitar. someone is singing. john denver is playing a guitar. a man standing on stage is singing a song while strumming a guitar along with a band of musicians. john denver performed with his band. the man is singing and playing the guitar. john denver is playing his guitar and singing. a man plays the guitar. the man sang country songs on stage. the men is singing. a band is playing on stage. a man is playing guiatar. john denver singing and playing a guitar. a man is playing a guitar on stage. john denver is singing song with guiter. john denver sang and played his guitar on stage. a man singing a song. the man is singing and p

### Step (2): Assign Specific Emotion

We applied this process to three datasets: 1,092 training data refined in step 1, and the entire 1K test data, chosen for its balance of sentimental and non-sentimental data.

In [None]:
# Train
# create the sentiment columns
emotion_train_row_df = create_sentiment_columns(emotion_train_row_df)
# create the emotion columns
emotion_train_row_df = create_emotion_columns(emotion_train_row_df)

# Total Test
# create the sentiment columns
total_test_row_df = create_sentiment_columns(total_test_row_df)
# create the emotion columns
total_test_row_df = create_emotion_columns(total_test_row_df)


  0%|          | 0/1092 [00:00<?, ?it/s]

100%|██████████| 1092/1092 [00:09<00:00, 119.37it/s]
100%|██████████| 1092/1092 [00:09<00:00, 110.41it/s]
100%|██████████| 394/394 [00:02<00:00, 135.39it/s]
100%|██████████| 394/394 [00:03<00:00, 118.65it/s]


In [None]:
# train
emotion_train_row_df.to_json('msvd_train1k.json', orient='records', lines=False)
# total test
total_test_row_df.to_json('didemo_test400.json', orient='records', lines=False)

# 2. Model Development

## 2.1 Baseline (CLIP-ViP)

<img src="./baseline.png" alt="nn" width="1000" height="800">

    - Task: video-language alignment (Video-to-Text, Text-to-Video)
    - Goal: 
        - Adapting image-text pre-trained models to video-text pre-training (i.e. post-training) 
        - Video Proxy mechanism on the basis of CLIP (CLIP-ViP) 
    - Approach 
        - In-domain auxiliary data generation: 
            - to bridge language domain gaps between images and videos datasets 
            - introduce auxiliary captions into large-scale video-subtitle data to reduce the language domain gap between pre-training and downstream data 
                - pre-training: Image-Text learning 
                - downstream data: Video-Text learning 

        - Video Proxy Mechanism:
            - to enable the Vision Transformer (ViT) model for both image and video encoding 
            - Before feeding into CLIP, we concatenate path tokens with a set of learnable parameters called video proxy tokens
            - The output of the first video proxy token will be regarded as the video's representation 
    - Loss function:
        - info-NCE loss 



We set the CLIP-ViP(1) model as our baseline, which is a video-text retrieval model based on pretrained image-text CLIP models. The main overview of the model is outlined above, which mainly focuses on utilizing CLIP models to work on video-text retrieval tasks through "post-training", and using other novel methods like in-domain auxiliary data generation and the video proxy mechanism. Like many other video-text retrieval models, it uses contrastive learning methods to better do perform task. 


(1) Xue, Hongwei, et al. "Clip-vip: Adapting pre-trained image-text model to video-language alignment." The Eleventh International Conference on Learning Representations. 2022.
## 2.2 Our Model (Use of Emotion Embeddings)


<img src="./ourEmbeddingModel_2.png" alt="nn" width="1000" height="800">


    - Modify the creation of text embeddings in the CLIP-ViP model
        - Use the emotion features extracted in the dataset as "emotion embeddings"
        - We find the average of the emotion embeddings corresponding to the emotions in each caption
        - This average emotion embedding is added to the final text embedding to allow the model to incorporate emotion data
Below, we give an example in code, of the modified Text Embedding class of the original model.

### Model Code - Emotion Embeddings

In [14]:
import torch.nn as nn
import torch
import numpy as np

In [7]:
data = [{
"video": "video7112.mp4",
"video_id": "video7112",
"caption": "while other friends too try and hitting the basket another is eager to achieve his fourth successful basket in basketball",
"duration": 18.35,
"emotion": 1,
"positive": 4.0,
"negative": 0.0,
"neutral": 0.0,
"joy": 4.0,
"trust": 3.0,
"surprise": 1.0,
"anticipation": 3.0,
"fear": 0.0,
"sadness": 0.0,
"disgust": 0.0,
"anger": 0.0
}]

In [5]:
# Initialize embeddings for each of the 8 emotions
embed_dim = 512
emotion_embedding = nn.Embedding(8, embed_dim)

In [6]:
# Embedding table with embeddings for each of the 8 emotions
emotion_embedding

Embedding(8, 512)

In [34]:
# Change non-zero values to 1, effectively binarizing the input
emotions = torch.tensor(np.array([4.0, 3.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])).unsqueeze(0) 
print(emotions.shape) # (bs, 8)
if emotions is not None:
    emotions = torch.where(emotions > 0, torch.ones_like(emotions), torch.zeros_like(emotions))
    print(f"emotions: {emotions}\n") 

torch.Size([1, 8])
emotions: tensor([[1., 1., 1., 1., 0., 0., 0., 0.]], dtype=torch.float64)



In [35]:
# Retrieve all emotion embeddings
batch_size = 1 
all_emotion_embeds = emotion_embedding.weight.unsqueeze(0).repeat(batch_size, 1, 1)  # [batch_size, 8, embed_dim]

In [36]:
all_emotion_embeds.shape

torch.Size([1, 8, 512])

In [40]:
# We find the emotion embeddings for the emotions that are present in the input
if emotions is not None:
    emotion_mask = emotions.unsqueeze(-1).type_as(all_emotion_embeds)  # [batch_size, 8, 1]
    print(f"emotion_mask: {emotion_mask}, shape = {emotion_mask.shape}\n")
    selected_emotion_embeds = all_emotion_embeds * emotion_mask  # [batch_size, 8, embed_dim]
    print(f"selected_motion_embeds: {selected_emotion_embeds.shape}\n")
    
    # We calculate the average of the emotion embeddings that are present
    safe_divisor = emotion_mask.sum(1) + (emotion_mask.sum(1) == 0).type_as(emotion_mask)
    emotion_embeds = selected_emotion_embeds.sum(1) / safe_divisor # [batch_size, embed_dim]

    print(f"emotion_embeds.shape: {emotion_embeds.shape}")

emotion_mask: tensor([[[1.],
         [1.],
         [1.],
         [1.],
         [0.],
         [0.],
         [0.],
         [0.]]]), shape = torch.Size([1, 8, 1])

selected_motion_embeds: torch.Size([1, 8, 512])

emotion_embeds.shape: torch.Size([1, 512])


In [None]:
seq_length = ? 

# Expand the emotion embeddings to match the sequence length
emotion_embeds = emotion_embeds.unsqueeze(1).expand(-1, seq_length, -1)  # [batch_size, seq_length, embed_dim]

position_embeddings = self.position_embedding(position_ids)

# Final text embeddings are the sum of the input embeddings, position embeddings and emotion embeddings
embeddings = inputs_embeds + position_embeddings + emotion_embeds

# 3. Model Performance Evaluation

## 3.1 Training Loss Curves and Validation Curves

Performance
1. MSR-VTT 9K: baseline
    - Training Loss
    - Validation Performance
2. MSR-VTT 7K: baseline
    - Training Loss
    - Validation Performance
3. MSR-VTT 6K: baseline
    - Training Loss
    - Validation Performance
4. MSR-VTT 6K: emotion embedding
    - Training Loss
    - Validation Performance

## 3.2 Final Results

To first evaluate the effects of training on different size datasplits, we trained the baseline model on the conventional 7k and 9k training sets and validated on the 1k validation set. In addition, we also conducted training on the 6k-emotion dataset we previously created. To evaluate the overall performance of the models, we first evaluate their performance on the entire test set, labeled "Emotion+Neutral" in the tables. On the other hand, __to evaluate the performance of these models on the retrieval of the caption-video pairs with emotions, which we previously identified as 34 videos out of the 1000 videos in the test dataset, we calculate the recall values for only these 34 queries, instead of the total 1000.__ These results are listed in the columns labeled "Emotion" in the tables. 

## MSR-VTT 7k: Baseline Model


| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1               |   49.4000%   |  47.8044%    |  32.3529%  |  41.1765%  |
| Recall@5               |   73.0000%   |  74.9501%    |  61.7647%  |  67.6471%  |
| Recall@10              |   83.4000%   |   84.4311%   |  73.5294%  |  82.3529%  |
| Recall Median          |     2.0         |   2.0        |    3.5     |    3.0     |
| Recall Mean            |       14.5       |   10.3       |   18.1     |   14.6     |





## MSR-VTT 9k: Baseline Model


| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1               |  49.5000%    | 49.3028%     |  35.2941%  |  35.2941%  |
| Recall@5               |   74.7000%   | 76.6932%     |  61.7647%  |  67.6471%  |
| Recall@10              |  84.8000%    | 85.3586%     |  73.5294%  |  82.3529%  |
| Recall Median          |   2.0        |    2.0       |    2.5     |    3.0     |
| Recall Mean            |   13.4       |    9.5       |   15.9     |   13.1     |




## MSR-VTT 6k (Emotion): Baseline Model
| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1               |   49.0000%   |  48.4032%    |  29.4118%  |  32.3529%  |
| Recall@5               |    73.2000%  |  75.6487%    |  58.8235%  |  70.5882%  |
| Recall@10              |  84.5000%    |  84.7305%    |  79.4118%  |  79.4118%  |
| Recall Median          |   2.0        |   2.0        |    3.5     |    2.0     |
| Recall Mean            |    13.6      |   9.9        |   16.8     |   14.7     |



As expected, in regard to the model's performance on emotion-containing queries in the "Emotion" columns, the performance degrades on the 6k(emotion) dataset in comparison to the other two datasets with both emotional and neutral data, as the model sees less data during training. From this, we conclude that training exclusively on data containing only emotion does not translate to improved performance on emotion-containing queries. In addition, when comparing the results for the entire test set with only the emotion-containing queries, we find that for all training set sizes, the recall values for "Emotion" are all lower than for "Emotion+Neutral". We find these as reasons to additionally implement methods to focus on the emotion information within our data, to better find such matches. 

## MSR-VTT 6k (Emotion): Our Model (Emotion Embeddings)
| Test Data                 | Emotion+Neutral |  Emotion+Neutral          |  Emotion  | Emotion          |
|:---------------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| Metric                    | T2V          | V2T          | T2V          | V2T          |
| Recall@1                  |  23.9000%    |    44.8104%  | 5.8824%      | 8.8235%      |
| Recall@5                  |  41.5000%    |  28.1437%    | 14.7059%     | 20.5882%     |
| Recall@10                 |      49.00%  |   50.9980%   | 23.5294%     | 20.5882%     |
| Recall Median             |     11.0     |   9.5        | 83.0         | 66.5         |
| Recall Mean               |      124.4   |      72.6    | 206.6        | 115.5        |


We show the results of training our proposed model. We find that our naive use of emotion embeddings actually hurts performance, with the recall values decreasing over the baseline. We find this can be caused by a variety of reasons. 
* The emotion embeddings are currently added to all tokens in the sequence, even for the padding used to fit the max length. We suspect that this can be adding noise during training.
* The simple addition of these emotion embeddings may not be enough to capture the complex relationships they have with the captions and videos. More sophisticated methods should be explored. 
* We currently do not model the lack of emotions as an embedding, which is an important aspect of our data.

These are all points that can be further improved during the remaining duration for our project. 

# 4. Remaining Project Plan (To-Do) 

* We believe that an in-depth analysis is required to understand the exact effects of additionally integrating this emotion data during training. We will track the effects of changes in model architecture on the embedding space, by using visualization approaches such as t-SNE. We will choose our final model based on these observations, so that they best align with our goals.

* As the goal of this project is to create and use a model that can better find videos more relevant to the query, especially emotionally, we will conduct further analysis on how the model learns to incorporate these emotions. We will do this by identifying semantically/emotionally similar captions to a previously unseen emotional query in advance, then track the changes in rankings for these captions when given this query as input. 

* Finally, as our main objective is to create a useful tool that users can use to retrieve more emotionally relevant videos based on their queries, we will build the functional application that utilizes this model to perform this task. 

# 5. Possible Future Directions

In this section, we show a list of the things we hope attempt/achieve during the remainder of the duration of this projection. We find there are many different aspects from which our project can be improved, and aim to implement as many of these changes as possible. 

1. Test Data Collection:
    - Image-captioning model OFA-Caption (Wang et al., 2022b) to generate one caption for the middle frame of each video in HD-VILA-100M
          - with max length of 16 words
2. Model development
    - Our model currently uses the addition operator to aggregate the emotion embeddings in the text embeddings. Considering the nature of such sequential input, this may not be the most effective method to apply the emotion embeddings. We hope to use instead of a simple sum, a more sophisticated method such as concatenating the emotion embeddings and then sending them through a fully connected layer to learn optimal final emotion embeddings, or change the way we encode emotions as embeddings by further creating embeddings for captions without emotions, and several more modeling variations to improve performance. 

    - By creating a new emotion encoder, we hope to apply cross-attention on video, text, and emotion features to better learn the relationships between the three. (1) 
    
    - In addition to the use of emotional data in the text(captions), we hope to utilize Facial Expression Recognition(FER) models to extract emotional data from the video data to integrate such video emotion features as well. 
3. Data Processing
   - Lack of training and test data:
         - As the number of videos filtered by the first step of data preprocessing, sentiment analysis, is so large, we find that we are short on data to train on. We plan to experiment with using the full dataset alongside the assigned emotions from step 2 of the data preprocessing process, essentially skipping step 1. 

(1) X. Zhang, M. Li, S. Lin, H. Xu and G. Xiao, "Transformer-based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2023.3312858.
