# MultiModal Sentiment Analysis

# DA623 Course Project

# Author   : Uttam Biswas
# Roll No. : 244156008

### Motivation for choosing this topic

I selected the option of Multimodal Sentiment Analysis precisely for the rationale that it captures the way humans integrate text, speech, and visual media simultaneously to interpret emotions. In the age of social media and video calls, analyzing sentiment through various channels is extremely relevant. What I find fascinating is how models adjust different data types and combine them, especially because the topic has significant potential in solving real-life problems in content moderation, virtual assistance, and user feedback systems.

### Historical Perscpective

Multimodal Sentiment Analysis has evolved alongside advancements in multimodal learning. Early approaches used handcrafted features from text, audio, and video separately, combining them using traditional classifiers. With the rise of deep learning, models like CMU-MOSEI and CMU-MOSI datasets enabled deep fusion models (e.g., LSTM-based fusion, Tensor Fusion Networks) to learn joint representations.

More recently, transformer-based models like Multimodal Transformer (MulT) and FLAVA have pushed the state of the art by using attention mechanisms to align and integrate modalities more effectively. This topic is a natural extension of the broader trend in multimodal learning toward unified models that can reason across different data types.

### Learning from this project

First of all for this project to understand I went through multiple research paper where I could understand those papers and would know how to implement it in near future.

After going through a lot of research paper the one that I found really interesting was this ->

### Title: Tensor Fusion Network for Multimodal Sentiment Analysis
#### Authors: Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency

I'll try to explain as much as I can from this research paper.

### Abstract

Multimodal sentiment analysis is a growing area of research that goes beyond just analyzing text to understand emotions. Instead, it looks at multiple sources—like speech, facial expressions, and voice tone—together with language. In this paper, the authors suggest that to understand emotions well, a model needs to learn patterns both within each type of data (like text or audio) which is again called as intra-modality dynamics of the moodel and between them which is called as inter-modality dynamics of the model. They propose a new method called the Tensor Fusion Network that can learn these patterns directly from the data. This approach is especially useful for analyzing online videos, where people often speak with expressive tones and gestures. Their experiments show that this method works better than previous state-of-the-art approaches for combining multiple data types to detect sentiment.

### Introduction

Multimodal sentiment analysis is becoming more important in the field of emotion and opinion understanding. Unlike traditional methods that only use text, this area focuses on analyzing opinions from videos by combining three sources: the words people say (language), how they say them (voice), and their body language or facial expressions (visuals).

This shift is especially useful in today's world where people often share their opinions through videos on platforms like YouTube or Facebook. One of the biggest challenges in this task is figuring out how different types of data (text, audio, visuals) interact with each other to express a particular emotion or feeling. For example, the phrase “This movie is sick” can mean different things depending on whether the person is smiling or frowning, or whether they sound excited or angry.

Another challenge is understanding patterns within each type of data on its own—especially language, since spoken opinions are often messy and less structured than written ones. People might hesitate, repeat themselves, or change their tone mid-sentence, which makes the analysis harder.

Earlier research tried two main approaches to combine data:

1) Early fusion, where features from all modalities are simply joined together at the beginning.

2) Late fusion, where each type of data is analyzed separately and the results are combined at the end.

But both of these have limitations. Early fusion can become too complex and confuse the model, while late fusion misses the chance to understand how the different types of data work together.

To solve this, the authors propose a new model called the Tensor Fusion Network (TFN). This model is designed to learn both types of patterns—within a single modality and across multiple modalities—at the same time. It uses something called Tensor Fusion to capture interactions between one, two, or all three data types. Each type of input (text, audio, video) also has its own specialized sub-network to learn better from that specific kind of data.

Their experiments show that TFN performs better than previous models in understanding emotions from videos, both when using all modalities together and even when using just one.

![image.png](attachment:image.png)

This image helps explain how unimodal, bimodal, and trimodal interactions work in multimodal sentiment analysis, using simple examples like how someone says “This movie is sick” and what their facial expression or tone might add to the meaning.

##### Unimodal
Only one type of information (modality) is used — either speech part, the facial exoression or the voice sound is used.

Example 1: If someone just says, “This movie is sick,” it’s hard to tell whether they mean it in a good or bad way.
→ Result: Confusing (?)

Example 2: If someone just smiles, even without saying anything, we assume they’re happy.
→ Result: Positive (++)

Example 3: Just hearing a loud voice alone doesn’t clearly tell us if it’s a happy or angry loud.
→ Result: Confusing (?)

###### Bimodal
Now we combine two types of information, like words + facial expression or words + voice sound 

Example 1: “This movie is sick” + a smile → Sounds like they mean it in a good way.
→ Result: Positive (++)

Example 2: “This movie is sick” + a frown → Now it sounds negative, even though the words are the same.
→ Result: Negative (--)

Example 3: “This movie is sick” + loud voice → Still unclear; could be shouting with joy or anger.
→ Result: Confusing (?)

##### Trimodal
All three types of signals are combined — words, facial expression, and voice.

Example 1: “This movie is sick” + smile + loud voice → Clearly sounds excited and positive.
→ Result: Strongly Positive (+++)

Example 2: “This movie is fair” + smile + loud voice → Even though “fair” means okay, the tone and smile make it feel mildly positive.
→ Result: Positive (+)



### Related Work

Sentiment analysis is about understanding if people are happy, sad, or neutral based on what they say. At first, it was done using only text — like checking for positive or negative words or phrases. Later, more advanced methods started looking at how words are structured and how meanings are formed in sentences.

Now, people share opinions through videos a lot — like on YouTube — so researchers started using multimodal sentiment analysis, which looks at text (what is said), audio (how it’s said), and visuals (like facial expressions or gestures) all together.

There are some popular video datasets for this, especially CMU-MOSI, which helps models learn from real spoken sentences with labeled emotions. Some models use deep learning like CNNs and SVMs to mix all this information and make better predictions.

This is also connected to emotion recognition, where voice and facial expressions help in understanding feelings. In general, using more than one type of data (text, sound, video) is becoming a big trend in machine learning.



### CMU-MOSI Dataset
The CMU-MOSI dataset is a collection of short video clips where people give their opinions about movies, mostly from YouTube. What makes this dataset different is that it doesn't just look at the text (what the person says), but also their voice tone and facial expressions — so it uses multiple types of data, that's why it is called Multimodal.



![image.png](attachment:image.png)

Each opinion in the dataset is labeled based on how positive or negative it is. The labels are on a 7-point scale ranging from highly negative to highly positive. There are a total of 2,199 opinion clips taken from 93 different people. On average, there are about 23 short opinion clips in each video, and each clip is about 4 seconds long.

To make sure the labels are fair, five human annotators watched each clip and rated the sentiment. They mostly agreed with each other, which shows that the labels are reliable.

### Tensor Fusion Network (TFN)
The Tensor Fusion Network (TFN) is a deep multimodal architecture designed to perform sentiment analysis on spoken opinion utterances by combining language, visual, and acoustic modalities. It is composed of three main components:

##### 1. Modality Embedding Subnetworks
Each modality (language, visual, and acoustic) has its own embedding subnetwork that takes raw features from that modality and transforms them into a rich, compact representation (embedding). These embeddings are learned specifically to capture meaningful patterns in each modality.

##### 2. Tensor Fusion Layer
Instead of simply concatenating the embeddings (as done in early fusion), this layer performs a 3-fold Cartesian product of the three embeddings. This operation explicitly models:

Unimodal interactions (individual modalities),

Bimodal interactions (pairs of modalities),

Trimodal interactions (jointly across all three modalities).

This results in a high-dimensional tensor that captures intricate cross-modal relationships, enabling more expressive modeling of sentiment-related cues.

##### 3. Sentiment Inference Subnetwork
This is the final prediction component. It takes the output from the Tensor Fusion Layer and performs sentiment inference. The exact output depends on the task type:

Binary classification (e.g., positive vs. negative sentiment),

5-class classification (e.g., very negative to very positive),

Regression (e.g., predicting a real-valued sentiment score)

### Spoken Language Embedding Subnetwork (Ul)
Spoken language often sounds different from written language. For example, someone might say, “I think it was alright... hmm... let me think... yeah... no... okay... yeah.” This kind of back-and-forth, where people talk while thinking, is very common in speech but not in writing. To handle this natural but unpredictable way of speaking, our model needs to focus on the meaningful parts and ignore the rest.

Our method learns a detailed representation for each spoken word, one by one, while also keeping track of the context from earlier words in the sentence. This helps the model understand what’s being said, even if later parts are unclear or off-topic. If useful information appears again later, the model can also pick that up and update its understanding.

We use an LSTM (Long Short-Term Memory) network, which is good at remembering what’s important over time. Each word is first turned into a 300-dimensional vector using GloVe word embeddings. The LSTM processes these word vectors and produces 128-dimensional outputs that capture the meaning and context of each word.

![image.png](attachment:image.png)

These outputs form a sequence of vectors, which we combine and pass through a fully connected neural layer to get a final 128-dimensional spoken language embedding zl. This vector is what the rest of the system uses to represent the spoken input.

![image.png](attachment:image.png)

 $h_l$ is a matrix of language representations formed from concatenation of $h_1$, $h_2$, $h_3$,....$h_T$ is then used as input to a fully-connected network that generates language embedding $z^l$:

![image.png](attachment:image.png)

 where Wl is the set of all weights in the Ul network, σ is the sigmoid function.

### Visual Embedding Subnetwork (Uv)
In opinion videos, people usually speak directly to the camera, so their facial expressions are the most important visual cues. For every video frame (taken at 30 frames per second), we detect the speaker’s face and analyze emotions like anger, joy, sadness, surprise, and more. This is done using the FACET tool, which also tracks 20 specific facial muscle movements called Action Units.

We also use OpenFace to get details like head movement, face angles, and 68 key points on the face. All of these features are collected for every frame and put together as a visual feature vector $v_j$
.

To keep things simple, we calculate the average of these visual features across all frames in the video. This gives us a single combined visual vector v.

This vector is then passed into a neural network with three hidden layers (each with 32 ReLU units). This network, using weights $W_v$, creates the final visual embedding $z^v$, which is a 32-dimensional vector. This embedding captures the important visual signals from the speaker's face and can be used in the next steps of the system.



![image.png](attachment:image.png)

### Acoustic Embedding Subnetwork (Ua)
For every opinion audio clip, we extract sound features using the COVAREP toolkit. These include MFCCs, pitch, voiced vs unvoiced segments, and other voice-related features like glottal flow, peak slope, and vocal tract shape. These features help capture important details of how a person is speaking and can reflect emotions in their voice.

Audio is split into short frames taken every 10 milliseconds (100 times per second). For each frame, a set of features is collected and then averaged across all frames in the clip to create a single summary vector a.

This acoustic vector a is passed into a deep neural network (with three hidden layers, each having 32 ReLU units) to generate the final sound-based embedding $z^a$. We found that making the network larger didn’t improve results, so we kept it simple and efficient.

![image.png](attachment:image.png)

### Tensor Fusion Layer
Instead of just combining features from language, visual, and audio data like earlier methods, we use a special fusion approach called Tensor Fusion. It captures how each modality (text, video, and audio) interacts individually, in pairs, and all together.

To do this, we first add a constant value of 1 to each embedding (from language, visual, and audio networks) so that the model can separately learn single, pairwise, and joint interactions. We then apply an outer product to these three vectors. This forms a high-dimensional 3D tensor that holds all combinations of the input embeddings.

This tensor includes:

Unimodal regions – individual features from each modality,

Bimodal regions – how two modalities relate (e.g., text + audio),

Trimodal region – how all three interact together.

Though this fused tensor has many dimensions, it doesn’t use any learnable parameters. We found that the model still generalizes well because the structure is meaningful and not overly complex—making it easier for the next layers to extract useful information.

![image.png](attachment:image.png)

This diagram compares Early Fusion (on the left) with Tensor Fusion (on the right) for combining features from three sources: language, visual, and audio.

##### 1.Left Side (Early Fusion):
Each input (language, visual, audio) is processed separately (unimodal).

Their outputs are simply stacked together into one big vector. This is called early fusion—just combining features without modeling their relationships.

##### 2.Right Side (Tensor Fusion):
Each input (language zˡ, visual zᵛ, audio zᵃ) is used to create a 3D structure.

This structure includes:

Unimodal parts (individual features),

Bimodal parts (interactions between any two modalities, like zˡ ⊗ zᵛ),

Trimodal part (interaction between all three: zˡ ⊗ zᵛ ⊗ zᵃ).

These interactions are modeled using outer products (⊗), capturing richer relationships.

### Sentiment Inference Subnetwork
After combining the language, visual, and audio features in the Tensor Fusion layer, the Sentiment Inference Subnetwork takes this combined info and predicts the speaker's sentiment.

This subnetwork is just a small neural network with two hidden layers. It can work in three ways:

For positive/negative prediction, it gives one output.

For five levels of sentiment (like very negative to very positive), it gives five outputs.

For score-based sentiment, it gives a number between 0 and 1.

### Experiments
The paper conducts three experiments to test the effectiveness of their model, Tensor Fusion Network (TFN), in sentiment analysis using multiple types of data (text, visual, and audio):

First, they compare TFN with past best methods for analyzing sentiment using multiple data types. The table shows that TFN performs better than other models in all tasks — binary classification (positive/negative), 5-class classification (very negative to very positive), and regression (predicting a sentiment score).

Second, they test how important each part of TFN is — like the unimodal (single type), bimodal (two-type), and trimodal (three-type) interactions — and also compare TFN to simpler fusion methods like early fusion.

Third, they see how well each individual modality (language, visual, or audio) performs alone, and compare it with the best existing models that use only one of these.

### Experiment1 : Multimodal Sentiment Analysis
![image.png](attachment:image.png)

Here TFN was tested against the following models:

C-MKL: A neural network model that combines multiple types of input features using kernel learning. It was the best method before TFN.

SAL-CNN: A method designed to reduce bias related to individual speakers in deep learning.

SVM-MD and RF-MD: These are traditional (non-deep learning) models that combine features early and then use Support Vector Machines or Random Forests for classification.

The results, shown in Table 1, clearly indicate that TFN performs better than all other models, especially in the 5-class classification task, showing that it can better understand fine-grained sentiment (like distinguishing between “slightly negative” and “very negative”).

### Experiment2: Tensor Fusion Evaluation
![image.png](attachment:image.png)

In Experiment 2, the authors test different parts (called subtensors) of their Tensor Fusion Network (TFN) to see how important each one is for predicting sentiment.

##### i) Testing one modality at a time:
Here models are tested using only language, visual, or acoustic input separately, and found that language was the most effective modality on its own.

##### ii) Testing combinations:
When testing combinations of two modalities (bimodal), the model achieved strong performance. However, using all three modalities together (trimodal) led to slightly worse performance than bimodal alone. Interestingly, when they removed only the trimodal interactions but kept unimodal and bimodal components, performance dropped noticeably.

##### iii) Full TFN vs. Early Fusion:
The full Tensor Fusion Network, which includes all unimodal, bimodal, and trimodal components, produced the best results overall. In contrast, the early fusion approach, which simply combines all three inputs before processing, resulted in poorer performance.

### Experiment 3: Modality Embedding Subnetworks Evaluation

In this experiment, we compare how well our Modality Embedding Networks perform against the best existing methods for analyzing sentiment using just language, just visual input, or just acoustic input.


### 3.1: Language Sentiment Analysis
In this part of the study, we only use the language (text) part of the data to analyze emotions or sentiment. We compare our TFN_language model with other top models that were made for analyzing sentiment using text.
![image.png](attachment:image.png)
##### i) What was compared:
The comparison included several models. RNTN uses sentence structure and grammar to analyze sentiment, while DAN predicts sentiment by averaging the meanings of words. D-CNN applies convolutional neural networks to capture relationships between nearby words. Additionally, multimodal models like CMKL-L, SAL-CNN-L, and SVM-MD-L were included, but only their language components were used for this comparison.

##### ii) What did we find:
The TFN$ _{\text{language}} $ model outperformed all the other models in nearly every task. It achieved better results in binary classification (positive vs. negative), 5-class classification (ranging from very negative to very positive), and regression (predicting a continuous sentiment score). The improvement in performance is reflected in the final row of table 3, labeled as ΔSOTA_language.

##### iii) Why does it matter:
Many existing models were created to handle formal, written text like product reviews or news articles. However, spoken language, such as that found in the CMU-MOSI dataset, tends to be more casual and less structured.  TFN$ _{\text{language}} $ is more effective because it is better suited to understanding and analyzing the nuances of spoken sentiment.

### 3.2: Visual Sentiment Analysis
This part of the study looked at how well we can understand people’s emotions just by looking at their faces in videos.
![image.png](attachment:image.png)
##### i) What did we compare:
We tested our model, called TFN$ _{\text{visual}} $, against other popular models that also try to read emotions from facial expressions in videos. Some models watched short clips, some looked at each video frame one by one, and others used tools that read facial features. We also tested some models that usually use more than one type of input, but this time we only used their visual part.

##### ii) What did we find:
Our model did the best. It was more accurate at telling whether someone felt positive or negative, and it also made better predictions when rating emotions on a scale. These improvements are shown at the bottom of Table 4.

##### iii)Why does it matter:
Understanding emotions from video is tough because people show feelings in many small, different ways. Older models often miss these details. Our  TFN$ _{\text{visual}} $ model is better at noticing these subtle expressions, especially in real conversations like the ones in our dataset.

### 3.3: Acoustic Sentiment Analysis
In this section, we only use audio (speech) information to predict the speaker's sentiment or emotion
![image.png](attachment:image.png)
##### i) What did we compare?
We compared our model, called TFN_acoustic, with several well-known models that are designed to work with audio data, such as tone, pitch, and other speech features. These included HL-RNN, which uses LSTM to analyze high-level audio features like prosody; Adieu-Net, an end-to-end model that works directly with raw audio in PCM format; and SER-LSTM, which uses LSTM to process spectrogram images of audio. We also looked at multimodal models like CMKL-A, SAL-CNN-A, and SVM-MD-A, but in our case, we only used the acoustic parts of these models to keep the comparison fair.

##### ii) What did we find?
Our model, TFN_acoustic, performed better than all the others in different types of tasks. These tasks included binary classification (deciding if the sentiment is positive or negative), 5-class classification (which looks at more detailed levels of sentiment), and regression (which predicts a specific sentiment score). Compared to the other models, ours had higher accuracy and F1 scores, lower mean absolute error (MAE), and slightly better correlation with the true sentiment values. These improvements can be clearly seen in the row labeled ΔSOTA_acoustic in our results.

##### iii) Why is this important?
This is important because human speech contains many emotional signals that are often missed by traditional models. Older models may focus too much on hand-crafted features or fail to fully understand the emotional context in speech. Our model, TFN_acoustic, is able to better learn from speech data and capture how a person feels just by analyzing the way they talk. This makes it more effective for understanding emotions in spoken language.

### Qualitative Analysis : CMU-MOSI Dataset
This part of our study looks at how different models interpret sentiment in specific examples, using spoken words, voice tone, and facial expressions. The goal is to predict a sentiment score between -3 (very negative) and +3 (very positive), based on what the speaker says, how they sound, and how they look.
![image.png](attachment:image.png)

##### i) What models are compared?
We compared several versions of our model. TFN-Acoustic uses only the audio signals, like tone of voice. TFN-Visual uses only facial expressions and other visual cues. TFN-Language focuses only on the spoken words. We also tested TFN-Early, a simple version that combines all three types of input without modeling how they interact well. Finally, we included TFN, our full model, which uses a Tensor Fusion Network to better capture how the different types of signals work together. For comparison, we used Ground Truth, which is the sentiment score given by human annotators.

##### ii) Example-wise summary:
In the first example, the sentence itself sounds only slightly negative, but the speaker is frowning, which shows stronger negative emotion. Only the full TFN model predicted this correctly, with a value very close to the human score. This shows that TFN is good at combining visual and language cues.

In the second example, the spoken words are not very clear in meaning, but the speaker’s smile and excited voice clearly show positive emotion. The TFN model predicted a score very close to the human-annotated one, because it could use all three types of signals together.

In the third example, the words sound positive, but the speaker shakes their head, which is a negative gesture. TFN balanced this contradiction and predicted a score near neutral, which matched the ground truth. This shows it can handle mixed signals from different sources.

In the fourth example, the speaker talks in a low-energy voice and is also frowning, both suggesting negativity. TFN-Acoustic handled this well, while the early fusion and visual-only models did not perform as well. The full TFN model gave a score close to the correct one, showing that the audio features were very helpful here.

### Conclusion
We proposed a novel end-to-end fusion method for sentiment analysis that explicitly captures unimodal, bimodal, and trimodal interactions among language, visual, and acoustic modalities.
Remark 1: Unlike early fusion approaches that tend to over-rely on the dominant modality (typically language), our method successfully models inter-modality dynamics, enabling better alignment and integration of complementary cues.

Extensive experiments on the CMU-MOSI dataset demonstrate that our approach achieves state-of-the-art performance, not only in multimodal sentiment analysis but also in unimodal settings (language-only, visual-only, and acoustic-only).
Remark 2: This highlights the robustness of our model even when only partial modality information is available.

Remark 3: Qualitative analysis confirms that our model captures nuanced cross-modal interactions, such as how conflicting visual or acoustic cues affect overall sentiment, which simpler models often fail to recognize.

Overall, our method provides a general and effective framework for modeling rich multimodal behaviors in sentiment prediction, and sets a strong foundation for future research in fine-grained affective computing.

### Reference:

Zadeh A., Chen M., Poria S., Cambria E., & Morency L.-P. (2017). Tensor Fusion Network for Multimodal Sentiment Analysis. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1103–1114. Association for Computational Linguistics. https://aclanthology.org/D17-1115



### What Surprised Me
##### i) Effectiveness of TFN over Early Fusion:
TFN captures complex inter-modality interactions far better than early fusion, which often biases heavily toward the language modality.

##### ii) Complementarity of Modalities:
Acoustic and visual features often disambiguate ambiguous language—something that unimodal models fail to capture.

##### iii) High Unimodal Performance:
TFN also achieves state-of-the-art results in language-only, visual-only, and acoustic-only settings—unexpected for a multimodal model.

### Scope for Improvement
##### i) Reduce Model Size:
Tensor fusion leads to high-dimensional representations. Use low-rank tensor approximations to reduce parameters and improve scalability.

##### ii) Handle Missing Modalities:
Real-world data often has missing channels. Make TFN robust to incomplete modalities (e.g., through modality dropout or imputation).

##### iii)  Better Temporal Modeling:
Replace or augment LSTM with transformers or temporal attention to better capture long-range dependencies.

##### iv) Fine-Grained Emotion Detection:
Extend TFN for emotion recognition beyond polarity, addressing nuances like sarcasm or mixed emotions.

##### v) Generalization to Larger Datasets:
Evaluate TFN on larger and more diverse multimodal datasets (e.g., MOSEI, MELD) to test real-world scalability.

