This repository contains the implementation of the Tensor Fusion Network (TFN) for multimodal sentiment analysis using the CMU-MOSI dataset. The TFN architecture incorporates language, visual, and acoustic modalities to predict sentiment intensity.
The CMU-MOSI dataset is an annotated dataset of video opinions from YouTube movie reviews. It includes sentiment annotations on a seven-step Likert scale from very negative to very positive. The dataset comprises 2,199 opinion utterances from 93 distinct speakers, with an average length of 4.2 seconds per video.
- Language Modality: Uses GloVe word vectors for spoken words.
- Visual Modality: Extracts facial expressions and action units using the FACET framework and OpenFace.
- Acoustic Modality: Extracts acoustic features using the COVAREP framework.
- Binary Sentiment Classification
- Five-Class Sentiment Classification
- Sentiment Regression
TFN consists of three main components:
- Modality Embedding Subnetworks: Extracts features from language, visual, and acoustic modalities.
- Tensor Fusion Layer: Explicitly models unimodal, bimodal, and trimodal interactions.
- Sentiment Inference Subnetwork: Performs sentiment inference based on the fused multimodal tensor.
- Language Embedding Subnetwork: Uses LSTM to learn time-dependent representations of spoken words.
- Visual Embedding Subnetwork: Uses a deep neural network to process visual features extracted from facial expressions.
- Acoustic Embedding Subnetwork: Uses a deep neural network to process acoustic features extracted from audio signals.
The Tensor Fusion Layer models the interactions between different modalities using a three-fold Cartesian product, generating a multimodal tensor that captures unimodal, bimodal, and trimodal dynamics.
A fully connected deep neural network that takes the multimodal tensor as input and performs sentiment classification or regression.
Three sets of experiments were conducted:
- Multimodal Sentiment Analysis: Compared TFN with state-of-the-art multimodal sentiment analysis models.
- Tensor Fusion Evaluation: Analyzed the importance of subtensors and the impact of each modality.
- Modality Embedding Subnetworks Evaluation: Compared TFN's modality-specific networks with state-of-the-art unimodal sentiment analysis models.
TFN outperformed state-of-the-art approaches in binary sentiment classification, five-class sentiment classification, and sentiment regression. The ablation study showed the importance of modeling trimodal dynamics for improved performance.
- Python 3.x
- TensorFlow or PyTorch (depending on the implementation)
- Required Python libraries (listed in
requirements.txt
)
-
Clone the repository:
git clone https://github.com/yourusername/TFN-multimodal-sentiment.git cd TFN-multimodal-sentiment
-
Install the required Python libraries:
pip3 install -r requirements.txt
- Download the CMU-MOSI dataset from the official source.
- Extract the dataset and place it in the
data
directory.
-
Preprocess the dataset:
python3 preprocess.py --data_dir data/CMU-MOSI
-
Train the TFN model:
python3 train.py --config configs/tfn_config.json
Evaluate the trained model on the test set:
python3 evaluate.py --model_dir models/tfn --data_dir data/CMU-MOSI
Modify the configuration file configs/tfn_config.json
to change hyperparameters, model settings, and dataset paths.
If you use this code or dataset in your research, please cite the original paper:
@inproceedings{zadeh2016mosi,
title={Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages},
author={Zadeh, Amir and Chen, Minghai and Poria, Soujanya and Cambria, Erik and Morency, Louis-Philippe},
booktitle={IEEE Intelligent Systems},
year={2016}
}
This project is licensed under the MIT License.
Feel free to open an issue if you have any questions or need further assistance. Happy researching!