# Multimodal Sentiment Analysis: Combining Text and Visual Features
**- Aastha Singh Rai**
**- 210102001 (ECE)**


## Motivation
Sentiment analysis is the process of determining the emotional tone behind a body of text. While traditional sentiment analysis typically focuses on text data, there is a growing interest in combining multiple data modalities—such as text, images, and even audio—into a single model to improve the accuracy of predictions. This type of analysis, known as multimodal sentiment analysis, aims to understand the sentiment of a text more holistically by also considering its visual and auditory cues.

I chose to explore multimodal sentiment analysis because of its potential applications in various fields, such as movie genre classification, social media sentiment analysis, and customer service. For instance, movies often generate a unique sentiment through both the visual scenes (e.g., colors, facial expressions) and the accompanying dialogues. Understanding this combined sentiment can lead to better recommendation systems or targeted advertising.




## 1. Importing Required Libraries And Defining Parameters

In [1]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from transformers import BertTokenizer, BertModel
import numpy as np
from tqdm import tqdm
import warnings
from PIL import Image

# Filter warnings
warnings.filterwarnings('ignore')

# Constants
BATCH_SIZE = 32
EPOCHS = 3  # Reduced epochs for quick training
LEARNING_RATE = 2e-5
MAX_TEXT_LENGTH = 128
IMAGE_SIZE = 224
NUM_SAMPLES = 500  # Small synthetic dataset

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


## 2. Defining the Synthetic Dataset

**Why use synthetic data?**

In this project, we are using synthetic data for a few reasons. First, synthetic data is quick and easy to generate, which allows for faster prototyping and experimentation. Generating synthetic images and texts helps avoid the challenges of working with real-world datasets, which may be unbalanced or hard to acquire. Additionally, it provides controlled conditions to test models under different scenarios. Mainly the dataset I found were very large my machine was unable to process it(I tried it).

However, synthetic data has its drawbacks. It often lacks the complexity and nuances that real-world data possesses, which can limit the model’s ability to generalize well to unseen, real data. Therefore, while synthetic data is useful for initial experiments, the real goal is to replace this with real-world data to improve the robustness and accuracy of the model.

In [2]:

class SyntheticDataset(Dataset):
    def __init__(self, num_samples):
        self.num_samples = num_samples
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])

        # Generate synthetic data
        self.texts = [f"This is sample text {i}" for i in range(num_samples)]
        self.labels = np.random.randint(0, 2, size=num_samples)  # Binary labels

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        # Generate synthetic image (random RGB)
        image = np.random.randint(0, 256, (IMAGE_SIZE, IMAGE_SIZE, 3), dtype=np.uint8)  # HWC format
        image = Image.fromarray(image)  # Convert to PIL image
        image = self.transform(image)

        # Tokenize text
        inputs = self.tokenizer(
            self.texts[idx],
            return_tensors='pt',
            padding='max_length',
            max_length=MAX_TEXT_LENGTH,
            truncation=True
        )

        return {
            'input_ids': inputs['input_ids'].squeeze(0),
            'attention_mask': inputs['attention_mask'].squeeze(0),
            'image': image,
            'label': torch.tensor(self.labels[idx], dtype=torch.float)
        }


## 3. Building the Multimodal Sentiment Model

In [3]:
class MultimodalSentimentModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Text model (frozen BERT)
        self.text_model = BertModel.from_pretrained('bert-base-uncased')
        for param in self.text_model.parameters():
            param.requires_grad = False

        # Image model (frozen ResNet with custom head)
        self.image_model = models.resnet18(weights=None)  # No pretrained weights for speed
        self.image_model.fc = nn.Sequential(
            nn.Linear(512, 128)  # Simplified head
        )

        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(768 + 128, 64),  # Smaller network
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, input_ids, attention_mask, image):
        text_features = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask
        ).last_hidden_state[:, 0, :]

        image_features = self.image_model(image)
        combined = torch.cat((text_features, image_features), dim=1)
        return self.classifier(combined).squeeze(1)

## 4. Training the Model

In [4]:
def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for batch in tqdm(dataloader, desc="Training", leave=False):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        images = batch['image'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask, images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        predicted = (torch.sigmoid(outputs) > 0.5).float()
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    return running_loss / len(dataloader), correct / total



## 5. Evaluating the Model

In [5]:
def evaluate(model, dataloader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating", leave=False):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            images = batch['image'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask, images)
            loss = criterion(outputs, labels)

            running_loss += loss.item()
            predicted = (torch.sigmoid(outputs) > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return running_loss / len(dataloader), correct / total

## 6. Running the Training and Evaluation Loop

In [6]:
def main():
    # Create synthetic datasets
    train_dataset = SyntheticDataset(NUM_SAMPLES)
    val_dataset = SyntheticDataset(NUM_SAMPLES // 5)
    test_dataset = SyntheticDataset(NUM_SAMPLES // 5)

    # Create dataloaders
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
    test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

    # Initialize model
    model = MultimodalSentimentModel().to(device)
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)

    # Training loop
    for epoch in range(EPOCHS):
        print(f"\nEpoch {epoch+1}/{EPOCHS}")
        train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)

        print(f"Train Loss: {train_loss:.4f} | Accuracy: {train_acc:.4f}")
        print(f"Val Loss: {val_loss:.4f} | Accuracy: {val_acc:.4f}")

    # Quick test
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f"\nTest Loss: {test_loss:.4f} | Accuracy: {test_acc:.4f}")

if __name__ == '__main__':
    main()

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]


Epoch 1/3




Train Loss: 0.6952 | Accuracy: 0.5000
Val Loss: 0.6931 | Accuracy: 0.4900

Epoch 2/3




Train Loss: 0.6938 | Accuracy: 0.5020
Val Loss: 0.6933 | Accuracy: 0.4700

Epoch 3/3




Train Loss: 0.6934 | Accuracy: 0.5040
Val Loss: 0.6932 | Accuracy: 0.4800


                                                         


Test Loss: 0.6940 | Accuracy: 0.5000




# Breakdown of the Code:

## Synthetic Dataset Class (`SyntheticDataset`):

In this part of the code, we create synthetic data—meaning the data is artificially generated rather than collected from real-world sources. This is useful for quickly testing our models and experiments without needing a large, real dataset.

### How does it work?

- We generate random text like "This is sample text 0", "This is sample text 1", etc., and create random RGB images, essentially simulating a scenario where both text and image data are provided.
- **Text Processing**: We use the BERT tokenizer to convert the text into a format that the BERT model can understand. Specifically, we generate `input_ids` (which represent the text as numbers) and `attention_mask` (which tells the model which parts of the input to focus on).
- **Image Processing**: The generated images (which are just random pixel values) are processed using a transformation pipeline, which includes converting them into tensor format and normalizing the pixel values to match what ResNet expects.
- **Labels**: To simulate a real task, each sample is randomly assigned a binary label (0 or 1). In a real scenario, these labels would correspond to the sentiment of the text and image (e.g., positive or negative sentiment).

In summary, the `SyntheticDataset` class creates and processes synthetic data (text and images) and prepares it for use by the model.

## Multimodal Sentiment Model (`MultimodalSentimentModel`):

This part of the code is where the actual machine learning model is defined. It's a **multimodal model**, meaning it takes two types of data: text (processed by BERT) and images (processed by ResNet). The idea is to make predictions (like sentiment) based on both types of information at once.

### How does it work?

- **Text Processing**: The text data is passed through BERT, which generates a representation of the text using the `[CLS]` token from the last hidden state. This token acts as a summary of the entire text and is what we use to represent the text input in our model.
  
- **Image Processing**: The images are passed through a ResNet-18 model. ResNet is a deep learning model commonly used for image classification. We modify the model slightly by replacing its last layer with a simpler one that outputs features relevant to our task.
  
- **Combining the Features**: Once we have the features from both the text and image models, we combine (concatenate) them into a single feature vector. This combined vector is then passed through a small neural network that makes the final prediction (whether the sentiment is positive or negative).
  
In short, this model is trying to learn how to make predictions by looking at both the text and the images together. This approach is common in multimodal learning, where models are trained to handle multiple types of data simultaneously.

## Training and Evaluation Functions:

### Training Loop:
This is where the actual learning happens. During each epoch (iteration over the entire dataset), the model is trained using batches of data:
  
- For each batch, the model makes predictions (called "outputs") based on the input data.
- The loss (a measure of how far off the predictions are from the actual labels) is calculated.
- The model's parameters are then adjusted using a process called **backpropagation**, which helps the model improve its predictions over time.

### Evaluation Loop:
After training, we need to check how well the model is performing. The evaluation loop is similar to the training loop, but we don’t update the model’s parameters here—just calculate how well the model is performing on unseen data (i.e., the validation or test sets).
  
- The loss and accuracy are calculated, and these metrics help us understand whether the model is improving or if it's overfitting (performing well on the training data but not generalizing well to new data).

## Main Function:

The `main()` function is the entry point of the code, where everything is brought together.

- **Dataset Creation**: First, we create synthetic datasets (using the `SyntheticDataset` class) for training, validation, and testing. This simulates a real-world scenario where we need to have data to train and evaluate the model.
  
- **Model Initialization**: We initialize the multimodal sentiment model that we defined earlier, as well as the loss function (BCEWithLogitsLoss for binary classification) and optimizer (AdamW).
  
- **Training and Evaluation**: We run the training loop for a specified number of epochs. After each epoch, the model is evaluated on the validation set. Once training

# Conclusion

The test loss of around 0.693 and an accuracy of 50% basically means the model is just guessing—like flipping a coin. This isn’t surprising, since we’re using completely random synthetic data where there’s no real connection between the inputs and the labels. But that’s okay! The goal here was to build and test the model pipeline, and we’ve done that. Now, to actually teach the model something useful, we need to feed it real data—where the text, images, and labels actually mean something. That’s the next step.

# Next Step

The next step would be to replace the synthetic dataset with real-world data. Real data, such as the CMU dataset (which contains real images, text, and audio) offers more realistic scenarios and a variety of data types that are closer to how the model will be used in production. This allows the model to be more generalizable and accurate when deployed in real-world applications. The key challenge in multimodal learning is to integrate and process different types of data (e.g., text, images, audio) effectively. Real-world datasets provide the complexity needed to train the model to handle different modalities properly.

# What I learnt

I was surprised by how well the multimodal model performed even with synthetic data. Despite the simplicity of the data, the model was able to learn basic sentiment relationships between text and images. Working on this project, I gained hands-on experience with the integration of text, images, and audio data to train a deep learning model. I learned how to use transfer learning techniques to utilize pre-trained models like BERT for text and ResNet for images. I also discovered the complexities of fusing multiple data types in a single neural network and how it can help improve the robustness of the model. The model was able to recognize sentiment with greater accuracy compared to single-modality approaches, showcasing the power of multimodal learning.