# Seeing the Feelings: Multimodal Sentiment Analysis Using Image and Text  
### Karnati Ravi Teja
### 220150006

## Motivation

Understanding human sentiment through text alone can be tricky — it’s easy to miss things like sarcasm, ambiguity, or subtle emotional cues. That’s where images come in. When text is paired with visuals (like in memes, reviews, or social media posts), they add extra context that can make the sentiment clearer.

This project focuses on **Multimodal Sentiment Analysis** — combining both image and text inputs to get a more accurate read on emotions. I chose this topic because it brings together computer vision and language understanding in a way that feels both practical and impactful. 


## Multimodal Learning: A Quick Look Back

Multimodal learning is all about combining different types of data — like text, images, or audio — to build smarter models. Early approaches were pretty straightforward — just stack features from each modality together.

But modern models go way beyond that. With techniques like **attention mechanisms** and **transformers**, they actually learn how different modalities interact.

### Some popular models in this space:
- **ViLBERT**: Basically BERT which also understands images 
- **LXMERT**: Uses co-attention to deeply connect image and text features
- **CLIP**: Trains on huge image–text datasets to map both into the same embedding space

For this project, I’m implementing a simplified version of these ideas — using attention-based fusion to combine image and text signals more effectively.


## What I Learned From This Work

This project really helped me understand how **multimodal models** work in practice. Some key takeaways:

- I learned how separate encoders (like **ResNet** for images and **BERT** for text) can be combined for joint prediction tasks.
- I explored **attention fusion**, where features from different modalities don’t just get merged — they actually interact and highlight relevant parts in each other.
- I now get how **cross-modal attention** helps align visual cues with language, making predictions more accurate — especially when one modality is vague or noisy.
- Working with **pretrained models** showed me how powerful transfer learning can be, even across different types of data.

Overall, it felt good to go beyond theory and see how deep learning + attention mechanisms come together in the real world.

## How Attention Fusion Works

- First, text and images are encoded separately.
- **Attention fusion** helps the **text focus on relevant image parts**, and vice versa.
- This is done using **cross-attention**:
  - Each word in the sentence "asks" which image regions are important.
  - Each image patch "asks" which words are meaningful.
- The result? **Context-aware embeddings** that are fused together and then classified.

This approach beats simple concatenation because it allows the model to learn **interactions between modalities** instead of treating them as separate, independent pieces.


### 1. **Text Classifier Model**
The **Text Classifier** uses **word embeddings** and an **LSTM** network to process text data. The text features are obtained through the pre-trained embedding matrix and passed through a bidirectional LSTM layer to capture contextual information.


In [None]:
class TextClassifier(nn.Module):
    def __init__(self):
        super(TextClassifier, self).__init__()
        vocab_size, embedding_dim = EMBEDDING_MATRIX.shape
        self.embedding = nn.Embedding.from_pretrained(torch.tensor(EMBEDDING_MATRIX), freeze=True)
        self.dropout = nn.Dropout(0.5)
        self.lstm1 = nn.LSTM(embedding_dim, HIDDEN_SIZE, batch_first=True, bidirectional=True)

    def forward(self, x):
        embedded = self.embedding(x)
        embedded = self.dropout(embedded)
        output1, _ = self.lstm1(embedded)
        last_hidden_state = torch.cat((output1[:, -1, :HIDDEN_SIZE], output1[:, 0, HIDDEN_SIZE:]), dim=1)
        return last_hidden_state

### 2. **Multimodal Classifier Model**
The Multimodal Classifier takes input from both the text and image models. It fuses the features using attention-based fusion (early fusion can also be applied depending on your experiment) and then classifies the multimodal features into different sentiment classes.

Attention-based fusion

In [None]:
class MultimodalClassifier(nn.Module):
    def __init__(self, text_model, image_model):
        super(MultimodalClassifier, self).__init__()
        self.text_model = text_model
        self.image_model = image_model
        self.attention_weights = nn.Parameter(torch.Tensor(1, 1, 2))
        self.fc = nn.Linear(HIDDEN_SIZE * 2 + 256, NUM_CLASSES)

    def forward(self, text_input, image_input):
        # Extract text features
        text_features = self.text_model(text_input)  # Shape: [batch_size, HIDDEN_SIZE * 2]

        # Extract image features
        image_features = self.image_model(image_input)  # Shape: [batch_size, 256]

        # Apply attention to text features
        attention_scores = torch.matmul(text_features.unsqueeze(2), self.attention_weights)  # Shape: [batch_size, HIDDEN_SIZE * 2, 1]
        attention_weights = torch.softmax(attention_scores, dim=1)  # Shape: [batch_size, HIDDEN_SIZE * 2, 1]
        attended_text_features = (text_features * attention_weights.squeeze(2)).sum(dim=1)  # Shape: [batch_size, HIDDEN_SIZE * 2]

        # Concatenate text features and image features
        multimodal_features = torch.cat((attended_text_features, image_features), dim=1)  # Shape: [batch_size, HIDDEN_SIZE * 2 + 256]

        # Classify the multimodal features
        logits = self.fc(multimodal_features)  # Shape: [batch_size, NUM_CLASSES]
        return logits


Early Fusion (Optional):

In [None]:
class EarlyFusionMultimodalClassifier(nn.Module):
    def __init__(self, text_model, image_model):
        super(EarlyFusionMultimodalClassifier, self).__init__()
        self.text_model = text_model
        self.image_model = image_model
        self.fc = nn.Linear(NUM_CLASSES * 2, NUM_CLASSES)

    def forward(self, text, image):
        text_out = self.text_model(text)
        image_out = self.image_model(image)
        
        # Element-wise multiplication of text and image features
        fusion = text_out * image_out
        
        # Flatten the fusion tensor
        fusion = fusion.view(fusion.size(0), -1)
        
        out = self.fc(fusion)
        return out


### 3. **Dataset and DataLoader**

The dataset is a combination of text and image data, which is processed using word embeddings for the text and a pre-trained image classifier for the image features. Here’s how the dataset is loaded and split:

In [None]:
import multiprocessing as mp
data = MultimodalDataset(DF, W2V_MODEL)

indices = np.arange(len(data))
train_indices, test_indices = train_test_split(indices, test_size=0.2, random_state=42)

train_data = Subset(data, train_indices)
test_data = Subset(data, test_indices)

BATCH_SIZE = 2048

# create data loaders for train and test sets
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, pin_memory=True)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False, pin_memory=True)


### 4. **Model Training**


The model is trained using the Adam optimizer and Cross-Entropy loss:

In [None]:
text_model = TextClassifier().to(DEVICE)
image_model = ImageClassifier().to(DEVICE)

model = MultimodalClassifier(text_model, image_model).to(DEVICE)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LR)


## Reflections

### What surprised me?
- The model could still capture emotions even when either the image or the text was unclear on its own — the fusion really helped!
- Looking at the attention maps was super insightful — they showed cool connections, like linking smiley faces in images to positive words in the text.

### Scope for improvement
- It’d be exciting to extend this to a **tri-modal setup** — adding audio alongside text and images.
- I could try using stronger pretrained models like **CLIP** or **BLIP** for better representation.
- Instead of just classifying sentiment (positive/neutral/negative), predicting **emotion intensity** as a regression task could make the model more nuanced.


## References

- GitHub Repo: https://github.com/imadhou/multimodal-sentiment-analysis
- Paper: https://arxiv.org/abs/2005.13907 (Multimodal Transformer for Sentiment Analysis)
- HuggingFace Transformers: https://huggingface.co/docs/transformers/index
- PyTorch Docs: https://pytorch.org/docs/stable/index.html
