![image.png](https://i.imgur.com/a3uAqnb.png)


# Multi-Modal Sentiment Analysis - Homework Assignment

![Combined Model Architecture](https://i.imgur.com/RVYyBe7.jpeg)

In this homework, you will build and compare three different models for sentiment analysis on a dataset of tweets, each containing both an image and text.

## 📌 Project Overview
- **Task**: Classify the sentiment of a tweet (positive, negative, or neutral) using its text, its image, and a combination of both.
- **Architecture**:
    1. An image-only model (CNN).
    2. A text-only model (RNN/LSTM/GRU).
    3. A combined, multi-modal model that fuses features from the two.
- **Dataset**: MVSA (Multi-view Social Data)
- **Goal**: Compare the performance of unimodal vs. multi-modal approaches for sentiment analysis.

## 📚 Learning Objectives
By completing this assignment, you will:
- Understand how to process a mixed-media dataset (images and text).
- Implement a CNN for image classification.
- Build an RNN/LSTM for text classification.
- Construct a multi-modal architecture by combining feature extractors.
- Evaluate and compare the performance of different models on the same task.

## 1️⃣ Dataset Setup (PROVIDED)

The MVSA dataset has been downloaded for you. The dataset structure is as follows:
- `labelResultAll.txt`: A file containing the labels for each data point. The format is `tweet_id,label`.
- `data/`: A folder containing all the image (`.jpg`) and text (`.txt`) files, named by their tweet ID.

In [6]:
import kagglehub
import os
from dotenv import load_dotenv

# Dataset already downloaded and prepared
load_dotenv()
path = kagglehub.dataset_download("vincemarcs/mvsasingle")
print("Path to dataset files:", path)

# Let's check the contents
print("\nContents of MVSA_Single:")
print(os.listdir(os.path.join(path, 'MVSA_Single')))

print("\nSample of files in the data folder:")
print(os.listdir(os.path.join(path, 'MVSA_Single', 'data'))[:5])

Path to dataset files: /home/ali/.cache/kagglehub/datasets/vincemarcs/mvsasingle/versions/1

Contents of MVSA_Single:
['data', 'labelResultAll.txt']

Sample of files in the data folder:
['3033.jpg', '1951.txt', '1304.txt', '3070.jpg', '1149.txt']


## 2️⃣ Import Libraries and Configuration

**Task**: Import all necessary libraries and set up configuration parameters.

**Requirements**:
- Import PyTorch, torchvision, pandas, and other utilities.
- Import libraries for text processing and evaluation (e.g., NLTK, Scikit-learn).
- Set random seeds for reproducibility.
- Configure hyperparameters with reasonable values.

In [None]:
#TODO: Import all necessary libraries (torch, nn, optim, pandas, etc.)
#TODO: Set random seeds for reproducibility (use seed=42)
#TODO: Check device availability and print (e.g., "cuda" or "cpu")
#TODO: Define configuration parameters:
IMG_SIZE = 224 # Image size (e.g., for ResNet)
BATCH_SIZE = 32 # Batch size
LEARNING_RATE = 1e-4 # Learning rate
NUM_EPOCHS = 10 # Number of training epochs
VOCAB_SIZE = 10000 # Maximum vocabulary size for text
MAX_LEN = 50 # Max sequence length for text

## 3️⃣ Data Loading and Preprocessing

**Task**: Load the labels, match them with their corresponding image and text files, and split the data.

**Requirements**:
- Read `labelResultAll.txt` into a pandas DataFrame.
- Map labels from ('positive', 'negative', 'neutral') to (0, 1, 2).
- Create a list of all data samples, where each sample is a tuple `(image_path, text_path, label)`.
- Split this list into training and validation sets (e.g., 80:20 split).

In [7]:
# TODO: Construct the full path to the data and label files


# TODO: Read 'labelResultAll.txt' using pandas. The file has no header and is comma-separated.
# Name the columns ['id', 'label'].


# TODO: Convert string labels ('positive', 'negative', 'neutral') to integer labels (0, 1, 2).
# You can use a dictionary for mapping.


# TODO: Create a list of all data samples. Each item should be a tuple:
# (path_to_image, path_to_text, integer_label)
# Iterate through the DataFrame and create the file paths.


# TODO: Split the data into training and validation sets using train_test_split from scikit-learn.
# Use a test_size of 0.2 and a random_state of 42.


# TODO: Print the number of samples in the training and validation sets.

## 4️⃣ Text and Image Transformations

**Task**: Define the preprocessing pipelines for both images and text.

**Requirements**:
- For images: Define transforms to resize, convert to tensor, and normalize.
- For text:
    - Build a vocabulary from the training text data.
    - Create a text pipeline function to tokenize, numericalize (convert tokens to integers), and pad sequences.

In [8]:
# TODO: Define image transforms using transforms.Compose:
#       - Resize to (IMG_SIZE, IMG_SIZE)
#       - ToTensor()
#       - Normalize with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] (ImageNet stats)


# TODO: Build the text vocabulary.
#       - Create a tokenizer function (e.g., using NLTK or basic string split).
#       - Iterate through the text files in your *training data*.
#       - Build a vocabulary that maps words to indices. Include special tokens for padding ('<pad>') and unknown words ('<unk>').


# TODO: Define a text processing pipeline function.
#       This function should take a text file path, read the text, tokenize it,
#       convert tokens to their corresponding vocabulary indices, and pad/truncate the sequence to MAX_LEN.

## 5️⃣ Custom Dataset and DataLoaders

**Task**: Create a custom Dataset class and set up DataLoaders.

**Requirements**:
- The `Dataset` should handle loading one sample (image, text, and label).
- `__getitem__` should apply the image transforms and text processing pipeline.
- Create `DataLoader` instances for both training and validation sets.

In [9]:
# TODO: Create a MVSADataset class inheriting from torch.utils.data.Dataset.
# TODO: In __init__(self, data_samples, image_transform, text_pipeline):
#       - Store the samples list, transforms, and text pipeline.
# TODO: Implement __len__ to return the number of samples.
# TODO: Implement __getitem__ to:
#       - Get the image_path, text_path, and label for the given index.
#       - Load the image with PIL, apply the image_transform.
#       - Apply the text_pipeline to the text_path to get a processed tensor.
#       - Return (processed_image, processed_text, label).

# TODO: Create train_dataset and val_dataset using your MVSADataset class.

# TODO: Create train_loader and val_loader with DataLoader.
#       - Use appropriate BATCH_SIZE and shuffle for the training loader.

## 6️⃣ Part A: Image-Only Model (CNN)

First, we will build, train, and evaluate a model that only uses the images to predict sentiment.

### 6.1 Define the CNN Architecture
**Task**: Create a CNN for image classification. A pre-trained model like ResNet is a good choice.

**Requirements**:
- Load a pre-trained ResNet (e.g., ResNet18).
- Replace the final fully connected layer to match the number of sentiment classes (3).

In [10]:
# TODO: Define the ImageModel class or function.
#       - Use models.resnet18(pretrained=True).
#       - Freeze the parameters of the pre-trained layers to avoid updating them initially.
#       - Replace the final `fc` layer with a new `nn.Linear` layer with 3 output units.

# TODO: Initialize the image model and move it to the configured device.
# TODO: Print the model architecture.

### 6.2 Train and Evaluate the Image Model
**Task**: Write the training and evaluation loop for the image-only model.

**Requirements**:
- Set up the loss function (CrossEntropyLoss) and optimizer (Adam).
- Loop through epochs and batches to train the model.
- After training, evaluate the model's performance on the validation set.
- Calculate and display the final accuracy and a confusion matrix.

In [11]:
# TODO: Instantiate the loss function (nn.CrossEntropyLoss).
# TODO: Instantiate the optimizer (e.g., optim.Adam) for the image model's parameters.

# TODO: Write the training loop for NUM_EPOCHS:
#       For each batch in train_loader:
#       - Get images and labels, move them to the device.
#       - Zero gradients.
#       - Get model outputs.
#       - Calculate loss.
#       - Backpropagate and update weights.

# TODO: Write the evaluation loop:
#       - Set model to evaluation mode (model.eval()).
#       - Use torch.no_grad() to disable gradient calculations.
#       - Iterate through the val_loader, get predictions.
#       - Collect all true labels and predictions.

# TODO: Calculate and print the final validation accuracy.
# TODO: Generate and plot a confusion matrix for the image model.

## 7️⃣ Part B: Text-Only Model (RNN/LSTM)

Next, we will build a model that uses only the text from the tweets.

### 7.1 Define the RNN/LSTM Architecture
**Task**: Create a text classification model using an Embedding layer and an LSTM/GRU layer.

**Requirements**:
- An `nn.Embedding` layer to convert word indices to dense vectors.
- An `nn.LSTM` or `nn.GRU` layer to process the sequence.
- A final `nn.Linear` layer to produce class scores.

In [12]:
# TODO: Define the TextModel class inheriting from nn.Module.
#       In __init__:
#       - Create an nn.Embedding layer (vocab_size, embedding_dim).
#       - Create an nn.LSTM or nn.GRU layer.
#       - Create a nn.Linear layer for the output classification (hidden_dim -> 3).
#       In forward(self, text):
#       - Pass text through embedding layer.
#       - Pass embeddings through LSTM/GRU.
#       - Use the final hidden state of the LSTM/GRU for classification.
#       - Pass the hidden state through the linear layer.

# TODO: Initialize the text model and move it to the device.
# TODO: Print the model architecture.

### 7.2 Train and Evaluate the Text Model
**Task**: Train and evaluate the text-only model using the same process as before.

In [13]:
# TODO: Instantiate the loss function and optimizer for the text model.

# TODO: Write the training loop for the text model.
#       For each batch in train_loader:
#       - Get texts and labels, move them to the device.
#       - Train the model (forward pass, loss, backward pass, optimizer step).

# TODO: Write the evaluation loop for the text model on the validation set.

# TODO: Calculate and print the final validation accuracy.
# TODO: Generate and plot a confusion matrix for the text model.

## 8️⃣ Part C: Combined Multimodal Model

Finally, we'll combine the two feature extractors into a single, powerful model.

### 8.1 Define the Multimodal Architecture
**Task**: Create a model that takes both an image and text as input.

**Requirements**:
- Use the pre-trained image CNN (without its final classifier layer) as an image feature extractor.
- Use the trained text model (without its final classifier layer) as a text feature extractor.
- Concatenate the features from both branches.
- Add one or more `nn.Linear` layers to classify the combined feature vector.

In [14]:
# TODO: Define the MultiModalModel class inheriting from nn.Module.
#       In __init__:
#       - Instantiate your image feature extractor (e.g., ResNet without the last layer).
#       - Instantiate your text feature extractor (e.g., Embedding + LSTM).
#       - Define a new classifier (nn.Sequential with Linear, ReLU, Dropout, Linear)
#         that takes the concatenated feature dimension as input.
#       In forward(self, image, text):
#       - Get image features.
#       - Get text features.
#       - Concatenate the features (torch.cat).
#       - Pass the combined features through the new classifier.
#       - Return the final logits.

# TODO: Initialize the multimodal model and move it to the device.
# TODO: Print the model architecture.

### 8.2 Train and Evaluate the Multimodal Model
**Task**: Train and evaluate the final combined model.

In [15]:
# TODO: Instantiate the loss function and optimizer for the multimodal model.

# TODO: Write the training loop for the multimodal model.
#       For each batch in train_loader:
#       - Get images, texts, and labels, move them to the device.
#       - Train the model (forward pass, loss, backward pass, optimizer step).

# TODO: Write the evaluation loop for the multimodal model on the validation set.

# TODO: Calculate and print the final validation accuracy.
# TODO: Generate and plot a confusion matrix for the multimodal model.

## 9️⃣ Performance Comparison

**Task**: Present the results of all three models side-by-side.

**Requirements**:
- Display the final validation accuracies for the Image-Only, Text-Only, and Multimodal models.
- Plot the confusion matrices for all three models in a single figure for easy comparison.

In [16]:
# TODO: Print the final validation accuracies for all three models in a summary table or list.

# TODO: Create a 1x3 subplot using matplotlib.
# TODO: Plot the confusion matrix for the image model in the first subplot.
# TODO: Plot the confusion matrix for the text model in the second subplot.
# TODO: Plot the confusion matrix for the multimodal model in the third subplot.
# TODO: Add titles to each subplot.
# TODO: Display the final plot.

## 📝 Evaluation Criteria

Your homework will be evaluated based on:

1.  **Implementation Correctness (40%)**
    - Correct implementation of all three model architectures (CNN, RNN/LSTM, Combined).
    - Proper data loading, preprocessing, and splitting.
    - Working training and evaluation loops for each model.

2.  **Model Training and Results (30%)**
    - All three models train without errors.
    - Loss decreases over epochs.
    - Final models produce reasonable predictions on the validation set.

3.  **Code Quality (20%)**
    - Clean, readable code with comments explaining key parts.
    - Correct use of PyTorch modules, tensor shapes, and data flow.
    - Efficient implementation.

4.  **Comparison and Visualization (10%)**
    - Clear presentation of final accuracies for all models.
    - Correctly generated and clearly labeled confusion matrices for comparison.

# Written by: Ali Habibullah