<a href="https://colab.research.google.com/github/ROODARGODARA/aspect-based-sentiment-analysis/blob/main/notebooks/multimodal_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multimodal Sentiment Analysis using Text and Images

This notebook implements a multimodal sentiment analysis system that combines
textual and visual information to classify sentiment into positive, neutral,
and negative categories. Text features are extracted using BERT, image features
using ResNet-50, and both are fused for final prediction.


In [None]:
!pip install -q transformers torchvision pandas scikit-learn tqdm


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizer, BertModel
from torchvision import models, transforms
from PIL import Image

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm


## Dataset Description

The dataset consists of short text captions, corresponding images, and sentiment
labels. Labels are encoded as:
- 0: Negative
- 1: Neutral
- 2: Positive


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


device(type='cpu')

In [None]:
class MultimodalDataset(Dataset):
    def __init__(self, csv_path, image_dir, tokenizer, max_len=128):
        self.data = pd.read_csv(csv_path)
        self.image_dir = image_dir
        self.tokenizer = tokenizer
        self.max_len = max_len

        self.image_transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = str(self.data.iloc[idx]["text"])
        image_name = self.data.iloc[idx]["image"]
        label = int(self.data.iloc[idx]["label"])

        encoding = self.tokenizer(
            text,
            max_length=self.max_len,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

        image_path = f"{self.image_dir}/{image_name}"
        image = Image.open(image_path).convert("RGB")
        image = self.image_transform(image)

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "image": image,
            "label": torch.tensor(label, dtype=torch.long)
        }


In [None]:
class TextEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS]
        return cls_embedding


In [None]:
class ImageEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        resnet = models.resnet50(pretrained=True)
        self.feature_extractor = nn.Sequential(*list(resnet.children())[:-1])

    def forward(self, images):
        features = self.feature_extractor(images)
        features = features.view(features.size(0), -1)
        return features


In [None]:
class MultimodalSentimentModel(nn.Module):
    def __init__(self, num_classes=3):
        super().__init__()

        self.text_encoder = TextEncoder()
        self.image_encoder = ImageEncoder()

        self.text_proj = nn.Linear(768, 512)
        self.image_proj = nn.Linear(2048, 512)

        self.classifier = nn.Sequential(
            nn.ReLU(),
            nn.Linear(1024, num_classes)
        )

    def forward(self, input_ids, attention_mask, images):
        text_features = self.text_encoder(input_ids, attention_mask)
        image_features = self.image_encoder(images)

        text_features = self.text_proj(text_features)
        image_features = self.image_proj(image_features)

        fused = torch.cat((text_features, image_features), dim=1)
        output = self.classifier(fused)

        return output


In [None]:
def train_model(model, dataloader, optimizer, criterion):
    model.train()
    total_loss = 0

    for batch in tqdm(dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        images = batch["image"].to(device)
        labels = batch["label"].to(device)

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask, images)
        loss = criterion(outputs, labels)

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)


In [None]:
def evaluate_model(model, dataloader):
    model.eval()
    preds, targets = [], []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            images = batch["image"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids, attention_mask, images)
            predictions = torch.argmax(outputs, dim=1)

            preds.extend(predictions.cpu().numpy())
            targets.extend(labels.cpu().numpy())

    acc = accuracy_score(targets, preds)
    f1 = f1_score(targets, preds, average="weighted")

    return acc, f1


File organization and directory correction

During initial setup in Google Colab, the dataset files were uploaded inside a nested directory. To ensure compatibility with the data loader, the files were programmatically moved to the expected root directory using Python’s shutil module. This step ensures consistent file paths for dataset loading and prevents file-not-found errors during training.

In [None]:
import shutil
import os

# Move train.csv
# shutil.move("/content/content/train.csv", "/content/train.csv")
shutil.move("/content/train_clean.csv", "/content/train.csv")

# Move images folder
# shutil.move("/content/content/images", "/content/images")

# Verify
print(os.path.exists("/content/train.csv"))
print(os.path.exists("/content/images"))
print(os.listdir("/content/images"))

True
True
['2534.jpg', '2526.jpg', '2545.jpg', '2594.jpg', '2504.jpg', '2515.jpg', '2503.jpg', '2552.jpg', '2505.jpg', '2558.jpg', '2582.jpg', '2512.jpg', '2593.jpg', '2598.jpg', '2508.jpg', '2543.jpg', '2566.jpg', '2591.jpg', '2510.jpg', '2576.jpg', '2525.jpg', '2551.jpg', '2550.jpg', '2596.jpg', '2500.jpg', '2599.jpg', '2533.jpg', '2506.jpg', '2532.jpg', '2536.jpg', '2507.jpg', '2516.jpg', '2501.jpg', '2562.jpg', '2571.jpg', '2567.jpg', '2521.jpg', '2564.jpg', '2585.jpg', '2522.jpg', '2560.jpg', '2539.jpg', '2579.jpg', '2523.jpg', '2527.jpg', '2537.jpg', '2586.jpg', '2514.jpg', '2559.jpg', '2580.jpg', '2595.jpg', '2565.jpg', '2499.jpg', '2574.jpg', '2509.jpg', '2570.jpg', '2520.jpg', '2577.jpg', '2592.jpg', '2538.jpg', '2584.jpg', '2519.jpg', '2587.jpg', '2524.jpg', '2597.jpg', '2518.jpg', '2531.jpg', '2549.jpg', '2529.jpg', '2569.jpg', '2528.jpg', '2511.jpg', '2572.jpg', '2544.jpg', '2540.jpg', '2557.jpg', '2583.jpg', '2561.jpg', '2575.jpg', '2554.jpg', '2590.jpg', '2517.jpg', '2548

Dataset verification

After organizing the dataset, a verification step was performed using the os module to confirm the existence of required files and directories. This check ensures that the training script can correctly access the CSV file and associated image folder before model execution.

In [None]:
import os

print("train.csv exists:", os.path.exists("/content/train.csv"))
print("images folder exists:", os.path.exists("/content/images"))
print("images:", os.listdir("/content/images"))

train.csv exists: True
images folder exists: True
images: ['2534.jpg', '2526.jpg', '2545.jpg', '2594.jpg', '2504.jpg', '2515.jpg', '2503.jpg', '2552.jpg', '2505.jpg', '2558.jpg', '2582.jpg', '2512.jpg', '2593.jpg', '2598.jpg', '2508.jpg', '2543.jpg', '2566.jpg', '2591.jpg', '2510.jpg', '2576.jpg', '2525.jpg', '2551.jpg', '2550.jpg', '2596.jpg', '2500.jpg', '2599.jpg', '2533.jpg', '2506.jpg', '2532.jpg', '2536.jpg', '2507.jpg', '2516.jpg', '2501.jpg', '2562.jpg', '2571.jpg', '2567.jpg', '2521.jpg', '2564.jpg', '2585.jpg', '2522.jpg', '2560.jpg', '2539.jpg', '2579.jpg', '2523.jpg', '2527.jpg', '2537.jpg', '2586.jpg', '2514.jpg', '2559.jpg', '2580.jpg', '2595.jpg', '2565.jpg', '2499.jpg', '2574.jpg', '2509.jpg', '2570.jpg', '2520.jpg', '2577.jpg', '2592.jpg', '2538.jpg', '2584.jpg', '2519.jpg', '2587.jpg', '2524.jpg', '2597.jpg', '2518.jpg', '2531.jpg', '2549.jpg', '2529.jpg', '2569.jpg', '2528.jpg', '2511.jpg', '2572.jpg', '2544.jpg', '2540.jpg', '2557.jpg', '2583.jpg', '2561.jpg', '2575

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

train_dataset = MultimodalDataset(
    csv_path="/content/train.csv",
    image_dir="/content/images",
    tokenizer=tokenizer
)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

model = MultimodalSentimentModel().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [26]:
EPOCHS = 5

for epoch in range(EPOCHS):
    loss = train_model(model, train_loader, optimizer, criterion)
    acc, f1 = evaluate_model(model, train_loader)

    print(f"Epoch {epoch+1}")
    print(f"Loss: {loss:.4f} | Accuracy: {acc:.4f} | F1: {f1:.4f}")

100%|██████████| 12/12 [03:27<00:00, 17.31s/it]


Epoch 1
Loss: 0.0000 | Accuracy: 1.0000 | F1: 1.0000


100%|██████████| 12/12 [03:00<00:00, 15.04s/it]


Epoch 2
Loss: 0.0000 | Accuracy: 1.0000 | F1: 1.0000


100%|██████████| 12/12 [03:02<00:00, 15.18s/it]


Epoch 3
Loss: 0.0000 | Accuracy: 1.0000 | F1: 1.0000


100%|██████████| 12/12 [03:02<00:00, 15.21s/it]


Epoch 4
Loss: 0.0000 | Accuracy: 1.0000 | F1: 1.0000


100%|██████████| 12/12 [03:01<00:00, 15.10s/it]


Epoch 5
Loss: 0.0000 | Accuracy: 1.0000 | F1: 1.0000


## Results and Observations

The model successfully trains on the multimodal dataset. Due to the small size
of the demo dataset, performance metrics may fluctuate. However, the pipeline
demonstrates effective fusion of text and image features.
