<a href="https://colab.research.google.com/github/JaveyBae/exist2025/blob/main/fine_tuning_clip.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [42]:
# 1. Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# 2. Unzip the file
import zipfile
import os

# Specify the zip file path (modify according to your actual path)
zip_path = '/content/drive/MyDrive/memes.zip'
extract_dir = '/content/'  # Extract to Colab local storage (faster access)

# Create extraction directory
os.makedirs(extract_dir, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

print(f"✅ Extraction complete. Files are located at: {extract_dir}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Extraction complete. Files are located at: /content/


In [43]:
!pip install -U sentence-transformers

import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from sentence_transformers import SentenceTransformer
from PIL import Image
from tqdm import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


# Task
Evaluate the trained multimodal classifier model on a subset of the memes dataset and report the accuracy.

## Load the trained model

### Subtask:
Load the model state dictionary from the saved file.


**Reasoning**:
Instantiate the model, load the state dictionary, and move the model to the correct device.



## Prepare evaluation data

### Subtask:
Prepare evaluation data by creating a new `MemeDataset` and `DataLoader` for a subset of the data.


**Reasoning**:
Prepare evaluation data by creating a new MemeDataset and DataLoader for a subset of the data.



In [31]:
from torch.utils.data import random_split

# Create a new instance of the MemeDataset class
full_dataset = MemeDataset(csv_file=CSV_PATH, image_dir=IMAGE_DIR)

# Determine the sizes for the training and evaluation sets
dataset_size = len(full_dataset)
eval_size = int(0.2 * dataset_size) # 20% for evaluation
train_size = dataset_size - eval_size # Remaining for training

# Split the dataset into training and evaluation sets
train_dataset, eval_dataset = random_split(full_dataset, [train_size, eval_size])

# Create a DataLoader for the evaluation dataset
EVAL_BATCH_SIZE = 16 # Or 32, depending on memory
eval_loader = DataLoader(eval_dataset, batch_size=EVAL_BATCH_SIZE, shuffle=False, collate_fn=custom_collate_fn)

print(f"Full dataset size: {dataset_size}")
print(f"Training dataset size: {train_size}")
print(f"Evaluation dataset size: {eval_size}")
print("✅ Evaluation DataLoader created.")

Full dataset size: 4044
Training dataset size: 3236
Evaluation dataset size: 808
✅ Evaluation DataLoader created.


## Evaluate the model

### Subtask:
Iterate through the evaluation DataLoader, get predictions from the model, and calculate accuracy.


**Reasoning**:
Iterate through the evaluation DataLoader, get predictions from the model, and calculate accuracy.



## Load and split data

### Subtask:
Load the dataset and split it into training, validation, and test sets using an 8:1:1 ratio.


**Reasoning**:
Load the CSV, create the dataset, calculate split sizes, and perform the random split.



In [33]:
# Load the CSV file
dataframe = pd.read_csv(CSV_PATH)

# Create the full dataset
full_dataset = MemeDataset(csv_file=CSV_PATH, image_dir=IMAGE_DIR)

# Calculate dataset sizes for 8:1:1 split
dataset_size = len(full_dataset)
train_size = int(0.8 * dataset_size)
val_size = int(0.1 * dataset_size)
test_size = dataset_size - train_size - val_size # Allocate remaining to test

# Split the dataset
train_dataset, val_dataset, test_dataset = random_split(full_dataset, [train_size, val_size, test_size])

print(f"Full dataset size: {dataset_size}")
print(f"Training dataset size: {train_size}")
print(f"Validation dataset size: {val_size}")
print(f"Test dataset size: {test_size}")

Full dataset size: 4044
Training dataset size: 3235
Validation dataset size: 404
Test dataset size: 405


## Create dataloaders

### Subtask:
Create DataLoaders for the training, validation, and test sets.


**Reasoning**:
Create DataLoaders for the training, validation, and test sets using the defined batch size and the custom collate function.



In [34]:
# Define batch size
BATCH_SIZE = 16 # You can adjust this if needed

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=custom_collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=custom_collate_fn)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=custom_collate_fn)

print(f"Batch size for DataLoaders: {BATCH_SIZE}")
print("✅ DataLoaders for training, validation, and test sets created.")

Batch size for DataLoaders: 16
✅ DataLoaders for training, validation, and test sets created.


## Define and initialize model

### Subtask:
Define and initialize the multimodal classifier model.


**Reasoning**:
Instantiate the MultimodalClassifier model, move it to the device, and print a confirmation message.



## Train the model with validation

### Subtask:
Train the model on the training set and evaluate it on the validation set during training to monitor performance and detect overfitting.


**Reasoning**:
Set the model to training mode, define the loss function and optimizer, and then iterate through epochs to train the model on the training data and evaluate on the validation data.



**Reasoning**:
Load the dataset into a pandas DataFrame and display the first few rows and the column names and their data types to understand the data structure.



**Reasoning**:
The previous attempt to load the data failed because the file 'train.csv' was not found. The correct file path is '/data/train.csv'. I will try loading the data again using the correct path.



**Reasoning**:
The dataset was not found in the previous two attempts. I will try loading the dataset from the default path '/kaggle/input/train.csv' which is common in Kaggle environments.



**Reasoning**:
The previous attempts to load the data failed. Since the path to the dataset is still unknown, I will try listing the files in the current directory and in the '/kaggle/input/' directory to locate the correct file path for 'train.csv'.



**Reasoning**:
Based on the file listing, the 'train.csv' file is not in the current directory, '/kaggle/input/', or '/data/'. However, 'processed_data_all_labels.csv' is present in the current directory. It is likely that the dataset for this task is actually 'processed_data_all_labels.csv'. I will load this file into a DataFrame and display its head and info to confirm its structure.



In [37]:
# Load the dataset 'processed_data_all_labels.csv'
df = pd.read_csv('processed_data_all_labels.csv')

# Display the first few rows
display(df.head())

# Display column names and data types
display(df.info())

Unnamed: 0,id,lang,text,task4_hard,task4_soft,task5_hard,task5_soft,task6_hard,task6_soft
0,110001,es,2+2=5 MITO Albert Einstein tenía bajo rendimie...,"[1.0, 0.0]","[1.0, 0.0]","[1.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.1667, 0.0, 0.0, 0.1667]"
1,110002,es,CUANDO UNA MUJER VA A LUCHAR POR SUS DERECHOS,"[1.0, 0.0]","[1.0, 0.0]","[1.0, 0.0]","[0.8333, 0.1667]","[1.0, 1.0, 0.0, 0.0, 0.0]","[0.6667, 0.6667, 0.0, 0.0, 0.1667]"
2,110003,es,ІЯ ЕГЕЯ Е MOA ¿El Partido Republicano busca pe...,"[0.0, 1.0]","[0.3333, 0.6667]","[1.0, 0.0]","[0.3333, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.1667, 0.1667, 0.0, 0.3333]"
3,110004,es,"Paises que ""apoyan"" los derechos de la mujer A...","[1.0, 0.0]","[0.5, 0.5]","[0.0, 1.0]","[0.0, 0.5]","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.3333, 0.1667, 0.0, 0.0, 0.0]"
4,110005,es,Ya verás como este 8 de marzo hay uno que te s...,"[0.0, 1.0]","[0.3333, 0.6667]","[0.0, 1.0]","[0.1667, 0.1667]","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.3333, 0.0, 0.0, 0.0, 0.0]"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4044 entries, 0 to 4043
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          4044 non-null   int64 
 1   lang        4044 non-null   object
 2   text        4044 non-null   object
 3   task4_hard  4044 non-null   object
 4   task4_soft  4044 non-null   object
 5   task5_hard  4044 non-null   object
 6   task5_soft  4044 non-null   object
 7   task6_hard  4044 non-null   object
 8   task6_soft  4044 non-null   object
dtypes: int64(1), object(8)
memory usage: 284.5+ KB


None

**Reasoning**:
The previous step successfully loaded the data into a DataFrame. The 'task4_hard' column, which is relevant to the subtask, is of type 'object' and contains list-like strings. To use this column for analysis, it needs to be converted to a more usable format. I will convert the 'task4_hard' column from string representation of a list to an actual list of floats, then extract the first element of each list, which represents the label for task4_hard, and convert it to an integer type for further analysis.



In [38]:
import ast

# Convert the 'task4_hard' column from string representation of list to actual list
df['task4_hard'] = df['task4_hard'].apply(ast.literal_eval)

# Extract the first element of the list and convert to integer
df['task4_hard_label'] = df['task4_hard'].apply(lambda x: int(x[0]))

# Display the first few rows with the new column
display(df[['task4_hard', 'task4_hard_label']].head())

# Display the data type of the new column
display(df['task4_hard_label'].info())

Unnamed: 0,task4_hard,task4_hard_label
0,"[1.0, 0.0]",1
1,"[1.0, 0.0]",1
2,"[0.0, 1.0]",0
3,"[1.0, 0.0]",1
4,"[0.0, 1.0]",0


<class 'pandas.core.series.Series'>
RangeIndex: 4044 entries, 0 to 4043
Series name: task4_hard_label
Non-Null Count  Dtype
--------------  -----
4044 non-null   int64
dtypes: int64(1)
memory usage: 31.7 KB


None

**Reasoning**:
The 'task4_hard_label' column has been successfully created and converted to an integer type. Now, to understand the distribution of labels in the 'task4_hard' column as required by the subtask, I will calculate and display the value counts for the 'task4_hard_label' column.



In [39]:
# Calculate and display the value counts for the 'task4_hard_label' column
label_counts = df['task4_hard_label'].value_counts()
print("Value counts for 'task4_hard_label':")
print(label_counts)

Value counts for 'task4_hard_label':
task4_hard_label
1    2324
0    1720
Name: count, dtype: int64


In [41]:
# Add more regularization and early stopping
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader, random_split
from sentence_transformers import SentenceTransformer
from PIL import Image
from tqdm import tqdm
import ast
import os

class MultimodalClassifier(nn.Module):
    def __init__(self, num_classes=1):
        super(MultimodalClassifier, self).__init__()
        self.img_model = SentenceTransformer('clip-ViT-B-32')
        self.text_model = SentenceTransformer('sentence-transformers/clip-ViT-B-32-multilingual-v1')

        embedding_dim = 512  # CLIP default dimension

        # Enhanced classifier with more regularization
        self.classifier = nn.Sequential(
            nn.Linear(embedding_dim * 2, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),  # Add Batch Normalization
            nn.Dropout(0.5),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Dropout(0.4),  # More Dropout
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_classes)
        )

    def forward(self, images, texts):
        with torch.no_grad():
            img_embeddings = self.img_model.encode(
                images,
                convert_to_tensor=True,
                device=self._get_device(),
                show_progress_bar=False,
                batch_size=len(images)
            )

            text_embeddings = self.text_model.encode(
                texts,
                convert_to_tensor=True,
                device=self._get_device(),
                show_progress_bar=False,
                batch_size=len(texts)
            )

        combined_embeddings = torch.cat((img_embeddings, text_embeddings), dim=1)
        logits = self.classifier(combined_embeddings)
        return logits

    def _get_device(self):
        return next(self.parameters()).device

# Dataset and collate_fn remain unchanged
class MemeDataset(Dataset):
    def __init__(self, csv_file, image_dir):
        self.dataframe = pd.read_csv(csv_file)
        self.image_dir = image_dir

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_id = str(self.dataframe.iloc[idx]['id'])
        possible_extensions = ['.jpeg', '.jpg', '.png', '.JPEG', '.JPG', '.PNG']

        img_path = None
        for ext in possible_extensions:
            temp_path = os.path.join(self.image_dir, img_id + ext)
            if os.path.exists(temp_path):
                img_path = temp_path
                break

        if img_path is None:
            raise FileNotFoundError(f"Image not found: {img_id}")

        image = Image.open(img_path).convert("RGB")
        text = self.dataframe.iloc[idx]['text']

        label_str = self.dataframe.iloc[idx]['task4_hard']
        label_vec = ast.literal_eval(label_str)
        label_idx = label_vec.index(1.0)
        label = torch.tensor(label_idx, dtype=torch.float)

        return {'image': image, 'text': text, 'label': label}

def custom_collate_fn(batch):
    images = [item['image'] for item in batch]
    texts = [item['text'] for item in batch]
    labels = torch.stack([item['label'] for item in batch])

    return {
        'image': images,
        'text': texts,
        'label': labels
    }

# Training settings
EPOCHS = 10  # Increase epochs, use early stopping
LEARNING_RATE = 1e-4  # Reduce learning rate
BATCH_SIZE = 16
PATIENCE = 3  # Early stopping patience

CSV_PATH = 'processed_data_all_labels.csv'
IMAGE_DIR = 'memes'

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Data splitting 8:1:1
full_dataset = MemeDataset(csv_file=CSV_PATH, image_dir=IMAGE_DIR)
dataset_size = len(full_dataset)
train_size = int(0.8 * dataset_size)
val_size = int(0.1 * dataset_size)
test_size = dataset_size - train_size - val_size

train_dataset, val_dataset, test_dataset = random_split(
    full_dataset, [train_size, val_size, test_size]
)

print(f"Train set: {train_size}, Validation set: {val_size}, Test set: {test_size}")

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=custom_collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=custom_collate_fn)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=custom_collate_fn)

# Model and optimizer
model = MultimodalClassifier(num_classes=1).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=0.01)  # Add weight decay

# Early stopping mechanism
best_val_loss = float('inf')
patience_counter = 0

# Training loop
for epoch in range(EPOCHS):
    print(f"\n--- Epoch {epoch+1}/{EPOCHS} ---")

    # Training phase
    model.train()
    progress_bar = tqdm(train_loader, desc="Training")
    epoch_train_loss = 0

    for batch_idx, batch in enumerate(progress_bar):
        images = batch['image']
        texts = batch['text']
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(images, texts)
        loss = criterion(outputs.squeeze(1), labels)
        loss.backward()
        optimizer.step()

        epoch_train_loss += loss.item()
        progress_bar.set_postfix({
            'loss': loss.item(),
            'avg_loss': epoch_train_loss / (batch_idx + 1)
        })

    # Validation phase
    model.eval()
    val_progress = tqdm(val_loader, desc="Validation")
    epoch_val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in val_progress:
            images = batch['image']
            texts = batch['text']
            labels = batch['label'].to(device)

            outputs = model(images, texts)
            loss = criterion(outputs.squeeze(1), labels)
            epoch_val_loss += loss.item()

            probs = torch.sigmoid(outputs.squeeze(1))
            preds = (probs > 0.5).float()
            correct += (preds == labels).sum().item()
            total += labels.size(0)

    avg_train_loss = epoch_train_loss / len(train_loader)
    avg_val_loss = epoch_val_loss / len(val_loader)
    val_accuracy = correct / total

    print(f"Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}, Val Acc: {val_accuracy:.4f}")

    # Early stopping check
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pth')
        print("✅ Saved best model")
    else:
        patience_counter += 1
        print(f"⚠️ Validation loss did not improve ({patience_counter}/{PATIENCE})")

        if patience_counter >= PATIENCE:
            print("🛑 Early stopping triggered, stopping training")
            break

# Load best model for testing
print("\n--- Testing Phase ---")
model.load_state_dict(torch.load('best_model.pth'))
model.eval()

test_correct = 0
test_total = 0

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Testing"):
        images = batch['image']
        texts = batch['text']
        labels = batch['label'].to(device)

        outputs = model(images, texts)
        probs = torch.sigmoid(outputs.squeeze(1))
        preds = (probs > 0.5).float()

        test_correct += (preds == labels).sum().item()
        test_total += labels.size(0)

test_accuracy = test_correct / test_total
print(f"\n🎯 Final test accuracy: {test_accuracy:.4f}")

Using device: cuda
Train set: 3235, Validation set: 404, Test set: 405

--- Epoch 1/10 ---


Training: 100%|██████████| 203/203 [01:20<00:00,  2.53it/s, loss=0.802, avg_loss=0.678]
Validation: 100%|██████████| 26/26 [00:09<00:00,  2.75it/s]


Train Loss: 0.6780, Val Loss: 0.6430, Val Acc: 0.6312
✅ Saved best model

--- Epoch 2/10 ---


Training: 100%|██████████| 203/203 [01:20<00:00,  2.53it/s, loss=0.816, avg_loss=0.632]
Validation: 100%|██████████| 26/26 [00:09<00:00,  2.64it/s]


Train Loss: 0.6320, Val Loss: 0.6322, Val Acc: 0.6510
✅ Saved best model

--- Epoch 3/10 ---


Training: 100%|██████████| 203/203 [01:20<00:00,  2.53it/s, loss=0.982, avg_loss=0.603]
Validation: 100%|██████████| 26/26 [00:09<00:00,  2.64it/s]


Train Loss: 0.6032, Val Loss: 0.6093, Val Acc: 0.6658
✅ Saved best model

--- Epoch 4/10 ---


Training: 100%|██████████| 203/203 [01:19<00:00,  2.54it/s, loss=1.09, avg_loss=0.585]
Validation: 100%|██████████| 26/26 [00:08<00:00,  2.95it/s]


Train Loss: 0.5853, Val Loss: 0.6238, Val Acc: 0.6460
⚠️ Validation loss did not improve (1/3)

--- Epoch 5/10 ---


Training: 100%|██████████| 203/203 [01:19<00:00,  2.54it/s, loss=0.646, avg_loss=0.557]
Validation: 100%|██████████| 26/26 [00:09<00:00,  2.65it/s]


Train Loss: 0.5572, Val Loss: 0.6132, Val Acc: 0.6609
⚠️ Validation loss did not improve (2/3)

--- Epoch 6/10 ---


Training: 100%|██████████| 203/203 [01:19<00:00,  2.56it/s, loss=1.05, avg_loss=0.536]
Validation: 100%|██████████| 26/26 [00:09<00:00,  2.62it/s]


Train Loss: 0.5359, Val Loss: 0.6122, Val Acc: 0.6708
⚠️ Validation loss did not improve (3/3)
🛑 Early stopping triggered, stopping training

--- Testing Phase ---


Testing: 100%|██████████| 26/26 [00:11<00:00,  2.30it/s]


🎯 Final test accuracy: 0.6617



