# Impact of Dataset Size and Distillation Techniques on Image Captioning Performance: An Empirical Study

Authors: Srushti Sangawar and Arunava Ghosh

Course CSCI 5922, University of Colorado Boulder

In this project, we explore the optimization of image captioning models by combining dataset distillation and pre-trained models. In this notebok file we are focusing on Resnet50 which will serve as our baseline. We aim to reduce the
computational burden and improve model performance, particularly in
resource-constrained environments. Our approach involves creating dis-
tilled datasets of different sizes (25%,50%,75% and 100%) using gradient-
based distillation and random selection methods. We then fine-tune the
ResNet-50 model to generate captions through an LSTM network. Per-
formance is evaluated using metrics such as BLEU and CIDEr scores,
as well as training time. The results will help understand the trade-off
between dataset size, distillation techniques, and training efficiency.

# Environment Setup, Device Configuration

This cell checks if CUDA (GPU support) is available on the system. It helps verify if the model can leverage GPU acceleration for training and inference. The system selects cuda if available, otherwise defaults to CPU. 

Also Installation of required libraries take place as per requirement. Later in the notebook as well we install some libraries based on requirement. 

In [None]:

import torch
print("CUDA Available:", torch.cuda.is_available())
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))
if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
!pip install torch

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import torch
torch.cuda.is_available()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("CUDA Available:", torch.cuda.is_available())
print("GPU Device:", torch.cuda.get_device_name(0))

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
! python --version

# Create Directory for Dataset and Models
Sets up necessary folders in the required environment for organizing dataset and model files. Certain code lines are hence commented


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# !mkdir -p /content/drive/MyDrive/deepreel/data
# !mkdir -p /content/drive/MyDrive/deepreel/models
# codes/DeepReel_Making_Images_Talk (1).ipynb

# import os

# os.makedirs(r"C:\Users\agnibdeepreel\data", exist_ok=True)
# os.makedirs(r"C:\Users\agnib\deepreel\models", exist_ok=True)

import os

os.makedirs(r"deepreel/data", exist_ok=True)
os.makedirs(r"deepreel/models", exist_ok=True)

# Download and Extract Dataset
Downloads the COCO validation and annotation data into a specified folder if they are not already present. This serves as the dataset on which we are focusing our research. The validation dataset provides data of size 5000. The dataset is then extracted in the required folder.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import urllib.request
import zipfile
import os


data_dir = r"deepreel/data"
os.makedirs(data_dir, exist_ok=True)


val_url = "http://images.cocodataset.org/zips/val2017.zip"
val_zip_path = os.path.join(data_dir, "val2017.zip")
urllib.request.urlretrieve(val_url, val_zip_path)


with zipfile.ZipFile(val_zip_path, 'r') as zip_ref:
    zip_ref.extractall(data_dir)


ann_url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
ann_zip_path = os.path.join(data_dir, "annotations_trainval2017.zip")
urllib.request.urlretrieve(ann_url, ann_zip_path)


with zipfile.ZipFile(ann_zip_path, 'r') as zip_ref:
    zip_ref.extractall(data_dir)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import os
import zipfile


val_zip = r"deepreel/data/val2017.zip"
ann_zip = r"deepreel/data/annotations_trainval2017.zip"


coco_path = r"coco"
os.makedirs(coco_path, exist_ok=True)


val_extract_path = os.path.join(coco_path, "val2017")
if not os.path.exists(val_extract_path):
    with zipfile.ZipFile(val_zip, 'r') as zip_ref:
        zip_ref.extractall(coco_path)
    print("val2017 extracted.")
else:
    print("val2017 already extracted.")


ann_extract_path = os.path.join(coco_path, "annotations")
if not os.path.exists(ann_extract_path):
    with zipfile.ZipFile(ann_zip, 'r') as zip_ref:
        zip_ref.extractall(coco_path)
    print("annotations extracted.")
else:
    print("annotations already extracted.")


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
!pip install -q pycocotools

Now let’s read the captions_val2017.json and understand its structure.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import json

annotations_path = r"coco/annotations/captions_val2017.json"


with open(annotations_path, 'r') as f:
    captions_data = json.load(f)


print(captions_data.keys())

Create a mapping from image_id → captions : Each image ID has 5 captions

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
from collections import defaultdict

# Map image_id to list of captions
image_id_to_captions = defaultdict(list)

for ann in captions_data['annotations']:
    image_id_to_captions[ann['image_id']].append(ann['caption'])


example_id = captions_data['annotations'][0]['image_id']
print(f"Image ID: {example_id}")
print("Captions:", image_id_to_captions[example_id])

Preprocess images for ResNet-50
ResNet-50 expects images to be:

Size: 224x224

Normalized with ImageNet stats

We'll use torchvision for this

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import os
from PIL import Image
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms


transform = transforms.Compose([
    transforms.Resize((224, 224)),  
    transforms.ToTensor(), 
    transforms.Normalize(mean=[0.485, 0.456, 0.406],  # ImageNet mean
                         std=[0.229, 0.224, 0.225])   # ImageNet std
])

#  Creating Custom Dataset

Create a custom dataset for val2017 images and captions

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class CocoDataset(Dataset):
    def __init__(self, image_dir, captions_dict, transform=None):
        self.image_dir = image_dir
        self.captions_dict = captions_dict
        self.transform = transform
        self.image_ids = list(captions_dict.keys())

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        image_id = self.image_ids[idx]
        image_path = os.path.join(self.image_dir, f'{image_id:012}.jpg')  # zero-padded filenames

        image = Image.open(image_path).convert('RGB')
        if self.transform:
            image = self.transform(image)

        captions = self.captions_dict[image_id]
        return image, captions

In [None]:

dataset = CocoDataset('coco/val2017/', image_id_to_captions, transform)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)


images, captions = next(iter(dataloader))
print("Batch image shape:", images.shape)
print("Captions for first image:\n", captions[0])


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# from torchvision.transforms.functional import to_pil_image



# image = images[0]


# unnorm = lambda t: t * torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1) + torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
# image = unnorm(image)
# image = torch.clamp(image, 0, 1)


# to_pil_image(image).show()


Extract features from images using ResNet-50
We will:

Use a pre-trained ResNet-50 model

Remove its last classification layer

Get 2048-dimensional feature vectors for each image



In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import torchvision.models as models
import torch.nn as nn


resnet = models.resnet50(pretrained=True)


modules = list(resnet.children())[:-1]  # remove last fc layer
resnet = nn.Sequential(*modules)

# Freeze the weights
for param in resnet.parameters():
    param.requires_grad = False

resnet.eval() 
resnet = resnet.cuda() if torch.cuda.is_available() else resnet

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
with torch.no_grad():
    
    features = resnet(images.cuda() if torch.cuda.is_available() else images)
    features = features.to(device)
    features = features.view(features.size(0), -1)  # flatten from (B, 2048, 1, 1) to (B, 2048)

print("Image features shape:", features.shape)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import json
from collections import defaultdict


with open(r'deepreel/data/annotations/captions_val2017.json', 'r') as f:
    captions_data = json.load(f)

# Create a mapping from image_id to list of caption
image_id_to_captions = defaultdict(list)
for annot in captions_data['annotations']:
    img_id = annot['image_id']
    caption = annot['caption']
    image_id_to_captions[img_id].append(caption)


sample_img_id = list(image_id_to_captions.keys())[0]
print(f"Sample image ID: {sample_img_id}")
print("Captions:", image_id_to_captions[sample_img_id])

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
%pip install tensorflow

# Image Feature Extraction using ResNet-50

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re


all_captions = []
for caps in image_id_to_captions.values():
    for c in caps:
        # Add start and end tokens
        cleaned = '<start> ' + c.lower().strip() + ' <end>'
        all_captions.append(cleaned)


tokenizer = Tokenizer(oov_token="<unk>")
tokenizer.fit_on_texts(all_captions)

# Convert captions to sequences of integer
caption_seqs = tokenizer.texts_to_sequences(all_captions)

# Add padding so all captions are same length
# max_caption_len = max(len(seq) for seq in caption_seqs)
max_caption_len = 32
caption_seqs_padded = pad_sequences(caption_seqs, maxlen=max_caption_len, padding='post')


vocab_size = len(tokenizer.word_index) + 1  # +1 for padding

print(f"Total captions: {len(all_captions)}")
print(f"Max caption length: {max_caption_len}")
print(f"Vocabulary size: {vocab_size}")
print("Sample padded sequence:", caption_seqs_padded[0])

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
%pip install tqdm

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.preprocessing import image
import numpy as np
from tqdm import tqdm
import os


resnet_model = ResNet50(weights='imagenet', include_top=False, pooling='avg')  

image_dir = './coco/val2017'


image_ids = list(image_id_to_captions.keys())
image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in image_ids]


def load_and_preprocess_img(img_path):
    img = image.load_img(img_path, target_size=(224, 224))
    img_array = image.img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    return preprocess_input(img_array)


features_dict = {}
for img_id, img_path in tqdm(zip(image_ids, image_paths), total=len(image_ids)):
    try:
        img_array = load_and_preprocess_img(img_path)
        features = resnet_model.predict(img_array, verbose=0)
        features_dict[img_id] = features.squeeze()
    except FileNotFoundError:
        print(f"Image not found: {img_path}")


sample_id = image_ids[0]
print(f"Feature shape for image {sample_id}: {features_dict[sample_id].shape}")


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import pickle


with open('features_dict.pkl', 'wb') as f:
    pickle.dump(features_dict, f)

print("Features dictionary saved successfully!")

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

with open('features_dict.pkl', 'rb') as f:
    features_dict = pickle.load(f)

print("Features dictionary loaded successfully!")

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import json


features_dict_json = {key: value.tolist() for key, value in features_dict.items()}


with open('features_dict.json', 'w') as f:
    json.dump(features_dict_json, f)

print("Features dictionary saved to JSON successfully!")

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import json
import numpy as np


with open('features_dict.json', 'r') as f:
    features_dict_json = json.load(f)


features_dict = {key: np.array(value) for key, value in features_dict_json.items()}

print("Features dictionary loaded from JSON successfully!")

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import pickle
import os
from tqdm import tqdm


if os.path.exists('features_dict.pkl'):
    with open('features_dict.pkl', 'rb') as f:
        features_dict = pickle.load(f)
    print("Loaded existing features dictionary.")
else:
    features_dict = {}


for img_id, img_path in tqdm(zip(image_ids, image_paths), total=len(image_ids)):
    if img_id not in features_dict:
        try:
            img_array = load_and_preprocess_img(img_path)
            features = resnet_model.predict(img_array, verbose=0)
            features_dict[img_id] = features.squeeze()
        except FileNotFoundError:
            print(f"Image not found: {img_path}")


with open('features_dict.pkl', 'wb') as f:
    pickle.dump(features_dict, f)

# MODEL TRAINING

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Layer, Input, Dense, Embedding, LSTM, Bidirectional, Dropout, Add, Activation, Concatenate, add
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Lambda


# max_length = max(len(c.split()) for c in all_captions)
caption_lengths = [len(c.split()) for c in all_captions]
max_length = int(np.percentile(caption_lengths, 90))

vocab_size = len(tokenizer.word_index) + 1


X1, X2, y = [], [], []

for img_id, caption_list in image_id_to_captions.items():
    feature = features_dict.get(img_id)
    if feature is None:
        continue
    for caption in caption_list:
        seq = tokenizer.texts_to_sequences([caption])[0]
        for i in range(1, len(seq)):
            in_seq, out_word = seq[:i], seq[i]
            
            in_seq = pad_sequences([in_seq], maxlen=max_length, padding='post')[0]
            X1.append(feature)
            X2.append(in_seq)
            y.append(out_word)


X1 = np.array(X1, dtype='float32')
X2 = np.array(X2)
y = np.array(y)

print(f"Image features shape: {X1.shape}")
print(f"Input sequence shape: {X2.shape}")
print(f"Target word shape: {y.shape}")


BATCH_SIZE = 64
dataset = tf.data.Dataset.from_tensor_slices(((X1, X2), y))
dataset = dataset.shuffle(buffer_size=1024).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)


class BahdanauAttention(Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)

    def call(self, features, hidden):
        
        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        
        score = self.V(tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))

        
        attention_weights = tf.nn.softmax(score, axis=1)

        
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights


inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)


inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = Bidirectional(LSTM(256, return_sequences=True))(se2)


# fe2_expanded = tf.expand_dims(fe2, 1)  
fe2_expanded = Lambda(lambda x: tf.expand_dims(x, 1))(fe2)
attention = BahdanauAttention(256)
context_vector, attention_weights = attention(se3, fe2)


decoder_input = Concatenate()([context_vector, fe2])
decoder2 = Dense(256, activation='relu')(decoder_input)
outputs = Dense(vocab_size, activation='softmax')(decoder2)


model = Model(inputs=[inputs1, inputs2], outputs=outputs)
print("TensorFlow using GPU(s):", tf.config.list_physical_devices('GPU'))
# model = model.to(device)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(1e-4), metrics=['sparse_categorical_accuracy'])
model.summary()


EPOCHS = 20
model.fit(dataset, epochs=EPOCHS)

# GENERATE CAPTIONS

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def generate_caption(model, image_feature, tokenizer, max_len):
    in_text = '<start>'

    for _ in range(max_len):
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        sequence = tf.keras.preprocessing.sequence.pad_sequences([sequence], maxlen=max_len, padding='post')
        yhat = model.predict([np.expand_dims(image_feature, axis=0), sequence], verbose=0)
        yhat_idx = np.argmax(yhat[0])  # choose word with highest probability
        word = tokenizer.index_word.get(yhat_idx, None)
        if word is None or word == '<end>':
            break
        in_text += ' ' + word

    return in_text.replace('<start>', '').strip()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

sample_id = image_ids[10] 
sample_feature = features_dict[sample_id]
# max_caption_len = 15
max_caption_len = max_length

caption = generate_caption(model, sample_feature, tokenizer, max_caption_len)
print(f"Generated caption:\n{caption}")

# DATA DISTILLATION

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import random
def random_selection(image_ids, percentage):
    
    num_images_to_select = int(len(image_ids) * percentage / 100)
    selected_image_ids = random.sample(image_ids, num_images_to_select)
    selected_image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in selected_image_ids]

    return selected_image_ids, selected_image_paths

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
%pip install scipy

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import numpy as np
from sklearn.metrics import pairwise_distances

def gradient_based_selection(image_ids, features_dict, percentage):
    
    feature_vectors = np.array([features_dict[img_id] for img_id in image_ids])
    distance_matrix = pairwise_distances(feature_vectors, metric='euclidean')
    distance_sums = np.sum(distance_matrix, axis=1)
    
    num_select = int(len(image_ids) * percentage / 100)
    selected_indices = np.argsort(distance_sums)[-num_select:]
    selected_image_ids = [image_ids[i] for i in selected_indices]
    selected_image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in selected_image_ids]
    
    return selected_image_ids, selected_image_paths


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
from PIL import Image
from torchvision import transforms

def preprocess_image(image_path):
    
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])
    image = Image.open(image_path).convert('RGB')
    return transform(image)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def prepare_target_caption(image_id, tokenizer, max_len=30, ground_truths=None):
    
    caption = ground_truths[image_id][0]
    tokens = tokenizer.encode(caption, return_tensors='pt', padding='max_length',
                              max_length=max_len, truncation=True)
    return tokens

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


from tqdm import tqdm
import numpy as np

def generate_distilled_captions(image_ids, image_paths, model, tokenizer, max_caption_len, batch_size=32):
    
    distilled_captions = {}
    
    for i in tqdm(range(0, len(image_paths), batch_size)):
        batch_paths = image_paths[i:i+batch_size]
        batch_ids = image_ids[i:i+batch_size]
        batch_images = np.vstack([load_and_preprocess_img(path) for path in batch_paths])
        batch_features = resnet_model.predict(batch_images, verbose=0)
        
        for j, feature in enumerate(batch_features):
            caption = generate_caption(model, feature.squeeze(), tokenizer, max_caption_len)
            distilled_captions[batch_ids[j]] = caption

    return distilled_captions


In [None]:
!pip install -q pycocoevalcap

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import time

def measure_training_time(dataset_image_ids, dataset_image_paths, model, tokenizer, max_caption_len):
    start_time = time.time()
    generate_distilled_captions(dataset_image_ids, dataset_image_paths, model, tokenizer, max_caption_len)
    return time.time() - start_time


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Distill dataset using random selection
random_25_ids, random_25_paths = random_selection(image_ids, 25)
random_50_ids, random_50_paths = random_selection(image_ids, 50)
random_75_ids, random_75_paths = random_selection(image_ids, 75)
random_100_ids, random_100_paths = random_selection(image_ids, 100)

# Distill dataset using gradient-based selection (with mock method)
gradient_25_ids, gradient_25_paths = gradient_based_selection(image_ids, features_dict, 25)
gradient_50_ids, gradient_50_paths = gradient_based_selection(image_ids, features_dict, 50)
gradient_75_ids, gradient_75_paths = gradient_based_selection(image_ids, features_dict, 75)
gradient_100_ids, gradient_100_paths = gradient_based_selection(image_ids, features_dict, 100)

# Generate captions for each distillation size
captions_random_25 = generate_distilled_captions(random_25_ids, random_25_paths, model, tokenizer, max_caption_len)
captions_random_50 = generate_distilled_captions(random_50_ids, random_50_paths, model, tokenizer, max_caption_len)
captions_random_75 = generate_distilled_captions(random_75_ids, random_75_paths, model, tokenizer, max_caption_len)
captions_random_100 = generate_distilled_captions(random_100_ids, random_100_paths, model, tokenizer, max_caption_len)

captions_gradient_25 = generate_distilled_captions(gradient_25_ids, gradient_25_paths, model, tokenizer, max_caption_len)
captions_gradient_50 = generate_distilled_captions(gradient_50_ids, gradient_50_paths, model, tokenizer, max_caption_len)
captions_gradient_75 = generate_distilled_captions(gradient_75_ids, gradient_75_paths, model, tokenizer, max_caption_len)
captions_gradient_100 = generate_distilled_captions(gradient_100_ids, gradient_100_paths, model, tokenizer, max_caption_len)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
time_random_25 = measure_training_time(random_25_ids, random_25_paths, model, tokenizer, max_caption_len)
time_random_50 = measure_training_time(random_50_ids, random_50_paths, model, tokenizer, max_caption_len)
time_random_75 = measure_training_time(random_75_ids, random_75_paths, model, tokenizer, max_caption_len)
time_random_100 = measure_training_time(random_100_ids, random_100_paths, model, tokenizer, max_caption_len)

time_gradient_25 = measure_training_time(gradient_25_ids, gradient_25_paths, model, tokenizer, max_caption_len)
time_gradient_50 = measure_training_time(gradient_50_ids, gradient_50_paths, model, tokenizer, max_caption_len)
time_gradient_75 = measure_training_time(gradient_75_ids, gradient_75_paths, model, tokenizer, max_caption_len)
time_gradient_100 = measure_training_time(gradient_100_ids, gradient_100_paths, model, tokenizer, max_caption_len)

# Print training times
print(f"Random 25% Time: {time_random_25}s")
print(f"Random 50% Time: {time_random_50}s")
print(f"Random 75% Time: {time_random_75}s")
print(f"Random 100% Time: {time_random_100}s")

print(f"Gradient 25% Time: {time_gradient_25}s")
print(f"Gradient 50% Time: {time_gradient_50}s")
print(f"Gradient 75% Time: {time_gradient_75}s")
print(f"Gradient 100% Time: {time_gradient_100}s")

In [None]:
from pycocotools.coco import COCO

annotation_file = r"deepreel/data/annotations/captions_val2017.json"
coco = COCO(annotation_file)

ground_truth_captions = {}

for img_id in coco.getImgIds():
    ann_ids = coco.getAnnIds(imgIds=img_id)
    anns = coco.loadAnns(ann_ids)
    ground_truth_captions[img_id] = [ann['caption'] for ann in anns]

print(f"Total images with ground truth captions: {len(ground_truth_captions)}")
print("Sample:\n", list(ground_truth_captions.items())[:1])


# EVALUATION OF CAPTIONS BY BLEU AND CIDEr Scores

In [None]:

from pycocotools.coco import COCO
from pycocoevalcap.eval import COCOEvalCap

annotation_file = "deepreel/data/annotations/captions_val2017.json"
coco_gt = COCO(annotation_file)


def evaluate_captions(coco_gt, generated_captions, exclude_metrics):
    
    valid_img_ids = set(coco_gt.getImgIds())

    filtered_generated = {
        img_id: caption for img_id, caption in generated_captions.items()
        if img_id in valid_img_ids
    }

    if not filtered_generated:
        raise ValueError("No valid image IDs found in generated captions.")

    results = [{"image_id": img_id, "caption": filtered_generated[img_id]}
               for img_id in sorted(filtered_generated.keys())]

    coco_res = coco_gt.loadRes(results)
    coco_eval = COCOEvalCap(coco_gt, coco_res)
    coco_eval.params['image_id'] = list(filtered_generated.keys())
    coco_eval.evaluate()

    scores = coco_eval.eval
    if exclude_metrics:
        scores = {k: v for k, v in scores.items() if k not in exclude_metrics}

    return scores






In [None]:
scores_random_25 = evaluate_captions(coco_gt, captions_random_25, exclude_metrics=["SPICE"])

print("Evaluation Metrics for Random 25%:")
for metric, score in scores_random_25.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_random_50 = evaluate_captions(coco_gt, captions_random_50, exclude_metrics=["SPICE"])

print("Evaluation Metrics for Random 50%:")
for metric, score in scores_random_50.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_random_75 = evaluate_captions(coco_gt, captions_random_75, exclude_metrics=["SPICE"])

print("Evaluation Metrics for Random 75% (SPICE excluded):")
for metric, score in scores_random_75.items():
    print(f"{metric}: {score:.4f}")


In [None]:
scores_random_100 = evaluate_captions(coco_gt, captions_random_100, exclude_metrics=["SPICE"])

print("Evaluation Metrics for Random 25%:")
for metric, score in scores_random_100.items():
    print(f"{metric}: {score:.4f}")


In [None]:
scores_gradient_25 = evaluate_captions(coco_gt, captions_gradient_25, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for Gradient 25%:")
for metric, score in scores_gradient_25.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_gradient_50 = evaluate_captions(coco_gt, captions_gradient_50, exclude_metrics=["SPICE"])

print("Evaluation Metrics for gradient 50%")
for metric, score in scores_gradient_50.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_gradient_75 = evaluate_captions(coco_gt, captions_gradient_75, exclude_metrics=["SPICE"])

print("Evaluation Metrics for gradient 75%")
for metric, score in scores_gradient_75.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_gradient_100 = evaluate_captions(coco_gt, captions_gradient_100, exclude_metrics=["SPICE"])

print("Evaluation Metrics for gradient 100%")
for metric, score in scores_gradient_100.items():
    print(f"{metric}: {score:.4f}")


# MANUAL TESTING OF GENERATED CAPTION

In [None]:
# # Show image
# image_path = ""
# try:
#     img = Image.open(image_path)
#     plt.imshow(img)
#     plt.axis('off')
#     plt.title(caption)
#     plt.show()
# except FileNotFoundError:
#     print(f"Image {image_path} not found.")

In [None]:
def test_random_image_captioning():
    image_ids = ["000000000139", "000000000285", "000000000632"]
    image_ids = [int(img_id) for img_id in image_ids] 

    for img_id in image_ids:
        feature = features_dict.get(img_id)
        print(feature)
        if feature is None:
            continue
        caption = generate_caption(model, feature, tokenizer, max_length)
        print(caption)

       

In [None]:
test_random_image_captioning()