# Impact of Dataset Size and Distillation Techniques on Image Captioning Performance: An Empirical Study

Authors: Srushti Sangawar and Arunava Ghosh

Course CSCI 5922, University of Colorado Boulder

In this project, we explore the optimization of image captioning models by combining dataset distillation and pre-trained models. In this notebook file we are focusing on GIT to get a comparitive understanding vs our baseline model. Like in the other notebooks for the research we aim to reduce the
computational burden and improve model performance, particularly in
resource-constrained environments. Our approach involves creating dis-
tilled datasets of different sizes (25%,50%,75% and 100%) using gradient-
based distillation and random selection methods. We then fine-tune the
GIT model as per requirement. Per-
formance is evaluated using metrics such as BLEU and CIDEr scores,
as well as training time. The results will help understand the trade-off
between dataset size, distillation techniques, and training efficiency.

# Environment Setup, Device Configuration

This cell checks if CUDA (GPU support) is available on the system. It helps verify if the model can leverage GPU acceleration for training and inference. The system selects cuda if available, otherwise defaults to CPU. 

Also Installation of required libraries take place as per requirement. Later in the notebook as well we install some libraries based on requirement. 

In [None]:

import torch
print("CUDA Available:", torch.cuda.is_available())
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))
if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
!pip install torch

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Check for GPU
import torch
torch.cuda.is_available()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("CUDA Available:", torch.cuda.is_available())
print("GPU Device:", torch.cuda.get_device_name(0))

# Create Directory for Dataset and Models
Sets up necessary folders in the required environment for organizing dataset and model files. Certain code lines are hence commented


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# !mkdir -p /content/drive/MyDrive/deepreel/data
# !mkdir -p /content/drive/MyDrive/deepreel/models
# codes/DeepReel_Making_Images_Talk (1).ipynb

# import os

# os.makedirs(r"C:\Users\agnibdeepreel\data", exist_ok=True)
# os.makedirs(r"C:\Users\agnib\deepreel\models", exist_ok=True)

import os

os.makedirs(r"deepreel/data", exist_ok=True)
os.makedirs(r"deepreel/models", exist_ok=True)

# Download and Extract Dataset
Downloads the COCO validation and annotation data into a specified folder if they are not already present. This serves as the dataset on which we are focusing our research. The validation dataset provides data of size 5000. The dataset is then extracted in the required folder.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import urllib.request
import zipfile
import os

# Create destination directory
data_dir = r"deepreel/data"
os.makedirs(data_dir, exist_ok=True)

# Download val2017.zip
val_url = "http://images.cocodataset.org/zips/val2017.zip"
val_zip_path = os.path.join(data_dir, "val2017.zip")
urllib.request.urlretrieve(val_url, val_zip_path)

# Extract val2017.zip
with zipfile.ZipFile(val_zip_path, 'r') as zip_ref:
    zip_ref.extractall(data_dir)

# Download annotations_trainval2017.zip
ann_url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
ann_zip_path = os.path.join(data_dir, "annotations_trainval2017.zip")
urllib.request.urlretrieve(ann_url, ann_zip_path)

# Extract annotations zip
with zipfile.ZipFile(ann_zip_path, 'r') as zip_ref:
    zip_ref.extractall(data_dir)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import os
import zipfile

# Path to ZIP files
val_zip = r"deepreel/data/val2017.zip"
ann_zip = r"deepreel/data/annotations_trainval2017.zip"

# Target extraction path
coco_path = r"coco"
os.makedirs(coco_path, exist_ok=True)

# Extract val2017.zip
val_extract_path = os.path.join(coco_path, "val2017")
if not os.path.exists(val_extract_path):
    with zipfile.ZipFile(val_zip, 'r') as zip_ref:
        zip_ref.extractall(coco_path)
    print("val2017 extracted.")
else:
    print("val2017 already extracted.")

# Extract annotations_trainval2017.zip
ann_extract_path = os.path.join(coco_path, "annotations")
if not os.path.exists(ann_extract_path):
    with zipfile.ZipFile(ann_zip, 'r') as zip_ref:
        zip_ref.extractall(coco_path)
    print("annotations extracted.")
else:
    print("annotations already extracted.")


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
!pip install -q pycocotools

Now let’s read the captions_val2017.json and understand its structure.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import json

annotations_path = r"coco/annotations/captions_val2017.json"

# Load annotations
with open(annotations_path, 'r') as f:
    captions_data = json.load(f)

# Preview the keys
print(captions_data.keys())

Create a mapping from image_id → captions : Each image ID has 5 captions

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
from collections import defaultdict

# Map image_id to list of captions
image_id_to_captions = defaultdict(list)

for ann in captions_data['annotations']:
    image_id_to_captions[ann['image_id']].append(ann['caption'])

# Show one example
example_id = captions_data['annotations'][0]['image_id']
print(f"Image ID: {example_id}")
print("Captions:", image_id_to_captions[example_id])

# Importing required Libraries

In [None]:
import os
import json
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
from collections import defaultdict
from tqdm.auto import tqdm

In [None]:
checkpoint = "microsoft/git-base"
processor = AutoProcessor.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
model.train()

#  Creating Custom Dataset

Create a custom dataset for val2017 images and captions

In [None]:
class CocoDataset(Dataset):
    def __init__(self, image_dir, captions_dict, processor):
        self.image_dir = image_dir
        self.captions_dict = captions_dict
        self.processor = processor
        self.image_ids = list(captions_dict.keys())

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        image_id = self.image_ids[idx]
        image_path = os.path.join(self.image_dir, f'{image_id:012}.jpg')
        image = Image.open(image_path).convert('RGB')
        caption = self.captions_dict[image_id][0]

        inputs = self.processor(
            images=image,
            text=caption,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            legacy=False
        )
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}
        inputs["labels"] = inputs["input_ids"]
        return inputs

In [None]:
dataset = CocoDataset(
    image_dir="coco/val2017",
    captions_dict=image_id_to_captions,
    processor=processor
)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)


# Model initialization and training

In [None]:
from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=5e-5)
num_epochs = 20

for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    total_loss = 0

    for batch in tqdm(dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1} Average Loss: {total_loss / len(dataloader):.4f}")


In [None]:
def generate_caption(image_path):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(pixel_values=inputs["pixel_values"])
    return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
sample_image_path = "coco/val2017/000000000139.jpg"
print("Generated Caption:", generate_caption(sample_image_path))

# DATA DISTILLATION

In [None]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# import random
# def random_selection(image_ids, percentage):
#     """
#     Select a subset of images based on random selection.

#     Arguments:
#     - image_ids: List of image IDs from the dataset.
#     - percentage: Percentage of dataset to be selected.

#     Returns:
#     - selected_image_ids: Subset of image IDs.
#     - selected_image_paths: Corresponding image file paths.
#     """
#     num_images_to_select = int(len(image_ids) * percentage / 100)
#     selected_image_ids = random.sample(image_ids, num_images_to_select)
#     selected_image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in selected_image_ids]

#     return selected_image_ids, selected_image_paths

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
%pip install scipy

In [None]:
# import numpy as np
# from sklearn.metrics import pairwise_distances

# def gradient_based_selection(image_ids, features_dict, percentage):
#     '''
#     Optimized selection using pairwise distances (O(n^2), but fast with vectorized ops)
#     '''
#     feature_vectors = np.array([features_dict[img_id] for img_id in image_ids])
#     distance_matrix = pairwise_distances(feature_vectors, metric='euclidean')
#     distance_sums = np.sum(distance_matrix, axis=1)
    
#     num_select = int(len(image_ids) * percentage / 100)
#     selected_indices = np.argsort(distance_sums)[-num_select:]
#     selected_image_ids = [image_ids[i] for i in selected_indices]
#     selected_image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in selected_image_ids]
    
#     return selected_image_ids, selected_image_paths


In [None]:
# from tqdm import tqdm
# import numpy as np

# def generate_distilled_captions(image_ids, image_paths, model, tokenizer, max_caption_len, batch_size=32):
#     '''
#     Generate captions in batches for better performance.
#     '''
#     distilled_captions = {}
    
#     for i in tqdm(range(0, len(image_paths), batch_size)):
#         batch_paths = image_paths[i:i+batch_size]
#         batch_ids = image_ids[i:i+batch_size]
#         batch_images = np.vstack([load_and_preprocess_img(path) for path in batch_paths])
#         batch_features = resnet_model.predict(batch_images, verbose=0)
        
#         for j, feature in enumerate(batch_features):
#             caption = generate_caption(model, feature.squeeze(), tokenizer, max_caption_len)
#             distilled_captions[batch_ids[j]] = caption

#     return distilled_captions


In [None]:
!pip install -q pycocoevalcap

In [None]:
import time

def measure_training_time(dataset_image_ids, dataset_image_paths, model, tokenizer, max_caption_len):
    start_time = time.time()
    generate_distilled_captions(dataset_image_ids, dataset_image_paths, model, tokenizer, max_caption_len)
    return time.time() - start_time

In [None]:
image_dir = './coco/val2017'

# Pick only image files for which we have captions
image_ids = list(image_id_to_captions.keys())
image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in image_ids]

In [None]:
# import os
# import random

# def random_selection(image_ids, image_dir, percentage):
#     """
#     Select a subset of images based on random selection.

#     Arguments:
#     - image_ids: List of image IDs from the dataset.
#     - image_dir: Directory where images are stored.
#     - percentage: Percentage of dataset to be selected.

#     Returns:
#     - selected_image_ids: Subset of image IDs.
#     - selected_image_paths: Corresponding image file paths.
#     """
#     num_images_to_select = int(len(image_ids) * percentage / 100)
#     selected_image_ids = random.sample(image_ids, num_images_to_select)
#     selected_image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in selected_image_ids]

#     return selected_image_ids, selected_image_paths


In [None]:
# random_25_ids, random_25_paths = random_selection(image_ids, image_dir,25)

In [None]:
%pip install tensorflow

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re

# Gather all captions into one list
all_captions = []
for caps in image_id_to_captions.values():
    for c in caps:
        # Add start and end tokens
        cleaned = '<start> ' + c.lower().strip() + ' <end>'
        all_captions.append(cleaned)

# Tokenizer setup
tokenizer = Tokenizer(oov_token="<unk>")
tokenizer.fit_on_texts(all_captions)

# Convert captions to sequences of integers
caption_seqs = tokenizer.texts_to_sequences(all_captions)

# Add padding so all captions are same length
max_caption_len = max(len(seq) for seq in caption_seqs)
caption_seqs_padded = pad_sequences(caption_seqs, maxlen=max_caption_len, padding='post')

# Vocab size
vocab_size = len(tokenizer.word_index) + 1  # +1 for padding

print(f"Total captions: {len(all_captions)}")
print(f"Max caption length: {max_caption_len}")
print(f"Vocabulary size: {vocab_size}")
print("Sample padded sequence:", caption_seqs_padded[0])

In [None]:
captions_random_25 = generate_distilled_captions(random_25_ids, random_25_paths, model, tokenizer, max_caption_len)

In [None]:
from tqdm import tqdm

def generate_distilled_captions(image_ids, image_paths, model, processor, batch_size=8):
    """
    Generate distilled captions directly using the trained GIT model.
    """
    distilled_captions = {}

    for i in tqdm(range(0, len(image_paths), batch_size)):
        batch_paths = image_paths[i:i+batch_size]
        batch_ids = image_ids[i:i+batch_size]
        batch_images = [Image.open(path).convert('RGB') for path in batch_paths]

        # Preprocess
        inputs = processor(images=batch_images, return_tensors="pt", padding="max_length", truncation=True).to(device)

        # Generate
        with torch.no_grad():
            outputs = model.generate(pixel_values=inputs["pixel_values"], max_length=50)

        # Decode
        decoded_captions = processor.batch_decode(outputs, skip_special_tokens=True)

        for img_id, caption in zip(batch_ids, decoded_captions):
            distilled_captions[img_id] = caption.strip()

    return distilled_captions


In [None]:
import random
import os

def random_selection(image_ids, image_dir, percentage):
    """
    Randomly select a subset of images.
    """
    num_images_to_select = int(len(image_ids) * percentage / 100)
    selected_image_ids = random.sample(image_ids, num_images_to_select)
    selected_image_paths = [os.path.join(image_dir, f"{image_id:012}.jpg") for image_id in selected_image_ids]

    return selected_image_ids, selected_image_paths


In [None]:
random_ids_25, random_paths_25 = random_selection(image_ids, image_dir, percentage=25)

In [None]:
distilled_captions_random_25 = generate_distilled_captions(random_ids_25, random_paths_25, model, processor)

In [None]:
# Choose your save directory
save_directory = "deepreel/models/git_finetuned"

# Create the directory if it doesn't exist
import os
os.makedirs(save_directory, exist_ok=True)

# Save model
model.save_pretrained(save_directory)

# Save processor (tokenizer + feature extractor)
processor.save_pretrained(save_directory)

print(f"Model and processor saved successfully at: {save_directory}")


In [None]:
random_ids_50, random_paths_50 = random_selection(image_ids, image_dir, percentage=50)
distilled_captions_random_50 = generate_distilled_captions(random_ids_50, random_paths_50, model, processor)

In [None]:
random_ids_75, random_paths_75 = random_selection(image_ids, image_dir, percentage=75)
distilled_captions_random_75 = generate_distilled_captions(random_ids_75, random_paths_75, model, processor)

In [None]:
random_ids_100, random_paths_100 = random_selection(image_ids, image_dir, percentage=100)
distilled_captions_random_100 = generate_distilled_captions(random_ids_100, random_paths_100, model, processor)

In [None]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import numpy as np
import os
from tqdm import tqdm

# Load CLIP model for feature extraction
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model.eval()

def extract_clip_features(image_dir, image_ids):
    features_dict = {}
    
    for image_id in tqdm(image_ids):
        image_path = os.path.join(image_dir, f"{image_id:012}.jpg")
        image = Image.open(image_path).convert("RGB")

        # Preprocess
        inputs = clip_processor(images=image, return_tensors="pt").to(device)

        # Extract features
        with torch.no_grad():
            image_features = clip_model.get_image_features(**inputs)

        # Normalize features (important for distance calculation later)
        image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
        image_features = image_features.squeeze().cpu().numpy()

        features_dict[image_id] = image_features

    return features_dict

# Usage
image_dir = ""  # image folder path added here
features_dict = extract_clip_features(image_dir, image_ids)

print(f"Extracted features for {len(features_dict)} images.")


In [None]:
import numpy as np
from sklearn.metrics import pairwise_distances
import os

def gradient_based_selection(image_ids, features_dict, image_dir, percentage):
    feature_vectors = np.array([features_dict[img_id] for img_id in image_ids])
    distance_matrix = pairwise_distances(feature_vectors, metric='euclidean')
    distance_sums = np.sum(distance_matrix, axis=1)

    num_select = int(len(image_ids) * percentage / 100)
    selected_indices = np.argsort(distance_sums)[-num_select:]

    selected_image_ids = [image_ids[i] for i in selected_indices]
    selected_image_paths = [os.path.join(image_dir, f"{image_ids[i]:012}.jpg") for i in selected_indices]

    return selected_image_ids, selected_image_paths


In [None]:
gradient_ids_25, gradient_paths_25 = gradient_based_selection(image_ids, features_dict,image_dir, percentage=25)
distilled_captions_gradient_25 = generate_distilled_captions(gradient_ids_25, gradient_paths_25, model, processor)

In [None]:
gradient_ids_50, gradient_paths_50 = gradient_based_selection(image_ids, features_dict,image_dir, percentage=50)
distilled_captions_gradient_50 = generate_distilled_captions(gradient_ids_50, gradient_paths_50, model, processor)

In [None]:
gradient_ids_75, gradient_paths_75 = gradient_based_selection(image_ids, features_dict,image_dir, percentage=75)
distilled_captions_gradient_75 = generate_distilled_captions(gradient_ids_75, gradient_paths_75, model, processor)

In [None]:
gradient_ids_100, gradient_paths_100 = gradient_based_selection(image_ids, features_dict,image_dir, percentage=100)
distilled_captions_gradient_100 = generate_distilled_captions(gradient_ids_100, gradient_paths_100, model, processor)

In [None]:
from pycocotools.coco import COCO

# Path to COCO annotations (adjust if needed)
annotation_file = r"deepreel/data/annotations/captions_val2017.json"
coco = COCO(annotation_file)

# Create dictionary mapping image_id to a list of reference captions
ground_truth_captions = {}

for img_id in coco.getImgIds():
    ann_ids = coco.getAnnIds(imgIds=img_id)
    anns = coco.loadAnns(ann_ids)
    ground_truth_captions[img_id] = [ann['caption'] for ann in anns]

print(f"Total images with ground truth captions: {len(ground_truth_captions)}")
print("Sample:\n", list(ground_truth_captions.items())[:1])


In [None]:
from pycocotools.coco import COCO
from pycocoevalcap.eval import COCOEvalCap

annotation_file = "deepreel/data/annotations/captions_val2017.json"
coco_gt = COCO(annotation_file)


def evaluate_captions(coco_gt, generated_captions, exclude_metrics):
    
    valid_img_ids = set(coco_gt.getImgIds())

    filtered_generated = {
        img_id: caption for img_id, caption in generated_captions.items()
        if img_id in valid_img_ids
    }

    if not filtered_generated:
        raise ValueError("No valid image IDs found in generated captions.")

    results = [{"image_id": img_id, "caption": filtered_generated[img_id]}
               for img_id in sorted(filtered_generated.keys())]

    coco_res = coco_gt.loadRes(results)
    coco_eval = COCOEvalCap(coco_gt, coco_res)
    coco_eval.params['image_id'] = list(filtered_generated.keys())
    coco_eval.evaluate()

    # Exclude metrics if specified
    scores = coco_eval.eval
    if exclude_metrics:
        scores = {k: v for k, v in scores.items() if k not in exclude_metrics}

    return scores


scores_random_25 = evaluate_captions(coco_gt, distilled_captions_random_25, exclude_metrics=["SPICE"])
# scores_gradient_25 = evaluate_captions(coco_gt, captions_gradient_25, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for Random 25%:")
for metric, score in scores_random_25.items():
    print(f"{metric}: {score:.4f}")


In [None]:
scores_random_50 = evaluate_captions(coco_gt, distilled_captions_random_50, exclude_metrics=["SPICE"])
# scores_gradient_25 = evaluate_captions(coco_gt, captions_gradient_25, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for Random 50% :")
for metric, score in scores_random_50.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_random_75 = evaluate_captions(coco_gt, distilled_captions_random_75, exclude_metrics=["SPICE"])
# scores_gradient_25 = evaluate_captions(coco_gt, captions_gradient_25, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for Random 75% :")
for metric, score in scores_random_75.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_random_100 = evaluate_captions(coco_gt, distilled_captions_random_100, exclude_metrics=["SPICE"])
# scores_gradient_25 = evaluate_captions(coco_gt, captions_gradient_25, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for Random 100% :")
for metric, score in scores_random_100.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_gradient_100 = evaluate_captions(coco_gt, distilled_captions_gradient_100, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for gradient 100% :")
for metric, score in scores_gradient_100.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_gradient_75 = evaluate_captions(coco_gt, distilled_captions_gradient_75, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for gradient 75% :")
for metric, score in scores_gradient_75.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_gradient_50 = evaluate_captions(coco_gt, distilled_captions_gradient_50, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for gradient 50% :")
for metric, score in scores_gradient_50.items():
    print(f"{metric}: {score:.4f}")

In [None]:
scores_gradient_25 = evaluate_captions(coco_gt, distilled_captions_gradient_25, exclude_metrics=["SPICE"])

print(" Evaluation Metrics for gradient 25% :")
for metric, score in scores_gradient_25.items():
    print(f"{metric}: {score:.4f}")