# Deep Learning: Assignment #4
## Submission date: 28/01/2026, 23:59.
### Topics:
- Word Embeddings
- Transformers
- Vision Transformers
- Few-Shot Learning
- Self-Supervised Learning


**Submitted by:**

- **Student 1 — Name, ID**
- **Student 2 — Name, ID**


**Assignment Instructions:**

· Submissions are in **pairs only**. Write both names + IDs at the top of the notebook.

· Keep your code **clean, concise, and readable**.

· You may work in your IDE, but you **must** paste the final code back into the **matching notebook cells** and run it there.  


· <font color='red'>Write your textual answers in red.</font>  
(e.g., `<span style="color:red">your answer here</span>`)

· All figures, printed results, and outputs should remain visible in the notebook.  
Run **all cells** before submitting and **do not clear outputs**.

· Use relative paths — **no absolute file paths** pointing to local machines.

· **Important:** Your submission must be entirely your own.  
Any form of plagiarism (including uncredited use of ChatGPT or AI tools) will result in **grade 0** and disciplinary action.


In [None]:
# Global Setup

import os
import re
import math
import random
from collections import defaultdict, Counter
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from tqdm.auto import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchvision
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


## Question 1: Learning Vision–Language Representations with Transformers (60 Points)

Recent multimodal models have shown that images and natural language can be embedded into a **shared semantic space**, enabling tasks such as image–text retrieval and zero-shot inference without training task-specific classifiers.

In this question, you will build a simplified multimodal model inspired by CLIP, combining a **Vision Transformer (ViT)** image encoder with a **Transformer-based text encoder**, trained using a **contrastive objective**.



### Load & Preprocess Data

In this homework, we use the **Flickr8k dataset**, a small-scale vision–language dataset commonly used in image captioning and multimodal learning research.

The dataset contains:
- approximately 8,000 natural images,
- five human-written captions per image.

Each training example consists of an image paired with one of its captions.  
Throughout this assignment, images and captions will be used to learn a **shared embedding space** between vision and language.

You may use the following commands to download and extract the dataset into your working directory:


In [None]:
!wget "https://github.com/awsaf49/flickr-dataset/releases/download/v1.0/flickr8k.zip"
!unzip -q flickr8k.zip -d ./flickr8k

Before training any models, the raw images and captions must be preprocessed into a form suitable for Transformer-based models.

Your preprocessing pipeline should include:
- loading and transforming images into tensors of a fixed size,
- parsing the captions file and associating each image with its captions,
- basic text preprocessing (e.g., lowercasing and tokenization),
- converting captions into sequences of token indices,
- creating attention masks for padded tokens.

You are free to choose reasonable design decisions (e.g., maximum caption length, tokenization strategy), as long as they are applied **consistently** throughout the assignment.


In [None]:
# TODO: Implement

### Learning a Shared Vision–Language Representation



The goal of this part is to learn a **shared embedding space** for images and natural language.

Given an image and one of its captions, the model should map both modalities to vectors in the same vector space, such that:
- semantically matching image–caption pairs are close,
- non-matching pairs are far apart.

This shared representation will later be used for retrieval and zero-shot inference, without training task-specific classifiers.


To achieve this goal, you will build two Transformer-based models:

- **A Vision Transformer (ViT)** image encoder, which represents an image as a sequence of patch embeddings and produces a single image-level representation.
- **A Transformer encoder for text**, which represents a caption as a sequence of token embeddings and produces a single caption-level representation.

Both encoders should output vectors of the same dimension.  
These vectors will be projected into a shared embedding space and normalized before computing similarity.

The Vision Transformer should include:
- patch embedding,
- positional embeddings,
- a learnable classification token,
- a Transformer encoder stack.

The text encoder should include:
- token embeddings,
- positional embeddings,
- a Transformer encoder stack.

You may choose reasonable architectural hyperparameters (e.g., depth, embedding dimension), as long as they are used consistently and yield good results.


In [None]:
# TODO: Implement

#### Training and Evaluation



The two encoders are trained jointly using a **contrastive learning objective** over mini-batches of image–caption pairs.

During training, the model should increase similarity between matching image–caption pairs and decrease similarity between non-matching pairs within the same mini-batch.  
Similarity between embeddings should be measured using **cosine similarity**.

After training, evaluate the learned representations using **retrieval-based metrics**:
- image → text retrieval,
- text → image retrieval.

Report Recall@1, Recall@3, and Recall@5 on the validation set.

</br>

>**Recall@K**

Recall@K is a standard metric for evaluating retrieval-based models.

For each query (an image or a caption), the model ranks all candidates from the opposite modality according to embedding similarity.

Recall@K measures the fraction of queries for which **at least one correct match** appears among the top $K$ retrieved results.

For example:
- Recall@1 measures how often the top-ranked result is correct.
- Recall@3 measures how often a correct result appears within the top 3.
- Recall@5 measures how often a correct result appears within the top 5.

Higher Recall@K values indicate better alignment between image and text representations.


These metrics will be used throughout the assignment to assess alignment quality.


In [None]:
# TODO: Implement

### Zero-Shot Caption Selection



In deep learning, **zero-shot evaluation** refers to making predictions on a task without training a model specifically for that task.  
Instead, the model relies entirely on representations learned during a different training objective.

In this assignment, the vision–language model is trained only to align images and captions in a shared embedding space using a contrastive objective. It is not trained to perform caption selection or classification directly. As a result, any success on caption selection reflects the quality of the learned representations and their semantic alignment.

In this evaluation, the trained Vision Transformer image encoder and Transformer-based text encoder are used without modification. Given a single image and a set of $N$ candidate captions, consisting of one correct caption and $N-1$ randomly selected captions, the model embeds the image and all captions and selects the caption whose embedding has the highest cosine similarity to the image embedding.

No additional parameters are introduced, and no further training is performed.

Evaluate caption selection accuracy for $N \in \{3, 5, 10, 20, 25\}$.

For each value of $N$, perform the evaluation over the **entire validation set**, using all images in the split. For each image, construct a candidate set consisting of the correct caption and $N-1\$ randomly selected captions from other images, and record whether the correct caption is ranked highest by cosine similarity.

Report the resulting caption selection accuracy as a function of $N$, and visualize the results in a plot with $N$ on the horizontal axis and accuracy on the vertical axis.

In addition, include several qualitative examples illustrating both correct and incorrect selections by showing an image alongside the candidate captions and the model’s similarity scores.

This evaluation provides an intuitive measure of how well semantic alignment has been learned.



In [None]:
# TODO: Implemnt

### Improving Visual Representations with Self-Supervised Learning




The experiments above rely on representations learned solely through image–caption alignment. However, Flickr8k provides limited paired supervision, which may limit the quality of the learned visual representations and, in turn, zero-shot performance.

Self-supervised learning addresses this limitation by allowing models to learn meaningful visual structure from images alone, without relying on captions or labels.

In this part, you will improve the Vision Transformer image encoder by **pretraining it using image-only self-supervised learning**, before performing vision–language alignment.

In the self-supervised stage, the Vision Transformer is trained using **only images**, without access to captions. For each image, two different augmented views are generated, and the model is trained to produce similar representations for views of the same image while producing dissimilar representations for views of different images.

After self-supervised pretraining, initialize the vision–language model with the pretrained Vision Transformer and repeat the contrastive image–caption training described earlier. You may choose whether to freeze the Vision Transformer or fine-tune it jointly with the text encoder, as long as the choice is applied consistently.

Evaluate the resulting model using the same protocols as before: image–text retrieval with Recall@K and zero-shot caption selection for $N = 3, 5, 10, 20, 25$. Compare these results to training the Vision Transformer from scratch, and visualize the differences using appropriate plots. Briefly discuss the effect of self-supervised pretraining on representation quality and zero-shot generalization.


In [None]:
# TODO: Implement

### Reflection



Answer the following questions in your own words.  
Your answers should demonstrate **conceptual understanding** rather than implementation details.

1. The model is trained using a contrastive objective on image–caption pairs, yet it is evaluated on caption selection without being trained for this task explicitly. Explain why this is possible, and what properties of the learned embedding space make zero-shot inference feasible.

2. Vision Transformers do not incorporate strong spatial inductive biases, unlike convolutional neural networks. Based on your experiments, discuss how this design choice affects learning in the low-data regime of Flickr8k, both before and after self-supervised pretraining.

3. Self-supervised pretraining improves downstream performance even though no captions are used during this stage. Explain what types of information the Vision Transformer can learn during self-supervised pretraining that are useful for vision–language alignment later.

4. Contrastive learning relies on negative examples drawn from within a mini-batch. Discuss how batch size influences the quality of learned representations, and how this consideration relates to both multimodal alignment and self-supervised learning.
