In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Contrastive Learning and the InfoNCE Loss from First Principles

*Part 1 of the Vizuara series on Contrastive Pretraining (CLIP-style)*
*Estimated time: 45 minutes*

## 1. Why Does This Matter?

Contrastive learning is the engine behind some of the most powerful AI models of the last few years -- CLIP, SimCLR, DINO, and many more. The core idea is deceptively simple: learn representations by pulling similar things close and pushing different things apart.

By the end of this notebook, you will:
- Understand the intuition behind contrastive learning from first principles
- Implement the InfoNCE loss function from scratch
- Build a simple contrastive learner on 2D data
- Visualize how the embedding space organizes itself during training

Let us see a preview of what we will build -- a model that learns to cluster similar data points together purely from contrastive supervision:

In [None]:
# Setup and GPU check
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
torch.manual_seed(42)
np.random.seed(42)

## 2. Building Intuition

Imagine you are organizing a music library. You have thousands of songs, and you want to group them so that similar songs are nearby and dissimilar songs are far apart.

One approach: pick a song, find another song that is similar (a **positive pair**), and find songs that are different (**negative pairs**). Then nudge your organization so the similar pair gets closer, and the different pairs get further apart.

Now repeat this millions of times. Over time, jazz songs cluster together, rock songs cluster together, and classical music forms its own group -- all without ever labeling a single song.

This is exactly what contrastive learning does, but in a high-dimensional embedding space instead of a music library shelf.

### Key Insight

The beauty of contrastive learning is that you never need explicit labels. You just need to know which pairs are "similar" and which are "different." In CLIP, this comes for free from the internet -- every image naturally comes with a caption.

### Think About This

Before we write any code, ask yourself:
- If you had to measure "similarity" between two vectors, what mathematical operation would you use?
- Why might we want to normalize our vectors to unit length before comparing them?
- What happens if all your embeddings collapse to the same point? Does that satisfy "all positives are close"?

## 3. The Mathematics

### Cosine Similarity

The foundation of contrastive learning is measuring similarity between vectors. We use **cosine similarity**, which measures the angle between two vectors:

$$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

Computationally, this means: take the dot product of a and b, then divide by the product of their lengths. If both vectors are normalized to unit length ($\|\mathbf{a}\| = \|\mathbf{b}\| = 1$), then cosine similarity is just the dot product.

Let us plug in some simple numbers. Suppose $\mathbf{a} = [3, 4]$ and $\mathbf{b} = [4, 3]$.

$$\text{sim} = \frac{3 \times 4 + 4 \times 3}{\sqrt{9+16} \times \sqrt{16+9}} = \frac{24}{5 \times 5} = \frac{24}{25} = 0.96$$

This tells us these vectors are very similar (pointing in nearly the same direction).

In [None]:
# Let us verify this computation
a = torch.tensor([3.0, 4.0])
b = torch.tensor([4.0, 3.0])

cosine_sim = torch.dot(a, b) / (torch.norm(a) * torch.norm(b))
print(f"Cosine similarity: {cosine_sim.item():.4f}")

# With normalized vectors, it is just the dot product
a_norm = F.normalize(a, dim=0)
b_norm = F.normalize(b, dim=0)
dot_product = torch.dot(a_norm, b_norm)
print(f"Dot product of normalized vectors: {dot_product.item():.4f}")
print(f"Same result? {torch.allclose(cosine_sim, dot_product)}")

### The InfoNCE Loss

Now the critical question: how do we use similarity to train a model?

Given a batch of $N$ pairs, we construct an $N \times N$ similarity matrix. For each row $i$, the diagonal entry $(i, i)$ is the positive pair, and all other entries are negatives. The loss treats this as an $N$-way classification problem:

$$\mathcal{L}_{\text{InfoNCE}} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\text{sim}(z_i^a, z_i^b) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^a, z_j^b) / \tau)}$$

Computationally, this says: for each anchor embedding $z_i^a$, compute its similarity with all $N$ candidate embeddings, scale by temperature $\tau$, apply softmax, and take the negative log probability of the correct match.

Let us work through this with $N = 3$ and $\tau = 0.5$. Suppose our similarity scores are:

| | $z_1^b$ | $z_2^b$ | $z_3^b$ |
|---|---|---|---|
| $z_1^a$ | **0.9** | 0.1 | -0.2 |

The softmax for row 1:
$$p_1 = \frac{\exp(0.9/0.5)}{\exp(0.9/0.5) + \exp(0.1/0.5) + \exp(-0.2/0.5)} = \frac{\exp(1.8)}{\exp(1.8) + \exp(0.2) + \exp(-0.4)} = \frac{6.05}{6.05 + 1.22 + 0.67} = \frac{6.05}{7.94} = 0.762$$

The loss for this row: $-\log(0.762) = 0.272$.

This is pretty good -- the model correctly assigns high probability to the right match. A perfect model would give probability 1.0 and loss 0.0.

In [None]:
# Let us verify the InfoNCE computation step by step
sims = torch.tensor([[0.9, 0.1, -0.2],
                      [0.2, 0.85, 0.0],
                      [-0.1, 0.15, 0.95]])
tau = 0.5

# Scale by temperature
scaled = sims / tau
print("Scaled similarities:")
print(scaled)

# Apply softmax along rows
probs = F.softmax(scaled, dim=1)
print("\nSoftmax probabilities:")
print(probs)

# The correct matches are on the diagonal
labels = torch.arange(3)
loss = F.cross_entropy(scaled, labels)
print(f"\nInfoNCE loss: {loss.item():.4f}")
print(f"(Manual check: {-torch.log(probs[0, 0]).item():.4f} for row 0)")

## 4. Let's Build It -- Component by Component

### 4.1 The Similarity Matrix

Let us start by implementing the similarity matrix computation.

In [None]:
def compute_similarity_matrix(embeddings_a, embeddings_b, temperature=0.07):
    """
    Compute the NxN cosine similarity matrix between two sets of embeddings.

    Args:
        embeddings_a: (N, D) normalized embeddings
        embeddings_b: (N, D) normalized embeddings
        temperature: temperature scaling factor

    Returns:
        (N, N) scaled similarity matrix
    """
    # Normalize embeddings to unit vectors
    a_norm = F.normalize(embeddings_a, dim=-1)
    b_norm = F.normalize(embeddings_b, dim=-1)

    # Cosine similarity matrix via matrix multiplication
    similarity = a_norm @ b_norm.T  # (N, N)

    # Scale by temperature
    return similarity / temperature

# Test it
N, D = 4, 8
a = torch.randn(N, D)
b = torch.randn(N, D)

sim_matrix = compute_similarity_matrix(a, b, temperature=0.1)
print(f"Similarity matrix shape: {sim_matrix.shape}")
print(f"Values range: [{sim_matrix.min():.2f}, {sim_matrix.max():.2f}]")

In [None]:
# Visualize the similarity matrix
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Random embeddings (before training)
random_sims = compute_similarity_matrix(
    torch.randn(8, 32), torch.randn(8, 32), temperature=1.0
)
im1 = axes[0].imshow(random_sims.detach().numpy(), cmap='RdYlGn', vmin=-1, vmax=1)
axes[0].set_title('Random Embeddings\n(Before Training)', fontsize=13)
axes[0].set_xlabel('Text Index')
axes[0].set_ylabel('Image Index')
plt.colorbar(im1, ax=axes[0])

# Ideal case: identity-like matrix
ideal = torch.eye(8) * 0.95 + torch.randn(8, 8) * 0.05
im2 = axes[1].imshow(ideal.numpy(), cmap='RdYlGn', vmin=-1, vmax=1)
axes[1].set_title('Aligned Embeddings\n(After Training)', fontsize=13)
axes[1].set_xlabel('Text Index')
axes[1].set_ylabel('Image Index')
plt.colorbar(im2, ax=axes[1])

plt.tight_layout()
plt.show()

### 4.2 The InfoNCE Loss Function

Now let us implement the full InfoNCE loss.

In [None]:
def info_nce_loss(embeddings_a, embeddings_b, temperature=0.07):
    """
    Compute symmetric InfoNCE loss.

    Args:
        embeddings_a: (N, D) first set of embeddings (e.g., images)
        embeddings_b: (N, D) second set of embeddings (e.g., texts)
        temperature: temperature parameter

    Returns:
        Scalar loss value
    """
    # Normalize
    a_norm = F.normalize(embeddings_a, dim=-1)
    b_norm = F.normalize(embeddings_b, dim=-1)

    # Compute similarity matrix
    logits = (a_norm @ b_norm.T) / temperature  # (N, N)

    # Labels: the diagonal entries are the correct matches
    labels = torch.arange(logits.size(0), device=logits.device)

    # Symmetric loss: a->b and b->a
    loss_a_to_b = F.cross_entropy(logits, labels)
    loss_b_to_a = F.cross_entropy(logits.T, labels)

    return (loss_a_to_b + loss_b_to_a) / 2

# Test with known good alignment
good_a = torch.randn(4, 16)
good_b = good_a + torch.randn(4, 16) * 0.1  # Slight noise
loss_good = info_nce_loss(good_a, good_b, temperature=0.5)

# Test with random alignment
bad_a = torch.randn(4, 16)
bad_b = torch.randn(4, 16)
loss_bad = info_nce_loss(bad_a, bad_b, temperature=0.5)

print(f"Loss with aligned pairs: {loss_good.item():.4f}")
print(f"Loss with random pairs:  {loss_bad.item():.4f}")
print(f"Aligned loss is lower? {loss_good < loss_bad}")

## 5. Your Turn

### TODO: Implement Temperature-Scaled Softmax Visualization

In [None]:
def visualize_temperature_effect(similarities, temperatures):
    """
    Visualize how temperature affects the softmax distribution.

    Args:
        similarities: 1D tensor of similarity scores
        temperatures: list of temperature values to compare
    """
    fig, axes = plt.subplots(1, len(temperatures), figsize=(5*len(temperatures), 4))

    for idx, tau in enumerate(temperatures):
        # ============ TODO ============
        # Step 1: Scale similarities by temperature (divide by tau)
        # Step 2: Apply softmax to get probabilities
        # Step 3: Plot as a bar chart on axes[idx]
        # ==============================

        scaled = ???  # YOUR CODE HERE
        probs = ???   # YOUR CODE HERE

        axes[idx].bar(range(len(probs)), probs.detach().numpy(), color='steelblue')
        axes[idx].set_title(f'tau = {tau}', fontsize=13)
        axes[idx].set_xlabel('Index')
        axes[idx].set_ylabel('Probability')
        axes[idx].set_ylim(0, 1.05)

    plt.tight_layout()
    plt.show()

# Test data: one high similarity, rest low
test_sims = torch.tensor([0.9, 0.2, 0.1, -0.1, 0.0])

In [None]:
# Verification
test_sims = torch.tensor([0.9, 0.2, 0.1, -0.1, 0.0])
visualize_temperature_effect(test_sims, [0.07, 0.5, 2.0])
# You should see: sharp peak at tau=0.07, moderate at tau=0.5, flat at tau=2.0
print("Check: Does the distribution get sharper as temperature decreases?")

### TODO: Implement Contrastive Accuracy

In [None]:
def contrastive_accuracy(embeddings_a, embeddings_b, temperature=0.07):
    """
    Compute how often the model picks the correct match.
    For each image, the predicted text is the one with highest similarity.

    Args:
        embeddings_a: (N, D) first set of embeddings
        embeddings_b: (N, D) second set of embeddings
        temperature: temperature parameter

    Returns:
        Accuracy as a float between 0 and 1
    """
    # ============ TODO ============
    # Step 1: Normalize both sets of embeddings
    # Step 2: Compute the NxN similarity matrix
    # Step 3: For each row, find the index of maximum similarity (argmax)
    # Step 4: Compare with ground truth labels (0, 1, 2, ..., N-1)
    # Step 5: Return the fraction of correct matches
    # ==============================

    accuracy = ???  # YOUR CODE HERE
    return accuracy

In [None]:
# Verification
aligned_a = torch.randn(10, 32)
aligned_b = aligned_a + torch.randn(10, 32) * 0.01  # Nearly identical
acc = contrastive_accuracy(aligned_a, aligned_b)
assert acc > 0.9, f"Expected high accuracy for aligned pairs, got {acc}"
print(f"Accuracy for aligned pairs: {acc:.2f}")

random_a = torch.randn(10, 32)
random_b = torch.randn(10, 32)
acc_rand = contrastive_accuracy(random_a, random_b)
print(f"Accuracy for random pairs: {acc_rand:.2f}")
print("Correct! Aligned pairs should have high accuracy, random pairs should be near 1/N")

## 6. Putting It All Together

Let us build a complete contrastive learning system on synthetic 2D data. We will create data points from different clusters and learn an encoder that maps them into an embedding space where similar points cluster together.

In [None]:
# Generate synthetic paired data
# Each "pair" consists of a point and a slightly perturbed version of it
def generate_contrastive_pairs(n_clusters=5, points_per_cluster=20, noise=0.3):
    """Generate paired data from clusters for contrastive learning."""
    centers = torch.randn(n_clusters, 2) * 3  # Cluster centers
    anchors = []
    positives = []
    labels = []

    for c in range(n_clusters):
        for _ in range(points_per_cluster):
            point = centers[c] + torch.randn(2) * 0.5
            augmented = point + torch.randn(2) * noise
            anchors.append(point)
            positives.append(augmented)
            labels.append(c)

    return (torch.stack(anchors), torch.stack(positives),
            torch.tensor(labels))

anchors, positives, labels = generate_contrastive_pairs()
print(f"Generated {len(anchors)} pairs from 5 clusters")
print(f"Input dimension: {anchors.shape[1]}")

In [None]:
# Visualize the raw data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = plt.cm.Set2(labels.numpy() / labels.max().item())

axes[0].scatter(anchors[:, 0].numpy(), anchors[:, 1].numpy(),
                c=colors, s=40, label='Anchors', marker='o')
axes[0].set_title('Anchor Points (by cluster)', fontsize=13)
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

axes[1].scatter(anchors[:, 0].numpy(), anchors[:, 1].numpy(),
                c=colors, s=30, marker='o', alpha=0.5, label='Anchors')
axes[1].scatter(positives[:, 0].numpy(), positives[:, 1].numpy(),
                c=colors, s=30, marker='^', alpha=0.5, label='Positives')
for i in range(0, len(anchors), 5):
    axes[1].plot([anchors[i, 0], positives[i, 0]],
                 [anchors[i, 1], positives[i, 1]], 'k-', alpha=0.1)
axes[1].set_title('Contrastive Pairs (linked)', fontsize=13)
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Build a simple encoder network
class ContrastiveEncoder(nn.Module):
    def __init__(self, input_dim=2, hidden_dim=64, embed_dim=16):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim),
        )

    def forward(self, x):
        return self.encoder(x)

encoder = ContrastiveEncoder(input_dim=2, hidden_dim=64, embed_dim=16)
print(f"Encoder parameters: {sum(p.numel() for p in encoder.parameters()):,}")

## 7. Training and Results

In [None]:
# Training loop
optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-3)
temperature = 0.1
num_epochs = 200
batch_size = 32
losses = []

for epoch in range(num_epochs):
    # Shuffle data
    perm = torch.randperm(len(anchors))
    epoch_loss = 0
    n_batches = 0

    for start in range(0, len(anchors), batch_size):
        idx = perm[start:start + batch_size]
        batch_a = anchors[idx]
        batch_p = positives[idx]

        # Encode both views
        emb_a = encoder(batch_a)
        emb_p = encoder(batch_p)

        # Compute contrastive loss
        loss = info_nce_loss(emb_a, emb_p, temperature=temperature)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        n_batches += 1

    avg_loss = epoch_loss / n_batches
    losses.append(avg_loss)

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} | Loss: {avg_loss:.4f}")

In [None]:
# Visualize training progress
plt.figure(figsize=(8, 4))
plt.plot(losses, color='steelblue', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('InfoNCE Loss', fontsize=12)
plt.title('Contrastive Learning Training Progress', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Visualize learned embeddings
with torch.no_grad():
    learned_embeddings = encoder(anchors)
    learned_norm = F.normalize(learned_embeddings, dim=-1)

# Use PCA to project to 2D for visualization
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(learned_norm.numpy())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before training (raw input)
axes[0].scatter(anchors[:, 0].numpy(), anchors[:, 1].numpy(),
                c=colors, s=40)
axes[0].set_title('Input Space (Raw Data)', fontsize=13)

# After training (learned embeddings)
axes[1].scatter(emb_2d[:, 0], emb_2d[:, 1], c=colors, s=40)
axes[1].set_title('Learned Embedding Space\n(After Contrastive Training)', fontsize=13)

plt.tight_layout()
plt.show()
print("Clusters should be more clearly separated in the embedding space!")

## 8. Final Output

In [None]:
# Final demonstration: use the learned encoder for retrieval
with torch.no_grad():
    all_embeddings = F.normalize(encoder(anchors), dim=-1)

# Pick a query point and find its nearest neighbors
query_idx = 0
query_emb = all_embeddings[query_idx:query_idx+1]
similarities = (query_emb @ all_embeddings.T).squeeze()

# Top 5 most similar
top_k = 5
top_indices = similarities.argsort(descending=True)[1:top_k+1]

print(f"Query point cluster: {labels[query_idx].item()}")
print(f"Top-{top_k} most similar points:")
for rank, idx in enumerate(top_indices):
    print(f"  Rank {rank+1}: cluster={labels[idx].item()}, similarity={similarities[idx]:.3f}")
    correct = "correct" if labels[idx] == labels[query_idx] else "WRONG"
    print(f"    -> {correct}")

# Compute retrieval accuracy across all points
correct = 0
for i in range(len(anchors)):
    q = all_embeddings[i:i+1]
    sims = (q @ all_embeddings.T).squeeze()
    nearest = sims.argsort(descending=True)[1]
    if labels[nearest] == labels[i]:
        correct += 1
retrieval_acc = correct / len(anchors)
print(f"\nOverall nearest-neighbor retrieval accuracy: {retrieval_acc:.1%}")
print("Congratulations! You have built a contrastive learning system from scratch!")

## 9. Reflection and Next Steps

### Reflection Questions
1. What would happen if we used a very small batch size (e.g., 2)? How would the number of negatives affect learning?
2. Why is temperature important? What happens if tau is too small or too large?
3. Could the embeddings collapse to all be the same vector? Why or why not?
4. How is InfoNCE different from a standard triplet loss?

### Optional Challenges
1. Try different temperature values (0.01, 0.1, 0.5, 1.0) and plot the final embedding quality for each.
2. Increase the number of clusters to 20. Does the model still separate them well?
3. Implement a version where the encoder is a deeper network. Does depth help?