# Notebook 2: Contrastive Learning (SimCLR, MoCo)

One of the most impactful paradigms in self-supervised learning for vision is contrastive learning. In contrastive learning, the model learns by comparing representations of data: it’s trained to make representations of similar inputs closer and those of different inputs farther apart in feature space. The core idea is to create pairs (or sets) of examples where we know some are “positive” (should be similar) and others “negative” (should be dissimilar), even without manual labels. How can we get such information without labels? Through clever data augmentation and instance identity. Instance Discrimination: Modern contrastive methods like SimCLR treat each individual image as its own class during pretraining. That is, an image and its augmented version form a positive pair (they should have similar embeddings), whereas an image and any other different image form negative pairs (dissimilar embeddings). By training the network with this objective, it learns to encode distinguishing features of each image.  Illustration of contrastive learning: each image (e.g., a photo of a dog) is augmented twice to create two different views. The ne produces embeddings for each view. A contrastive loss then attracts embeddings from the same image (green arrows) and repels embeddings from different images (red arrows). The prototypical example of contrastive learning in SSL is SimCLR (Simple Contrastive Learning of Representations) by Chen et al., 2020. SimCLR demonstrated that a simple setup with large batch contrastive training can learn very powerful representations:

- **Data Augmentation:** For each image, SimCLR generates two random augmentations (crops, color jitter, flip, blur, etc.). These two augmented images are treated as a positive pair, since they originated from the same source image. All other images in the batch are negatives.
- **Neural Network Encoder:** Both augmented images are passed through a base encoder network (e.g., ResNet) to obtain feature vectors. SimCLR then uses a small projection MLP to map these features into a latent space where the contrastive loss is applied. (The use of a projection head was found to improve the quality of the learned underlying features – the encoder’s output – by focusing the contrastive task on the MLP output).
- **Contrastive Loss (NT-Xent):** The loss function used is a normalized temperature-scaled cross-entropy loss (also called InfoNCE). For a given pair of positive examples (i and j are two augmentations of the same image), the goal is to identify j among a set of negatives. Concretely, the similarity (often cosine similarity) between the representation of i and j should be maximized, while the similarity between i and any other k (a different image’s representation) is minimized. The loss pushes the dot product of positives higher than that of any negatives. Intuitively, the model learns to cluster augmented views of the same image in embedding space and separate clusters for different images.
- **Large Batch = Many Negatives:** SimCLR relies on very large batch sizes (e.g., 256, 1024, even 8192) so that each batch provides many negative examples. The more negatives, the harder the discrimination task, which forces the model to learn more nuanced features to tell every image apart from every other.

After training, the encoder can produce image embeddings that cluster semantically – even though the model never saw class labels, images of the same object or concept often end up with similar representations. These representations can be evaluated by training a simple linear classifier on top (a common protocol to measure representation quality). SimCLR achieved remarkable results: for instance, a linear classifier on SimCLR features reached 76.5% top-1 accuracy on ImageNet (nearly matching a supervised ResNet-50), and with fine-tuning it surpassed supervised learning in low-label settings. Key discovery: The combination of strong data augmentation, a big network, a projection head, and lots of negatives was sufficient to learn high-quality visual features without any manual labels. This was a breakthrough in 2020, greatly narrowing the gap between unsupervised and supervised learning on ImageNet. Another influential method is MoCo (Momentum Contrast) by He et al., 2020. MoCo shares the same goal as SimCLR (instance discrimination via contrastive loss) but introduces a couple of innovations to make training more efficient with limited batch sizes:

- **Memory Bank / Queue:** Instead of relying on extremely large batches for negatives, MoCo maintains a memory queue of feature vectors from recent batches. This effectively provides a large set of negative samples without needing them all in the current batch. After each batch, the new embeddings are enqueued and the oldest are dequeued, keeping the queue size fixed (e.g., 65k).
- **Momentum Encoder:** A potential issue with using a memory bank is that the encoder keeps changing during training, so cached features (negatives) become stale. MoCo addresses this by having two networks: a query encoder (updated normally by backpropagation) and a key encoder that is updated slowly by momentum – i.e., the key encoder’s weights are a moving average of the query encoder’s weights. This momentum update (with a factor like 0.999) means the key encoder evolves more smoothly, so the features in the queue (computed by past key encoders) remain more consistent even as training progresses. In effect, the key encoder is "lagging behind" the query encoder, providing a form of consistency for negative comparisons.
- **Contrastive Setup:** For each image, MoCo creates a query (e.g., an augmented image passed through the query encoder) and a key (another augmentation passed through the key encoder). The positive pair is the query vs. its corresponding key. The negatives are the other keys in the queue. The loss is again InfoNCE. Only the query encoder is updated by gradients; the key encoder is updated by momentum.

This design decouples the dictionary of negatives from batch size – one can use a moderate batch and still have a huge set of negatives accumulated in the queue. MoCo can also be seen as continuously building a dynamic dictionary of feature keys: as training goes, new images populate the dictionary and old ones leave, and the dictionary is looked up via the contrastive loss (matching queries to their keys).

 Conceptual diagram of MoCo: A query image (left) and a key image (a different augmentation of the same image) are encoded by two networks. The key encoder is updated via momentum from the query encoder (light blue). A queue of past key representations ${k_0, k_1, ...}$ serves as a large set of negatives. The contrastive loss is computed such that the query $q$ should match its positive key $k^+$ and be distinct from other keys in the queue. In practice, MoCo showed that one can achieve results on par with SimCLR but using a memory mechanism instead of gigantic batches. MoCo v1 achieved about 60% top-1 on ImageNet with linear eval (for ResNet50), and subsequent improvements (MoCo v2, v3) pushed these numbers even higher, closing the gap with SimCLR. Both SimCLR and MoCo are contrastive approaches and use a similar loss (InfoNCE). Their success has led to many follow-up works and also analysis that these methods learn features encoding object identity, even though the pretext task was just instance discrimination. However, contrastive learning is not the end of the story – later in the course, we will also see non-contrastive approaches (like BYOL, SimSiam) that surprisingly learn good features without explicitly contrasting examples.


### Contrastive Loss in Code (SimCLR-style)

To solidify understanding, let's implement a simplified version of contrastive training for images. We will:
- Use a small CNN as the encoder.
- Define two random augmentation transforms.
- Generate pairs of images and compute the contrastive loss (InfoNCE).

Due to resource limits, we'll train on a small subset or for few epochs just to illustrate the process.

**First, some utility code for augmentations and the loss:**

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T

# Define random augmentation for SimCLR (similar to those in the paper)
augment = T.Compose([
    T.RandomResizedCrop(size=32),   # random crop (for CIFAR, original size 32, so acts as random crop+scale)
    T.RandomHorizontalFlip(),       # random flip
    T.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.2),  # color jitter
    T.RandomGrayscale(p=0.2),
    T.ToTensor()
])

# Simple CNN encoder (for 32x32 images like CIFAR-10)
class SmallCNN(nn.Module):
    def __init__(self, out_dim=128):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1), nn.ReLU(),  # 16x16
            nn.Conv2d(32, 64, 3, stride=2, padding=1), nn.ReLU(), # 8x8
            nn.Conv2d(64, 128, 3, stride=2, padding=1), nn.ReLU(),# 4x4
            nn.Flatten()
        )
        self.fc = nn.Linear(128*4*4, out_dim)  # project to feature vector
    def forward(self, x):
        x = self.conv(x)
        x = self.fc(x)
        return x

# InfoNCE loss implementation
def info_nce_loss(z_i, z_j, temperature=0.5):
    # z_i and z_j are two sets of feature vectors (aug1 and aug2 for the batch)
    # Normalize the vectors
    z_i = F.normalize(z_i, dim=1)
    z_j = F.normalize(z_j, dim=1)
    batch_size = z_i.size(0)
    # Compute similarity matrix (batch_size x batch_size)
    sim_matrix = torch.mm(z_i, z_j.t())  # dot products between all i in aug1 and all j in aug2
    # For contrastive loss, we want sim(i,j) for positive pairs (i and its j) to be high relative to others.
    # Create labels: 0,...,batch_size-1 where each i's positive is j at same index.
    labels = torch.arange(batch_size).to(z_i.device)
    # Scale by temperature
    sim_matrix /= temperature
    # Apply cross-entropy loss: each row i should have label i (meaning j = i is the correct match)
    loss_i = F.cross_entropy(sim_matrix, labels)
    loss_j = F.cross_entropy(sim_matrix.t(), labels)
    # Final loss is average of both directions
    return 0.5 * (loss_i + loss_j)

**Now, let's simulate training on CIFAR-10 (we'll use only a subset for speed). Note: This is a toy run; real contrastive training would require many more epochs and possibly a larger model and GPU for good results. But we'll observe if the loss decreases and maybe examine the learned embeddings:**

In [None]:
import torchvision
from torch.utils.data import DataLoader, Subset

# Load CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='data', train=True, download=True)
# We'll ignore labels entirely for self-supervised training
# Use only a subset of data for demonstration (e.g., 10000 images)
subset_indices = list(range(10000))
train_subset = Subset(train_dataset, subset_indices)

# DataLoader with our augmentation applied on the fly
def simclr_collate_fn(batch):
    # Custom collate: apply augment to get two views for each image
    images, _ = zip(*batch)
    aug1 = [augment(img) for img in images]
    aug2 = [augment(img) for img in images]
    # Stack into tensors
    aug1 = torch.stack(aug1, dim=0)
    aug2 = torch.stack(aug2, dim=0)
    return aug1, aug2

loader = DataLoader(train_subset, batch_size=128, shuffle=True, collate_fn=simclr_collate_fn)

# Initialize model and optimizer
model = SmallCNN(out_dim=128)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

# Training loop (few epochs)
for epoch in range(5):
    total_loss = 0.0
    model.train()
    for aug1, aug2 in loader:
        aug1, aug2 = aug1.to(device), aug2.to(device)
        # Compute features
        z1 = model(aug1)
        z2 = model(aug2)
        # Contrastive loss
        loss = info_nce_loss(z1, z2, temperature=0.5)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    avg_loss = total_loss / len(loader)
    print(f"Epoch {epoch+1}, InfoNCE Loss: {avg_loss:.4f}")

You should see the InfoNCE loss decrease over epochs, indicating the model is getting better at pulling together augmented views and pushing apart different images. After training, the model's encoder part (model.conv and first FC layer output) provides 128-dim representations. We can evaluate how meaningful they are via a simple linear probe on CIFAR-10 labels, or qualitatively by checking nearest neighbors in the representation space. For brevity, let's do a quick check of nearest neighbors: pick a random image, find the closest images in the subset by cosine similarity of the learned features:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Switch model to eval
model.eval()
# Extract features for subset (this can be time-consuming for full data; we'll use subset again)
features = []
images = []
with torch.no_grad():
    for aug1, aug2 in loader:  # we can use one of the augmented views as representative
        aug1 = aug1.to(device)
        feat = model(aug1).cpu()
        features.append(feat)
        images.append(aug1.cpu())
features = torch.cat(features)
images = torch.cat(images)
features = F.normalize(features, dim=1)  # normalize for cosine similarity

# Pick a random image from the set and find nearest neighbor
idx = np.random.randint(len(features))
query_feat = features[idx]
# Cosine similarity with all others
sims = torch.mv(features, query_feat)  # dot product with each (since features are normalized, this is cosine)
sims[idx] = -1.0  # exclude itself
nn_idx = torch.argmax(sims).item()

# Display the query and its nearest neighbor
query_img = images[idx]
nn_img = images[nn_idx]
plt.figure(figsize=(2,2))
plt.subplot(1,2,1); plt.imshow(query_img.permute(1,2,0)); plt.title("Query"); plt.axis('off')
plt.subplot(1,2,2); plt.imshow(nn_img.permute(1,2,0)); plt.title("Nearest Neighbor"); plt.axis('off')
plt.show()

Does the nearest neighbor look semantically similar to the query? Often, even with this small training, you might notice some alignment (e.g., both are greenish images, or both have similar textures). With full training (and not such a tiny model), contrastive learning yields representations where nearest neighbors are truly semantically related (e.g., pictures of dogs cluster together, etc.), despite never using labels during training. 

Takeaway: Contrastive learning forces the model to learn what makes images unique. SimCLR’s loss, for example, implicitly teaches the model that an image of a dog must not be confused with a cat or a truck, because the only positive partner for a given dog image is an augmented version of itself – all other images (including other dogs in different poses) are negatives. As training progresses, the model develops internal features that cluster similar objects to minimize confusion. In practice, many improvements and tricks have been developed on top of these basics:
- Using stop-gradient and asymmetry (BYOL, SimSiam) to avoid collapse without negatives.
- Clustering-based contrastive methods (SwAV, DeepCluster) which we won’t cover in detail here, use group assignments as “pseudo-labels”.
- Tuning the temperature parameter in InfoNCE, which balances concentration vs. uniformity of the feature distribution.
- Architectural advances: e.g., using Vision Transformers as encoders in DINO (which is a self-distillation method combining ideas from contrastive and BYOL).

We will encounter some of these ideas in later notebooks. But first, our next notebook will shift from contrastive learning to another set of classic self-supervised tasks: pretext tasks like rotations, puzzles, and colorization.

**Bonus Exercise:** Experiment with the temperature hyperparameter in the InfoNCE loss. In the code above, we used temperature=0.5. Try reasoning what would happen if the temperature is extremely low (e.g., 0.1) or high (e.g., 2.0). How would that affect the training dynamics? (For an advanced exercise, if you have time and resources, you could run the training code with different temperatures to empirically observe the effect on loss convergence or feature quality.)

## References:
- Chen, T. et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)." ICML. – The SimCLR paper showing simple contrastive learning can yield SOTA results.
- He, K. et al. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)." CVPR. – Introduces the momentum encoder and queue for contrastive learning.
- Chen, X. et al. (2020). "Improved Baselines with Momentum Contrast (MoCo v2).” – An improved version of MoCo aligning more closely with SimCLR’s augmentations, closing the gap.
- SimCLR Blog (Google AI) – “Advancing Self-Supervised and Semi-Supervised Learning with SimCLR”, which provides an accessible overview of SimCLR and key findings (including linear eval accuracy).
- Grill, J.-B. et al. (2020). "Bootstrap Your Own Latent (BYOL)." – A pioneering work showing negative pairs aren’t strictly required, heralding a new wave of SSL methods beyond contrastive.
- Caron, M. et al. (2020). "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (SwAV)." – Combines contrastive learning with clustering; no explicit negative pairs, using a “swapped” prediction mechanism.
- Oord, A. et al. (2018). "Representation Learning with Contrastive Predictive Coding (CPC)." – Earlier work on contrastive learning (predicting future in a sequence), which inspired InfoNCE loss usage in vision and NLP.
- Papers With Code – Contrastive Learning Methods – A repository of various contrastive and non-contrastive SSL methods for further exploration.
