# Notebook 1: Introduction to Self-Supervised Learning

Self-Supervised Learning (SSL) is a machine learning technique that leverages unlabeled data by generating its own supervisory signals (pseudo-labels) from the data itself. In other words, instead of relying on human-annotated labels, a self-supervised model creates implicit labels from the input data and learns to predict the22-L30】. This approach is particularly useful in fields like computer vision and natural language processing, where labeled datasets are costly or impractical to obtain. By designing clever pretext tasks – auxiliary objectives that can be derived from unlabeled data – SSL enables a model to learn meaningful representations, which can later be applied to actual tasks (called downstream tasks) such as classification or detection. In essence, SSL fills in the gap between supervised and unsupervised learning. It is technically a subset of unsupervised learning (no human labels are used), yet it resembles supervdel is trained against a ground truth (albeit a ground truth obtained from the data itself). As Yann LeCun famously described, self-supervised methods ang a model to "fill in the blanks" – the model is given part of the data and must predict the missing piece. Over the past few years, SSL has been instrumental in training advanced models across domains: from large transformer language models like BERT and GPT (predicting masked words or next sentences) to **aund GANs in vision, and modern vision frameworks like SimCLR and MoCo. The ability to leverage vast amounts of unlabeled data has made SSL a cornerstone for pre-training models that can be fine-tuned with minimal labeled examples.

## Why Self-Supervision?

There are several motivations for self-supervised learning:
- **Data Abundance:** Unlabeled data (images, text, audio) is far more abundant than labeled data. SSL allows models to tap into this abundance by creating pseudo-labels. This dramatically reduces the need for manual annotation, making it time- and cost-effecientation Learning:** The goal of many SSL methods is to learn a rich feature representation from data without labels. These representations capture useful structure in the input (e.g., edges and shapes in images, syntactic/semantic context in text) which can be reused for various tasks via transfer learning.
- **Reduce Overfitting and Bias:** Because SSL pre-training typically uses very diverse data and tasks, the learned features tend to be more general and robust. Models like SimCLR have demonstrated that pre-training on unlabeled data and then fine-tuning can even outperform training from scratch on the labeled data, especially in low-label regimes.
- **Human Inspiration:** Humans learn from the world with minimal explicit labels – infants learn concepts by observation and inference (for example, by playing with objects or observing physics, essentially solving self-supervised tasks). SSL takes inspiration from this, aiming for models that can learn in a more human-like, label-efficient manner.

## Types of Self-Supervised Tasks

SSL tasks generally fall into two broad categories:

- **Pretext Tasks:** These are the self-supervised objectives we design for representation learning. They can be anything where we can derive a target from the data itself. Classic examples include: predicting the rotation applied to an image, solving a jigsaw puzzle arrangement of image patches, colorizing a grayscale image, predicting missing words in a sentence, etc. The pretext task is not of direct interest for deployment, but by training on it, the model learns internal features that are useful.
- **Downstream Tasks:** After a model is pre-trained on a pretext task, it is then fine-tuned or evaluated on a real task of interest (classification, segmentation, translation, etc.). The hope is that the pre-training has equipped the model with a good representation that makes learning the downstream task much easier (often requiring fewer labeled examples). This two-stage process is analogous to how one might first learn general knowledge and then specialize on a specific task.

### Common pretext tasks in vision include:
- **Context Prediction:** Given one part of an image, predict the position or content of another part (e.g., the pioneer work by Doersch et al. on predicting relative positions of patches).
- **Jigsaw Puzzle:** Break an image into patches, shuffle them, and train a network to reassemble the patches in the correct order. This task forces a model to understand global structure and context.
- **Rotation Prediction:** Rotate images by a random multiple of 90° and train a classifier to recognize the rotation angle. The model must learn about object orientation and features to do so.
- **Colorization:** Remove the color from ask the model with predicting the color (often in Lab color space) from the grayscale input. By learning to colorize, the model picks up on semantics (grass is green, sky is blue, etc.) without explicit labels for those objects.
- **Contrastive Learning:** Construct pairs of augmented images and train the model to identify which images originate from the same source versus different ones. Contrastive methods (covered in Notebook 2) like SimCLR and MoCo treat each image as its own class and use a contrastive loss to learn when two inputs are “similar” or “dissimilar”.

### On the NLP side, typical pretext tasks are:
- **Masked Language Modeling (MLM):** Randomly mask out words in a sentence and have the model predict them. This is what BERT does, enabling it to capture bidirectional context.
- **Next Sentence Prediction (NSP):** Also used in BERT, where the model learns to predict if one sentence follows another, teaching it about discourse and sentence relations.
- **Autoregressive Language Modeling:** e.g., GPT models predict the next word given previous words (technically self-supervised since the next word is part of the data itself). This learns rich representations of language.
- **Denoising Auto-encoders:** General idea used by models like BART (for text) or denoising image autoencoders, where random noise or corruption is applied to input and the model learns to reconstruct the original input.

All these tasks share the theme: remove or alter part of the data, then predict it. As a result, self-supervised models learn useful internal representations while solving the pretext task. Once trained, we often discard the output layer (that was specific to the pretext task) and use the rest of the network (the encoder) as a pre-trained backbone for downstream tasks.

## A Simple Example: Autoencoding as Self-Supervision

One classic example of self-supervision is the autoencoder. An autoencoder trains a neural network to compress data into a latent representation and then reconstruct the original data from this representation. No human labels are needed: the training objective is simply to output the same data that was input (with some constraints like a bottleneck to prevent a trivial copying solution). This can be seen as a form of self-supervised learning because the "label" for each input is the input itself. Let’s illustrate a simple autoencoder on a vision dataset (MNIST for simplicity). The network will take in an image of a digit, compress it to a lower-dimensional code, then reconstruct the image. While this is a toy example, it demonstrates learning from data alone:


In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as T

# Define a simple autoencoder
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        # Encoder: compress image to latent vector
        self.encoder = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 128),
            nn.ReLU(True),
            nn.Linear(128, 32)  # latent vector of size 32
        )
        # Decoder: reconstruct image from latent vector
        self.decoder = nn.Sequential(
            nn.Linear(32, 128),
            nn.ReLU(True),
            nn.Linear(128, 28*28),
            nn.Sigmoid()  # output pixel values 0-1
        )
    def forward(self, x):
        z = self.encoder(x)
        x_recon = self.decoder(z)
        return x_recon

# Load MNIST dataset (as unlabeled data)
transform = T.ToTensor()
train_data = torchvision.datasets.MNIST(root='data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)

# Initialize model, loss, optimizer
autoencoder = Autoencoder()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(autoencoder.parameters(), lr=1e-3)

# Training loop (note: for a quick demo, we'll do a few iterations; more epochs for actual training)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
autoencoder.to(device)
for epoch in range(1):  # adjust epochs as needed
    for images, _ in train_loader:
        images = images.to(device)
        # Forward pass
        recon = autoencoder(images)
        # Compute reconstruction loss
        loss = criterion(recon, images.view(images.size(0), -1))
        # Backprop and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Training loss (MSE) after epoch {epoch+1}: {loss.item():.4f}")

Running the above will train the autoencoder (the training loss should decrease as the model learns to reconstruct digits). After training, we can visualize a few reconstructions to see what the model has learned:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Get a batch of test images
test_data = torchvision.datasets.MNIST(root='data', train=False, download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=8, shuffle=True)
images, _ = next(iter(test_loader))
images = images.to(device)
with torch.no_grad():
    reconstructions = autoencoder(images).cpu().view(-1, 1, 28, 28)

# Plot original vs reconstructed
fig, axes = plt.subplots(2, 8, figsize=(8, 2))
for i in range(8):
    axes[0, i].imshow(images[i].cpu().squeeze(), cmap='gray')
    axes[0, i].axis('off')
    axes[1, i].imshow(reconstructions[i].squeeze(), cmap='gray')
    axes[1, i].axis('off')
plt.suptitle("Original images (top row) vs Autoencoder reconstructions (bottom row)")
plt.show()

In this autoencoder example, the pretext task was image reconstruction – the model learned to encode the image into a 32-dimensional vector such that it could decode it back. Although our goal isn't specifically to reconstruct digits in practice, the hidden 32-dimensional representation hopefully captures salient features of the digits (strokes, shape, etc.). This representation could be used as input to a downstream classifier, or the encoder could be fine-tuned for a supervised task. Indeed, early unsupervised learning methods like autoencoders and variational autoencoders (VAE) were forerunners of modern SSL.

## Transfer Learning and SSL

One of the major benefits of self-supervised pre-training is seen in transfer learning. After performing SSL on a large corpus of data, we obtain a model (or feature extractor) that can be transferred to tasks with limited labeled data. A prominent success story is in NLP: models like BERT are pre-trained on massive text via MLM and NSP and then fine-tuned on tasks like sentiment analysis or Q&A, often achieving state-of-the-art with far fewer task-specific examples than would otherwise be needed. Similarly, in vision, a model like MoCo or SimCLR pre-trained on millions of unlabeled images can be fine-tuned on a small labeled dataset (like with 1% of ImageNet labels) and still yield high accuracy. We will see concrete examples of this in later notebooks.

**Summary:** In this introductory notebook, we covered what self-supervised learning is and why it’s important. We highlighted various self-supervised tasks and how they enable learning from unlabeled data. In the next notebook, we will dive deeper into one of the most popular approaches in SSL for vision: contrastive learning, exemplified by SimCLR and MoCo.

**Bonus Exercise:** Think of a creative self-supervised task for a domain of your choice (e.g., audio, text, or even time-series data). For instance, for audio, you might remove a segment of a waveform and train a model to predict it (a sort of audio inpainting). Outline how you would set up this pretext task and what representation the model might learn. (No code solution needed — this is an open design question to spur creativity.)

## References:
- LeCun, Y. (2019). "Self-Supervised Learning: The Dark Matter of Intelligence." (Facebook AI Blog). – Visionary discussion on why SSL is critical for AI, coining it as dark matter of intelligence.
- Chen, T. et al. (2020). "SimCLR: A Simple Framework for Contrastive Learning of Visual Representations." – Proposed SimCLR, a seminal contrastive SSL method.
- He, K. et al. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)." – Introduced MoCo using a queue for negatives and a momentum encoder.
- Noroozi, M. & Favaro, P. (2016). "Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles." – Established jigsaw puzzle as a pretext task.
- Gidaris, S. et al. (2018). "Unsupervised Representation Learning by Predicting Image Rotations." – Showed that predicting 0°, 90°, 180°, 270° rotations yields strong features.
- Zhang, R. et al. (2016). "Colorful Image Colorization." – Used colorization of grayscale images as a self-supervised task (treating L channel as input and ab as output).
- IBM Cloud Education (2023). "What is self-supervised learning?" (IBM Think Blog) – A clear overview of SSL concepts and differences from supervised/unsupervised.
- MNIST Dataset – A simple image dataset of handwritten digits used here for the autoencoder demo.
