In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
try:
    import tensorflow as tf
    from tensorflow.keras import layers, Model
    from tensorflow.keras.datasets import fashion_mnist, mnist
    TENSORFLOW_AVAILABLE = True
except ImportError:
    TENSORFLOW_AVAILABLE = False
from IPython.display import display, Markdown, Image

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title): print(f"\n{80*'='}\n| {title.upper()} |\n{80*'='}")

note("Environment initialized for Self-Supervised Learning.")

# Chapter 7.12: Self-Supervised Representation Learning

---

### Table of Contents

1.  [**Introduction: Learning from the Data Itself**](#intro)
2.  [**Generative/Reconstructive Learning: Autoencoders**](#autoencoders)
    - [The Denoising Autoencoder (DAE)](#dae)
    - [The Variational Autoencoder (VAE)](#vae)
3.  [**Contrastive Learning: SimCLR**](#contrastive)
    - [The NT-Xent Loss Function](#nt-xent)
4.  [**Applications in Economics**](#applications)
    - [Application 1: Denoising and Generative Modeling](#app-dae-vae)
    - [Application 2: Economic Feature Extraction from Text](#app-text)
5.  [**Exercises**](#exercises)
6.  [**Summary and Key Takeaways**](#summary)

<a id='intro'></a>
## 1. Introduction: Learning from the Data Itself

Supervised learning, while powerful, is fundamentally bottlenecked by its reliance on massive, human-annotated datasets. **Self-Supervised Learning (SSL)** provides a path around this bottleneck by designing ingenious **pretext tasks** where the data provides its own supervision. A model that solves this pretext puzzle is forced to learn a rich, semantic understanding of the data's underlying structure. This learned 'world knowledge,' embodied in its internal representations, can then be transferred to solve downstream tasks where labeled data is scarce.

This chapter explores two dominant SSL paradigms:
1.  **Generative/Reconstructive Learning:** We will focus on **Autoencoders**, which learn representations by compressing and decompressing data. We will cover the basic autoencoder, the more robust **Denoising Autoencoder**, and the powerful **Variational Autoencoder (VAE)**, which learns a probabilistic latent space.
2.  **Contrastive Learning:** We will introduce the core ideas behind modern contrastive methods like **SimCLR**, which learn representations by pulling augmented "positive pairs" together and pushing "negative pairs" apart in the embedding space.

<a id='autoencoders'></a>
## 2. Generative/Reconstructive Learning: Autoencoders

An **autoencoder** is an unsupervised neural network that learns a compressed, low-dimensional representation of data by learning to reconstruct its own input. It is composed of two sub-networks:
- An **encoder** that maps the high-dimensional input $\mathbf{x}$ to a low-dimensional latent space representation $\mathbf{z}$.
- A **decoder** that attempts to reconstruct the original input $\hat{\mathbf{x}}$ from the latent representation $\mathbf{z}$.

By forcing information through a narrow **bottleneck** (the latent space), the autoencoder is compelled to learn the most salient factors of variation in the data.

<a id='dae'></a>
### 2.1 The Denoising Autoencoder (DAE)

A simple autoencoder can sometimes learn an uninteresting identity function. A **Denoising Autoencoder (DAE)** provides a more robust pretext task. The model is not trained to reconstruct the original input, but rather to reconstruct a *clean* version of the input from a *corrupted* version. This forces the model to learn more robust features and prevents it from simply memorizing the data.

![DAE Architecture](../images/07-Machine-Learning/dae_architecture.png)
*<center><b>Figure 1: The Denoising Autoencoder (DAE) framework.</b></center>*

**The Process:**
1.  Take the original input $\mathbf{x}$.
2.  Add random noise (e.g., Gaussian noise) to create a corrupted input $\tilde{\mathbf{x}}$.
3.  Train the autoencoder to minimize the reconstruction error between the decoder's output $\hat{\mathbf{x}}$ and the **original, clean input** $\mathbf{x}$.
$$ L(\theta) = || \mathbf{x} - g_\theta(f_\theta(\tilde{\mathbf{x}})) ||^2 $$

<a id='vae'></a>
### 2.2 Variational Autoencoders (VAEs)

The **Variational Autoencoder (VAE)**, introduced by Kingma and Welling (2013), is a more sophisticated, probabilistic, and generative model. Unlike a standard autoencoder that learns a single point in the latent space for each input, a VAE learns a **probability distribution**. The encoder doesn't output a latent vector $\mathbf{z}$, but rather the parameters of a distribution (typically the mean $\mu_z$ and log-variance $\log(\sigma_z^2)$ of a Gaussian) from which we can sample a latent vector.

This has two profound consequences:
1.  **Continuity and Regularization:** By forcing the encoder to learn a distribution, the VAE encourages the latent space to be more continuous and well-structured. Nearby points in the latent space correspond to semantically similar outputs.
2.  **Generative Capability:** Once trained, we can sample a random point from the latent distribution and pass it through the decoder to **generate new, plausible data** that has never been seen before.

The VAE is trained by minimizing a loss function with two components:
$$ \text{Loss}_{VAE} = \underbrace{||\mathbf{x} - \hat{\mathbf{x}}||^2}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))}_{\text{KL Divergence}} $$ 
The first term is the standard reconstruction error. The second term is the **Kullback-Leibler (KL) divergence**, which acts as a regularizer. It measures how much the learned latent distribution $q(\mathbf{z}|\mathbf{x})$ deviates from a standard prior distribution $p(\mathbf{z})$ (typically a standard normal distribution). This forces the learned representations to be organized in a structured, Gaussian-like cloud in the latent space.

<a id='contrastive'></a>
## 3. Contrastive Learning (SimCLR)

A different and highly successful paradigm for SSL is **Contrastive Learning**. Instead of reconstructing the input, the goal is to learn representations such that similar, or "positive," pairs of inputs are close together in the embedding space, while dissimilar, or "negative," pairs are far apart.

![SimCLR Framework](../images/07-Machine-Learning/simclr_framework.png)
*<center><b>Figure 2: The SimCLR framework.</b> An input image is augmented twice to form a positive pair. The encoder and projection head create latent vectors, and the contrastive loss maximizes the agreement between the positive pair relative to all other (negative) examples in the batch.</center>*

**SimCLR (A Simple Framework for Contrastive Learning)** is a canonical example:
1.  **Data Augmentation:** Take a batch of images. For each image, create two different, randomly augmented "views" (e.g., by cropping, rotating, or changing colors). These two views form a **positive pair**.
2.  **Encoder:** Pass all augmented images through an encoder network (typically a ResNet) to get their vector representations.
3.  **Projection Head:** The representations are passed through a small MLP projection head to map them into the space where the contrastive loss is applied.
4.  **Contrastive Loss (NT-Xent):** For a given positive pair, the loss function aims to maximize their agreement (e.g., cosine similarity) relative to the agreement with all other "negative" examples in the same batch.

By solving this pretext task, the encoder learns to produce representations that are invariant to the augmentations, capturing the high-level semantic content of the image rather than low-level pixel details.

<a id='nt-xent'></a>
### The NT-Xent Loss Function

The loss function for a positive pair of examples $(i, j)$ is defined as:
$$ \ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbf{1}_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)} $$
Where:
- $z_i$ and $z_j$ are the latent vectors of the positive pair.
- $\text{sim}(u, v) = u^T v / ||u|| ||v||$ is the cosine similarity.
- $\tau$ is a **temperature** parameter that scales the scores. Lower temperatures amplify the differences between pairs.
- The denominator sums over all other examples in the batch ($2N-1$ negative pairs).
This is a form of cross-entropy loss that tries to classify the correct positive pair among all possible pairs in the batch.

<a id='applications'></a>
## 4. Applications in Economics

<a id='app-dae-vae'></a>
### Application 1: Denoising and Generative Modeling
We will first build and train the Denoising Autoencoder example, and then visualize its output. We will then do the same for the Variational Autoencoder to showcase its generative capabilities.

In [None]:
sec("Building, Training, and Visualizing a Denoising Autoencoder")
if TENSORFLOW_AVAILABLE:
    (x_train_f, _), (x_test_f, _) = fashion_mnist.load_data()
    x_train_f = x_train_f.astype('float32') / 255.[..., tf.newaxis]
    x_test_f = x_test_f.astype('float32') / 255.[..., tf.newaxis]

    noise_factor = 0.2
    x_train_noisy = x_train_f + noise_factor * tf.random.normal(shape=x_train_f.shape)
    x_test_noisy = tf.clip_by_value(x_test_f + noise_factor * tf.random.normal(shape=x_test_f.shape), 0., 1.)

    inp = layers.Input(shape=(28, 28, 1))
    x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(inp)
    x = layers.MaxPooling2D((2, 2), padding='same')(x)
    encoded = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(x)
    x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation='relu', padding='same')(encoded)
    decoded = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
    autoencoder = Model(inp, decoded)
    autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
    
    note("Training DAE... (This may take a minute)")
    autoencoder.fit(x_train_noisy, x_train_f, epochs=10, batch_size=128, validation_data=(x_test_noisy, x_test_f), verbose=0)
    
    decoded_imgs = autoencoder.predict(x_test_noisy)
    n = 10
    plt.figure(figsize=(20, 6))
    for i in range(n):
        ax = plt.subplot(3, n, i + 1); plt.imshow(tf.squeeze(x_test_noisy[i])); plt.gray(); ax.axis('off')
        if i == n//2: ax.set_title('Noisy Input')
        ax = plt.subplot(3, n, i + 1 + n); plt.imshow(tf.squeeze(decoded_imgs[i])); plt.gray(); ax.axis('off')
        if i == n//2: ax.set_title('Reconstructed Output')
        ax = plt.subplot(3, n, i + 1 + 2*n); plt.imshow(tf.squeeze(x_test_f[i])); plt.gray(); ax.axis('off')
        if i == n//2: ax.set_title('Original Clean Image')
    plt.show()
else:
    note("TensorFlow not available. Skipping code labs.")

In [None]:
sec("Building, Visualizing, and Generating with a VAE")
if TENSORFLOW_AVAILABLE:
    original_dim = 28 * 28; intermediate_dim = 64; latent_dim = 2
    class Sampling(layers.Layer):
        def call(self, inputs): return tf.random.normal(tf.shape(inputs[0])) * tf.exp(inputs[1] * .5) + inputs[0]
    encoder_inputs = keras.Input(shape=(original_dim,)); x = layers.Dense(intermediate_dim, activation="relu")(encoder_inputs)
    z_mean = layers.Dense(latent_dim)(x); z_log_var = layers.Dense(latent_dim)(x)
    z = Sampling()([z_mean, z_log_var]); encoder = tf.keras.Model(encoder_inputs, [z_mean, z_log_var, z])
    latent_inputs = tf.keras.Input(shape=(latent_dim,)); x = layers.Dense(intermediate_dim, activation="relu")(latent_inputs)
    decoder_outputs = layers.Dense(original_dim, activation="sigmoid")(x); decoder = keras.Model(latent_inputs, decoder_outputs)
    class VAE(keras.Model):
        def __init__(self, e, d, **k): super().__init__(**k); self.encoder,self.decoder=e,d
        def train_step(self, data):
            with tf.GradientTape() as tape:
                z_mean, z_log_var, z = self.encoder(data)
                recon = self.decoder(z)
                recon_loss = tf.reduce_mean(tf.reduce_sum(keras.losses.binary_crossentropy(data, recon), axis=-1))
                kl_loss = tf.reduce_mean(tf.reduce_sum(-0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)), axis=1))
                self.add_loss(kl_loss)
                total_loss = recon_loss + kl_loss
            grads = tape.gradient(total_loss, self.trainable_weights); self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
            return {"loss": total_loss, "reconstruction_loss": recon_loss, "kl_loss": kl_loss}
    (x_train_m, y_train_m), (x_test_m, y_test_m) = mnist.load_data()
    mnist_digits = np.concatenate([x_train_m, x_test_m]).astype("float32") / 255
    mnist_digits = np.reshape(mnist_digits, (-1, original_dim))
    vae = VAE(encoder, decoder); vae.compile(optimizer=keras.optimizers.Adam())
    note("Training VAE... (This may take a minute)")
    vae.fit(mnist_digits, epochs=25, batch_size=128, verbose=0)
    
    z_mean, _, _ = vae.encoder.predict(mnist_digits, verbose=0)
    plt.figure(figsize=(12, 10)); plt.scatter(z_mean[:, 0], z_mean[:, 1], c=np.concatenate([y_train_m, y_test_m])); plt.colorbar()
    plt.xlabel("Latent Dim 1"); plt.ylabel("Latent Dim 2"); plt.title("VAE Latent Space")
    plt.show()

    n=15; digit_size=28; figure=np.zeros((28*n,28*n)); grid_x=np.linspace(-4,4,n); grid_y=np.linspace(-4,4,n)[::-1]
    for i,yi in enumerate(grid_y):
        for j,xi in enumerate(grid_x):
            z=np.array([[xi,yi]]); x_decoded=vae.decoder.predict(z,verbose=0); figure[i*28:(i+1)*28,j*28:(j+1)*28]=x_decoded[0].reshape(28,28)
    plt.figure(figsize=(10,10)); plt.imshow(figure,cmap="Greys_r"); plt.axis("Off"); plt.title("Generated Digits")
    plt.show()
else:
    note("TensorFlow not available. Skipping code labs.")

<a id='app-text'></a>
### Application 2: Economic Feature Extraction from Text

A powerful application of SSL in economics is using large, pre-trained language models to extract meaningful features from text data. Models like BERT are trained on a self-supervised objective (masked language modeling) on a massive text corpus. We can leverage the resulting representations without any fine-tuning.

Here, we will use the `sentence-transformers` library, which provides easy access to pre-trained models, to convert sentences from central bank statements into high-quality vector embeddings. We can then visualize these embeddings to see how the model has learned to group semantically similar statements.

In [None]:
sec("Feature Extraction from FOMC Statements")

try:
    from sentence_transformers import SentenceTransformer
    from sklearn.decomposition import PCA

    fomc_sentences = [
        "Inflation has remained persistently below the Committee's 2 percent objective.",
        "The labor market has continued to strengthen and that economic activity has been rising at a solid rate.",
        "Job gains have been solid, on average, in recent months, and the unemployment rate has remained low.",
        "Measures of inflation compensation remain low; survey-based measures of longer-term inflation expectations are little changed.",
        "The Committee decided to raise the target range for the federal funds rate to 2-1/4 to 2-1/2 percent.",
        "In light of the current shortfall of inflation from 2 percent, the Committee will carefully monitor actual and expected inflation.",
        "The Board of Governors of the Federal Reserve System voted unanimously to raise the rate for primary credit."
    ]

    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(fomc_sentences)
    note(f"Generated embeddings of shape: {embeddings.shape}")

    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embeddings)

    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:,0], embeddings_2d[:,1])
    for i, txt in enumerate(range(len(fomc_sentences))):
        plt.annotate(f'S{i+1}', (embeddings_2d[i,0], embeddings_2d[i,1]))
    plt.title('2D PCA of Sentence Embeddings from FOMC Statements')
    plt.show()
    note("S1, S4, S6 (all about inflation) are clustered together. S2 & S3 (labor market) are close. S5 & S7 (policy rate action) are close. The self-supervised model has learned the semantic meaning of these sentences without any explicit labels.")

except ImportError:
    import subprocess
    subprocess.run(['pip', 'install', 'sentence-transformers'])
    note("Skipping text embedding example: 'sentence-transformers' is not installed. You can install it with 'pip install sentence-transformers'.")

<a id='exercises'></a>\n## 5. Exercises\n\n1.  **DAE vs. VAE:** What is the fundamental difference in what a Denoising Autoencoder and a Variational Autoencoder learn about the data's latent space?\n2.  **The Role of Noise:** In a Denoising Autoencoder, what would happen if the noise factor was set to 0? What if it was set to a very large value?\n3.  **Contrastive Loss:** In the NT-Xent loss function, what is the role of the temperature parameter $\tau$? How would the optimization dynamics change if $\tau$ is very high or very low?\n4.  **Data Augmentation:** Why is the choice of data augmentations so critical for the success of contrastive learning methods like SimCLR?

<a id='summary'></a>\n## 6. Summary and Key Takeaways\n\nSelf-Supervised Learning represents a paradigm shift away from the reliance on large, manually labeled datasets, allowing models to learn rich representations from the inherent structure of the data itself.\n\n**Key Concepts**:\n- **Pretext Task**: SSL works by creating a 'pretext' task where the data provides its own labels (e.g., reconstructing a noisy image, predicting a masked word). By solving this task, the model is forced to learn meaningful features.\n- **Reconstructive SSL (Autoencoders)**: This family of models, including Denoising and Variational Autoencoders, learns by compressing data into a low-dimensional latent space and then decompressing it. VAEs are notable for learning a probabilistic latent space, which allows for the generation of new data.\n- **Contrastive SSL (SimCLR)**: This family of models learns by pulling representations of similar (positive) data pairs together while pushing dissimilar (negative) pairs apart. The choice of data augmentation to create positive pairs is critical to its success.\n- **Transfer Learning**: The powerful representations learned via SSL can be transferred to downstream supervised tasks where labeled data is scarce, often leading to significant performance improvements.

### Solutions to Exercises\n\n---\n\n**1. DAE vs. VAE:**\nA DAE learns a deterministic mapping from an input to a single point in the latent space. Its goal is simply to create a good compression. A VAE learns the parameters of a **probability distribution** (e.g., a mean and variance) for each input. This forces the latent space to be continuous and structured, which allows the VAE to be used as a **generative model** (i.e., we can sample from the latent space to create new data), a capability that a standard DAE lacks.\n\n---\n\n**2. The Role of Noise:**\n- **Noise = 0:** The DAE becomes a standard autoencoder. It might learn a trivial identity function (simply copying the input) without learning any useful, robust features.\n- **Noise very large:** The input becomes almost pure noise. The model would be unable to recover the original signal, and the reconstruction loss would be consistently high, preventing the model from learning anything meaningful.\n\n---\n\n**3. Contrastive Loss Temperature:**\nThe temperature parameter $\tau$ controls the 'sharpness' of the softmax function. \n- **High $\tau$**: The distribution of attention weights becomes softer (more uniform). The loss is less sensitive to the differences between negative samples, making the task easier but potentially leading to less powerful representations.\n- **Low $\tau$**: The distribution becomes very 'peaky'. The model is forced to focus heavily on discriminating the positive pair from even very similar negative pairs (hard negatives). This makes the learning task harder but can lead to better, more discriminative representations.\n\n---\n\n**4. Data Augmentation in Contrastive Learning:**\nThe choice of augmentations defines what the model learns to be invariant to. The pretext task is to identify the positive pair, which consists of two different augmentations of the same image. Therefore, the model must learn representations that are robust to these augmentations. If the augmentations are too simple (e.g., tiny crops), the task is too easy and the model learns little. If they are too aggressive (e.g., completely changing the color profile), the model might not be able to recognize the two views as belonging to the same image. The art of SSL is in designing augmentations that remove superficial information while preserving the core semantic content.