In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Denoising Score Matching -- Vizuara

## 1. Why Does This Matter?

In the previous notebook, we learned how to train a score function using the tractable score matching loss. It worked, but it required computing the Jacobian trace -- an operation that scales as $O(D^2)$, making it infeasible for images or any high-dimensional data.

**Denoising Score Matching (DSM)** solves this problem with a remarkably simple idea: instead of matching the score of the clean data distribution, we match the score of a noisy version. Since we control the noise, we know the target score in closed form -- no Jacobian needed.

This single insight underpins ALL modern diffusion models (DDPM, Stable Diffusion, DALL-E, Sora).

By the end of this notebook, you will:
- Understand the DSM objective and derive it step by step
- Implement DSM training from scratch
- Train a score network on 2D data and visualize the learned score field
- See the direct equivalence between DSM and DDPM's noise prediction

## 2. Building Intuition

### The Magnet Analogy

Imagine a tabletop with invisible magnets hidden at specific spots. You want to map the magnetic field (which direction each point is pulled toward), but you cannot see or measure the magnets directly.

Here is the trick:

1. **Place a metal ball on a magnet.** This is your clean data point $x$.
2. **Flick the ball in a random direction.** It lands at a new spot $\tilde{x}$. This is your noisy data.
3. **Ask a student (neural network):** "Which direction would pull this ball back to its starting position?"

The student does not know where the ball came from. But YOU do -- because you placed it. So you can give the student feedback: "You guessed this direction, but the true direction is that way."

After enough practice with many balls on many magnets, the student learns the entire magnetic field -- the score function.

### Why This Works

The key insight: **we do not need to know the score of the clean distribution.** We only need the score of the transition from clean to noisy, which is a simple Gaussian. The direction from noisy back to clean is:

$$\text{target} = \frac{x - \tilde{x}}{\sigma^2}$$

This is just "the direction from the noisy point back to the clean point, divided by the noise variance." No Jacobians, no intractable integrals.

## 3. The Mathematics

### Step 1: Add Noise

We corrupt each clean data point $x$ with Gaussian noise:

$$\tilde{x} = x + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

The conditional distribution is:

$$q_\sigma(\tilde{x}|x) = \mathcal{N}(\tilde{x}; x, \sigma^2 I)$$

**Computationally:** sample a noise vector $\epsilon$ from a standard normal, scale it by $\sigma$, and add it to $x$.

### Step 2: Compute the Target Score

Since the conditional is Gaussian, we can compute its score in closed form:

$$\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = \frac{x - \tilde{x}}{\sigma^2} = -\frac{\epsilon}{\sigma}$$

**Computationally:** this is just the negative noise divided by $\sigma$, or equivalently, the vector pointing from the noisy point back to the clean point, scaled by $1/\sigma^2$.

### Step 3: The DSM Loss

Train the score network to predict this target:

$$\mathcal{L}_{\text{DSM}} = \mathbb{E}_{x \sim p(x)} \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)} \left[ \left\| s_\theta(x + \sigma\epsilon) - \frac{-\epsilon}{\sigma} \right\|^2 \right]$$

Or equivalently, written in terms of $\tilde{x}$:

$$\mathcal{L}_{\text{DSM}} = \mathbb{E} \left[ \left\| s_\theta(\tilde{x}) - \frac{x - \tilde{x}}{\sigma^2} \right\|^2 \right]$$

**Computationally:** for each training step, sample a batch of clean data, add noise, compute the target direction, predict with the network, and minimize the MSE.

### Numerical Walkthrough

Let us trace through one training sample.

Clean data point: $x = [3.0, 4.0]$, noise level: $\sigma = 0.5$

We sample noise: $\epsilon = [0.4, -0.6]$

Noisy point: $\tilde{x} = [3.0 + 0.5 \times 0.4, \; 4.0 + 0.5 \times (-0.6)] = [3.2, 3.7]$

Target score: $\frac{x - \tilde{x}}{\sigma^2} = \frac{[3.0, 4.0] - [3.2, 3.7]}{0.25} = \frac{[-0.2, 0.3]}{0.25} = [-0.8, 1.2]$

Alternative (using noise): $-\frac{\epsilon}{\sigma} = -\frac{[0.4, -0.6]}{0.5} = [-0.8, 1.2]$

Both give the same answer. This is exactly what we want.

If the network predicts $s_\theta(\tilde{x}) = [-0.7, 1.0]$, the loss is:

$$\|[-0.7, 1.0] - [-0.8, 1.2]\|^2 = \|[0.1, -0.2]\|^2 = 0.01 + 0.04 = 0.05$$

## 4. Let's Build It -- Component by Component

### 4.1 Data Generation

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# Generate 2D data from a mixture of Gaussians
def generate_data(n_samples=2000):
    """Create a bimodal 2D dataset with two clusters."""
    cluster1 = torch.randn(n_samples // 2, 2) * 0.5 + torch.tensor([2.0, 2.0])
    cluster2 = torch.randn(n_samples // 2, 2) * 0.5 + torch.tensor([-2.0, -2.0])
    data = torch.cat([cluster1, cluster2], dim=0)
    return data

data = generate_data(2000)

plt.figure(figsize=(6, 6))
plt.scatter(data[:, 0].numpy(), data[:, 1].numpy(), alpha=0.3, s=5)
plt.title('Training Data: Mixture of 2 Gaussians')
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Dataset shape: {data.shape}")
print(f"Cluster 1 center: ~(2, 2), Cluster 2 center: ~(-2, -2)")

### 4.2 Noise Corruption

In [None]:
# Demonstrate the noise corruption process
sigma = 0.5  # Noise level

# Take 5 samples to show the process
n_demo = 5
clean_samples = data[:n_demo]
noise = torch.randn_like(clean_samples) * sigma
noisy_samples = clean_samples + noise
target_scores = (clean_samples - noisy_samples) / (sigma ** 2)

print("Clean -> Noisy -> Target Score")
print("-" * 50)
for i in range(n_demo):
    x = clean_samples[i].numpy()
    x_tilde = noisy_samples[i].numpy()
    target = target_scores[i].numpy()
    print(f"x={x} -> x_tilde={x_tilde} -> target={target}")

# Visualize
plt.figure(figsize=(8, 8))
plt.scatter(data[:, 0], data[:, 1], alpha=0.1, s=3, c='blue', label='Clean data')
noisy_all = data + torch.randn_like(data) * sigma
plt.scatter(noisy_all[:, 0], noisy_all[:, 1], alpha=0.1, s=3, c='red', label='Noisy data')
plt.legend()
plt.title(f'Clean vs Noisy Data (sigma={sigma})')
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nThe noisy data is a slightly blurred version of the clean data.")
print(f"Each noisy point was displaced by noise with std={sigma}")

### 4.3 Score Network Architecture

In [None]:
class ScoreNetwork(nn.Module):
    """Simple MLP that predicts the score (gradient of log density)."""
    def __init__(self, input_dim=2, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.SiLU(),  # Smooth activation
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, input_dim)  # Same dimension as input
        )

    def forward(self, x):
        """
        Args:
            x: (batch_size, input_dim) -- noisy data points
        Returns:
            scores: (batch_size, input_dim) -- predicted score vectors
        """
        return self.net(x)

model = ScoreNetwork()
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Test forward pass
test_input = torch.randn(4, 2)
test_output = model(test_input)
print(f"Input shape:  {test_input.shape}")
print(f"Output shape: {test_output.shape}")
print("Output is the same dimension as input -- each point gets a score vector.")

### 4.4 The DSM Training Loop

In [None]:
def dsm_loss(model, clean_data, sigma):
    """
    Compute the Denoising Score Matching loss.

    Steps:
    1. Add Gaussian noise to create noisy data
    2. Compute target score: (clean - noisy) / sigma^2
    3. Predict score with the model
    4. Return MSE between prediction and target
    """
    # Step 1: Add noise
    noise = torch.randn_like(clean_data) * sigma
    noisy_data = clean_data + noise

    # Step 2: Target score
    target_score = (clean_data - noisy_data) / (sigma ** 2)

    # Step 3: Predict
    predicted_score = model(noisy_data)

    # Step 4: MSE loss
    loss = ((predicted_score - target_score) ** 2).sum(dim=-1).mean()

    return loss

# Verify loss computation
test_loss = dsm_loss(model, data[:32], sigma=0.5)
print(f"Initial DSM loss: {test_loss.item():.4f}")
print("(This should decrease during training)")

## 5. Your Turn -- TODO Exercises

### TODO 1: Implement DSM Loss from Noise Perspective

In [None]:
def dsm_loss_noise_form(model, clean_data, sigma):
    """
    Compute DSM loss using the noise prediction form.

    Instead of target = (x - x_tilde) / sigma^2,
    use target = -epsilon / sigma

    These are mathematically equivalent!

    Args:
        model: ScoreNetwork
        clean_data: (N, D) clean data points
        sigma: noise standard deviation

    Returns:
        loss: scalar
    """
    # ============ TODO ============
    # Step 1: Sample epsilon ~ N(0, I) with same shape as clean_data
    # Step 2: Compute noisy_data = clean_data + sigma * epsilon
    # Step 3: Compute target_score = -epsilon / sigma
    # Step 4: Get predicted_score from model(noisy_data)
    # Step 5: Return MSE between predicted and target
    # ==============================

    loss = None  # YOUR CODE HERE

    return loss

In [None]:
# Verification: both forms should give similar loss values
torch.manual_seed(42)
loss_original = dsm_loss(model, data[:256], sigma=0.5)
torch.manual_seed(42)
loss_noise = dsm_loss_noise_form(model, data[:256], sigma=0.5)

if loss_noise is not None:
    print(f"Original DSM loss: {loss_original.item():.4f}")
    print(f"Noise-form DSM loss: {loss_noise.item():.4f}")
    # They use different random noise samples so won't be identical,
    # but should be in the same ballpark
    print("Both forms should give similar magnitudes (same order)")
else:
    print("Implement the function first!")

### TODO 2: Effect of Noise Level

In [None]:
def train_with_sigma(sigma, n_epochs=1000):
    """
    Train a score model with a specific noise level and return
    the final loss and the trained model.

    Args:
        sigma: noise standard deviation
        n_epochs: number of training epochs

    Returns:
        model: trained ScoreNetwork
        losses: list of loss values
    """
    # ============ TODO ============
    # Step 1: Create a fresh ScoreNetwork and Adam optimizer (lr=1e-3)
    # Step 2: Train for n_epochs using dsm_loss()
    # Step 3: Record losses
    # ==============================

    model = None  # YOUR CODE HERE
    losses = []   # YOUR CODE HERE

    return model, losses

# Train with different noise levels
# sigmas = [0.1, 0.5, 1.0, 2.0]
# models = {}
# all_losses = {}
# for s in sigmas:
#     print(f"\nTraining with sigma={s}...")
#     m, l = train_with_sigma(s)
#     models[s] = m
#     all_losses[s] = l

In [None]:
# Visualize: compare score fields for different noise levels
# (Uncomment after implementing train_with_sigma)

# fig, axes = plt.subplots(1, 4, figsize=(24, 6))
# n_grid = 20
# x_range = torch.linspace(-5, 5, n_grid)
# y_range = torch.linspace(-5, 5, n_grid)
# xx, yy = torch.meshgrid(x_range, y_range, indexing='ij')
# grid = torch.stack([xx.flatten(), yy.flatten()], dim=1)
#
# for ax, s in zip(axes, sigmas):
#     with torch.no_grad():
#         scores = models[s](grid)
#     ax.scatter(data[:, 0], data[:, 1], alpha=0.1, s=3, c='blue')
#     ax.quiver(grid[:, 0], grid[:, 1], scores[:, 0], scores[:, 1],
#               color='red', alpha=0.7, scale=50)
#     ax.set_title(f'sigma = {s}')
#     ax.set_aspect('equal')
#     ax.grid(True, alpha=0.3)
# plt.suptitle('Effect of Noise Level on Learned Score Field')
# plt.tight_layout()
# plt.show()

## 6. Putting It All Together

In [None]:
# Full DSM training pipeline
model = ScoreNetwork()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
sigma = 0.5
n_epochs = 2000

losses = []
for epoch in range(n_epochs):
    # Shuffle data
    idx = torch.randperm(len(data))
    batch = data[idx[:512]]

    loss = dsm_loss(model, batch, sigma)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses.append(loss.item())
    if (epoch + 1) % 500 == 0:
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {loss.item():.4f}")

plt.figure(figsize=(10, 4))
plt.plot(losses, alpha=0.7)
plt.title('DSM Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Final loss: {losses[-1]:.4f}")

## 7. Training and Results

In [None]:
# Visualize the learned score field
n_grid = 25
x_range = torch.linspace(-5, 5, n_grid)
y_range = torch.linspace(-5, 5, n_grid)
xx, yy = torch.meshgrid(x_range, y_range, indexing='ij')
grid = torch.stack([xx.flatten(), yy.flatten()], dim=1)

with torch.no_grad():
    scores = model(grid)

plt.figure(figsize=(10, 10))
plt.scatter(data[:, 0], data[:, 1], alpha=0.15, s=5, c='blue')
plt.quiver(grid[:, 0].numpy(), grid[:, 1].numpy(),
           scores[:, 0].numpy(), scores[:, 1].numpy(),
           color='red', alpha=0.7, scale=50)
plt.title('Learned Score Field via Denoising Score Matching', fontsize=14)
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.show()

print("All arrows point toward the data clusters!")
print("The score field acts as a compass guiding us toward high-density regions.")
print("And we achieved this WITHOUT computing any Jacobian traces!")

In [None]:
# Connection to DDPM: show that score = -noise/sigma
print("=" * 60)
print("DSM <-> DDPM Equivalence")
print("=" * 60)

# Take a sample
x_clean = data[0:1]
eps = torch.randn_like(x_clean)
x_noisy = x_clean + sigma * eps

# DSM target
dsm_target = (x_clean - x_noisy) / (sigma ** 2)

# DDPM-style target (scaled)
ddpm_target = -eps / sigma

print(f"\nClean point:    {x_clean[0].numpy()}")
print(f"Noise (eps):    {eps[0].numpy()}")
print(f"Noisy point:    {x_noisy[0].numpy()}")
print(f"\nDSM target:     {dsm_target[0].numpy()}")
print(f"DDPM-style:     {ddpm_target[0].numpy()}")
print(f"\nThey are identical! Score prediction = Noise prediction (up to scaling)")

## 8. Final Output

In [None]:
# Final visualization: DSM score field with score magnitude heatmap
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Left: Score field
with torch.no_grad():
    scores = model(grid)
    score_magnitudes = torch.norm(scores, dim=-1).reshape(n_grid, n_grid)

axes[0].scatter(data[:, 0], data[:, 1], alpha=0.1, s=3, c='blue')
axes[0].quiver(grid[:, 0].numpy(), grid[:, 1].numpy(),
               scores[:, 0].numpy(), scores[:, 1].numpy(),
               color='red', alpha=0.7, scale=50)
axes[0].set_title('Learned Score Field (DSM)', fontsize=14)
axes[0].set_xlabel('x1')
axes[0].set_ylabel('x2')
axes[0].set_aspect('equal')
axes[0].grid(True, alpha=0.3)

# Right: Score magnitude heatmap
im = axes[1].imshow(score_magnitudes.numpy().T,
                     extent=[-5, 5, -5, 5], origin='lower',
                     cmap='viridis', aspect='equal')
axes[1].scatter(data[:, 0], data[:, 1], alpha=0.15, s=3, c='white')
axes[1].set_title('Score Magnitude ||s(x)||', fontsize=14)
axes[1].set_xlabel('x1')
axes[1].set_ylabel('x2')
plt.colorbar(im, ax=axes[1])

plt.suptitle('Denoising Score Matching: Learned Score Function', fontsize=16)
plt.tight_layout()
plt.show()

print("\nKey takeaway: DSM replaces the intractable Jacobian trace with a")
print("simple noise-direction target. This is why diffusion models work!")
print(f"\nCompare: Tractable SM for a 784-dim image needs a 784x784 Jacobian.")
print(f"DSM needs only a forward pass through the network. Night and day.")

## 9. Reflection and Next Steps

### Think About This

1. **What happens if sigma is too small?** The noisy distribution is nearly identical to the clean distribution. Does this help or hurt training?

2. **What happens if sigma is too large?** The noisy points are far from the clean data. Does the score function still capture the data structure?

3. **Why does DSM work even though we match the score of the NOISY distribution, not the clean one?** (Hint: Vincent proved that as sigma approaches 0, the DSM objective converges to the true score matching objective.)

4. **How does multi-scale noise (many sigmas at once) solve the problems from questions 1 and 2?** This is what NCSN (Noise Conditional Score Networks) and DDPM do!

### Extension Challenge

Modify the code to train with MULTIPLE noise levels simultaneously. For each training step, randomly sample a sigma from a set of values (e.g., [0.1, 0.3, 0.5, 1.0, 2.0]) and condition the network on sigma by concatenating it to the input. This is the bridge to full diffusion models.

### What's Next

In the next notebook, we will use our trained score function to actually **generate new data** using Langevin Dynamics -- turning our score compass into a generative model.