In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

# üöÄ Ha & Schmidhuber World Models: Teaching Agents to Dream

**Notebook 2 of 6 ‚Äî World Action Models Series | Vizuara**

**Estimated time: ~40 minutes**

In this notebook, we will build the landmark **World Models** architecture from Ha & Schmidhuber (2018) ‚Äî a system where an agent learns a compressed model of its environment and then trains entirely inside its own imagination. By the end, you will have a working V-M-C pipeline where an agent learns to act by dreaming.

# ü§ñ AI Teaching Assistant

Need help with this notebook? Open the **AI Teaching Assistant** ‚Äî it has already read this entire notebook and can help with concepts, code, and exercises.

**[üëâ Open AI Teaching Assistant](https://course-creator-brown.vercel.app/courses/world-action-models/practice/2/assistant)**

*Tip: Open it in a separate tab and work through this notebook side-by-side.*


In [None]:
# ============================================================
# Setup ‚Äî Install dependencies and configure environment
# ============================================================
!pip install gymnasium matplotlib numpy -q

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym
from collections import deque

%matplotlib inline

# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 1. Why Does This Matter?

What if an agent could **dream**?

Think about how you learned to ride a bicycle. Yes, you needed real practice ‚Äî but much of your learning happened *mentally*. Before your next attempt, your brain replayed what went wrong, imagined corrections, and rehearsed new strategies. You were training inside a model of the world that your brain had built.

Ha and Schmidhuber asked: **can we give this same ability to artificial agents?**

Their 2018 paper introduced a beautifully simple three-component architecture:

| Component | Role | Analogy |
|-----------|------|---------|
| **V** (Vision) | Compress raw observations into compact codes | Your retina compressing millions of photons into a scene |
| **M** (Memory) | Predict what happens next | Your brain imagining "if I lean left, I will fall" |
| **C** (Controller) | Choose actions | Your reflexes ‚Äî fast, simple, reactive |

The breakthrough insight: once V and M are trained on real experience, the Controller can be trained **entirely inside M's imagination** ‚Äî no real environment needed. The agent literally learns by dreaming.

> **By the end of this notebook, you will build a complete V-M-C pipeline, train each component, and watch an agent learn to balance a pole ‚Äî first in reality, then inside its own dream.**

## 2. Building Intuition (No Code Yet)

Let us build intuition for each component with three analogies before we touch any mathematics.

### 2.1 The Vision Model (V) ‚Äî Like Compressing a Photo

Imagine you take a high-resolution photograph (10 megapixels) and save it as a JPEG. The JPEG is much smaller, but it still captures the essential content ‚Äî the people, the objects, the scene. You threw away pixel-level noise but kept the *meaning*.

The Vision model does exactly this. It takes a raw observation (the full state of the environment) and compresses it into a tiny **latent vector** $z$. This vector is like a JPEG of the observation ‚Äî much smaller, but it captures everything the agent needs to know.

### 2.2 The Memory Model (M) ‚Äî Like Predicting the Next Scene in a Movie

Now imagine you are watching a movie, and someone pauses it. Can you predict what happens in the next frame? If a ball is flying through the air, you predict it will continue along its arc. If a character is about to open a door, you predict the door will swing open.

The Memory model does precisely this. Given the current compressed observation $z_t$ and the action the agent takes $a_t$, it predicts what the *next* compressed observation $z_{t+1}$ will look like. It has learned the dynamics of the world.

### 2.3 The Controller (C) ‚Äî Simple Reflexes

Here is the surprising part: the Controller is deliberately **tiny**. It is just a single linear layer ‚Äî a matrix multiply and a bias term. Why so simple?

Because the hard work has already been done by V and M. The Vision model has compressed the raw observation into a meaningful representation. The Memory model has built up a rich internal state that summarizes the history. The Controller just needs to react to these ‚Äî like how your reflexes are fast and simple, but they draw on all the context your brain has built up.

### The Full Loop: See, Remember, Act

```
Observation ‚Üí [V] ‚Üí latent z ‚Üí [M] ‚Üí next hidden state h ‚Üí [C] ‚Üí action a
                                  ‚Üë                                    |
                                  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

The agent sees the world through V, remembers through M, and acts through C. And when we want to train in dreams? We simply disconnect from the real environment and let M generate imaginary observations.

## 3. The Mathematics

Now let us formalize each component. After every equation, we will explain what it means computationally so that the math never feels abstract.

### 3.1 The VAE (Vision Model)

The Vision model is a **Variational Autoencoder (VAE)**. It has two parts:

**Encoder** ‚Äî compresses observation $x_t$ into a latent distribution:

$$\mu_t, \log \sigma_t^2 = \text{Encoder}(x_t)$$

Computationally, this says: pass the observation through a neural network that outputs two vectors ‚Äî a mean $\mu$ and a log-variance $\log \sigma^2$. These define a Gaussian distribution in latent space.

**Reparameterization Trick** ‚Äî sample from this distribution in a differentiable way:

$$z_t = \mu_t + \sigma_t \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

Computationally, this says: take the mean vector, then add noise scaled by the standard deviation. We sample the noise $\epsilon$ from a standard normal distribution. This is clever because the randomness is in $\epsilon$ (which does not depend on any parameters), so gradients can flow through $\mu$ and $\sigma$.

Let us plug in some numbers. Suppose our latent dimension is 2, and the encoder outputs $\mu = [0.5, -0.3]$ and $\sigma = [0.1, 0.2]$. We sample $\epsilon = [1.0, -0.5]$. Then:

$$z = [0.5 + 0.1 \times 1.0, \; -0.3 + 0.2 \times (-0.5)] = [0.6, \; -0.4]$$

This is exactly what we want ‚Äî a sample that is close to the mean but with controlled randomness.

**Decoder** ‚Äî reconstructs the observation from $z$:

$$\hat{x}_t = \text{Decoder}(z_t)$$

**VAE Loss** ‚Äî balances reconstruction quality and regularization:

$$\mathcal{L}_{\text{VAE}} = \underbrace{\| x_t - \hat{x}_t \|^2}_{\text{Reconstruction}} + \underbrace{\beta \cdot D_{\text{KL}}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, I))}_{\text{Regularization}}$$

The first term says: the reconstruction should look like the original. The second term says: the latent distribution should stay close to a standard normal ‚Äî this prevents the encoder from "cheating" by using wildly different regions of latent space for different observations. The $\beta$ parameter controls the balance.

### 3.2 The MDN-RNN (Memory Model)

The Memory model is an **LSTM** combined with a **Mixture Density Network (MDN)**. It predicts the next latent state as a probability distribution.

At each time step, it receives the current latent vector $z_t$ and action $a_t$, concatenates them, and feeds them to an LSTM:

$$h_{t+1} = \text{LSTM}([z_t, a_t], h_t)$$

Computationally, this says: take the compressed observation and the action, glue them together into one vector, and feed them into an LSTM cell along with the previous hidden state. The LSTM updates its internal memory.

The LSTM hidden state $h_{t+1}$ is then used to predict the distribution over the next latent state:

$$P(z_{t+1} \mid a_t, z_t, h_t) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(z_{t+1} \mid \mu_k, \sigma_k)$$

This is a **mixture of Gaussians** ‚Äî instead of predicting a single point, the model predicts $K$ possible next states with different weights $\pi_k$. This is important because the future is often uncertain. If the agent is at the top of a hill, it could roll left or right ‚Äî a single Gaussian cannot capture this bimodality, but a mixture can.

Let us plug in numbers with $K=2$ Gaussians in a 1D latent space. Suppose the model outputs:
- Component 1: $\pi_1 = 0.7$, $\mu_1 = 0.3$, $\sigma_1 = 0.1$
- Component 2: $\pi_2 = 0.3$, $\mu_2 = -0.5$, $\sigma_2 = 0.2$

The model is saying: "There is a 70% chance the next latent state will be near 0.3, and a 30% chance it will be near -0.5." This captures genuine uncertainty about the future.

### 3.3 The Controller (C)

The Controller is remarkably simple ‚Äî just a linear mapping:

$$a_t = W_c \cdot [z_t, h_t] + b_c$$

Computationally, this says: concatenate the current compressed observation $z_t$ and the LSTM hidden state $h_t$, multiply by a weight matrix $W_c$, and add a bias $b_c$.

Why so simple? The intelligence lives in V (which creates meaningful representations) and M (which builds up a rich hidden state summarizing history). The Controller just needs to map from this already-rich representation to an action. Ha and Schmidhuber showed that even a linear controller suffices when it sits on top of powerful V and M components.

For CartPole, our Controller takes a concatenated vector of size $(\text{latent\_dim} + \text{hidden\_dim})$ and outputs a single logit for the binary action (left or right).

### 3.4 Dream Training

Here is where the magic happens. Once V and M are trained on real experience, we can train C entirely inside M's imagination:

1. Start with a real initial observation $x_0$, encode it to get $z_0 = V(x_0)$
2. Initialize the LSTM hidden state $h_0$
3. For each dream step $t$:
   - Controller picks action: $a_t = C(z_t, h_t)$
   - Memory predicts next state: $z_{t+1} \sim M(z_t, a_t, h_t)$
   - Memory updates hidden state: $h_{t+1} = \text{LSTM}([z_t, a_t], h_t)$
   - Estimate reward from the predicted state (we will define a simple reward function)
4. Accumulate total dream reward
5. Update Controller parameters to maximize dream reward

The agent is literally learning by dreaming ‚Äî no interaction with the real environment during Controller training. This is exactly what we want.

## 4. Let Us Build It

Now let us implement each component step by step. We will use CartPole-v1, which provides a 4-dimensional state vector (cart position, cart velocity, pole angle, pole angular velocity). This keeps things tractable for a Colab notebook while preserving all the conceptual richness.

### 4.1 Data Collection ‚Äî Gathering Real Experience

First, we need to collect experience from the real environment using a random policy. The agent will act randomly and record what it sees.

In [None]:
def collect_data(env_name="CartPole-v1", num_episodes=200, seed=SEED):
    """Collect (observation, action, next_observation) tuples
    using a random policy."""
    env = gym.make(env_name)

    observations, actions, next_observations, dones = [], [], [], []

    for ep in range(num_episodes):
        obs, _ = env.reset(seed=seed + ep)
        done = False
        while not done:
            action = env.action_space.sample()  # Random action
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            observations.append(obs)
            actions.append(action)
            next_observations.append(next_obs)
            dones.append(done)

            obs = next_obs

    env.close()
    print(f"Collected {len(observations)} transitions "
          f"from {num_episodes} episodes")
    return (np.array(observations), np.array(actions),
            np.array(next_observations), np.array(dones))

obs_data, act_data, next_obs_data, done_data = collect_data()

Let us visualize some of the collected observations to understand what our agent is seeing.

In [None]:
# üìä Visualize collected observations
fig, axes = plt.subplots(2, 2, figsize=(10, 6))
labels = ["Cart Position", "Cart Velocity",
          "Pole Angle", "Pole Angular Velocity"]

for i, (ax, label) in enumerate(zip(axes.flat, labels)):
    ax.plot(obs_data[:200, i], alpha=0.7, linewidth=0.8)
    ax.set_title(label, fontsize=12)
    ax.set_xlabel("Time step")
    ax.set_ylabel("Value")
    ax.grid(True, alpha=0.3)

plt.suptitle("CartPole Observations (First 200 Steps)",
             fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
print(f"Observation shape: {obs_data.shape}")
print(f"Observation range: [{obs_data.min():.2f}, {obs_data.max():.2f}]")

### 4.2 Building the VAE (Vision Model)

Now let us build the Vision model. Since CartPole gives us a 4D state vector (not images), we will use simple linear layers. The architecture is:

- **Encoder**: 4D observation ‚Üí hidden layer ‚Üí ($\mu$, $\log \sigma^2$) of dimension `latent_dim`
- **Decoder**: latent $z$ ‚Üí hidden layer ‚Üí 4D reconstruction

In [None]:
class VAE(nn.Module):
    """Variational Autoencoder for compressing observations
    into a latent space."""

    def __init__(self, obs_dim=4, hidden_dim=64, latent_dim=2):
        super().__init__()
        self.latent_dim = latent_dim

        # Encoder: observation -> (mu, logvar)
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder: z -> reconstructed observation
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, obs_dim),
        )

    def encode(self, x):
        """Encode observation to latent distribution parameters."""
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        """Sample z using the reparameterization trick."""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + std * eps

    def decode(self, z):
        """Decode latent vector back to observation space."""
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        x_recon = self.decode(z)
        return x_recon, mu, logvar, z

vae = VAE(obs_dim=4, hidden_dim=64, latent_dim=2).to(device)
print(f"VAE parameters: {sum(p.numel() for p in vae.parameters()):,}")

Now let us define the VAE loss function. Recall from Section 3: it is the sum of reconstruction loss and KL divergence.

In [None]:
def vae_loss_fn(x_recon, x, mu, logvar, beta=1.0):
    """Compute VAE loss = reconstruction + beta * KL divergence.

    The KL divergence for a Gaussian q(z|x) against N(0,I) has
    a closed-form solution, which we use here.
    """
    # Reconstruction loss (MSE)
    recon_loss = F.mse_loss(x_recon, x, reduction="sum")

    # KL divergence: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return (recon_loss + beta * kl_loss) / x.size(0), recon_loss / x.size(0), kl_loss / x.size(0)

Let us train the VAE on our collected observations.

In [None]:
def train_vae(vae, obs_data, epochs=30, batch_size=128,
              lr=1e-3, beta=0.5):
    """Train the VAE on collected observations."""
    optimizer = optim.Adam(vae.parameters(), lr=lr)
    dataset = torch.FloatTensor(obs_data).to(device)

    losses = {"total": [], "recon": [], "kl": []}

    for epoch in range(epochs):
        # Shuffle data each epoch
        perm = torch.randperm(len(dataset))
        epoch_loss, epoch_recon, epoch_kl = 0.0, 0.0, 0.0
        n_batches = 0

        for i in range(0, len(dataset), batch_size):
            batch = dataset[perm[i:i + batch_size]]
            x_recon, mu, logvar, z = vae(batch)
            loss, recon, kl = vae_loss_fn(x_recon, batch, mu, logvar, beta)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            epoch_recon += recon.item()
            epoch_kl += kl.item()
            n_batches += 1

        losses["total"].append(epoch_loss / n_batches)
        losses["recon"].append(epoch_recon / n_batches)
        losses["kl"].append(epoch_kl / n_batches)

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1:3d} | Loss: {losses['total'][-1]:.4f} "
                  f"| Recon: {losses['recon'][-1]:.4f} "
                  f"| KL: {losses['kl'][-1]:.4f}")

    return losses

vae_losses = train_vae(vae, obs_data)

Let us visualize the training curves and the quality of reconstruction.

In [None]:
# üìä VAE Training Curves
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].plot(vae_losses["total"], color="navy")
axes[0].set_title("Total VAE Loss", fontsize=12)
axes[0].set_xlabel("Epoch")
axes[0].grid(True, alpha=0.3)

axes[1].plot(vae_losses["recon"], color="crimson")
axes[1].set_title("Reconstruction Loss", fontsize=12)
axes[1].set_xlabel("Epoch")
axes[1].grid(True, alpha=0.3)

axes[2].plot(vae_losses["kl"], color="forestgreen")
axes[2].set_title("KL Divergence", fontsize=12)
axes[2].set_xlabel("Epoch")
axes[2].grid(True, alpha=0.3)

plt.suptitle("VAE Training Progress", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

Now let us see how well the VAE reconstructs observations. Good reconstruction means the latent space captures the essential information.

In [None]:
# üìä Original vs Reconstructed Observations
vae.eval()
test_obs = torch.FloatTensor(obs_data[:50]).to(device)
with torch.no_grad():
    recon, mu, logvar, z = vae(test_obs)

test_np = test_obs.cpu().numpy()
recon_np = recon.cpu().numpy()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
labels = ["Cart Position", "Cart Velocity",
          "Pole Angle", "Pole Angular Velocity"]

for i, (ax, label) in enumerate(zip(axes.flat, labels)):
    ax.plot(test_np[:, i], "b-", label="Original", alpha=0.8)
    ax.plot(recon_np[:, i], "r--", label="Reconstructed", alpha=0.8)
    ax.set_title(label, fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle("VAE Reconstruction Quality",
             fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

# Quantify reconstruction error
mse = np.mean((test_np - recon_np) ** 2)
print(f"Mean Squared Reconstruction Error: {mse:.6f}")

Not bad, right? The VAE has learned to compress 4D observations into 2D latent vectors while preserving the essential information. Let us also visualize the latent space itself.

In [None]:
# üìä Visualize the 2D Latent Space
vae.eval()
all_obs = torch.FloatTensor(obs_data).to(device)
with torch.no_grad():
    _, mu_all, _, _ = vae(all_obs)

mu_np = mu_all.cpu().numpy()

plt.figure(figsize=(8, 6))
scatter = plt.scatter(mu_np[:, 0], mu_np[:, 1],
                      c=obs_data[:, 2],  # Color by pole angle
                      cmap="coolwarm", alpha=0.3, s=5)
plt.colorbar(scatter, label="Pole Angle")
plt.xlabel("Latent Dimension 1", fontsize=12)
plt.ylabel("Latent Dimension 2", fontsize=12)
plt.title("VAE Latent Space (colored by pole angle)",
          fontsize=14, fontweight="bold")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

This is exactly what we want. The latent space has organized itself so that similar physical states (similar pole angles) cluster together. The VAE has learned a meaningful compression of the environment.

### 4.3 Building the MDN-RNN (Memory Model)

Now let us build the Memory model. It uses an LSTM cell to maintain a running summary of the agent's history, and a Mixture Density Network head to predict the distribution over the next latent state.

In [None]:
class MDNRNN(nn.Module):
    """MDN-RNN World Model: LSTM + Mixture Density Network.

    Takes (z_t, a_t) as input, maintains hidden state h_t,
    and predicts P(z_{t+1}) as a mixture of Gaussians.
    """

    def __init__(self, latent_dim=2, action_dim=1,
                 hidden_dim=64, num_gaussians=3):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.latent_dim = latent_dim
        self.num_gaussians = num_gaussians

        # LSTM cell: processes (z_t, a_t) sequentially
        input_dim = latent_dim + action_dim
        self.lstm = nn.LSTMCell(input_dim, hidden_dim)

        # MDN head: predicts mixture parameters from h
        # For each Gaussian: pi (weight), mu (mean), sigma (std)
        n_params = num_gaussians * (1 + latent_dim + latent_dim)
        self.mdn_head = nn.Linear(hidden_dim, n_params)

    def init_hidden(self, batch_size=1):
        """Initialize LSTM hidden and cell states to zeros."""
        h = torch.zeros(batch_size, self.hidden_dim).to(device)
        c = torch.zeros(batch_size, self.hidden_dim).to(device)
        return h, c

    def forward(self, z, action, hidden):
        """One step of the world model.

        Args:
            z: latent vector [batch, latent_dim]
            action: action taken [batch, 1]
            hidden: tuple of (h, c) for LSTM
        Returns:
            pi, mu, sigma: MDN parameters
            hidden: updated (h, c)
        """
        # Concatenate z and action as input to LSTM
        inp = torch.cat([z, action.float().unsqueeze(-1)
                         if action.dim() == 1 else action.float()],
                        dim=-1)
        h, c = self.lstm(inp, hidden)

        # Get MDN parameters from hidden state
        mdn_params = self.mdn_head(h)
        pi, mu, sigma = self._parse_mdn_params(mdn_params)

        return pi, mu, sigma, (h, c)

    def _parse_mdn_params(self, params):
        """Parse raw network output into mixture parameters."""
        K = self.num_gaussians
        D = self.latent_dim

        # Split into pi, mu, sigma
        pi_raw = params[:, :K]
        mu = params[:, K:K + K * D].view(-1, K, D)
        sigma_raw = params[:, K + K * D:].view(-1, K, D)

        # Apply activations
        pi = F.softmax(pi_raw, dim=-1)        # Weights sum to 1
        sigma = torch.exp(sigma_raw) + 1e-6   # Positive std dev

        return pi, mu, sigma

mdnrnn = MDNRNN(latent_dim=2, action_dim=1,
                 hidden_dim=64, num_gaussians=3).to(device)
print(f"MDN-RNN parameters: {sum(p.numel() for p in mdnrnn.parameters()):,}")

Now we need a loss function for the MDN-RNN. The loss is the negative log-likelihood of the actual next latent state under the predicted mixture of Gaussians.

In [None]:
def mdn_loss_fn(pi, mu, sigma, target):
    """Negative log-likelihood of target under the mixture.

    Args:
        pi: mixture weights [batch, K]
        mu: means [batch, K, D]
        sigma: std devs [batch, K, D]
        target: actual next z [batch, D]
    """
    target = target.unsqueeze(1)  # [batch, 1, D]

    # Log probability of target under each Gaussian component
    # log N(x | mu, sigma) = -0.5 * ((x-mu)/sigma)^2 - log(sigma) - 0.5*log(2*pi)
    var = sigma ** 2
    log_prob = (-0.5 * ((target - mu) ** 2 / var)
                - torch.log(sigma)
                - 0.5 * np.log(2 * np.pi))
    log_prob = log_prob.sum(dim=-1)  # Sum over latent dims [batch, K]

    # Weighted by mixture coefficients (log-sum-exp for stability)
    log_pi = torch.log(pi + 1e-8)
    log_mixture = torch.logsumexp(log_pi + log_prob, dim=-1)  # [batch]

    return -log_mixture.mean()

Let us prepare the sequential training data and train the MDN-RNN.

In [None]:
def prepare_sequences(obs_data, act_data, next_obs_data,
                      done_data, vae, seq_len=16):
    """Encode observations to latent space and create sequences
    for MDN-RNN training."""
    vae.eval()
    with torch.no_grad():
        obs_t = torch.FloatTensor(obs_data).to(device)
        next_obs_t = torch.FloatTensor(next_obs_data).to(device)
        mu_z, _ = vae.encode(obs_t)
        mu_z_next, _ = vae.encode(next_obs_t)

    z_data = mu_z.cpu().numpy()
    z_next_data = mu_z_next.cpu().numpy()

    # Build sequences, respecting episode boundaries
    sequences = []
    start = 0
    for i in range(len(done_data)):
        if done_data[i] or i == len(done_data) - 1:
            ep_len = i - start + 1
            if ep_len >= seq_len:
                for j in range(start, i + 1 - seq_len):
                    seq_z = z_data[j:j + seq_len]
                    seq_a = act_data[j:j + seq_len]
                    seq_z_next = z_next_data[j:j + seq_len]
                    sequences.append((seq_z, seq_a, seq_z_next))
            start = i + 1

    print(f"Created {len(sequences)} training sequences "
          f"of length {seq_len}")
    return sequences

sequences = prepare_sequences(
    obs_data, act_data, next_obs_data, done_data, vae
)

In [None]:
def train_mdnrnn(mdnrnn, sequences, epochs=20,
                 batch_size=64, lr=1e-3):
    """Train MDN-RNN on sequential latent data."""
    optimizer = optim.Adam(mdnrnn.parameters(), lr=lr)
    losses = []

    for epoch in range(epochs):
        np.random.shuffle(sequences)
        epoch_loss = 0.0
        n_batches = 0

        for i in range(0, len(sequences), batch_size):
            batch_seqs = sequences[i:i + batch_size]
            bs = len(batch_seqs)

            # Stack batch
            z_batch = torch.FloatTensor(
                np.array([s[0] for s in batch_seqs])).to(device)
            a_batch = torch.FloatTensor(
                np.array([s[1] for s in batch_seqs])).to(device)
            z_next_batch = torch.FloatTensor(
                np.array([s[2] for s in batch_seqs])).to(device)

            # Process sequence step by step
            hidden = mdnrnn.init_hidden(bs)
            seq_loss = 0.0
            seq_len = z_batch.shape[1]

            for t in range(seq_len):
                pi, mu, sigma, hidden = mdnrnn(
                    z_batch[:, t], a_batch[:, t], hidden)
                step_loss = mdn_loss_fn(
                    pi, mu, sigma, z_next_batch[:, t])
                seq_loss += step_loss

            loss = seq_loss / seq_len
            optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(mdnrnn.parameters(), 1.0)
            optimizer.step()

            epoch_loss += loss.item()
            n_batches += 1

        avg_loss = epoch_loss / max(n_batches, 1)
        losses.append(avg_loss)
        if (epoch + 1) % 5 == 0:
            print(f"Epoch {epoch+1:3d} | MDN-RNN Loss: {avg_loss:.4f}")

    return losses

mdnrnn_losses = train_mdnrnn(mdnrnn, sequences)

In [None]:
# üìä MDN-RNN Training Curve
plt.figure(figsize=(8, 4))
plt.plot(mdnrnn_losses, color="darkorange", linewidth=2)
plt.title("MDN-RNN Training Loss", fontsize=14, fontweight="bold")
plt.xlabel("Epoch")
plt.ylabel("Negative Log-Likelihood")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Let us verify that the Memory model can actually predict future latent states. We will take a real trajectory, feed it through the model step by step, and compare predicted vs actual latent states.

In [None]:
# üìä Predicted vs Actual Latent Trajectories
mdnrnn.eval()
vae.eval()

# Pick a sequence from training data
test_seq = sequences[0]
z_seq = torch.FloatTensor(test_seq[0]).unsqueeze(0).to(device)
a_seq = torch.FloatTensor(test_seq[1]).unsqueeze(0).to(device)
z_next_seq = torch.FloatTensor(test_seq[2]).unsqueeze(0).to(device)

predicted_z = []
hidden = mdnrnn.init_hidden(1)

with torch.no_grad():
    for t in range(z_seq.shape[1]):
        pi, mu, sigma, hidden = mdnrnn(
            z_seq[:, t], a_seq[:, t], hidden)
        # Use the highest-weight Gaussian's mean as prediction
        best_k = pi.argmax(dim=-1)
        pred = mu[0, best_k[0]].cpu().numpy()
        predicted_z.append(pred)

predicted_z = np.array(predicted_z)
actual_z = z_next_seq[0].cpu().numpy()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for d in range(2):
    axes[d].plot(actual_z[:, d], "b-o", label="Actual", markersize=4)
    axes[d].plot(predicted_z[:, d], "r--x",
                 label="Predicted", markersize=4)
    axes[d].set_title(f"Latent Dimension {d+1}", fontsize=12)
    axes[d].legend()
    axes[d].grid(True, alpha=0.3)

plt.suptitle("MDN-RNN: Predicted vs Actual Latent Trajectories",
             fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

The Memory model has learned the dynamics of the latent space. It can predict where the agent's state will go next, given its current state and action. This is exactly what we need for dream training.

### 4.4 Building the Controller

The Controller is the simplest component ‚Äî a single linear layer that maps from the combined representation $[z_t, h_t]$ to an action.

In [None]:
class Controller(nn.Module):
    """Simple linear controller: maps [z, h] to action logits.

    This is intentionally minimal ‚Äî the intelligence lives
    in the Vision and Memory models.
    """

    def __init__(self, latent_dim=2, hidden_dim=64, action_dim=2):
        super().__init__()
        self.fc = nn.Linear(latent_dim + hidden_dim, action_dim)

    def forward(self, z, h):
        """Select action from latent state and memory.

        Args:
            z: current latent observation [batch, latent_dim]
            h: LSTM hidden state [batch, hidden_dim]
        Returns:
            action_logits: [batch, action_dim]
        """
        combined = torch.cat([z, h], dim=-1)
        return self.fc(combined)

controller = Controller(latent_dim=2, hidden_dim=64,
                        action_dim=2).to(device)
print(f"Controller parameters: "
      f"{sum(p.numel() for p in controller.parameters()):,}")
print(f"\nTotal system parameters: "
      f"{sum(p.numel() for p in vae.parameters()) + sum(p.numel() for p in mdnrnn.parameters()) + sum(p.numel() for p in controller.parameters()):,}")

Notice how few parameters the Controller has compared to V and M. This is by design ‚Äî the Controller is a thin decision layer on top of rich representations.

### 4.5 Dream Training Loop

Now we arrive at the most exciting part: training the Controller inside the Memory model's imagination. The agent will never touch the real environment during this phase.

In [None]:
def dream_reward(z, latent_dim=2):
    """Estimate reward from latent state.

    Since we cannot access the real reward function inside
    a dream, we learn a simple proxy. For CartPole, we
    decode z back to observation space and check if the
    pole angle is small (the pole is balanced).
    """
    with torch.no_grad():
        obs_recon = vae.decode(z)
    # CartPole reward: +1 if pole angle (dim 2) is small
    # and cart position (dim 0) is near center
    pole_angle = obs_recon[:, 2].abs()
    cart_pos = obs_recon[:, 0].abs()
    reward = (1.0 - pole_angle) * (1.0 - 0.1 * cart_pos)
    return reward.clamp(min=0.0)

In [None]:
def dream_rollout(controller, mdnrnn, vae, initial_obs,
                  dream_steps=50, temperature=1.0):
    """Roll out a trajectory inside the world model's dream.

    Args:
        initial_obs: real observation to start the dream [1, obs_dim]
        dream_steps: how many steps to dream forward
        temperature: controls stochasticity of MDN sampling

    Returns:
        total_reward: accumulated reward in the dream
        log_probs: log probabilities of chosen actions (for REINFORCE)
    """
    # Encode initial observation
    with torch.no_grad():
        mu, logvar = vae.encode(initial_obs)
        z = mu  # Use mean for initial state (no sampling noise)

    hidden = mdnrnn.init_hidden(1)
    total_reward = 0.0
    log_probs = []

    for t in range(dream_steps):
        # Controller selects action
        action_logits = controller(z, hidden[0])
        action_dist = torch.distributions.Categorical(
            logits=action_logits)
        action = action_dist.sample()
        log_probs.append(action_dist.log_prob(action))

        # Memory predicts next state
        with torch.no_grad():
            pi, mu_mdn, sigma_mdn, hidden = mdnrnn(
                z, action.float().unsqueeze(-1), hidden)

        # Sample from the predicted distribution
        # Pick a Gaussian component
        k = torch.multinomial(pi, 1).squeeze(-1)
        chosen_mu = mu_mdn[0, k[0]]
        chosen_sigma = sigma_mdn[0, k[0]] * temperature
        z = (chosen_mu
             + chosen_sigma * torch.randn_like(chosen_mu))
        z = z.unsqueeze(0)  # [1, latent_dim]

        # Estimate reward
        r = dream_reward(z)
        total_reward += r.item()

    return total_reward, log_probs

Now let us train the Controller using the REINFORCE algorithm, but entirely inside dreams.

In [None]:
def train_controller_in_dreams(controller, mdnrnn, vae,
                               real_obs_data, epochs=100,
                               dreams_per_epoch=8,
                               dream_steps=50, lr=1e-3):
    """Train the Controller using REINFORCE, entirely in dreams.

    Each epoch: start multiple dreams from different real
    observations, accumulate policy gradient, update Controller.
    """
    optimizer = optim.Adam(controller.parameters(), lr=lr)
    reward_history = []

    mdnrnn.eval()
    vae.eval()
    controller.train()

    for epoch in range(epochs):
        epoch_rewards = []
        optimizer.zero_grad()

        for _ in range(dreams_per_epoch):
            # Pick a random real observation to start the dream
            idx = np.random.randint(len(real_obs_data))
            init_obs = torch.FloatTensor(
                real_obs_data[idx:idx+1]).to(device)

            total_reward, log_probs = dream_rollout(
                controller, mdnrnn, vae, init_obs, dream_steps)
            epoch_rewards.append(total_reward)

            # REINFORCE loss: -log_prob * reward
            dream_loss = 0
            for lp in log_probs:
                dream_loss -= lp * total_reward
            dream_loss = dream_loss / dreams_per_epoch
            dream_loss.backward()

        optimizer.step()

        avg_reward = np.mean(epoch_rewards)
        reward_history.append(avg_reward)

        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1:3d} | "
                  f"Avg Dream Reward: {avg_reward:.2f}")

    return reward_history

print("Training Controller in dreams...")
print("(No real environment interaction!)\n")
dream_rewards = train_controller_in_dreams(
    controller, mdnrnn, vae, obs_data,
    epochs=100, dreams_per_epoch=8, dream_steps=50
)

In [None]:
# üìä Dream Training Reward Curve
plt.figure(figsize=(10, 4))
plt.plot(dream_rewards, alpha=0.3, color="mediumpurple", linewidth=0.8)

# Smoothed curve
window = 10
smoothed = np.convolve(dream_rewards,
                       np.ones(window)/window, mode="valid")
plt.plot(range(window-1, len(dream_rewards)), smoothed,
         color="indigo", linewidth=2, label="Smoothed (10-epoch)")

plt.title("Controller Training ‚Äî Learning in Dreams",
          fontsize=14, fontweight="bold")
plt.xlabel("Epoch")
plt.ylabel("Average Dream Reward")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. üîß Your Turn (TODO)

Now it is your turn to implement two key functions. These are the core mechanisms that make World Models work.

### TODO 1: Implement the Reparameterization Trick

The reparameterization trick is what makes the VAE trainable. Given the mean $\mu$ and log-variance $\log \sigma^2$, you need to sample a latent vector $z$ such that gradients can flow through $\mu$ and $\log \sigma^2$.

Recall the formula: $z = \mu + \sigma \cdot \epsilon$, where $\sigma = e^{0.5 \cdot \log \sigma^2}$ and $\epsilon \sim \mathcal{N}(0, I)$.

In [None]:
def reparameterize_todo(mu, logvar):
    """Sample z using the reparameterization trick.

    Args:
        mu: mean of the latent distribution [batch, latent_dim]
        logvar: log variance [batch, latent_dim]

    Returns:
        z: sampled latent vector [batch, latent_dim]

    Steps:
        1. Compute std from logvar: std = exp(0.5 * logvar)
        2. Sample epsilon from N(0, I) with same shape as std
        3. Return mu + std * epsilon
    """
    # ============ TODO ============
    # Implement the three steps described above.
    # Hint: use torch.exp() and torch.randn_like()
    # ==============================

    z = None  # YOUR CODE HERE

    return z

In [None]:
# ‚úÖ Verification: Run this cell to check your implementation
torch.manual_seed(42)
test_mu = torch.tensor([[0.5, -0.3]])
test_logvar = torch.tensor([[-2.0, -1.0]])

torch.manual_seed(42)  # Reset seed for reproducible epsilon
result = reparameterize_todo(test_mu, test_logvar)

# Expected: mu + exp(0.5*logvar) * epsilon
# std = exp([-1.0, -0.5]) = [0.3679, 0.6065]
# epsilon (seed=42) = [0.3367, 0.1288]
# z = [0.5 + 0.3679*0.3367, -0.3 + 0.6065*0.1288]
#   = [0.6239, -0.2219]
expected = torch.tensor([[0.6239, -0.2219]])

if result is None:
    print("‚ùå You haven't implemented the function yet. "
          "Replace 'z = None' with your code.")
elif torch.allclose(result, expected, atol=1e-3):
    print("‚úÖ Correct! Your reparameterization trick works perfectly.")
    print(f"   Result: {result.squeeze().tolist()}")
    print(f"   Expected: {expected.squeeze().tolist()}")
else:
    print(f"‚ùå Not quite. Got {result.squeeze().tolist()}, "
          f"expected {expected.squeeze().tolist()}")
    print("   Hint: std = exp(0.5 * logvar), not exp(logvar)")

### TODO 2: Implement the Dream Rollout

This is the heart of World Models ‚Äî rolling forward through the Memory model to generate an imaginary trajectory and accumulate reward.

In [None]:
def dream_rollout_todo(controller, mdnrnn, vae,
                       initial_z, initial_hidden,
                       num_steps=20):
    """Roll forward through the world model's imagination.

    Args:
        controller: the Controller network
        mdnrnn: the Memory (MDN-RNN) network
        vae: the Vision (VAE) network (for reward estimation)
        initial_z: starting latent state [1, latent_dim]
        initial_hidden: tuple of (h, c) for LSTM
        num_steps: how many dream steps to take

    Returns:
        total_reward: float, accumulated reward over the dream
        z_trajectory: list of latent states visited

    For each step:
        1. Use controller to get action logits from (z, h)
        2. Sample action from Categorical distribution
        3. Run mdnrnn forward with (z, action, hidden)
           to get (pi, mu, sigma, new_hidden)
        4. Sample next z from the highest-weight Gaussian
           (use the mean of the component with largest pi)
        5. Compute reward using dream_reward(z)
        6. Accumulate reward and store z in trajectory
    """
    z = initial_z
    hidden = initial_hidden
    total_reward = 0.0
    z_trajectory = [z.detach().cpu().numpy().squeeze()]

    # ============ TODO ============
    # Implement the dream rollout loop described above.
    # Remember:
    #   - action_logits = controller(z, hidden[0])
    #   - Use torch.distributions.Categorical for sampling
    #   - For MDN sampling, use pi.argmax to pick best component
    #   - Use torch.no_grad() for mdnrnn forward pass
    #   - Accumulate reward as a float with .item()
    # ==============================

    pass  # YOUR CODE HERE

    return total_reward, z_trajectory

In [None]:
# ‚úÖ Verification: Run this cell to check your implementation
torch.manual_seed(42)
np.random.seed(42)

# Create a test initial state
test_init_obs = torch.FloatTensor(obs_data[0:1]).to(device)
with torch.no_grad():
    test_mu, _ = vae.encode(test_init_obs)
test_hidden = mdnrnn.init_hidden(1)

controller.eval()
mdnrnn.eval()

result = dream_rollout_todo(
    controller, mdnrnn, vae,
    test_mu, test_hidden, num_steps=20
)

if result is None or result[0] == 0.0 and len(result[1]) == 1:
    print("‚ùå You haven't implemented the dream rollout yet. "
          "Replace 'pass' with your loop code.")
else:
    total_r, z_traj = result
    print(f"‚úÖ Dream rollout completed!")
    print(f"   Total reward over 20 steps: {total_r:.2f}")
    print(f"   Trajectory length: {len(z_traj)} states")
    if len(z_traj) == 21:  # initial + 20 steps
        print("   ‚úÖ Correct trajectory length!")
    else:
        print(f"   ‚ö†Ô∏è Expected 21 states (initial + 20 steps), "
              f"got {len(z_traj)}")

    # Plot the dream trajectory
    z_traj_np = np.array(z_traj)
    plt.figure(figsize=(8, 5))
    plt.plot(z_traj_np[:, 0], z_traj_np[:, 1], "o-",
             color="mediumpurple", markersize=5, alpha=0.8)
    plt.plot(z_traj_np[0, 0], z_traj_np[0, 1], "g*",
             markersize=15, label="Start")
    plt.plot(z_traj_np[-1, 0], z_traj_np[-1, 1], "r*",
             markersize=15, label="End")
    plt.title("Dream Trajectory in Latent Space",
              fontsize=14, fontweight="bold")
    plt.xlabel("Latent Dim 1")
    plt.ylabel("Latent Dim 2")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 6. Putting It All Together ‚Äî The Full V-M-C Pipeline

Now let us connect all three components into a single coherent system that can both interact with the real environment and dream.

In [None]:
class WorldModelAgent:
    """Complete V-M-C World Model agent.

    Combines Vision (VAE), Memory (MDN-RNN), and Controller
    into a unified system that can:
    1. Act in the real environment
    2. Dream (roll forward in imagination)
    3. Train the Controller via dream experience
    """

    def __init__(self, vae, mdnrnn, controller):
        self.vae = vae
        self.mdnrnn = mdnrnn
        self.controller = controller
        self.hidden = None

    def reset(self):
        """Reset the memory state for a new episode."""
        self.hidden = self.mdnrnn.init_hidden(1)

    def act(self, observation):
        """Choose an action given a real observation.

        Full pipeline: observation ‚Üí V ‚Üí z ‚Üí [z,h] ‚Üí C ‚Üí action
        """
        self.vae.eval()
        self.controller.eval()

        with torch.no_grad():
            obs_t = torch.FloatTensor(observation).unsqueeze(0).to(device)
            mu, _ = self.vae.encode(obs_t)
            z = mu  # Use mean (no sampling noise at test time)

            action_logits = self.controller(z, self.hidden[0])
            action = action_logits.argmax(dim=-1).item()

            # Update memory with this step
            action_t = torch.tensor([[action]]).float().to(device)
            _, _, _, self.hidden = self.mdnrnn(
                z, action_t, self.hidden)

        return action

    def evaluate_real(self, env_name="CartPole-v1",
                      num_episodes=10, max_steps=500):
        """Evaluate the agent in the real environment."""
        env = gym.make(env_name)
        episode_rewards = []

        for ep in range(num_episodes):
            obs, _ = env.reset(seed=SEED + ep + 1000)
            self.reset()
            total_reward = 0

            for step in range(max_steps):
                action = self.act(obs)
                obs, reward, terminated, truncated, _ = env.step(action)
                total_reward += reward
                if terminated or truncated:
                    break

            episode_rewards.append(total_reward)

        env.close()
        return episode_rewards

agent = WorldModelAgent(vae, mdnrnn, controller)

## 7. Training and Results

Let us evaluate our dream-trained agent in the **real** environment. Remember ‚Äî the Controller was trained entirely inside the Memory model's imagination. Now we test whether those dream skills transfer to reality.

In [None]:
# Evaluate dream-trained agent in reality
print("Evaluating dream-trained agent in real CartPole...")
real_rewards = agent.evaluate_real(num_episodes=20)

print(f"\nResults over 20 episodes:")
print(f"  Mean reward:   {np.mean(real_rewards):.1f}")
print(f"  Std reward:    {np.std(real_rewards):.1f}")
print(f"  Max reward:    {np.max(real_rewards):.1f}")
print(f"  Min reward:    {np.min(real_rewards):.1f}")

In [None]:
# Compare with a random agent baseline
env = gym.make("CartPole-v1")
random_rewards = []
for ep in range(20):
    obs, _ = env.reset(seed=SEED + ep + 2000)
    total = 0
    done = False
    while not done:
        obs, r, term, trunc, _ = env.step(env.action_space.sample())
        total += r
        done = term or trunc
    random_rewards.append(total)
env.close()

print(f"Random agent mean reward: {np.mean(random_rewards):.1f}")
print(f"Dream-trained agent mean reward: {np.mean(real_rewards):.1f}")

In [None]:
# üìä Performance Comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart comparison
axes[0].bar(["Random Agent", "Dream-Trained Agent"],
            [np.mean(random_rewards), np.mean(real_rewards)],
            color=["salmon", "mediumpurple"],
            edgecolor="black", linewidth=0.5)
axes[0].errorbar(
    [0, 1],
    [np.mean(random_rewards), np.mean(real_rewards)],
    yerr=[np.std(random_rewards), np.std(real_rewards)],
    fmt="none", color="black", capsize=5
)
axes[0].set_ylabel("Average Reward", fontsize=12)
axes[0].set_title("Random vs Dream-Trained Agent",
                   fontsize=14, fontweight="bold")
axes[0].grid(True, alpha=0.3, axis="y")

# Episode-by-episode comparison
axes[1].plot(random_rewards, "o-", color="salmon",
             label="Random", alpha=0.7)
axes[1].plot(real_rewards, "s-", color="mediumpurple",
             label="Dream-Trained", alpha=0.7)
axes[1].set_xlabel("Episode", fontsize=12)
axes[1].set_ylabel("Reward", fontsize=12)
axes[1].set_title("Per-Episode Rewards",
                   fontsize=14, fontweight="bold")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. üéØ Final Output ‚Äî Real Environment vs. Agent's Dream

This is the culmination of our notebook. Let us compare what actually happens in the real environment with what the agent *imagines* will happen. We will start from the same initial state and roll forward ‚Äî once in reality, once in the dream.

In [None]:
# Run one real episode and one dream episode from the same start
env = gym.make("CartPole-v1")
obs, _ = env.reset(seed=123)
init_obs = obs.copy()

# --- Real trajectory ---
agent.reset()
real_trajectory = [obs.copy()]
real_actions = []
done = False
for step in range(100):
    action = agent.act(obs)
    obs, reward, terminated, truncated, _ = env.step(action)
    real_trajectory.append(obs.copy())
    real_actions.append(action)
    if terminated or truncated:
        break
env.close()
real_trajectory = np.array(real_trajectory)

In [None]:
# --- Dream trajectory ---
vae.eval()
mdnrnn.eval()
controller.eval()

init_obs_t = torch.FloatTensor(init_obs).unsqueeze(0).to(device)
with torch.no_grad():
    z, _ = vae.encode(init_obs_t)

hidden = mdnrnn.init_hidden(1)
dream_z_list = [z.cpu().numpy().squeeze()]
dream_obs_list = [init_obs.copy()]

with torch.no_grad():
    for step in range(min(len(real_actions), 100)):
        # Controller picks action
        logits = controller(z, hidden[0])
        action = logits.argmax(dim=-1)

        # Memory predicts next state
        pi, mu_m, sigma_m, hidden = mdnrnn(
            z, action.float().unsqueeze(-1), hidden)
        best_k = pi.argmax(dim=-1)
        z = mu_m[0, best_k[0]].unsqueeze(0)

        # Decode to observation space for visualization
        obs_recon = vae.decode(z)

        dream_z_list.append(z.cpu().numpy().squeeze())
        dream_obs_list.append(obs_recon.cpu().numpy().squeeze())

dream_obs_array = np.array(dream_obs_list)

In [None]:
# üìä Side-by-Side: Real vs Dream Trajectories
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
labels = ["Cart Position", "Cart Velocity",
          "Pole Angle", "Pole Angular Velocity"]

n_steps = min(len(real_trajectory), len(dream_obs_array))

for i, (ax, label) in enumerate(zip(axes.flat, labels)):
    ax.plot(real_trajectory[:n_steps, i], "b-",
            label="Real Environment", linewidth=2, alpha=0.8)
    ax.plot(dream_obs_array[:n_steps, i], "r--",
            label="Agent's Dream", linewidth=2, alpha=0.8)
    ax.set_title(label, fontsize=13)
    ax.set_xlabel("Time Step")
    ax.legend(fontsize=10)
    ax.grid(True, alpha=0.3)

plt.suptitle("üéØ Real Environment vs. Agent's Dream",
             fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()

print(f"Real trajectory length:  {len(real_trajectory)} steps")
print(f"Dream trajectory length: {len(dream_obs_array)} steps")

In [None]:
# üìä Dream trajectory in latent space
dream_z_np = np.array(dream_z_list)

# Also encode the real trajectory for comparison
with torch.no_grad():
    real_obs_t = torch.FloatTensor(real_trajectory).to(device)
    real_z, _ = vae.encode(real_obs_t)
    real_z_np = real_z.cpu().numpy()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Real trajectory in latent space
axes[0].plot(real_z_np[:, 0], real_z_np[:, 1], "b-o",
             markersize=3, alpha=0.6, linewidth=1)
axes[0].plot(real_z_np[0, 0], real_z_np[0, 1], "g*",
             markersize=15, zorder=5, label="Start")
axes[0].plot(real_z_np[-1, 0], real_z_np[-1, 1], "r*",
             markersize=15, zorder=5, label="End")
axes[0].set_title("Real Trajectory (Latent Space)",
                   fontsize=13, fontweight="bold")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Dream trajectory in latent space
axes[1].plot(dream_z_np[:, 0], dream_z_np[:, 1], "r-o",
             markersize=3, alpha=0.6, linewidth=1)
axes[1].plot(dream_z_np[0, 0], dream_z_np[0, 1], "g*",
             markersize=15, zorder=5, label="Start")
axes[1].plot(dream_z_np[-1, 0], dream_z_np[-1, 1], "r*",
             markersize=15, zorder=5, label="End")
axes[1].set_title("Dream Trajectory (Latent Space)",
                   fontsize=13, fontweight="bold")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Match axis limits
for ax in axes:
    ax.set_xlabel("Latent Dim 1")
    ax.set_ylabel("Latent Dim 2")

plt.suptitle("Latent Space Trajectories: Reality vs Imagination",
             fontsize=15, fontweight="bold")
plt.tight_layout()
plt.show()

In [None]:
# üéØ Congratulations!
print("=" * 60)
print("üéØ CONGRATULATIONS!")
print("=" * 60)
print()
print("You have successfully built a complete World Model!")
print()
print("Here is what you accomplished in this notebook:")
print("  1. ‚úÖ Collected experience from the real environment")
print("  2. ‚úÖ Built a VAE that compresses observations to latent codes")
print("  3. ‚úÖ Built an MDN-RNN that predicts future states")
print("  4. ‚úÖ Built a simple linear Controller")
print("  5. ‚úÖ Trained the Controller entirely in dreams")
print("  6. ‚úÖ Evaluated the dream-trained agent in reality")
print("  7. ‚úÖ Compared real vs imagined trajectories")
print()
print("The key insight: by learning a good model of the world")
print("(V + M), an agent can train its behavior (C) without")
print("any additional interaction with the real environment.")
print("It learns by dreaming.")
print()
print("=" * 60)

## 9. Reflection and Next Steps

### ü§î Reflection Questions

Take a moment to think about these questions before moving on to the next notebook:

1. **Model fidelity:** The dream trajectories diverge from reality over time. Why does this happen, and what determines how quickly the dream drifts from reality? How might we improve this?

2. **Controller simplicity:** Ha and Schmidhuber deliberately used a simple linear controller. What are the advantages of this design choice? What happens if we make the controller more complex (e.g., a deep network)?

3. **Compounding errors:** In dream training, the Memory model's prediction errors compound at each step ‚Äî the agent acts on a slightly wrong state, which leads to a slightly more wrong next state. How does this relate to the concept of "distribution shift" in machine learning? Can you think of ways to mitigate this?

### üîß Optional Challenges

1. **Increase the latent dimension.** Change `latent_dim` from 2 to 8 or 16. Does the VAE reconstruct better? Does the MDN-RNN predict better? Does the agent perform better in the real environment? Plot the results for different latent dimensions.

2. **Try a harder environment.** Replace CartPole-v1 with `MountainCar-v0` or `Acrobot-v1`. You will need to adjust the reward proxy function and possibly the network sizes. Can the World Model learn meaningful dynamics for these environments?

### What Comes Next

In the next notebook, we will explore how World Models evolved. Dreamer (Hafner et al., 2020) extended these ideas with:
- **Continuous action spaces** instead of discrete ones
- **Learned reward models** instead of hand-crafted reward proxies
- **Latent imagination with actor-critic** instead of simple REINFORCE
- **Recurrent State-Space Models** that are more expressive than MDN-RNNs

The progression from Ha and Schmidhuber's V-M-C to Dreamer is a beautiful case study in how simple, elegant ideas get refined and scaled. We will see you there.

---

**Original paper:** Ha, D., & Schmidhuber, J. (2018). *World Models.* arXiv:1803.10122

**Notebook by Vizuara | World Action Models Series ‚Äî Notebook 2 of 6**