# 🚀 World Action Models — Notebook Series

## How AI Learns to Imagine Before It Acts

*A Vizuara learning series based on the article: "World Action Models: How AI Learns to Imagine Before It Acts"*

---

Welcome to this series of six hands-on Google Colab notebooks! Together, they take you on a journey from the fundamentals of world models to the cutting edge of embodied AI — Vision-Language-Action models that can see, understand, and act in the physical world.

Every notebook is designed to be run independently in Google Colab with a T4 GPU. Each one builds on the concepts from the previous notebooks, but includes enough context to stand alone.

## 📚 The Learning Path

### Notebook 1: World Models from First Principles
**Estimated time:** ~25 minutes | **Prerequisites:** Basic PyTorch

- What is a world model? The formal definition: $s_{t+1} = f(s_t, a_t)$
- Model-free vs model-based reinforcement learning
- Build a neural network world model for CartPole
- Implement a random shooting planner that uses imagination to act
- **Final output:** Planning agent vs random agent comparison

---

### Notebook 2: Ha & Schmidhuber World Models — Teaching Agents to Dream
**Estimated time:** ~40 minutes | **Prerequisites:** Notebook 1, basic VAE knowledge

- The V-M-C architecture: Vision (VAE) + Memory (MDN-RNN) + Controller
- The reparameterization trick and probabilistic state prediction
- Dream training — training a controller entirely inside the world model's imagination
- **Final output:** Real environment vs agent's dream trajectory comparison

---

### Notebook 3: DreamerV3 — Imagination-Based RL with the RSSM
**Estimated time:** ~45 minutes | **Prerequisites:** Notebooks 1-2

- The Recurrent State-Space Model: deterministic + stochastic paths
- Why the world is both predictable AND unpredictable
- Imagination-based actor-critic training
- **Final output:** RSSM imagining CartPole trajectories, real vs imagined comparison

---

### Notebook 4: JEPA — Predicting in Abstract Space
**Estimated time:** ~40 minutes | **Prerequisites:** Basic knowledge of transformers

- Yann LeCun's insight: predict what matters, ignore irrelevant details
- I-JEPA architecture: context encoder, target encoder, predictor
- EMA updates to prevent representation collapse
- **Final output:** t-SNE of learned representations, linear probe accuracy on CIFAR-10

---

### Notebook 5: Genie — From a Single Image to an Interactive World
**Estimated time:** ~40 minutes | **Prerequisites:** Notebook 4 recommended

- World models as generators of entire interactive environments
- Learning actions from unlabeled video — no action labels needed!
- VQ-VAE tokenizer, Latent Action Model, Dynamics Model
- **Final output:** Interactive world generation — same start frame, different action sequences

---

### Notebook 6: Vision-Language-Action Models — See, Understand, Act
**Estimated time:** ~45 minutes | **Prerequisites:** Notebooks 1-5 recommended

- The convergence: vision + language + action in one model
- Flow matching for smooth action trajectory generation
- Physical Intelligence's π0 and Meta's V-JEPA 2
- **Final output:** VLA generating robot trajectories from language commands

## 🗺️ How These Notebooks Connect

```
Notebook 1: World Models          ──► "What if agents could imagine?"
    │
    ▼
Notebook 2: V-M-C (Dream)        ──► "Train inside your own dream"
    │
    ▼
Notebook 3: DreamerV3 (RSSM)     ──► "Scale imagination to 150+ tasks"
    │
    ├─────────────────────┐
    ▼                     ▼
Notebook 4: JEPA          Notebook 5: Genie
"Predict abstractly"      "Generate entire worlds"
    │                     │
    └─────────┬───────────┘
              ▼
Notebook 6: VLA Models    ──► "See, understand, and act"
```

## 🔧 Setup Instructions

All notebooks are designed to run in **Google Colab** with a **T4 GPU**. To get started:

1. Open any notebook in Google Colab
2. Go to **Runtime → Change runtime type → GPU (T4)**
3. Run the first setup cell to verify GPU access
4. Follow along from top to bottom!

Each notebook installs its own dependencies. No local setup is needed.

## 📖 About This Series

This notebook series accompanies the Vizuara Substack article: **"World Action Models: How AI Learns to Imagine Before It Acts."**

The article covers the conceptual landscape; these notebooks give you hands-on experience building each system from scratch. By the end, you will have implemented:

- A world model that predicts physics
- A VAE + RNN that enables dreaming
- An RSSM with imagination-based RL
- An I-JEPA that learns without pixel reconstruction
- A Genie-style interactive world generator
- A VLA model with flow matching

Happy learning! 🎓

*— Vizuara AI*