# World Model Core
*Sean Steinle, Kiya Aminfar*

This notebook walks through the core aspects of world models, developing crucial pieces of code sequentially. Not that this code isn't meant for scale -- instead, this is for a demonstration of how we developed the code that we did.

## Table of Contents
1. [Collecting Rollout Data](#Collecting-Rollout-Data)
2. [Training the VAE](#Training-the-VAE)
3. [Training the MDN-RNN](#Training-the-MDN-RNN)
    - [Prepping Rollout Data for the MDN-RNN](#Prepping-Rollout-Data-for-the-MDN-RNN)
    - [Core Training](#Core-Training)
4. [Training the Controller](#Training-the-Controller)
5. [Early Results](#Early-Results)

In [1]:
import gymnasium as gym
import matplotlib.pyplot as plt
import os
import numpy as np

## Collecting Rollout Data

In [2]:
#Let's begin by creating an instance of our humanoid environment and checking out what basic observations look like.
env = gym.make('Humanoid-v5', render_mode="rgb_array")
obs, info = env.reset()

In [3]:
obs.shape, obs

((348,),
 array([ 1.40752215e+00,  1.00663365e+00,  6.02556701e-03,  1.16719831e-03,
        -4.62792461e-03,  5.14522305e-03, -9.58258864e-03,  3.63368217e-03,
         1.28224640e-04,  9.20090990e-03,  8.16243137e-03,  6.34724809e-03,
        -2.19157998e-03, -4.84501161e-03,  5.03633911e-03, -3.15465082e-03,
         2.63887772e-03, -6.97352955e-03,  9.36419674e-03, -8.20846185e-05,
        -1.74765736e-03,  6.09327069e-03, -5.16949615e-03, -4.27053733e-03,
         1.70261180e-03,  6.54108492e-03,  1.32864334e-03, -1.15680755e-03,
        -3.40851327e-03,  6.35224940e-03, -3.04810121e-03, -2.93511876e-04,
         7.56070784e-03,  8.48507331e-03, -6.35783697e-03,  3.95974482e-03,
         2.30324114e-03,  7.89472002e-03,  8.49384980e-03,  5.69242465e-04,
         5.84585705e-03,  4.37064015e-03, -4.43530877e-03,  8.93418639e-03,
        -1.32483179e-04,  2.30367321e+00,  2.28714761e+00,  4.46661737e-02,
        -1.23613986e-03,  7.63484451e-02,  3.07654488e-02, -1.63893363e-01,
   

In [4]:
info

{'x_position': np.float64(0.0025825735263227626),
 'y_position': np.float64(-0.0014417381873546332),
 'tendon_length': array([-0.00819099, -0.00181518]),
 'tendon_velocity': array([ 0.00059913, -0.01484291]),
 'distance_from_origin': np.float64(0.0029577516832452)}

As we can see, the humanoid environment gives us a TON of observations! We get dozens of variables representing various positions and velocities of body parts, the center of mass, and a lot of other variables I hardly understand. For an exhaustive list, see the [doc](https://gymnasium.farama.org/environments/mujoco/humanoid/#observation-space). The fact that there are so many variables here is what makes learning latent observations so obvious!

We also get some nice summary stats in info, but we aren't going to include them in our scrape.

In [20]:
import json

def collect_rollout_data(env_name: str, out_dir: str, n_timesteps: int=10000, print_n_episodes: int=1000):
    """Simulates `n_timesteps` in the `env_name` environment, saving observations, rewards, and actions to a triplet of .npy files at `out_dir`."""
    env = gym.make(env_name, render_mode='rgb_array')
    obs, info = env.reset()
    observations, rewards, actions = [], [] , []
    episode_count = 0

    for timestep in range(n_timesteps):  # Run for n_timesteps or until the episode ends
        action = env.action_space.sample() #select random action
        obs, reward, terminated, truncated, info = env.step(action) #execute and get results
        observations.append(obs) #save observation
        rewards.append(reward) #save reward
        actions.append(action) #save action
        if terminated or truncated: #check for game over, if so reset env
            episode_count+=1
            total_observations.append(episode_observations)
            total_rewards.append(episode_rewards)
            total_actions.append(episode_actions)
            if episode_count % print_n_episodes == 0: print(f"finished {episode_count} episodes") #provide update on training
            observation, info = env.reset()
        env.close()
    np_obs, np_rewards, np_actions = np.array(observations), np.array(rewards), np.array(actions)
    print(f"observations has shape: {np_obs.shape}\trewards has shape: {np_rewards.shape}\tactions has shape: {np_actions.shape}")
    np.save(f'{out_dir}/{env_name}_{n_timesteps}_rollout_observations.npy', np_obs) #load with: new_obs = np.load("../data/processed/Humanoid-v5_10000_rollout_observations.npy")
    np.save(f'{out_dir}/{env_name}_{n_timesteps}_rollout_rewards.npy', np_rewards)
    np.save(f'{out_dir}/{env_name}_{n_timesteps}_rollout_actions.npy', np_actions)
    return np_obs, np_rewards, np_actions

SyntaxError: unmatched ']' (2640406133.py, line 7)

In [22]:
humanoid_obs, humanoid_rewards, humanoid_actions, humanoid_done = collect_rollout_data('Humanoid-v5', "../data/processed", 10000, 100)

finished 100 episodes
finished 200 episodes
finished 300 episodes
finished 400 episodes
observations has shape: (10000, 348)	rewards has shape: (10000,)	actions has shape: (10000, 17)


In [23]:
#We see that the number of samples per episode changes, but the observation space's dimensionality never does!
humanoid_obs[0].shape, humanoid_obs[1].shape, humanoid_obs[2].shape

array([4.89014007, 4.92889786, 4.94746227, ..., 4.52836352, 4.57897921,
       4.56725108])

In [24]:
#Same holds for actions! We can even see the dimensionality of actions and observations are similar based on episode length
humanoid_actions[0].shape, humanoid_actions[1].shape, humanoid_actions[2].shape

array([[-0.1344295 , -0.33341795,  0.2567148 , ..., -0.26523423,
        -0.22723526, -0.34251404],
       [-0.2680746 , -0.05566841, -0.06947935, ..., -0.19162744,
        -0.23866333,  0.33647585],
       [-0.1664092 ,  0.0057418 , -0.32716182, ...,  0.06560624,
         0.14931448,  0.23692013],
       ...,
       [-0.3987542 ,  0.35181522,  0.2205222 , ..., -0.27054495,
         0.26330063, -0.19415343],
       [-0.26650003, -0.35020053, -0.03938405, ..., -0.18459655,
         0.04306056,  0.2507781 ],
       [-0.25835124,  0.19055103, -0.2883535 , ...,  0.04691168,
        -0.00619155, -0.23913088]], dtype=float32)

In [9]:
humanoid_done, humanoid_done.sum()

(array([False, False, False, ..., False, False, False]), np.int64(413))

## Training the VAE

Now that we have an easy function for gathering experiences in the environment, we need to train the VAE module of our world model which will compress the observation space into latent space with fewer dimensions.

Note that the original World Model implementation worked with tensorflow 1.18.0. This is incredibly outdated (worked with Python 3.5), so let's get a newer version (tensorflow 2.19.0). Additionally, we need to change the structure of the VAE from working with images to working with a vector of observation data! Luckily, ChatGPT is very good at updating code (or it will be very obvious if it is not!).

In [10]:
import tensorflow as tf
from tensorflow.keras import layers, Model, saving

@saving.register_keras_serializable()
class MLPVAE(Model):
    def __init__(self, input_dim=348, z_size=32, kl_tolerance=0.5):
        super(MLPVAE, self).__init__()
        self.z_size = z_size
        self.kl_tolerance = kl_tolerance

        # Encoder
        self.encoder = tf.keras.Sequential([
            layers.InputLayer(input_shape=(input_dim,)),
            layers.Dense(256, activation='relu'),
            layers.Dense(128, activation='relu'),
            layers.Dense(2 * z_size),  # output both mu and logvar
        ])

        # Decoder
        self.decoder = tf.keras.Sequential([
            layers.InputLayer(input_shape=(z_size,)),
            layers.Dense(128, activation='relu'),
            layers.Dense(256, activation='relu'),
            layers.Dense(input_dim, activation='linear'),  # output same shape as input
        ])

    def sample_z(self, mu, logvar):
        eps = tf.random.normal(shape=tf.shape(mu))
        sigma = tf.exp(0.5 * logvar)
        return mu + sigma * eps

    def encode(self, x):
        h = self.encoder(x)
        mu, logvar = tf.split(h, num_or_size_splits=2, axis=1)
        logvar = tf.clip_by_value(logvar, -10.0, 10.0)  # helps with exploding values
        return mu, logvar

    def decode(self, z):
        return self.decoder(z)

    def call(self, x):
        mu, logvar = self.encode(x)
        z = self.sample_z(mu, logvar)
        x_recon = self.decode(z)
        return x_recon, mu, logvar

    def compute_loss(self, x):
        x_recon, mu, logvar = self(x)
        recon_loss = tf.reduce_mean(tf.reduce_sum(tf.square(x - x_recon), axis=1))
        kl_loss = -0.5 * tf.reduce_sum(1 + logvar - tf.square(mu) - tf.exp(logvar), axis=1)
        kl_loss = tf.maximum(kl_loss, self.kl_tolerance * self.z_size)
        kl_loss = tf.reduce_mean(kl_loss)
        total_loss = recon_loss + kl_loss
        return total_loss, recon_loss, kl_loss

2025-05-09 11:48:41.087368: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-09 11:48:41.107521: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746805721.126734   10102 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746805721.132519   10102 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746805721.147023   10102 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

Let's also write a training function for convenience.

In [11]:
def create_dataset(x_train, batch_size=64, shuffle_buffer=10000):
    # Assuming x_train is a NumPy array of shape [n_samples, 348]
    dataset = tf.data.Dataset.from_tensor_slices(x_train.astype(np.float32))
    dataset = dataset.shuffle(shuffle_buffer).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return dataset

def train_vae(model, dataset, epochs=10, learning_rate=1e-4):
    optimizer = tf.keras.optimizers.Adam(learning_rate)

    for epoch in range(epochs):
        total_loss = 0.0
        total_batches = 0
        for x_batch in dataset:
            with tf.GradientTape() as tape:
                loss, recon_loss, kl_loss = model.compute_loss(x_batch)
            grads = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

            total_loss += loss.numpy()
            total_batches += 1

        avg_loss = total_loss / total_batches
        print(f"Epoch {epoch+1}: avg loss = {avg_loss:.4f}")


In [12]:
# x_train should be a NumPy array of shape (n_samples, 348)
x_train = humanoid_obs
x_train = (x_train - np.mean(x_train, axis=0)) / (np.std(x_train, axis=0) + 1e-6)

dataset = create_dataset(humanoid_obs, batch_size=64)
vae = MLPVAE(input_dim=348, z_size=32)
train_vae(vae, dataset, epochs=20)

2025-05-09 11:48:47.993476: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-05-09 11:49:06.936035: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 1: avg loss = 302230.9688


2025-05-09 11:49:25.816401: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 2: avg loss = 163989.3906
Epoch 3: avg loss = 75357.7734


2025-05-09 11:49:57.662699: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 4: avg loss = 59448.0898
Epoch 5: avg loss = 53910.8359
Epoch 6: avg loss = 49874.7539
Epoch 7: avg loss = 45672.3164


2025-05-09 11:50:51.672229: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 8: avg loss = 39600.4180
Epoch 9: avg loss = 33288.8203
Epoch 10: avg loss = 28312.5820
Epoch 11: avg loss = 24710.0449
Epoch 12: avg loss = 22066.8398
Epoch 13: avg loss = 19569.1250
Epoch 14: avg loss = 17441.9258
Epoch 15: avg loss = 15839.2998


2025-05-09 11:52:38.244919: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 16: avg loss = 14604.4170
Epoch 17: avg loss = 13732.7695
Epoch 18: avg loss = 13018.8848
Epoch 19: avg loss = 12438.6729
Epoch 20: avg loss = 11923.9502


Loss is going down! At first I got a tons of NAN values, but it's because I wasn't normalizing the input data and I also needed to clip the logvar values we were getting as a result of the encoding process. If you get NANs again, a lower learning rate could help too. Onto saving the model!

In [13]:
vae.save_weights('../models/vae/humanoid_10000_vae_model.weights.h5') #save ONLY weights -- much simpler than serializing the entire object

In [14]:
new_vae = MLPVAE(input_dim=348, z_size=32) #instantiate new model object 
new_vae(tf.zeros((1, 348))) #invoke it to build its shape 
new_vae.load_weights('../models/vae/humanoid_10000_vae_model.weights.h5') #now load weights into empty vector

In [15]:
new_vae

<MLPVAE name=mlpvae_1, built=True>

In [16]:
vae

<MLPVAE name=mlpvae, built=True>

## Training the MDN-RNN

Now that we have a model which captures observations, we're theoretically ~1/3 done with the project! I say theoretically because this was probably the easiest part of the project. Now onto the meat of world models: capturing the transitions of our environment and training the MDN-RNN!

### Prepping Rollout Data for the MDN-RNN

To train the MDN-RNN, we first need to enhance our basic rollout dataset with predictions of `mu` and `logvar` for each experience. Then we'll feed this information to the MDN-RNN.

In [18]:
#We can use the dataset records still in memory. Note that we only need the observations for now!
humanoid_obs.shape, humanoid_rewards.shape, humanoid_actions.shape, humanoid_done.shape

((10000, 348), (10000,), (10000, 17), (10000,))

In [19]:
#We can also use the VAE still in memory!
vae

<MLPVAE name=mlpvae, built=True>

In [20]:
humanoid_obs[0]

array([ 1.39590622e+00,  9.99995520e-01, -1.48260806e-03,  2.35242370e-03,
        1.10815003e-03,  7.28172965e-03, -7.15410548e-02,  1.55540659e-02,
       -1.52926195e-02,  1.69786602e-02,  7.06296636e-02, -1.89977570e-03,
        1.16534514e-02, -6.50198093e-03,  5.72273184e-02, -8.02832134e-03,
       -1.75396758e-02,  1.50975119e-02,  1.15088466e-02,  2.90521775e-02,
       -2.29436430e-02,  1.14054975e-03, -3.53934903e-01, -1.84836038e-02,
       -2.29504838e-01, -5.99725486e-02,  1.94204832e+00, -3.87114357e-01,
        2.02086546e+00, -7.42386529e+00,  9.98969633e-01, -1.04443180e+00,
        2.28379743e+00,  7.55341866e+00, -6.47223162e-02,  1.17495983e+00,
       -8.30978728e-01,  6.81784841e+00, -2.13443194e-01, -1.98394812e+00,
        1.65379456e+00,  4.63941103e-01,  2.70873725e+00, -1.69449907e+00,
        3.42377348e-01,  2.29532303e+00,  2.28309398e+00,  4.81239231e-02,
       -1.25221834e-05,  1.16678255e-01,  4.11431168e-04, -2.47469251e-01,
       -1.54578979e-03,  

In [36]:
#That's it! Just predict for our entire observation set.
mu, logvar = vae.encode(humanoid_obs)
humanoid_z = vae.sample_z(mu, logvar).numpy()
z.shape

(10000, 32)

In [37]:
np.save(f'../data/processed/Humanoid-v5_10000_rollout_z.npy', humanoid_z)

## Core Training

Now that we have a dataset prepared to train the MDN-RNN, let's define the model and test it out!

In [None]:
#The 'recipe' is missing task hierarchies!
    #We never cared before because mastering domains was infeasible.

#Minecraft paper offers a cool example because we have a very clear task hierarchy! But crafting task hierarchies in real environments are nontrivial -- @Nick.

#Is foundation models as a prior good enough?
#We don't just need information about what environments are like, we need goals! We need to know what it takes to be good at X.
    #This is a RICH area for assurance research too. This set of goals is necessarily broad, but need to ensure they align tightly to our values and domain mastery.
#Information about computers might help with tasks, but what are the tasks? And how do the tasks present a cohesive picture of domain mastery?