# **Tutorial 2: Decision Diffuser (DD) for D4RL-MuJoCo**

## 1. Introduction

In this tutorial, we’ll implement a minimal Decision Diffuser (DD) using CleanDiffuser. DD is a planning-based diffusion RL algorithm that uses classifier-free guidance (CFG) to generate high-performance decision trajectories. We’ll be using the D4RL-MuJoCo dataset for both training and evaluation. Along the way, we’ll dive into CFG and explore how to customize CFG models for specific tasks. Let’s start with an overview of how CFG works.

### 1.1 Classifier-free Guidance (CFG)

In a conditional generation task, we aim to sample from a conditional distribution $q_0(\bm x|\bm y)$. The score function can be written as:

$$
\nabla_{\bm x}\log q_t(\bm x_t|\bm y) = \nabla_{\bm x}\log q_t(\bm x_t) + \nabla_{\bm x}\log q_t(\bm y|\bm x_t),
$$

where the first term on the right is the score function for the unconditional distribution $q_t(\bm x_t)$, which we can estimate by training an unconditional diffusion model. The second term, which is the guidance term, is what needs to be estimated for CFG. CFG simplifies this by expressing the guidance term as:

$$
\nabla_{\bm x}\log q_t(\bm y|\bm x_t) = \nabla_{\bm x}\log q_t(\bm x_t|\bm y) - \nabla_{\bm x}\log q_t(\bm x_t).
$$

By training a conditional noise prediction model $\bm\epsilon_\theta(\bm x_t, t, \bm y)$, we can guide the sampling process without needing an additional classifier:

$$
\bar{\bm\epsilon_\theta}(\bm x_t, t, \bm y) = \bm\epsilon_\theta(\bm x_t, t) - w \cdot (\bm\epsilon_\theta(\bm x_t, t, \bm y) - \bm\epsilon_\theta(\bm x_t, t)),
$$

where $w$ represents the strength of the guidance. In practice, we use a dummy condition $\bm y = \bm\Phi$ for unconditional generation, meaning $\bm\epsilon_\theta(\bm x_t, t, \bm\Phi) = \bm\epsilon_\theta(\bm x_t, t)$.

In decision-making tasks, the condition $\bm y$ can represent highly complex, multimodal data like image-based observations, language instructions, or point clouds. Some implementations even use large transformers for multimodal fusion while utilizing smaller MLPs as the diffusion neural network backbone. In CleanDiffuser, we’ve decoupled the neural networks for diffusion $\bm\epsilon_\theta$ and conditions $\bm\zeta_\phi$ to facilitate development and debugging. Conditional diffusion models in CleanDiffuser are implemented as $\epsilon_\theta(\bm x_t, t, \bm\zeta_\phi(\bm y))$, with a dummy condition $\bm\zeta_\phi(\bm\Phi) = \bm 0$. This is why in Tutorial 1, we used both a `NNDiffusion` and a `NNCondition` to build the diffusion model. The `NNDiffusion` corresponds to $\bm\epsilon_\theta$, and the `NNCondition` corresponds to $\bm\zeta_\phi$. If the condition is simple, we can use an `IdentityCondition` to pass it directly to `NNDiffusion`.

### 1.2 Diffusion Planners

DD is a diffusion planner that uses CFG to generate high-performance decision trajectories. The core idea is to generate high-quality decision trajectories and extract the first action to execute—much like MPC (Model Predictive Control) and other planning-based model-based RL algorithms. While those approaches rely on search methods and dynamic models to find optimal trajectories, diffusion planners achieve this through conditional generation.

To guide this generation process, we use a “high-performance” variable as the condition. One straightforward approach is to use the discounted return-to-go of the trajectory, $\sum_{s=t}^T \gamma^{s-t} r_s$, as the condition. This is a Monte Carlo estimation of the trajectory’s value. During training, we normalize these values to the range [0, 1], where 1 represents the highest performance. During inference, we use relatively high normalized values (e.g., 0.8-1.0) as conditions to generate high-performance trajectories. For more details, you can refer to [Diffuser](https://arxiv.org/abs/2205.09991) and [Decision Diffuser](https://arxiv.org/abs/2211.15657).

## 2. Setting up the Environment and Dataset

For this tutorial, we’ll use the `halfcheetah-medium-v2` environment from D4RL-MuJoCo as an example. D4RL-MuJoCo is a widely used offline RL benchmark, and the `halfcheetah-medium-v2` task requires controlling a halfcheetah robot to move as fast as possible. CleanDiffuser provides a simple interface to load D4RL datasets.

In [2]:
import d4rl
import gym

from cleandiffuser.dataset.d4rl_mujoco_dataset import D4RLMuJoCoDataset

# horizon=4 is enough for halfcheetah tasks as suggested in Diffuser paper.
horizon = 4
env = gym.make("halfcheetah-medium-v2")
dataset = D4RLMuJoCoDataset(env.get_dataset(), terminal_penalty=-100, horizon=horizon)
obs_dim, act_dim = dataset.obs_dim, dataset.act_dim

No module named 'flow'
/home/dzb/miniforge3/envs/cleandiffuser/lib/python3.9/site-packages/glfw/__init__.py:914: GLFWError: (65544) b'X11: The DISPLAY environment variable is missing'
No module named 'carla'
pybullet build time: Nov 28 2023 23:52:03
  logger.warn(
load datafile: 100%|██████████| 21/21 [00:04<00:00,  5.06it/s]


## 3. Building the Diffusion Model

Unlike in Tutorial 1, here we need the diffusion model to generate decision trajectories, which look like this:

$$
\bm\tau = \left[
\begin{aligned}
&\bm s_0, \bm s_1, \dots, \bm s_{H-1} \\
&\bm a_0, \bm a_1, \dots, \bm a_{H-1}
\end{aligned}
\right],
$$

where $\bm s_t$ is the state at time $t$, $\bm a_t$ is the action at time $t$, and $H$ is the horizon. The trajectory $\bm\tau$ has the shape (H, obs_dim + act_dim). For this, we need a neural network backbone designed to generate sequences. We’ll use `DiT1d`, a modified version of DiT for 1D sequences. `DiT1d` expects the condition as a tensor of shape (batch_size, embed_dim), so we use an MLP `NNCondition` to map the scalar condition (the trajectory’s value) to a tensor of the required shape.

> **Note:** The official DD generates only state trajectories and uses an inverse dynamics model $\bm a_t = \mathcal{I}(\bm s_t, \bm s_{t+1})$ to extract the action. In this tutorial, we skip the inverse dynamics model and directly use state-action trajectories.


In [3]:
import torch

from cleandiffuser.diffusion import ContinuousDiffusionSDE
from cleandiffuser.nn_condition import MLPCondition
from cleandiffuser.nn_diffusion import DiT1d

# Neural network backbones
nn_diffusion = DiT1d(
    x_dim=obs_dim + act_dim, emb_dim=128, d_model=320, n_heads=10, depth=2, timestep_emb_type="untrainable_fourier"
)
nn_condition = MLPCondition(in_dim=1, out_dim=128, hidden_dims=128, dropout=0.25)

# Mask
fix_mask = torch.zeros((horizon, obs_dim + act_dim))
fix_mask[0, :obs_dim] = 1.0
loss_weight = torch.ones((horizon, obs_dim + act_dim))
loss_weight[0, obs_dim:] = 10.0

planner = ContinuousDiffusionSDE(
    nn_diffusion,
    nn_condition,
    fix_mask=fix_mask,
    loss_weight=loss_weight,
)

We also define a `fix_mask` and `loss_weight`. The `fix_mask` is a binary tensor of the same shape as the generated data, where 1 indicates fixed data (known, such as the current observation), and 0 indicates data to be generated. The `loss_weight` assigns different weights to different parts of the data, with more weight on the first action (as this is the action we care most about).

## 4. Training the Diffusion Model

In [5]:
import numpy as np
import pytorch_lightning as L
from pytorch_lightning.callbacks import ModelCheckpoint


class StateActionSequenceWrapper(torch.utils.data.Dataset):
    def __init__(self, dataset: torch.utils.data.Dataset):
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getattr__(self, name):
        return getattr(self.dataset, name)

    def __getitem__(self, idx):
        batch = self.dataset[idx]

        obs = batch["obs"]["state"]
        act = batch["act"]
        val = batch["val"] / 580.0
        # 580 is a hard-coded constant to normalize the value to [-1, 1].
        # it depends on the environment.

        return {
            "x0": np.concatenate([obs, act], axis=-1),
            "condition_cfg": val,
        }


save_path = "results/tutorial2_dd_for_d4rl_mujoco/"

dataloader = torch.utils.data.DataLoader(
    StateActionSequenceWrapper(dataset), batch_size=512, shuffle=True, num_workers=4, persistent_workers=True
)

callback = ModelCheckpoint(dirpath=save_path, filename="dd-{step}", every_n_train_steps=10_000)

trainer = L.Trainer(
    accelerator="gpu",
    devices=[0, 1, 2, 3],
    max_steps=500_000,
    deterministic=True,
    log_every_n_steps=200,
    default_root_dir=save_path,
    callbacks=[callback],
)

trainer.fit(planner, dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float3

Epoch 1026:  69%|██████▉   | 338/487 [00:24<00:10, 14.03it/s, v_num=1, diffusion_loss=0.0854]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=500000` reached.


Epoch 1026:  69%|██████▉   | 338/487 [00:24<00:10, 14.02it/s, v_num=1, diffusion_loss=0.0854]


## 5. Evaluation

Inference with diffusion planners differs slightly from diffusion policies. First, we replace the first state in the prior with the current observation. Then, we generate a trajectory conditioned on a high-performance score. Finally, we extract and execute the first action. Here, we evaluate the trained model in the `halfcheetah-medium-v2` environment. We set the target score condition to `0.95` and the guidance strength to `15.0`.

In [9]:
n_seeds = 3

# device for evaluation
device = "cuda:0"

# loading from checkpoint
planner.load_state_dict(
    torch.load("results/tutorial2_dd_for_d4rl_mujoco/dd-step=500000.ckpt", map_location=device)["state_dict"]
)
planner.to(device).eval()

# evaluating
env_eval = gym.vector.make("halfcheetah-medium-v2", num_envs=50)
normalizer = dataset.get_normalizer()
condition = torch.full((50, 1), 0.95, device=device)
prior = torch.zeros((50, horizon, obs_dim + act_dim))
scores = []

for _ in range(n_seeds):
    
    obs, all_done, ep_rew, t = env_eval.reset(), False, 0.0, 0

    while not np.all(all_done):
        obs = torch.tensor(normalizer.normalize(obs))
        prior[:, 0, :obs_dim] = obs

        traj, log = planner.sample(
            prior,
            solver="ddpm",
            sample_steps=5,
            sampling_schedule="uniform_logsnr",
            condition_cfg=condition,
            w_cfg=15,
            use_ema=True,
            temperature=0.5,
        )
        act = traj[:, 0, obs_dim:].clip(-1.0, 1.0).cpu().numpy()

        obs, rew, done, info = env_eval.step(act)

        ep_rew += rew
        t += 1
        all_done = np.logical_or(all_done, done)

        print(f"[t={t}] rew: {rew}")

    scores.append(env.get_normalized_score(ep_rew).mean() * 100.)

print(f"D4RL score: {np.mean(scores)}+-{np.std(scores)}")

  logger.warn(


[t=1] rew: [-0.41064528 -0.30203276 -0.20408439 -0.25790227 -0.41847913 -0.39337452
  0.13060199  0.09943322 -0.48303013  0.04111774 -0.15423047 -0.21373067
 -0.91625248 -0.53069481 -0.05168775 -0.32483423 -0.19358753  0.25464607
 -0.07734991 -1.10700749 -0.16604815 -0.29155564  0.04072121  0.20223909
 -0.13359973 -0.80226683 -0.53838576 -0.10523825 -0.45469917 -0.54115094
 -0.32379289 -0.308251    0.21935378 -0.76873307 -0.11030326 -0.84384955
 -0.66522372  0.36740042 -0.89999698 -0.02216342 -0.68160968 -0.4262629
 -0.01935014 -0.04477123 -0.0544906  -0.36912971 -0.31445271 -0.44020865
 -0.17633822 -0.00880385]
[t=2] rew: [-0.82970939 -0.84336024 -0.567595   -0.41672669 -1.06254991 -0.80862268
  0.00671522  0.19081357 -0.78877307 -0.14846946 -0.9647318  -0.57616058
 -1.06108402 -1.14432744 -0.7677909  -0.75277192 -0.42861499  0.33303388
 -0.74326439 -1.5707345  -0.94771047 -0.6658399  -0.4218538  -0.02689107
 -0.30171476 -1.08704527 -0.92795743 -0.06756259 -1.04294568 -1.50142517
 -1.

The results are promising! Despite not using the inverse dynamics model like the official DD, the performance remains competitive compared to other popular offline RL algorithms (see the table below).

||BC|CQL|IQL|DT|TT|Diffuser|DD (Official)|DD (Tutorial 2)|
|---|--|--|---|--|--|--------|-------------|--------------|
|HalfCheetah-Medium-v2|42.6|44.0|47.4|42.6|46.9|44.2|49.1+-1.0|48.0+-0.3|