# **Tutorial 1. A Minimal DBC (Diffusion Behavior Clone) Implementation**
## 1 Introduction
In this tutorial, we'll explore how to implement a minimal DBC (Diffusion Behavior Clone) with CleanDiffuser. 
DBC is an imitation learning algorithm that aims to replicate behaviors from the offline demonstration dataset. 
It uses the diffusion model to generate samples from the policy distribution $\pi_\theta(a|s)$. So the basic idea is just the same as the diffusion-based 
image generation model, but the difference is that DBC is conditioned on the state $s$ and generates actions $a$.

Imitation learning requires a dataset of expert demonstrations. In this tutorial, we'll use the RelayKitchen environment, which consists of a 9 DoF position-controlled Franka robot interacting with a kitchen scene, including an openable microwave, four turnable oven burners, an oven light switch, a freely movable kettle, two hinged cabinets, and a sliding cabinet door. It also contains 566 human demonstrations of various tasks, such as opening the microwave, turning on the oven light, and moving the kettle. Agents are trained to imitate these demonstrations and finish as many tasks as possible within a limited time.

Let's start by downloading the expert demonstrations!

In [1]:
! mkdir ../dev
! cd ../dev
! wget https://diffusion-policy.cs.columbia.edu/data/training/kitchen.zip
! unzip kitchen.zip
! rm kitchen.zip
! cd ../tutorials

--2025-05-12 11:01:56--  https://diffusion-policy.cs.columbia.edu/data/training/kitchen.zip
Resolving diffusion-policy.cs.columbia.edu (diffusion-policy.cs.columbia.edu)... 128.59.16.27
Connecting to diffusion-policy.cs.columbia.edu (diffusion-policy.cs.columbia.edu)|128.59.16.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 777744116 (742M) [application/zip]
Saving to: ‘kitchen.zip’


2025-05-12 11:11:26 (1.30 MB/s) - ‘kitchen.zip’ saved [777744116/777744116]

Archive:  kitchen.zip
   creating: kitchen/
  inflating: kitchen/all_init_qvel.npy  
   creating: kitchen/kitchen_demos_multitask/
   creating: kitchen/kitchen_demos_multitask/friday_kettle_switch_hinge_slide/
  inflating: kitchen/kitchen_demos_multitask/friday_kettle_switch_hinge_slide/kitchen_playdata_2019_06_28_15_26_42.mjl  
  inflating: kitchen/kitchen_demos_multitask/friday_kettle_switch_hinge_slide/kitchen_playdata_2019_06_28_15_24_31.mjl  
  inflating: kitchen/kitchen_demos_multitask/friday_ke

## 2 Setting up the Environment and Prepare the Dataset

CleanDiffuser has already provided a simple interface to set up the environment and prepare the dataset. We'll have a gym-like environment to interact with, and a pytorch Dataset class by the following code. Note that `KitchenDataset` is a sequential dataset that returns the state-action trajectory segmentations in the demonstration dataset. The `horizon` parameter is the length of the trajectory segmentations, `pad_before` and `pad_after` are the padding length before and after the trajectory segmentations. Since we just consider single-step decision making here, we set `horizon=1`, `pad_before=0`, and `pad_after=0`.

In [1]:
import cleandiffuser

In [2]:
import scipy.interpolate as interpolate

In [5]:
import gym

from cleandiffuser.env import kitchen
from cleandiffuser.dataset.kitchen_dataset import KitchenDataset


env = gym.make('kitchen-all-v0')
dataset = KitchenDataset("../dev/kitchen", horizon=1, pad_before=0, pad_after=0)

data = dataset[0]
obs, act = data["obs"]["state"], data["action"]
obs_dim, act_dim = obs.shape[-1], act.shape[-1]
print(f'Finish loading data. Observation shape: {obs.shape}. Action shape {act.shape}.')

Reading configurations for Franka
[40m[97mInitializing Franka sim[0m
Finish loading data. Observation shape: torch.Size([1, 60]). Action shape torch.Size([1, 9]).


## 3 Create the Diffusion Model

We'll use the diffusion model to generate samples from the policy distribution $\pi_\theta(a|s)$. Following DBC, we use DDPM with `PearceMlp` as the neural network backbone and `PearceObsCondition` as the condition network. If you are familiar with the UNet architecture in the diffusion model, `PearceMlp` here just serves as the role of UNet in image generation. After creating the networks, we can create the diffusion model by integrating them!

In [6]:
import torch

from cleandiffuser.diffusion import ContinuousDiffusionSDE
from cleandiffuser.nn_condition import PearceObsCondition
from cleandiffuser.nn_diffusion import PearceMlp


device = "cuda:0" if torch.cuda.is_available() else "cpu"

nn_diffusion = PearceMlp(act_dim=act_dim, To=1, emb_dim=128, hidden_dim=512, timestep_emb_type="untrainable_fourier")
""" nn.Module: xt (bs, act_dim) x t (bs, ) x condition (bs, To * emb_dim) -> eps_theta (bs, act_dim) """
nn_condition = PearceObsCondition(obs_dim=obs_dim, emb_dim=128, flatten=True, dropout=0.0)
""" nn.Module: obs (bs, To, obs_dim) x t (bs, ) -> condition (bs, To * emb_dim) if `flatten` else (bs, To, emb_dim) """

actor = ContinuousDiffusionSDE(
        nn_diffusion, nn_condition,
        x_max=+1. * torch.ones(act_dim),
        x_min=-1. * torch.ones(act_dim),
        ema_rate=0.9999, device=device)

  """Continuous-time Diffusion SDE (VP-SDE)


Let's further understand the used parameters here:

**PearceMlp**
- `act_dim`: (int) The dimension of the action space.
- `To`: (int) Number of observations to condition on. `To=1` means the model conditions on only the current observation. `To>1` means the model conditions on the current and previous history observations.
- `emb_dim`: (int) The embedding dimension of the neural network. Should match the `emb_dim` defined in the condition network.
- `hidden_dim`: (int) The hidden dimension of the neural network.
- `timestep_emb_type`: (str) The type of timestep embedding. Use `untrainable_fourier` or `fourier` for continuous time embeddings. Use `untrainable_positional` or `positional` for discrete time embeddings.

**PearceObsCondition**
- `obs_dim`: (int) The dimension of the observation space.
- `emb_dim`: (int) The embedding dimension of the neural network. Should match the `emb_dim` defined in the backbone network.
- `flatten`: (bool) Whether to flatten the input observation. Since `PearceMlp` requires a flattened input (bs, To * emb_dim), we set `flatten=True`.
- `dropout`: (float) The *label* dropout rate. It should be larger than `0.0` if you want to use CFG (classifier-free guidance). Since we don't use CFG here (or you can regard it as a CFG always using guidance strength $w=1.0$), we set `dropout=0.0`.

**ContinuousDiffusionSDE**
- `nn_diffusion`: (DiffusionModel) The neural network backbone of the diffusion model.
- `nn_condition`: (Optional[BaseNNCondition]) The condition network of the diffusion model. If `None`, the model is unconditioned. Here we set `nn_condition` to the condition network `PearceObsCondition`.
- `x_max`: (Optional[torch.Tensor]) The maximum value of the generated tensor, i.e., action here. Since the action range is $[-1, 1]$, we set `x_max=1.0 * torch.ones(act_dim)`. Setting `x_max` can help constrain the generated action within the action range. If `None`, the model does not constrain the generated tensor.
- `x_min`: (Optional[torch.Tensor]) The minimum value of the generated tensor, i.e., action here. Since the action range is $[-1, 1]$, we set `x_min=-1.0 * torch.ones(act_dim)`. Setting `x_min` can help constrain the generated action within the action range. If `None`, the model does not constrain the generated tensor.
- `ema_rate`: (float) The exponential moving average rate. We set `ema_rate=0.9999` to stabilize the generation quality.
- `device`: (Optional[torch.device]) Device.

Before training, let's sample some actions from the untrained diffusion model!

In [7]:
n_samples = 2
sampled_acts, log = actor.sample(
    prior=torch.zeros((n_samples, act_dim)), solver="ddpm", n_samples=n_samples, sample_steps=5,
    condition_cfg=obs.expand(2, 1, obs_dim).to(device), w_cfg=1.0)
print(f'Sampled actions: {sampled_acts}')

Sampled actions: tensor([[ 0.0909,  0.9461,  0.9999, -1.0000,  0.9999, -0.6491,  1.0000, -1.0000,
          0.8635],
        [ 0.3170, -0.9147,  0.9246, -0.3781, -0.7969, -0.4925,  1.0000, -0.9998,
          0.0205]], device='cuda:0')


The sampled actions are apparently meaningless. But we can see how to sample actions $a$ from the policy distribution $\pi_\theta(a|s)$ with `actor.sample()` here.

## 4 Training the Diffusion Model

Training in CleanDiffuser is straightforward. We don't need to care about the loss details of each diffusion model. We just call `diffusion.update()` to update the model. Here we train the model to generate `act` conditioned on `obs` in the demonstration dataset. We train the model for 500k steps with a batch size of 256, and the learning rate is 3e-4 as default. (Actually, training for 100k steps can already get good performance.)

In [8]:
import os

from torch.utils.data import DataLoader

from cleandiffuser.utils import loop_dataloader


savepath = "./tutorials/results/1_a_minimal_DBC_implementation/"
if not os.path.exists(savepath):
    os.makedirs(savepath)

dataloader = DataLoader(dataset, batch_size=256, shuffle=True, num_workers=4, persistent_workers=True)

n_gradient_steps = 0
avg_loss = 0.
actor.train()
for batch in loop_dataloader(dataloader):
    
    obs, act = batch["obs"]["state"][:, 0].to(device), batch["action"][:, 0].to(device)
    
    avg_loss += actor.update(x0=act, condition=obs)["loss"]
    
    n_gradient_steps += 1
    
    if n_gradient_steps % 1000 == 0:
        print(f'Step: {n_gradient_steps} | Loss: {avg_loss / 1000}')
        avg_loss = 0.
    
    if n_gradient_steps % 100_000 == 0:
        actor.save(savepath + "diffusion.pt")
    
    if n_gradient_steps == 500_000:
        break
    

Step: 1000 | Loss: 0.21947902069985867
Step: 2000 | Loss: 0.15672770391404628
Step: 3000 | Loss: 0.1299504531994462
Step: 4000 | Loss: 0.1147472239881754
Step: 5000 | Loss: 0.10775078709423543
Step: 6000 | Loss: 0.10308647757023573
Step: 7000 | Loss: 0.09798835342004895
Step: 8000 | Loss: 0.09597688286751509
Step: 9000 | Loss: 0.09235397230088711
Step: 10000 | Loss: 0.09122545094415545
Step: 11000 | Loss: 0.08955496735870838
Step: 12000 | Loss: 0.08738295983150601
Step: 13000 | Loss: 0.08741957069188357
Step: 14000 | Loss: 0.08559414035826922
Step: 15000 | Loss: 0.08445013580098748
Step: 16000 | Loss: 0.08360904774442315
Step: 17000 | Loss: 0.08315508530288934
Step: 18000 | Loss: 0.08230660912767053
Step: 19000 | Loss: 0.08138406999409199
Step: 20000 | Loss: 0.0806511876322329
Step: 21000 | Loss: 0.07983257757872343
Step: 22000 | Loss: 0.07903466258198023
Step: 23000 | Loss: 0.07905950520932674
Step: 24000 | Loss: 0.0787675854600966
Step: 25000 | Loss: 0.07783875493332744
Step: 26000 |

KeyboardInterrupt: 

## 5 Evaluation

Let's see how our DBC performs in the RelayKitchen environment! We parallelly interact with 50 environments and use 3 random seeds to evaluate the performance. The evaluation metric is the success rate to finish `n` tasks. We use DDPM with 5 sampling steps (compared to 50 sampling steps used in DBC official implementation) to generate actions. The results show that we can achieve a success rate of 76.67% to finish 4 tasks, compared to 68% in the DBC paper report!

In [9]:
import numpy as np


solver = "ddpm"
sampling_step = 5
num_episodes = 3
num_envs = 50


actor.load(savepath + "diffusion.pt")
actor.eval()

# Parallelize evaluation
env_eval = gym.vector.make('kitchen-all-v0', num_envs=num_envs)

# Get normalizers
normalizers = dataset.get_normalizer()
state_normalizer = normalizers["obs"]["state"]
action_normalizer = normalizers["action"]

all_rews = []
for n in range(num_episodes):
    
    obs, done, ep_rews, ep_len = env_eval.reset(), False, 0., 0
    prior = torch.zeros((num_envs, act_dim), device=device)
    while not np.all(done):
        
        obs = state_normalizer.normalize(obs)
        
        act, log = actor.sample(
            prior, solver=solver, n_samples=num_envs, sample_steps=sampling_step,
            sample_step_schedule="quad_continuous",
            w_cfg=1.0, condition_cfg=torch.tensor(obs, device=device, dtype=torch.float32))
        act = act.cpu().numpy()
        
        act = action_normalizer.unnormalize(act)
        
        obs, rew, _done, info = env_eval.step(act)
        ep_rews += rew * (1 - done)
        done = np.logical_or(done, _done)
        ep_len += 1
        
        print(f'[({n + 1}/{num_episodes}), t={ep_len}] Episode reward: {ep_rews}')

    all_rews.append(ep_rews)

all_rews = np.array(all_rews)
task_sr = np.zeros((7, ))
for i in range(7):
    task_sr[i] = (all_rews > i).sum() / (num_episodes * num_envs)
    
print(f'Evaluated {int(num_episodes * num_envs)} episodes.')
print(f'Task success rate: {np.round(task_sr * 100., 2)}')

Reading configurations for Franka
[40m[97mInitializing Franka sim[0m


  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")


Reading configurations for Franka
[40m[97mInitializing Franka sim[0mReading configurations for Franka

Reading configurations for Franka
[40m[97mInitializing Franka sim[0mReading configurations for Franka

[40m[97mInitializing Franka sim[0mReading configurations for Franka

[40m[97mInitializing Franka sim[0mReading configurations for Franka

[40m[97mInitializing Franka sim[0m[40m[97mInitializing Franka sim[0mReading configurations for Franka


[40m[97mInitializing Franka sim[0mReading configurations for Franka

[40m[97mInitializing Franka sim[0mReading configurations for Franka

Reading configurations for Franka[40m[97mInitializing Franka sim[0m
Reading configurations for Franka

[40m[97mInitializing Franka sim[0m
[40m[97mInitializing Franka sim[0mReading configurations for Franka

[40m[97mInitializing Franka sim[0mReading configurations for Franka

[40m[97mInitializing Franka sim[0mReading configurations for Franka

Reading configurations for Fra