# Lab 08: Imitation Learning

In this lab, we look into the problem of learning from expert demonstrations.

- Find a policy $\pi(a | s)$ that best imitates the expert policy $\pi^*(a | s)$ in the given environment.
- It's worth noting, that we don't need access to the environment rewards.

Major Imitation Learning techniques are:

1. Behavioural Cloning,
1. Imitation Learning via Interactive Demonstrator e.g. SMILe (Ross and Bagnell, 2010) or DAgger (Ross et al., 2011),
1. Inverse Reinforcement Learning -- out of scope of this lab.

We will solve the Ant problem, shown below, examining the first two approaches.

## Install dependencies

In [1]:
!pip -q install gymnasium[mujoco]
!pip -q install pyglet
!pip -q install pyopengl
!pip -q install pyvirtualdisplay

In [2]:
!pip install numpy==1.23.5
!pip install torch==2.5
!git clone https://github.com/lychanl/sample-factory.git

fatal: destination path 'sample-factory' already exists and is not an empty directory.


In [3]:
!pip install -q sample-factory[mujoco]

In [4]:
%cd sample-factory

/content/ML-stuff/1.2/RL/lab08/sample-factory


## Download Expert

In [5]:
!python -m sample_factory.huggingface.load_from_hub -r LLParallax/sf_Ant

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/content/ML-stuff/1.2/RL/lab08/sample-factory/./train_dir/sf_Ant is already a clone of https://huggingface.co/LLParallax/sf_Ant. Make sure you pull the latest changes with `repo.git_pull()`.
[37m[1m[2025-04-24 16:55:00,826][11904] The repository LLParallax/sf_Ant has been cloned to ./train_dir/sf_Ant[0m


In [6]:
import functools

import torch

from sample_factory.algo.learning.learner import Learner
from sample_factory.algo.utils.env_info import extract_env_info
from sample_factory.algo.utils.make_env import make_env_func_batched
from sample_factory.algo.utils.rl_utils import prepare_and_normalize_obs
from sample_factory.cfg.arguments import load_from_checkpoint
from sample_factory.model.actor_critic import create_actor_critic
from sample_factory.model.model_utils import get_rnn_size
from sample_factory.utils.attr_dict import AttrDict
from sample_factory.utils.typing import Config


def create_expert(cfg):
    cfg = load_from_checkpoint(cfg)

    cfg.num_envs = 1

    env = make_env_func_batched(
        cfg, env_config=AttrDict(worker_index=0, vector_index=0, env_id=0), render_mode=None
    )

    if hasattr(env.unwrapped, "reset_on_init"):
        # reset call ruins the demo recording for VizDoom
        env.unwrapped.reset_on_init = False

    actor_critic = create_actor_critic(cfg, env.observation_space, env.action_space)
    actor_critic.eval()

    device = torch.device("cpu" if cfg.device == "cpu" else "cuda")
    actor_critic.model_to_device(device)

    policy_id = cfg.policy_index
    name_prefix = dict(latest="checkpoint", best="best")[cfg.load_checkpoint_kind]
    checkpoints = Learner.get_checkpoints(Learner.checkpoint_dir(cfg, policy_id), f"{name_prefix}_*")
    checkpoint_dict = Learner.load_checkpoint(checkpoints, device)
    actor_critic.load_state_dict(checkpoint_dict["model"])
    return actor_critic


def get_expert_actions(obs, cfg: Config, actor_critic, env, env_info, device):
    rnn_states = torch.zeros([env.num_agents, get_rnn_size(cfg)], dtype=torch.float32, device=device)

    obs = {"obs": obs}
    with torch.no_grad():
        normalized_obs = prepare_and_normalize_obs(actor_critic, obs)
        policy_outputs = actor_critic(normalized_obs, rnn_states)

        # sample actions from the distribution by default
        actions = policy_outputs["actions"]
    return actions

## Load expert model

In [7]:
from sample_factory.cfg.arguments import parse_full_cfg, parse_sf_args
from sample_factory.envs.env_utils import register_env
from sf_examples.mujoco.mujoco_params import add_mujoco_env_args, mujoco_override_defaults
from sf_examples.mujoco.train_mujoco import register_mujoco_components
from sf_examples.mujoco.mujoco_utils import MUJOCO_ENVS, make_mujoco_env


def register_mujoco_components():
    for env in MUJOCO_ENVS:
        register_env(env.name, make_mujoco_env)


register_mujoco_components()
argv = ["--algo=APPO", "--env=mujoco_ant", "--experiment=sf_Ant", "--train_dir=train_dir", "--no_render"]
parser, partial_cfg = parse_sf_args(argv=argv, evaluation=True)
add_mujoco_env_args(partial_cfg.env, parser)
mujoco_override_defaults(partial_cfg.env, parser)
cfg = parse_full_cfg(parser, argv=argv)
expert = create_expert(cfg)

[33m[2025-04-24 16:55:04,034][11684] Loading existing experiment configuration from train_dir/sf_Ant/config.json[0m
[36m[2025-04-24 16:55:04,036][11684] Overriding arg 'experiment' with value 'sf_Ant' passed from command line[0m
[36m[2025-04-24 16:55:04,037][11684] Overriding arg 'train_dir' with value 'train_dir' passed from command line[0m
[36m[2025-04-24 16:55:04,038][11684] Adding new argument 'wandb_dir'='/content/ML-stuff/1.2/RL/lab08/sample-factory/wandb' that is not in the saved config file![0m
[36m[2025-04-24 16:55:04,039][11684] Adding new argument 'fps'=0 that is not in the saved config file![0m
[36m[2025-04-24 16:55:04,039][11684] Adding new argument 'eval_env_frameskip'=None that is not in the saved config file![0m
[36m[2025-04-24 16:55:04,040][11684] Adding new argument 'no_render'=True that is not in the saved config file![0m
[36m[2025-04-24 16:55:04,040][11684] Adding new argument 'save_video'=False that is not in the saved config file![0m
[36m[2025-04-

## Helpers
collecting data  

evaluation

In [8]:
import time

from IPython import display as ipydisplay

import torch
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np

from matplotlib import animation


@torch.no_grad()
def run_policy (env, model, total_steps=10000, verbose=True):
    obs_array = np.empty([total_steps, *env.observation_space.shape])
    act_array = np.empty([total_steps, env.action_space.shape[0]])
    rew_array = np.empty([total_steps, 1])
    done_array = np.empty([total_steps, 1])

    iter_time = time.time()
    done = True
    for i in range(total_steps):
        if verbose and (i + 1) % 1000 == 0:
            steps_per_second = 1000 / (time.time() - iter_time)
            print(f'Step {i + 1}/{total_steps}, Steps per second: {steps_per_second}')
            iter_time = time.time()

        if done:
            obs, info = env.reset()

        act = model(torch.from_numpy(obs).unsqueeze(0).float())[0].detach().cpu().numpy()
        obs_, rew, terminated, truncated, _ = env.step(act)
        done = terminated or truncated

        obs_array[i] = obs
        act_array[i] = act
        rew_array[i] = rew
        done_array[i] = float(done)

        obs = obs_

    return obs_array, act_array, rew_array, done_array

def calculate_returns(rew, done):
    rew_cumsum = np.cumsum(rew)[:, None]
    ret_cumsum = rew_cumsum * done
    ret_cumsum_trimed = ret_cumsum[np.nonzero(ret_cumsum)]
    ret_cumsum_trimed[1:] -= ret_cumsum_trimed[:-1]
    return ret_cumsum_trimed

def evaluate_agent(env, model, verbose=False):
    _, _, rew, done = run_policy(env, model, total_steps=50000, verbose=verbose)
    rets = calculate_returns(rew, done)

    print(f'Num. episodes: {len(rets)}')
    print(f'Avg. return: {np.mean(rets)}')
    print(f'Max. return: {np.max(rets)}')
    print(f'Min. return: {np.min(rets)}')

@torch.no_grad()
def collect_frames(eval_env, model, num_frames=2000):
    state, _ = eval_env.reset()
    state = torch.from_numpy(np.array(state)).float()
    frames = []

    for _ in range(num_frames):
        frames.append(eval_env.render())

        action = model(state.unsqueeze(0))[0]
        next_state, reward, terminal, truncate, info = eval_env.step(action.detach().cpu().numpy())

        if terminal or truncate:
            state, _ = eval_env.reset()
        state = next_state
        state = torch.from_numpy(np.array(state)).float()

    return frames

def display_frames_as_video(frames):
    """
    Displays a list of frames as a video.
    """
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi=72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=50)
    ipydisplay.display(ipydisplay.HTML(anim.to_jshtml()))

## 1. Behavior Clonning

Algorithm

1. Collect the expert data.
2. Fit the model (classifier/regressor) to the expert data.

### Create model

In [9]:
import torch
import torch.nn as nn


class MLP(nn.Module):
    def __init__(self, input_shape, output_size, hidden_sizes=(256, 256), hidden_activation=nn.Tanh(), output_activation=None, l2_weight=0.0001):
        super(MLP, self).__init__()
        self.layers = nn.Sequential()

        # Input layer
        self.layers.add_module("input", nn.Linear(input_shape, hidden_sizes[0]))
        self.layers.add_module("input_activation", hidden_activation)

        # Hidden layers
        layer_sizes = zip(hidden_sizes[:-1], hidden_sizes[1:])
        for i, (h1, h2) in enumerate(layer_sizes):
            self.layers.add_module(f"hidden_{i}", nn.Linear(h1, h2))
            self.layers.add_module(f"activation_{i}", hidden_activation)

        # Output layer
        self.layers.add_module("output", nn.Linear(hidden_sizes[-1], output_size))
        if output_activation is not None:
            self.layers.add_module("output_activation", output_activation)

        # Regularization
        self.l2_weight = l2_weight

    def forward(self, x):
        # Forward pass through the network
        x = self.layers(x)
        return x

    def l2_regularization(self):
        l2_reg = None
        for name, param in self.named_parameters():
            if 'weight' in name:
                if l2_reg is None:
                    l2_reg = param.norm(2)
                else:
                    l2_reg = l2_reg + param.norm(2)
        return self.l2_weight * l2_reg

### Function for training the model

In [10]:
from torch.utils.data import DataLoader, TensorDataset


def train(obs, act, model, num_epochs=10, batch_size=32):
    obs_tensor = torch.tensor(obs, dtype=torch.float32)
    act_tensor = torch.tensor(act, dtype=torch.float32)

    dataset = TensorDataset(obs_tensor, act_tensor)
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


    # Define the loss function and optimizer
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters())

    # Training loop
    for epoch in range(num_epochs):
        for batch_idx, (x_batch, y_batch) in enumerate(data_loader):
            # Forward pass
            y_pred = model(x_batch)

            # Compute loss
            loss = loss_fn(y_pred, y_batch) + model.l2_regularization()

            # Zero gradients, perform a backward pass, and update the weights.
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Print loss every epoch
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")

In [11]:
env = gym.make('Ant-v4')
env.num_agents = 1
env_info = extract_env_info(env, cfg)
device = torch.device("cpu" if cfg.device == "cpu" else "cuda")
collected_data = run_policy(env, functools.partial(get_expert_actions, cfg=cfg, actor_critic=expert, env=env, env_info=env_info, device=device), total_steps=10000)

Step 1000/10000, Steps per second: 369.21153874085456
Step 2000/10000, Steps per second: 554.943421439983
Step 3000/10000, Steps per second: 578.2246508777101
Step 4000/10000, Steps per second: 561.3150039037606
Step 5000/10000, Steps per second: 541.2807653676655
Step 6000/10000, Steps per second: 555.0655522753391
Step 7000/10000, Steps per second: 437.14705200912044
Step 8000/10000, Steps per second: 417.0037301928885
Step 9000/10000, Steps per second: 530.0576108367568
Step 10000/10000, Steps per second: 565.5538486493964


In [12]:
obs, act, rewards, dones = collected_data

print(obs.shape, act.shape, rewards.shape, dones.shape)

# print(obs)

# EXERCISE: Create model
model = MLP(obs.shape[1], act.shape[1])

train(obs, act, model)

(10000, 27) (10000, 8) (10000, 1) (10000, 1)


Epoch 1/10, Loss: 0.01409242581576109
Epoch 2/10, Loss: 0.012592637911438942
Epoch 3/10, Loss: 0.01511292066425085
Epoch 4/10, Loss: 0.009069368243217468
Epoch 5/10, Loss: 0.012666529044508934
Epoch 6/10, Loss: 0.010790582746267319
Epoch 7/10, Loss: 0.007859050296247005
Epoch 8/10, Loss: 0.010145654901862144
Epoch 9/10, Loss: 0.01104153972119093
Epoch 10/10, Loss: 0.012880825437605381


In [13]:
evaluate_agent(env, model)

Num. episodes: 66
Avg. return: 3749.199646308135
Max. return: 5834.928417044925
Min. return: 20.85465267905238


### Exercise

Discuss the questions

1. In principle, do we need the expert policy for BC?

2. What are the problems with BC?

3. How can we help BC do better?


In [14]:
# Collect the exploratory data
def exploratory(obs, **kwargs):
    """Adds the Gaussian noise to the expert actions."""
    rnn_states = torch.zeros([env.num_agents, get_rnn_size(cfg)], dtype=torch.float32, device=device)

    obs = {"obs": obs}
    with torch.no_grad():
        normalized_obs = prepare_and_normalize_obs(kwargs['actor_critic'], obs)
        policy_outputs = kwargs['actor_critic'](normalized_obs, rnn_states)

        # sample actions from the distribitution by default
        actions = policy_outputs["actions"]

    actions += torch.randn_like(actions) / 10
    return actions

expl_data = run_policy(env, functools.partial(exploratory, cfg=cfg, actor_critic=expert, env=env, env_info=env_info, device=device), total_steps=10000)

Step 1000/10000, Steps per second: 550.3415110988647
Step 2000/10000, Steps per second: 532.1157553333
Step 3000/10000, Steps per second: 411.6771605325425
Step 4000/10000, Steps per second: 428.5263867913097
Step 5000/10000, Steps per second: 547.7728002064513
Step 6000/10000, Steps per second: 541.01133668098
Step 7000/10000, Steps per second: 546.9134683653571
Step 8000/10000, Steps per second: 559.8663145753446
Step 9000/10000, Steps per second: 544.922158691554
Step 10000/10000, Steps per second: 434.3872657738518


In [15]:
obs_expl, act_expl, rewards, dones = expl_data
# Exercise: Run BC on the exploratory data

# ANSWER
model_expl = MLP(obs_expl.shape[1], act_expl.shape[1])

train(obs_expl, act_expl, model_expl)
# END ANSWER

Epoch 1/10, Loss: 0.03041917271912098
Epoch 2/10, Loss: 0.02428119257092476
Epoch 3/10, Loss: 0.0363578125834465
Epoch 4/10, Loss: 0.025699084624648094
Epoch 5/10, Loss: 0.027162104845046997
Epoch 6/10, Loss: 0.027334270998835564
Epoch 7/10, Loss: 0.0225650817155838
Epoch 8/10, Loss: 0.019606370478868484
Epoch 9/10, Loss: 0.02227487787604332
Epoch 10/10, Loss: 0.02308768220245838


In [16]:
evaluate_agent(env, model_expl)

Num. episodes: 54
Avg. return: 4755.789993483126
Max. return: 5639.069530215638
Min. return: 48.74341965483836


### Exercise

Answer the questions

1. Why does it do better?

2. How can we use the expert to further improve the data?


In [17]:
# Exercise: Infere the expert actions on the exploratory observations
#           and run BC on it.

collected_data = run_policy(env, functools.partial(get_expert_actions, cfg=cfg, actor_critic=expert, env=env, env_info=env_info, device=device), total_steps=10000)

obs, act, rewards, dones = collected_data

obs = torch.tensor(obs, dtype=torch.float32)
eobs = obs + torch.randn_like(obs) / 10

model_oexpl = MLP(eobs.shape[1], act.shape[1])

train(eobs, act, model_oexpl)
# ANSWER
# ANSWER END

Step 1000/10000, Steps per second: 396.4603215368312
Step 2000/10000, Steps per second: 422.40114726874106
Step 3000/10000, Steps per second: 574.4097764743793
Step 4000/10000, Steps per second: 567.0541616446894
Step 5000/10000, Steps per second: 567.7668777927734
Step 6000/10000, Steps per second: 577.6359270666994
Step 7000/10000, Steps per second: 571.7701519715985
Step 8000/10000, Steps per second: 490.80509253533006
Step 9000/10000, Steps per second: 428.8574150401889
Step 10000/10000, Steps per second: 478.4076317025538


  obs_tensor = torch.tensor(obs, dtype=torch.float32)


Epoch 1/10, Loss: 0.01786557212471962
Epoch 2/10, Loss: 0.020376749336719513
Epoch 3/10, Loss: 0.01852814108133316
Epoch 4/10, Loss: 0.018537839874625206
Epoch 5/10, Loss: 0.01323208212852478
Epoch 6/10, Loss: 0.01210375688970089
Epoch 7/10, Loss: 0.014186827465891838
Epoch 8/10, Loss: 0.010674932971596718
Epoch 9/10, Loss: 0.012855072505772114
Epoch 10/10, Loss: 0.011490416713058949


In [18]:
evaluate_agent(env, model_oexpl)

Num. episodes: 58
Avg. return: 2647.846824096608
Max. return: 4700.022542569983
Min. return: 63.91135489550652


In [20]:
print(device)

cuda


### Exercise

Answer the questions

1. Did it help? Why?


1. How can you extend this idea?


## 2. Imitation Learning via Interactive Demostrator

[DAgger](https://www.ri.cmu.edu/pub_files/2011/4/Ross-AISTATS11-NoRegret.pdf)

1. Collect the expert data.
2. Fit the model (classifier/regressor) to the expert data.
3. Collect the imitator data.
4. Infere the expert actions on the imitator data.
5. Fit the model to the extended dataset.
6. Repeat from 3.

In [19]:
# We will pre-train on less expert data to keep the same dataset size
obs_ = obs[:2000, :]
act_ = act[:2000, :]

# EXERCISE: pretrain on first 2000 samples
# ANSWER
...
# END ANSWER

evaluate_agent(env, model_dagger)

NameError: name 'model_dagger' is not defined

In [None]:
# Exercise: Implement DAgger

for i in range(4):
    print(f'\n### Iter. {i+1} ###')

    # ANSWER
    print('\n1. Data collection')
    obs_extra, _, _, _ = # Collect 2k steps


    print('\n2. Training')
    # reset model for fair comparison
    model_dagger = ...

    # END ANSWER

    print('\n3. Evaluation')
    evaluate_agent(env, model_dagger)


### Iter. 1 ###

1. Data collection
Step 1000/2000, Steps per second: 3294.7693632223873
Step 2000/2000, Steps per second: 3707.6622117245215

2. Training
Epoch 1/10, Loss: 0.01619561016559601
Epoch 2/10, Loss: 0.01852767914533615
Epoch 3/10, Loss: 0.013723315671086311
Epoch 4/10, Loss: 0.014866928569972515
Epoch 5/10, Loss: 0.014802966266870499
Epoch 6/10, Loss: 0.011287961155176163
Epoch 7/10, Loss: 0.015429697930812836
Epoch 8/10, Loss: 0.01600058376789093
Epoch 9/10, Loss: 0.012179029174149036
Epoch 10/10, Loss: 0.011647832579910755

3. Evaluation
Num. episodes: 53
Avg. return: 4466.662482786521
Max. return: 5459.026620695091
Min. return: 384.4503018264944

### Iter. 2 ###

1. Data collection
Step 1000/2000, Steps per second: 3451.289655274561
Step 2000/2000, Steps per second: 3548.693703259671

2. Training
Epoch 1/10, Loss: 0.01004914939403534
Epoch 2/10, Loss: 0.010697782039642334
Epoch 3/10, Loss: 0.009375996887683868
Epoch 4/10, Loss: 0.00921697448939085
Epoch 5/10, Loss: 0.00

### Note

Training the expert with the PPO algorithm took 10M data samples (env. interactions). Here, we nearly match it with only 10k samples! Training from the expert can be much more efficient than reinforcement learning.