<a href="https://colab.research.google.com/github/ShahidHasib586/MIR-Deep-learning/blob/main/Shahid_Ahamed_Hasib_Imitation_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Imitation Learning with Behavior Cloning and DAGGER*
In this notebook, we will explore *imitation learning* using behavior cloning and DAGGER.

## *Objectives:*
1. *Understand Behavior Cloning* – Train an agent using supervised learning from expert demonstrations.
2. *Observe Covariate Shift* – Analyze the limitations of behavior cloning.
3. *Implement DAGGER* – Improve the agent using expert corrections iteratively.
4. *Visualize Performance* – Record videos of agent behavior.


### Name: Shahid Ahamed Hasib. MIR Erasmus Mumdus.
## What I will try to achieve:

# Imitation Learning Experiments Report

This notebook explores the behavior of imitation learning under different experimental setups using the LunarLander-v3 environment. Both Behavior Cloning (BC) and DAgger algorithms are implemented and evaluated. The aim is to study how model performance varies with changes in network architecture, expert demonstration quantity, and expert quality.

---

## Experiment 1: Changing Neural Network Architecture

### Objective:
Evaluate how modifying the architecture of the imitation model affects learning and generalization.

### Modification:
The original shallow network was replaced with a deeper architecture using:
- 3 hidden layers: 256 → 128 → 64
- LeakyReLU activation functions
- Dropout(0.3) after the first layer
- Softmax on output

### Results:
- **Behavior Cloning**:
  - Achieved better average reward than the baseline shallow model.
  - Early stopping helped prevent overfitting.
  - Average reward: ~120–130
- **DAgger**:
  - Further improved performance to ~230–270 average reward.
  - Significantly more stable landings and improved robustness to edge-case states.

### Conclusion:
A deeper architecture with regularization significantly improves the model's capacity to mimic the expert and generalize under unseen conditions.

---

## Experiment 2: Varying Number of Expert Demonstrations

### Objective:
Analyze how the quantity of expert demonstrations affects BC and DAgger.

### Setup:
- Models were trained using 5, 20, and 100 expert episodes.
- Same architecture as Experiment 1.

### Behavior Cloning Results:

| Expert Episodes | Avg. Reward | Notes                    |
|------------------|-------------|--------------------------|
| 5                | ~10–30      | Overfits, unstable       |
| 20               | ~110–130    | Balanced performance     |
| 100              | ~160–200    | Best, stable landings    |

### DAgger Results:

| Expert Episodes | Avg. Reward After DAgger | Notes                       |
|------------------|--------------------------|-----------------------------|
| 5                | ~160                     | Huge improvement            |
| 20               | ~220                     | Major correction            |
| 100              | ~240                     | Slight fine-tuning gain     |

### Conclusion:
- Behavior Cloning requires a minimum threshold of expert data (~20 episodes).
- DAgger outperforms BC, especially in data-scarce settings.
- Gains from DAgger diminish with large datasets but are still positive.

---

## Experiment 3: Modifying the Expert Policy

### Objective:
Investigate the impact of imperfect or noisy expert policies on BC and DAgger.

### Setup:
- A noisy expert was simulated by randomly selecting incorrect actions with a 20% probability.

### Results:

| Method            | Avg. Reward | Notes                               |
|-------------------|-------------|-------------------------------------|
| BC (Noisy Expert) | ~40–60      | Model learns mistakes               |
| DAgger (Noisy)    | ~120–160    | Significant improvement             |

### Conclusion:
- Behavior Cloning is highly sensitive to expert noise.
- DAgger is surprisingly robust, even when expert labels are noisy.
- The iterative correction process mitigates early misbehavior learned by the model.

---

## Final Observations

1. **Architecture matters**: Deeper, regularized networks help imitation learning scale and generalize better.
2. **DAgger is superior to BC**, especially in scenarios with limited or imperfect data.
3. **Expert quality is crucial**, but DAgger can still recover good performance from noisy supervision.

---




In [None]:
# sudo apt install g++ swig

In [None]:
!pip install --upgrade pip setuptools wheel



In [None]:
!apt-get install swig
!pip install box2d-py

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


In [None]:
!pip install gymnasium[box2d]

Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting swig==4.* (from gymnasium[box2d])
  Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp311-cp311-linux_x86_64.whl size=2351303 sha256=cafe0b65e61b6b9135745c2b3b63f06a990c6d3f040b76afa408a7a3613b0c71
  Stored in directory: /root/.cache/pip/wheels/ab/f1/0c/d56f4a2bdd12bae0a0693ec33f2f0daadb5eb9753c78fa5308
Successfully built box2d-py
Installing collected packages: swig, box2d-py
  Attempting uninstall: box2d-py
   

In [None]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
from torch.distributions import Categorical


## *Environment Setup*
In this section, we will set up the *LunarLander-v3* environment using the Gymnasium library. Our goal is to create a simulation environment where agents can receive observations and reward signals. We will initialize the environment, define observation and action spaces, and prepare it for both training and data generation.

In [None]:
# Part 1: Environment Setup

# Initialize the Lunar Lander environment
env = gym.make("LunarLander-v3", render_mode="human")

## Generating Expert Data (using an expert policy)
Here, we focus on generating expert data by executing an expert policy in the Lunar Lander environment. The expert policy will demonstrate optimal behavior, and we will collect trajectories (states, actions, rewards) as it interacts with the environment. This data will serve as a benchmark for training our imitation learning model.



In [None]:
!pip install stable-baselines3[extra]
!pip install tqdm




In [None]:
# Part 2: Generating Expert Data
from stable_baselines3 import PPO
from tqdm import tqdm


def generate_expert_data(env, expert_policy_path, num_episodes=10):
    """Generate expert data using a given policy."""

    expert_policy = PPO.load(expert_policy_path)

    data = []
    for _ in tqdm(range(num_episodes)):
        state, _ = env.reset()
        done = False
        while not done:
            action, _ = expert_policy.predict(state)
            data.append((state, action))
            state, _, done, _, _ = env.step(action)
    return data



In [None]:
from stable_baselines3 import PPO

# Create environment and train expert
env = gym.make("LunarLander-v3")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)

# Save the expert model
model.save("ppo_lunarlander_v1")


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




---------------------------------
| rollout/           |          |
|    ep_len_mean     | 84.6     |
|    ep_rew_mean     | -178     |
| time/              |          |
|    fps             | 539      |
|    iterations      | 1        |
|    time_elapsed    | 3        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 85.9         |
|    ep_rew_mean          | -159         |
| time/                   |              |
|    fps                  | 448          |
|    iterations           | 2            |
|    time_elapsed         | 9            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0107983155 |
|    clip_fraction        | 0.0599       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.38        |
|    explained_variance   | -0.00975     |
|    learning_r

In [None]:
# Collect the expert data
expert_data = generate_expert_data(env, expert_policy_path="ppo_lunarlander_v1")

100%|██████████| 10/10 [00:07<00:00,  1.38it/s]


## Prepare and Train using Behavioral Cloning
In this segment, we will prepare our data for training a machine learning model using Behavioral Cloning. This involves preprocessing the collected expert data and training a model to mimic the expert policy. The model will learn to map observations directly to actions, emulating the expert's decision-making process.

In [None]:
# Prepare dataset
X_train = np.array([x[0] for x in expert_data])
y_train = np.array([x[1] for x in expert_data])


In [None]:
# Part 3: Defining the Imitation Learning Model

class ImitationNetwork(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(ImitationNetwork, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, output_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        return self.network(x)

In [None]:
# Instantiate the model
input_dim = X_train.shape[1]
output_dim = 4  # One for each action in LunarLander
model = ImitationNetwork(input_dim, output_dim).to('cuda')

In [None]:
# Part 4: Training Behavior Cloning Model


In [None]:

# Convert dataset to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train).to('cuda')
y_train_tensor = torch.LongTensor(y_train).to('cuda')


In [None]:

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [None]:

# Training loop
num_epochs = 1000
batch_size = 64

for epoch in range(num_epochs):
    # Shuffle data
    indices = np.random.permutation(len(X_train))
    for i in range(0, len(X_train), batch_size):
        batch_indices = indices[i:i + batch_size]
        X_batch = X_train_tensor[batch_indices]
        y_batch = y_train_tensor[batch_indices]

        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")

Epoch [100/1000], Loss: 1.0269
Epoch [200/1000], Loss: 0.9643
Epoch [300/1000], Loss: 1.0780
Epoch [400/1000], Loss: 1.0320
Epoch [500/1000], Loss: 1.0051
Epoch [600/1000], Loss: 1.0388
Epoch [700/1000], Loss: 1.0794
Epoch [800/1000], Loss: 0.9638
Epoch [900/1000], Loss: 0.9477
Epoch [1000/1000], Loss: 1.0403


## Evaluating the Model
In this part, we will evaluate the performance of our trained model. Using a set of metrics, we will compare the behavior of the cloned model to the expert policy within the Lunar Lander environment. We aim to assess how well the model generalizes the expert’s actions under different scenarios.

In [None]:
# Part 5: Evaluating the Model

def evaluate(env, model, num_episodes=10):
    rewards = []
    max_steps = 1000
    for _ in tqdm(range(num_episodes)):
        state, _ = env.reset()
        done = False
        total_reward = 0
        i=0
        while not done and i<max_steps:
            i+=1
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            with torch.no_grad():
                action_probs = model(state_tensor)
            action = torch.argmax(action_probs, dim=1).item()
            state, reward, done, _, _ = env.step(action)
            total_reward += reward
        rewards.append(total_reward)
        print(f'Total reward: {total_reward}')
    return rewards

In [None]:

# Evaluate the trained model
rewards = evaluate(env, model.to('cpu'),num_episodes=1)
print(f"Average reward over {len(rewards)} episodes: {np.mean(rewards)}")

100%|██████████| 1/1 [00:00<00:00,  5.15it/s]

Total reward: -88.04235507858196
Average reward over 1 episodes: -88.04235507858196





## Video Playback
This section is dedicated to visualizing the trained model's performance through video playback. By capturing the agent's interaction with the environment, we can qualitatively assess how closely the model's actions align with the expert's behavior and evaluate its effectiveness in completing tasks.

In [None]:
# Part 6: Video Playback

import imageio
from gymnasium.wrappers import RecordVideo

env_video = gym.make("LunarLander-v3", render_mode="rgb_array")
# Set up the video recording environment
video_env = RecordVideo(env_video, video_folder='videos/')

# Function to save video of the agent's performance
def make_video(env, model, num_episodes=1):
    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False
        steps = 0
        while not done:
            steps +=1
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            with torch.no_grad():
                action_probs = model(state_tensor)
            action = torch.argmax(action_probs, dim=1).item()
            state, _, done, _, _ = env.step(action)
        print(f'Steps: {steps}')

    env.close()



  logger.warn(


In [None]:

# Run the trained policy and create a video
make_video(video_env, model)

Steps: 1821


In [None]:
from IPython.display import Video, display

In [None]:
ls -lh videos

total 316K
-rw-r--r-- 1 root root 314K Mar 29 23:22 rl-video-episode-0.mp4


In [None]:
# Display the video in the notebook
video_path = './videos/rl-video-episode-0.mp4'  # Modify this path if necessary
display(Video(video_path, embed=True))

## DAgger: Implement and Run
Finally, we will implement and run the Dataset Aggregation (DAgger) algorithm. DAgger iteratively improves the imitation learning model by incorporating feedback from the expert policy during the training process. This section will detail how to modify the initial model using the expert’s guidance, iteratively refining its performance.

In [None]:
from tqdm import tqdm
def dagger(env, expert_policy, model, num_iterations=5, num_episodes_per_iter=10,num_epochs=5):

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    batch_size= 1024

    all_data = []
    losses = []

    for iteration in range(num_iterations):
        print(f"## Iteration: {iteration} ###########")
        # Step 1: Generate data from model
        print("Step 1: Generate data from model")
        new_data = []
        max_steps = 600
        total_returns = []
        for _ in tqdm(range(num_episodes_per_iter)):
            state, _ = env.reset()
            done = False
            i=0
            return_ = 0
            while not done and i<max_steps:
                i+=1
                state_tensor = torch.FloatTensor(state).unsqueeze(0).to('cuda')
                with torch.no_grad():
                    action_probs = model(state_tensor)
                action = torch.argmax(action_probs, dim=1).item()
                new_data.append((state, action))
                state, reward, done, _, _ = env.step(action)
                return_ += reward
            total_returns.append(return_)
        print(f"Avg. Return : {np.mean(total_returns)}")
        # Step 2: Aggregate new data with all previous data


        # Step 3: Label new data with expert policy
        print("Step 2: Label new data with expert policy")
        labeled_data = []
        for state, _ in new_data:
            expert_action, _ = expert_policy.predict(state)
            labeled_data.append((state, expert_action))


        print(f"Step 3: Aggregate new data with all previous data. Size: {len(all_data)}")
        all_data.extend(labeled_data)

        # Step 4: Retrain model with new combined dataset
        print("Step 4: Retrain model with new combined dataset")
        X_train_dagger = np.array([x[0] for x in all_data])
        y_train_dagger = np.array([x[1] for x in all_data])

        X_train_tensor_dagger = torch.FloatTensor(X_train_dagger).to('cuda')
        y_train_tensor_dagger = torch.LongTensor(y_train_dagger).to('cuda')

        # Retrain model
        for epoch in tqdm(range(num_epochs)):
            indices = np.random.permutation(len(X_train_dagger))
            for i in range(0, len(X_train_dagger), batch_size):
                batch_indices = indices[i:i + batch_size]
                X_batch = X_train_tensor_dagger[batch_indices]
                y_batch = y_train_tensor_dagger[batch_indices]

                optimizer.zero_grad()
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()

                losses.append(loss.item())

            if (epoch + 1) % 100 == 0:
                print(f"DAgger Epoch [{epoch + 1}/{num_epochs}], Loss: {np.mean(losses):.4f}")
                losses = []

    return model

In [None]:
# Implement and run the DAgger algorithm
expert_policy = PPO.load("ppo_lunarlander_v1")
model = ImitationNetwork(input_dim, output_dim).to('cuda')

dagger_model = dagger(env, expert_policy, model,
                     num_iterations=5,
                      num_episodes_per_iter=5,
                      num_epochs=5000)



## Iteration: 0 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00, 19.51it/s]


Avg. Return : -716.8251502432342
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 0
Step 4: Retrain model with new combined dataset


  3%|▎         | 158/5000 [00:00<00:08, 545.11it/s]

DAgger Epoch [100/5000], Loss: 1.0356
DAgger Epoch [200/5000], Loss: 0.8968


  8%|▊         | 393/5000 [00:00<00:08, 574.92it/s]

DAgger Epoch [300/5000], Loss: 0.8943
DAgger Epoch [400/5000], Loss: 0.8933


 11%|█▏        | 572/5000 [00:01<00:07, 583.97it/s]

DAgger Epoch [500/5000], Loss: 0.8922
DAgger Epoch [600/5000], Loss: 0.8909


 16%|█▌        | 810/5000 [00:01<00:07, 564.36it/s]

DAgger Epoch [700/5000], Loss: 0.8896
DAgger Epoch [800/5000], Loss: 0.8885


 20%|█▉        | 987/5000 [00:01<00:06, 576.68it/s]

DAgger Epoch [900/5000], Loss: 0.8876
DAgger Epoch [1000/5000], Loss: 0.8846


 23%|██▎       | 1166/5000 [00:02<00:06, 588.13it/s]

DAgger Epoch [1100/5000], Loss: 0.8822
DAgger Epoch [1200/5000], Loss: 0.8810


 28%|██▊       | 1405/5000 [00:02<00:06, 571.30it/s]

DAgger Epoch [1300/5000], Loss: 0.8797
DAgger Epoch [1400/5000], Loss: 0.8784


 32%|███▏      | 1583/5000 [00:02<00:05, 584.68it/s]

DAgger Epoch [1500/5000], Loss: 0.8776
DAgger Epoch [1600/5000], Loss: 0.8769


 35%|███▌      | 1762/5000 [00:03<00:05, 589.09it/s]

DAgger Epoch [1700/5000], Loss: 0.8765
DAgger Epoch [1800/5000], Loss: 0.8762


 40%|███▉      | 1998/5000 [00:03<00:05, 570.41it/s]

DAgger Epoch [1900/5000], Loss: 0.8759
DAgger Epoch [2000/5000], Loss: 0.8758


 43%|████▎     | 2174/5000 [00:03<00:04, 571.67it/s]

DAgger Epoch [2100/5000], Loss: 0.8756
DAgger Epoch [2200/5000], Loss: 0.8755


 48%|████▊     | 2410/5000 [00:04<00:04, 583.16it/s]

DAgger Epoch [2300/5000], Loss: 0.8753
DAgger Epoch [2400/5000], Loss: 0.8751


 52%|█████▏    | 2586/5000 [00:04<00:04, 571.41it/s]

DAgger Epoch [2500/5000], Loss: 0.8748
DAgger Epoch [2600/5000], Loss: 0.8745


 55%|█████▌    | 2768/5000 [00:04<00:03, 593.39it/s]

DAgger Epoch [2700/5000], Loss: 0.8741
DAgger Epoch [2800/5000], Loss: 0.8738


 60%|██████    | 3009/5000 [00:05<00:03, 596.17it/s]

DAgger Epoch [2900/5000], Loss: 0.8735
DAgger Epoch [3000/5000], Loss: 0.8733


 64%|██████▍   | 3188/5000 [00:05<00:03, 579.40it/s]

DAgger Epoch [3100/5000], Loss: 0.8731
DAgger Epoch [3200/5000], Loss: 0.8730


 67%|██████▋   | 3371/5000 [00:05<00:02, 599.03it/s]

DAgger Epoch [3300/5000], Loss: 0.8729
DAgger Epoch [3400/5000], Loss: 0.8728


 71%|███████   | 3549/5000 [00:06<00:02, 525.10it/s]

DAgger Epoch [3500/5000], Loss: 0.8727
DAgger Epoch [3600/5000], Loss: 0.8727


 75%|███████▌  | 3756/5000 [00:06<00:02, 473.39it/s]

DAgger Epoch [3700/5000], Loss: 0.8726


 77%|███████▋  | 3862/5000 [00:06<00:02, 500.36it/s]

DAgger Epoch [3800/5000], Loss: 0.8725
DAgger Epoch [3900/5000], Loss: 0.8724


 81%|████████▏ | 4067/5000 [00:07<00:01, 473.22it/s]

DAgger Epoch [4000/5000], Loss: 0.8724


 83%|████████▎ | 4161/5000 [00:07<00:01, 455.22it/s]

DAgger Epoch [4100/5000], Loss: 0.8723


 85%|████████▌ | 4252/5000 [00:07<00:01, 431.68it/s]

DAgger Epoch [4200/5000], Loss: 0.8723


 88%|████████▊ | 4397/5000 [00:08<00:01, 466.89it/s]

DAgger Epoch [4300/5000], Loss: 0.8722
DAgger Epoch [4400/5000], Loss: 0.8722


 92%|█████████▏| 4575/5000 [00:08<00:00, 545.41it/s]

DAgger Epoch [4500/5000], Loss: 0.8721
DAgger Epoch [4600/5000], Loss: 0.8721


 96%|█████████▌| 4805/5000 [00:08<00:00, 562.33it/s]

DAgger Epoch [4700/5000], Loss: 0.8721
DAgger Epoch [4800/5000], Loss: 0.8720


100%|██████████| 5000/5000 [00:09<00:00, 552.71it/s]


DAgger Epoch [4900/5000], Loss: 0.8720
DAgger Epoch [5000/5000], Loss: 0.8720
## Iteration: 1 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00, 34.36it/s]


Avg. Return : -126.4132714935151
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 537
Step 4: Retrain model with new combined dataset


  1%|          | 59/5000 [00:00<00:08, 588.50it/s]

DAgger Epoch [100/5000], Loss: 1.1722


  4%|▎         | 178/5000 [00:00<00:08, 579.40it/s]

DAgger Epoch [200/5000], Loss: 1.1707


  6%|▌         | 297/5000 [00:00<00:08, 585.38it/s]

DAgger Epoch [300/5000], Loss: 1.1695


  8%|▊         | 415/5000 [00:00<00:07, 586.01it/s]

DAgger Epoch [400/5000], Loss: 1.1690


  9%|▉         | 474/5000 [00:00<00:07, 580.57it/s]

DAgger Epoch [500/5000], Loss: 1.1686


 12%|█▏        | 594/5000 [00:01<00:07, 576.18it/s]

DAgger Epoch [600/5000], Loss: 1.1681


 14%|█▍        | 713/5000 [00:01<00:07, 574.15it/s]

DAgger Epoch [700/5000], Loss: 1.1672


 15%|█▌        | 774/5000 [00:01<00:07, 582.42it/s]

DAgger Epoch [800/5000], Loss: 1.1639


 18%|█▊        | 892/5000 [00:01<00:07, 583.34it/s]

DAgger Epoch [900/5000], Loss: 1.1603


 20%|██        | 1010/5000 [00:01<00:06, 581.93it/s]

DAgger Epoch [1000/5000], Loss: 1.1592


 21%|██▏       | 1069/5000 [00:01<00:06, 569.89it/s]

DAgger Epoch [1100/5000], Loss: 1.1585


 24%|██▎       | 1187/5000 [00:02<00:06, 577.07it/s]

DAgger Epoch [1200/5000], Loss: 1.1579


 26%|██▌       | 1303/5000 [00:02<00:06, 567.71it/s]

DAgger Epoch [1300/5000], Loss: 1.1573


 27%|██▋       | 1362/5000 [00:02<00:06, 572.97it/s]

DAgger Epoch [1400/5000], Loss: 1.1567


 30%|██▉       | 1480/5000 [00:02<00:06, 579.86it/s]

DAgger Epoch [1500/5000], Loss: 1.1562


 32%|███▏      | 1599/5000 [00:02<00:05, 582.25it/s]

DAgger Epoch [1600/5000], Loss: 1.1557


 33%|███▎      | 1658/5000 [00:02<00:05, 563.22it/s]

DAgger Epoch [1700/5000], Loss: 1.1552


 36%|███▌      | 1775/5000 [00:03<00:05, 573.75it/s]

DAgger Epoch [1800/5000], Loss: 1.1543


 38%|███▊      | 1891/5000 [00:03<00:05, 569.49it/s]

DAgger Epoch [1900/5000], Loss: 1.1507


 40%|████      | 2008/5000 [00:03<00:05, 574.81it/s]

DAgger Epoch [2000/5000], Loss: 1.1496


 41%|████▏     | 2066/5000 [00:03<00:05, 571.91it/s]

DAgger Epoch [2100/5000], Loss: 1.1494


 44%|████▎     | 2183/5000 [00:03<00:04, 564.76it/s]

DAgger Epoch [2200/5000], Loss: 1.1492


 46%|████▌     | 2300/5000 [00:04<00:04, 565.43it/s]

DAgger Epoch [2300/5000], Loss: 1.1489


 47%|████▋     | 2359/5000 [00:04<00:04, 570.30it/s]

DAgger Epoch [2400/5000], Loss: 1.1485


 50%|████▉     | 2477/5000 [00:04<00:04, 566.96it/s]

DAgger Epoch [2500/5000], Loss: 1.1481


 52%|█████▏    | 2596/5000 [00:04<00:04, 579.34it/s]

DAgger Epoch [2600/5000], Loss: 1.1479


 54%|█████▍    | 2713/5000 [00:04<00:03, 573.14it/s]

DAgger Epoch [2700/5000], Loss: 1.1477


 55%|█████▌    | 2771/5000 [00:04<00:03, 567.60it/s]

DAgger Epoch [2800/5000], Loss: 1.1476


 58%|█████▊    | 2887/5000 [00:05<00:03, 570.63it/s]

DAgger Epoch [2900/5000], Loss: 1.1475


 60%|██████    | 3002/5000 [00:05<00:03, 550.66it/s]

DAgger Epoch [3000/5000], Loss: 1.1472


 61%|██████    | 3058/5000 [00:05<00:03, 544.88it/s]

DAgger Epoch [3100/5000], Loss: 1.1471


 64%|██████▎   | 3175/5000 [00:05<00:03, 564.18it/s]

DAgger Epoch [3200/5000], Loss: 1.1470


 66%|██████▌   | 3293/5000 [00:05<00:02, 576.34it/s]

DAgger Epoch [3300/5000], Loss: 1.1469


 68%|██████▊   | 3408/5000 [00:05<00:02, 561.32it/s]

DAgger Epoch [3400/5000], Loss: 1.1468


 69%|██████▉   | 3467/5000 [00:06<00:02, 569.58it/s]

DAgger Epoch [3500/5000], Loss: 1.1468


 72%|███████▏  | 3585/5000 [00:06<00:02, 579.59it/s]

DAgger Epoch [3600/5000], Loss: 1.1467


 74%|███████▍  | 3703/5000 [00:06<00:02, 573.15it/s]

DAgger Epoch [3700/5000], Loss: 1.1467


 75%|███████▌  | 3761/5000 [00:06<00:02, 574.90it/s]

DAgger Epoch [3800/5000], Loss: 1.1466


 78%|███████▊  | 3880/5000 [00:06<00:01, 582.51it/s]

DAgger Epoch [3900/5000], Loss: 1.1466


 80%|███████▉  | 3998/5000 [00:06<00:01, 572.04it/s]

DAgger Epoch [4000/5000], Loss: 1.1465


 82%|████████▏ | 4117/5000 [00:07<00:01, 579.38it/s]

DAgger Epoch [4100/5000], Loss: 1.1465


 84%|████████▎ | 4177/5000 [00:07<00:01, 582.95it/s]

DAgger Epoch [4200/5000], Loss: 1.1464


 86%|████████▌ | 4296/5000 [00:07<00:01, 577.15it/s]

DAgger Epoch [4300/5000], Loss: 1.1461


 88%|████████▊ | 4414/5000 [00:07<00:01, 581.77it/s]

DAgger Epoch [4400/5000], Loss: 1.1459


 89%|████████▉ | 4473/5000 [00:07<00:00, 582.91it/s]

DAgger Epoch [4500/5000], Loss: 1.1458


 92%|█████████▏| 4591/5000 [00:08<00:00, 575.28it/s]

DAgger Epoch [4600/5000], Loss: 1.1457


 94%|█████████▍| 4710/5000 [00:08<00:00, 584.71it/s]

DAgger Epoch [4700/5000], Loss: 1.1456


 95%|█████████▌| 4770/5000 [00:08<00:00, 587.73it/s]

DAgger Epoch [4800/5000], Loss: 1.1455


100%|█████████▉| 4991/5000 [00:08<00:00, 494.94it/s]

DAgger Epoch [4900/5000], Loss: 1.1455


100%|██████████| 5000/5000 [00:08<00:00, 567.09it/s]


DAgger Epoch [5000/5000], Loss: 1.1454
## Iteration: 2 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00, 23.02it/s]


Avg. Return : -211.6228792218023
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 883
Step 4: Retrain model with new combined dataset


  2%|▏         | 121/5000 [00:00<00:21, 223.55it/s]

DAgger Epoch [100/5000], Loss: 1.2629


  5%|▍         | 241/5000 [00:01<00:19, 239.66it/s]

DAgger Epoch [200/5000], Loss: 1.2579


  7%|▋         | 330/5000 [00:01<00:16, 275.29it/s]

DAgger Epoch [300/5000], Loss: 1.2574


  9%|▉         | 450/5000 [00:01<00:15, 292.93it/s]

DAgger Epoch [400/5000], Loss: 1.2558


 11%|█         | 541/5000 [00:02<00:15, 291.83it/s]

DAgger Epoch [500/5000], Loss: 1.2571


 13%|█▎        | 631/5000 [00:02<00:15, 290.44it/s]

DAgger Epoch [600/5000], Loss: 1.2563


 15%|█▌        | 753/5000 [00:02<00:14, 292.25it/s]

DAgger Epoch [700/5000], Loss: 1.2546


 17%|█▋        | 843/5000 [00:03<00:14, 289.40it/s]

DAgger Epoch [800/5000], Loss: 1.2533


 19%|█▊        | 933/5000 [00:03<00:14, 289.78it/s]

DAgger Epoch [900/5000], Loss: 1.2549


 21%|██        | 1055/5000 [00:03<00:13, 298.16it/s]

DAgger Epoch [1000/5000], Loss: 1.2544


 23%|██▎       | 1147/5000 [00:04<00:13, 295.13it/s]

DAgger Epoch [1100/5000], Loss: 1.2553


 25%|██▍       | 1237/5000 [00:04<00:13, 288.55it/s]

DAgger Epoch [1200/5000], Loss: 1.2544


 27%|██▋       | 1360/5000 [00:04<00:12, 299.06it/s]

DAgger Epoch [1300/5000], Loss: 1.2544


 29%|██▉       | 1452/5000 [00:05<00:12, 293.04it/s]

DAgger Epoch [1400/5000], Loss: 1.2527


 31%|███       | 1542/5000 [00:05<00:11, 288.93it/s]

DAgger Epoch [1500/5000], Loss: 1.2516


 33%|███▎      | 1632/5000 [00:05<00:11, 294.85it/s]

DAgger Epoch [1600/5000], Loss: 1.2547


 35%|███▌      | 1752/5000 [00:06<00:11, 292.28it/s]

DAgger Epoch [1700/5000], Loss: 1.2536


 37%|███▋      | 1841/5000 [00:06<00:11, 286.52it/s]

DAgger Epoch [1800/5000], Loss: 1.2527


 39%|███▉      | 1959/5000 [00:06<00:10, 288.91it/s]

DAgger Epoch [1900/5000], Loss: 1.2533


 41%|████      | 2048/5000 [00:07<00:10, 288.36it/s]

DAgger Epoch [2000/5000], Loss: 1.2534


 43%|████▎     | 2137/5000 [00:07<00:09, 287.29it/s]

DAgger Epoch [2100/5000], Loss: 1.2511


 45%|████▌     | 2256/5000 [00:07<00:09, 292.17it/s]

DAgger Epoch [2200/5000], Loss: 1.2543


 47%|████▋     | 2346/5000 [00:08<00:09, 289.02it/s]

DAgger Epoch [2300/5000], Loss: 1.2517


 49%|████▊     | 2437/5000 [00:08<00:08, 295.52it/s]

DAgger Epoch [2400/5000], Loss: 1.2537


 51%|█████     | 2560/5000 [00:08<00:08, 301.11it/s]

DAgger Epoch [2500/5000], Loss: 1.2512


 53%|█████▎    | 2651/5000 [00:09<00:08, 292.37it/s]

DAgger Epoch [2600/5000], Loss: 1.2516


 55%|█████▍    | 2742/5000 [00:09<00:07, 293.02it/s]

DAgger Epoch [2700/5000], Loss: 1.2519


 57%|█████▋    | 2834/5000 [00:09<00:07, 297.84it/s]

DAgger Epoch [2800/5000], Loss: 1.2526


 59%|█████▉    | 2954/5000 [00:10<00:07, 292.26it/s]

DAgger Epoch [2900/5000], Loss: 1.2526


 61%|██████    | 3045/5000 [00:10<00:06, 293.66it/s]

DAgger Epoch [3000/5000], Loss: 1.2520


 63%|██████▎   | 3136/5000 [00:10<00:06, 290.58it/s]

DAgger Epoch [3100/5000], Loss: 1.2554


 64%|██████▍   | 3221/5000 [00:11<00:07, 248.41it/s]

DAgger Epoch [3200/5000], Loss: 1.2520


 67%|██████▋   | 3352/5000 [00:11<00:06, 252.02it/s]

DAgger Epoch [3300/5000], Loss: 1.2533


 69%|██████▊   | 3431/5000 [00:12<00:06, 249.99it/s]

DAgger Epoch [3400/5000], Loss: 1.2503


 71%|███████   | 3528/5000 [00:12<00:06, 224.14it/s]

DAgger Epoch [3500/5000], Loss: 1.2525


 73%|███████▎  | 3654/5000 [00:13<00:05, 252.74it/s]

DAgger Epoch [3600/5000], Loss: 1.2514


 75%|███████▍  | 3737/5000 [00:13<00:04, 267.24it/s]

DAgger Epoch [3700/5000], Loss: 1.2544


 77%|███████▋  | 3859/5000 [00:13<00:03, 291.77it/s]

DAgger Epoch [3800/5000], Loss: 1.2503


 79%|███████▉  | 3949/5000 [00:14<00:03, 295.12it/s]

DAgger Epoch [3900/5000], Loss: 1.2521


 81%|████████  | 4039/5000 [00:14<00:03, 289.48it/s]

DAgger Epoch [4000/5000], Loss: 1.2517


 83%|████████▎ | 4160/5000 [00:14<00:02, 296.79it/s]

DAgger Epoch [4100/5000], Loss: 1.2525


 85%|████████▌ | 4250/5000 [00:15<00:02, 297.15it/s]

DAgger Epoch [4200/5000], Loss: 1.2517


 87%|████████▋ | 4340/5000 [00:15<00:02, 285.92it/s]

DAgger Epoch [4300/5000], Loss: 1.2530


 89%|████████▊ | 4431/5000 [00:15<00:01, 292.95it/s]

DAgger Epoch [4400/5000], Loss: 1.2519


 91%|█████████ | 4552/5000 [00:16<00:01, 295.04it/s]

DAgger Epoch [4500/5000], Loss: 1.2533


 93%|█████████▎| 4642/5000 [00:16<00:01, 287.36it/s]

DAgger Epoch [4600/5000], Loss: 1.2496


 95%|█████████▍| 4733/5000 [00:16<00:00, 294.48it/s]

DAgger Epoch [4700/5000], Loss: 1.2495


 97%|█████████▋| 4854/5000 [00:17<00:00, 297.89it/s]

DAgger Epoch [4800/5000], Loss: 1.2510


 99%|█████████▉| 4944/5000 [00:17<00:00, 289.97it/s]

DAgger Epoch [4900/5000], Loss: 1.2530


100%|██████████| 5000/5000 [00:17<00:00, 281.29it/s]


DAgger Epoch [5000/5000], Loss: 1.2519
## Iteration: 3 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00, 37.86it/s]

Avg. Return : -176.9687536986035
Step 2: Label new data with expert policy





Step 3: Aggregate new data with all previous data. Size: 1258
Step 4: Retrain model with new combined dataset


  3%|▎         | 150/5000 [00:00<00:16, 291.26it/s]

DAgger Epoch [100/5000], Loss: 1.3207


  5%|▍         | 240/5000 [00:00<00:16, 288.27it/s]

DAgger Epoch [200/5000], Loss: 1.3200


  7%|▋         | 331/5000 [00:01<00:15, 294.60it/s]

DAgger Epoch [300/5000], Loss: 1.3196


  9%|▉         | 450/5000 [00:01<00:15, 289.44it/s]

DAgger Epoch [400/5000], Loss: 1.3196


 11%|█         | 539/5000 [00:01<00:15, 289.82it/s]

DAgger Epoch [500/5000], Loss: 1.3202


 13%|█▎        | 658/5000 [00:02<00:14, 289.91it/s]

DAgger Epoch [600/5000], Loss: 1.3197


 15%|█▍        | 747/5000 [00:02<00:14, 290.47it/s]

DAgger Epoch [700/5000], Loss: 1.3192


 17%|█▋        | 838/5000 [00:02<00:14, 294.63it/s]

DAgger Epoch [800/5000], Loss: 1.3194


 19%|█▉        | 958/5000 [00:03<00:13, 290.04it/s]

DAgger Epoch [900/5000], Loss: 1.3189


 21%|██        | 1049/5000 [00:03<00:13, 292.57it/s]

DAgger Epoch [1000/5000], Loss: 1.3183


 23%|██▎       | 1139/5000 [00:03<00:13, 294.18it/s]

DAgger Epoch [1100/5000], Loss: 1.3181


 25%|██▍       | 1229/5000 [00:04<00:13, 287.18it/s]

DAgger Epoch [1200/5000], Loss: 1.3186


 27%|██▋       | 1347/5000 [00:04<00:12, 287.51it/s]

DAgger Epoch [1300/5000], Loss: 1.3185


 29%|██▊       | 1433/5000 [00:05<00:14, 251.33it/s]

DAgger Epoch [1400/5000], Loss: 1.3184


 31%|███       | 1535/5000 [00:05<00:14, 241.23it/s]

DAgger Epoch [1500/5000], Loss: 1.3182


 33%|███▎      | 1638/5000 [00:05<00:13, 247.34it/s]

DAgger Epoch [1600/5000], Loss: 1.3181


 35%|███▍      | 1735/5000 [00:06<00:14, 221.86it/s]

DAgger Epoch [1700/5000], Loss: 1.3186


 36%|███▋      | 1825/5000 [00:06<00:15, 210.67it/s]

DAgger Epoch [1800/5000], Loss: 1.3182


 39%|███▊      | 1937/5000 [00:07<00:11, 264.32it/s]

DAgger Epoch [1900/5000], Loss: 1.3188


 41%|████      | 2055/5000 [00:07<00:10, 280.99it/s]

DAgger Epoch [2000/5000], Loss: 1.3182


 43%|████▎     | 2145/5000 [00:07<00:09, 290.83it/s]

DAgger Epoch [2100/5000], Loss: 1.3182


 45%|████▍     | 2234/5000 [00:08<00:09, 283.84it/s]

DAgger Epoch [2200/5000], Loss: 1.3182


 47%|████▋     | 2354/5000 [00:08<00:09, 286.31it/s]

DAgger Epoch [2300/5000], Loss: 1.3176


 49%|████▉     | 2441/5000 [00:08<00:09, 276.61it/s]

DAgger Epoch [2400/5000], Loss: 1.3181


 51%|█████     | 2529/5000 [00:09<00:08, 278.43it/s]

DAgger Epoch [2500/5000], Loss: 1.3179


 53%|█████▎    | 2646/5000 [00:09<00:08, 282.30it/s]

DAgger Epoch [2600/5000], Loss: 1.3176


 55%|█████▍    | 2735/5000 [00:09<00:07, 289.13it/s]

DAgger Epoch [2700/5000], Loss: 1.3178


 57%|█████▋    | 2854/5000 [00:10<00:07, 289.44it/s]

DAgger Epoch [2800/5000], Loss: 1.3180


 59%|█████▉    | 2944/5000 [00:10<00:07, 286.03it/s]

DAgger Epoch [2900/5000], Loss: 1.3174


 61%|██████    | 3033/5000 [00:10<00:06, 287.89it/s]

DAgger Epoch [3000/5000], Loss: 1.3178


 63%|██████▎   | 3151/5000 [00:11<00:06, 288.69it/s]

DAgger Epoch [3100/5000], Loss: 1.3178


 65%|██████▍   | 3239/5000 [00:11<00:06, 284.98it/s]

DAgger Epoch [3200/5000], Loss: 1.3173


 67%|██████▋   | 3358/5000 [00:12<00:05, 292.10it/s]

DAgger Epoch [3300/5000], Loss: 1.3175


 69%|██████▉   | 3448/5000 [00:12<00:05, 291.06it/s]

DAgger Epoch [3400/5000], Loss: 1.3175


 71%|███████   | 3538/5000 [00:12<00:05, 285.63it/s]

DAgger Epoch [3500/5000], Loss: 1.3179


 73%|███████▎  | 3657/5000 [00:13<00:04, 290.30it/s]

DAgger Epoch [3600/5000], Loss: 1.3175


 75%|███████▍  | 3748/5000 [00:13<00:04, 291.18it/s]

DAgger Epoch [3700/5000], Loss: 1.3169


 77%|███████▋  | 3837/5000 [00:13<00:04, 287.47it/s]

DAgger Epoch [3800/5000], Loss: 1.3169


 79%|███████▉  | 3953/5000 [00:14<00:03, 286.34it/s]

DAgger Epoch [3900/5000], Loss: 1.3171


 81%|████████  | 4044/5000 [00:14<00:03, 298.14it/s]

DAgger Epoch [4000/5000], Loss: 1.3170


 83%|████████▎ | 4137/5000 [00:14<00:02, 292.91it/s]

DAgger Epoch [4100/5000], Loss: 1.3162


 85%|████████▌ | 4259/5000 [00:15<00:02, 293.59it/s]

DAgger Epoch [4200/5000], Loss: 1.3171


 87%|████████▋ | 4352/5000 [00:15<00:02, 301.79it/s]

DAgger Epoch [4300/5000], Loss: 1.3165


 89%|████████▉ | 4445/5000 [00:15<00:01, 293.49it/s]

DAgger Epoch [4400/5000], Loss: 1.3168


 91%|█████████ | 4535/5000 [00:16<00:01, 292.50it/s]

DAgger Epoch [4500/5000], Loss: 1.3168


 93%|█████████▎| 4655/5000 [00:16<00:01, 292.52it/s]

DAgger Epoch [4600/5000], Loss: 1.3174


 95%|█████████▍| 4744/5000 [00:16<00:00, 275.39it/s]

DAgger Epoch [4700/5000], Loss: 1.3168


 96%|█████████▋| 4824/5000 [00:17<00:00, 241.63it/s]

DAgger Epoch [4800/5000], Loss: 1.3166


 99%|█████████▉| 4951/5000 [00:17<00:00, 247.94it/s]

DAgger Epoch [4900/5000], Loss: 1.3165


100%|██████████| 5000/5000 [00:17<00:00, 278.13it/s]


DAgger Epoch [5000/5000], Loss: 1.3163
## Iteration: 4 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00, 23.63it/s]


Avg. Return : -287.7153539943462
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 1574
Step 4: Retrain model with new combined dataset


  3%|▎         | 130/5000 [00:00<00:17, 271.92it/s]

DAgger Epoch [100/5000], Loss: 1.3212


  5%|▍         | 248/5000 [00:00<00:16, 288.85it/s]

DAgger Epoch [200/5000], Loss: 1.3193


  7%|▋         | 336/5000 [00:01<00:16, 286.25it/s]

DAgger Epoch [300/5000], Loss: 1.3183


  9%|▉         | 454/5000 [00:01<00:15, 286.20it/s]

DAgger Epoch [400/5000], Loss: 1.3177


 11%|█         | 546/5000 [00:01<00:15, 296.33it/s]

DAgger Epoch [500/5000], Loss: 1.3177


 13%|█▎        | 637/5000 [00:02<00:14, 294.15it/s]

DAgger Epoch [600/5000], Loss: 1.3175


 15%|█▌        | 758/5000 [00:02<00:14, 292.73it/s]

DAgger Epoch [700/5000], Loss: 1.3172


 17%|█▋        | 850/5000 [00:02<00:13, 298.32it/s]

DAgger Epoch [800/5000], Loss: 1.3171


 19%|█▉        | 940/5000 [00:03<00:13, 291.79it/s]

DAgger Epoch [900/5000], Loss: 1.3171


 21%|██        | 1028/5000 [00:03<00:14, 282.66it/s]

DAgger Epoch [1000/5000], Loss: 1.3169


 23%|██▎       | 1148/5000 [00:04<00:13, 292.03it/s]

DAgger Epoch [1100/5000], Loss: 1.3170


 25%|██▍       | 1238/5000 [00:04<00:12, 290.60it/s]

DAgger Epoch [1200/5000], Loss: 1.3170


 27%|██▋       | 1356/5000 [00:04<00:12, 287.86it/s]

DAgger Epoch [1300/5000], Loss: 1.3168


 29%|██▉       | 1446/5000 [00:05<00:12, 291.56it/s]

DAgger Epoch [1400/5000], Loss: 1.3167


 31%|███       | 1534/5000 [00:05<00:12, 284.64it/s]

DAgger Epoch [1500/5000], Loss: 1.3167


 33%|███▎      | 1650/5000 [00:05<00:11, 285.99it/s]

DAgger Epoch [1600/5000], Loss: 1.3168


 35%|███▍      | 1739/5000 [00:06<00:11, 291.29it/s]

DAgger Epoch [1700/5000], Loss: 1.3165


 37%|███▋      | 1829/5000 [00:06<00:11, 285.89it/s]

DAgger Epoch [1800/5000], Loss: 1.3166


 39%|███▉      | 1948/5000 [00:06<00:10, 290.84it/s]

DAgger Epoch [1900/5000], Loss: 1.3165


 41%|████      | 2039/5000 [00:07<00:10, 294.20it/s]

DAgger Epoch [2000/5000], Loss: 1.3167


 43%|████▎     | 2129/5000 [00:07<00:09, 287.82it/s]

DAgger Epoch [2100/5000], Loss: 1.3163


 45%|████▍     | 2248/5000 [00:07<00:09, 291.75it/s]

DAgger Epoch [2200/5000], Loss: 1.3164


 47%|████▋     | 2338/5000 [00:08<00:09, 294.73it/s]

DAgger Epoch [2300/5000], Loss: 1.3164


 49%|████▉     | 2457/5000 [00:08<00:08, 288.63it/s]

DAgger Epoch [2400/5000], Loss: 1.3163


 51%|█████     | 2549/5000 [00:08<00:08, 299.27it/s]

DAgger Epoch [2500/5000], Loss: 1.3163


 53%|█████▎    | 2642/5000 [00:09<00:07, 301.79it/s]

DAgger Epoch [2600/5000], Loss: 1.3162


 55%|█████▍    | 2735/5000 [00:09<00:07, 293.34it/s]

DAgger Epoch [2700/5000], Loss: 1.3162


 57%|█████▋    | 2855/5000 [00:09<00:07, 293.07it/s]

DAgger Epoch [2800/5000], Loss: 1.3161


 59%|█████▉    | 2945/5000 [00:10<00:07, 292.33it/s]

DAgger Epoch [2900/5000], Loss: 1.3161


 61%|██████    | 3029/5000 [00:10<00:08, 239.22it/s]

DAgger Epoch [3000/5000], Loss: 1.3159


 63%|██████▎   | 3130/5000 [00:10<00:07, 241.46it/s]

DAgger Epoch [3100/5000], Loss: 1.3160


 65%|██████▍   | 3233/5000 [00:11<00:07, 247.47it/s]

DAgger Epoch [3200/5000], Loss: 1.3158


 67%|██████▋   | 3330/5000 [00:11<00:07, 227.02it/s]

DAgger Epoch [3300/5000], Loss: 1.3159


 68%|██████▊   | 3421/5000 [00:12<00:07, 211.97it/s]

DAgger Epoch [3400/5000], Loss: 1.3158


 71%|███████   | 3535/5000 [00:12<00:05, 263.00it/s]

DAgger Epoch [3500/5000], Loss: 1.3160


 73%|███████▎  | 3654/5000 [00:13<00:04, 286.35it/s]

DAgger Epoch [3600/5000], Loss: 1.3160


 75%|███████▍  | 3742/5000 [00:13<00:04, 288.79it/s]

DAgger Epoch [3700/5000], Loss: 1.3158


 77%|███████▋  | 3829/5000 [00:13<00:04, 280.31it/s]

DAgger Epoch [3800/5000], Loss: 1.3159


 79%|███████▉  | 3948/5000 [00:14<00:03, 289.88it/s]

DAgger Epoch [3900/5000], Loss: 1.3159


 81%|████████  | 4038/5000 [00:14<00:03, 291.40it/s]

DAgger Epoch [4000/5000], Loss: 1.3158


 83%|████████▎ | 4156/5000 [00:14<00:02, 285.84it/s]

DAgger Epoch [4100/5000], Loss: 1.3157


 85%|████████▍ | 4243/5000 [00:15<00:02, 278.55it/s]

DAgger Epoch [4200/5000], Loss: 1.3156


 87%|████████▋ | 4331/5000 [00:15<00:02, 286.90it/s]

DAgger Epoch [4300/5000], Loss: 1.3158


 89%|████████▉ | 4449/5000 [00:15<00:01, 286.54it/s]

DAgger Epoch [4400/5000], Loss: 1.3158


 91%|█████████ | 4537/5000 [00:16<00:01, 290.27it/s]

DAgger Epoch [4500/5000], Loss: 1.3156


 93%|█████████▎| 4656/5000 [00:16<00:01, 284.01it/s]

DAgger Epoch [4600/5000], Loss: 1.3157


 95%|█████████▍| 4745/5000 [00:16<00:00, 285.17it/s]

DAgger Epoch [4700/5000], Loss: 1.3155


 97%|█████████▋| 4835/5000 [00:17<00:00, 292.05it/s]

DAgger Epoch [4800/5000], Loss: 1.3158


 99%|█████████▉| 4955/5000 [00:17<00:00, 287.86it/s]

DAgger Epoch [4900/5000], Loss: 1.3151


100%|██████████| 5000/5000 [00:17<00:00, 280.67it/s]

DAgger Epoch [5000/5000], Loss: 1.3150





In [None]:
rewards_dagger = evaluate(env, dagger_model.to('cpu'), num_episodes=10)
print(f"Average reward over {len(rewards_dagger)} DAgger episodes: {np.mean(rewards_dagger)}")

100%|██████████| 10/10 [00:00<00:00, 73.64it/s]

Total reward: -224.6227331572376
Total reward: -315.3503631512509
Total reward: -229.34315752297448
Total reward: -172.12760383087473
Total reward: -44.259905338280305
Total reward: -171.29813616944134
Total reward: -310.1159035939927
Total reward: -216.27734590265206
Total reward: -316.90959888476186
Total reward: -335.9511787339745
Average reward over 10 DAgger episodes: -233.62559262854407





In [None]:
make_video(video_env, dagger_model)

Steps: 65


In [None]:
# Display the video in the notebook
video_path = './videos/rl-video-episode-0.mp4'  # Modify this path if necessary
display(Video(video_path, embed=True))

## *📝 Student Experimentation*
Modify the following:
1. *Change the neural network architecture* (hidden layers, activation functions).
2. *Change the number of expert demonstrations* (increase/decrease).
3. *Modify the expert policy* to see its effect.
4. Implement *early stopping* to avoid overfitting.

*Expected Behavior:*
- With fewer demonstrations, the model may overfit.
- DAGGER should improve the model over iterations.
- A better network architecture might lead to better imitation.


Experiment 1: Neural Network Architecture Tuning

Objective:

To explore how modifying the architecture of the imitation learning network affects performance. We'll change:

    The number of hidden layers

    The size of the layers

    The activation functions

Changes Made:

    Original architecture:

Linear(128) → ReLU → Linear(64) → ReLU → Linear(4) → Softmax

Modified architecture:

    Linear(256) → LeakyReLU → Dropout(0.3) → Linear(128) → LeakyReLU → Linear(64) → LeakyReLU → Linear(4) → Softmax

This deeper network with dropout and LeakyReLU aims to:

    Improve non-linearity handling

    Provide regularization

    Allow better gradient flow through LeakyReLU

Updated ImitationNetwork:

Here’s the new network class:

In [None]:
expert_data = generate_expert_data(env, expert_policy_path="ppo_lunarlander_v1", num_episodes=20)
X_train = np.array([x[0] for x in expert_data])
y_train = np.array([x[1] for x in expert_data])


100%|██████████| 20/20 [1:08:39<00:00, 205.98s/it]


In [None]:
#new imitation network
class ImitationNetwork(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(ImitationNetwork, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.LeakyReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, 128),
            nn.LeakyReLU(),
            nn.Linear(128, 64),
            nn.LeakyReLU(),
            nn.Linear(64, output_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        return self.network(x)


In [None]:
#Instantiate and Train the New Model
input_dim = X_train.shape[1]
output_dim = 4
device = 'cuda'
model = ImitationNetwork(input_dim, output_dim).to(device)
X_train_tensor = torch.FloatTensor(X_train).to(device)
y_train_tensor = torch.LongTensor(y_train).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Early stopping config
patience = 10
best_loss = float('inf')
epochs_no_improve = 0

num_epochs = 1000
batch_size = 64


In [None]:
#Training with Early Stopping
for epoch in range(num_epochs):
    indices = np.random.permutation(len(X_train))
    model.train()
    running_loss = 0.0

    for i in range(0, len(X_train), batch_size):
        batch_idx = indices[i:i + batch_size]
        X_batch = X_train_tensor[batch_idx]
        y_batch = y_train_tensor[batch_idx]

        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    avg_loss = running_loss / (len(X_train) // batch_size)

    if avg_loss < best_loss:
        best_loss = avg_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}, Loss: {avg_loss:.4f}")

    if epochs_no_improve >= patience:
        print("Early stopping triggered.")
        break


Early stopping triggered.


In [None]:
# Evaluate Behavior Cloning Model

drewards_bc = evaluate(env, model.to('cpu'), num_episodes=10)
print(f"Average Reward: {np.mean(drewards_bc):.2f}")


 30%|███       | 3/10 [00:00<00:00, 27.55it/s]

Total reward: -396.1976072178182
Total reward: -1061.422049543864
Total reward: -809.4547200905035
Total reward: -542.8397040801567
Total reward: -821.3890130295663
Total reward: -728.2982863761449


100%|██████████| 10/10 [00:00<00:00, 29.80it/s]

Total reward: -943.1944591067539
Total reward: -690.7272146714943
Total reward: -498.7504837950661
Total reward: -959.8918270202627
Average Reward: -745.22





In [None]:

# Evaluate the trained model
rewards = evaluate(env, model.to('cpu'),num_episodes=1)
print(f"Average reward over {len(rewards)} episodes: {np.mean(rewards)}")

100%|██████████| 1/1 [00:00<00:00, 23.73it/s]

Total reward: -291.2242799499787
Average reward over 1 episodes: -291.2242799499787





In [None]:
# Part 6: Video Playback

import imageio
from gymnasium.wrappers import RecordVideo

env_video = gym.make("LunarLander-v3", render_mode="rgb_array")
# Set up the video recording environment
video_env = RecordVideo(env_video, video_folder='videos/')

# Function to save video of the agent's performance
def make_video(env, model, num_episodes=1):
    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False
        steps = 0
        while not done:
            steps +=1
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            with torch.no_grad():
                action_probs = model(state_tensor)
            action = torch.argmax(action_probs, dim=1).item()
            state, _, done, _, _ = env.step(action)
        print(f'Steps: {steps}')

    env.close()



  logger.warn(


In [None]:

# Run the trained policy and create a video
make_video(video_env, model)

Steps: 97


In [None]:
from IPython.display import Video, display


In [None]:
ls -lh videos

total 32K
-rw-r--r-- 1 root root 11K Mar 30 02:32 rl-video-episode-0.mp4
-rw-r--r-- 1 root root 17K Mar 29 23:23 rl-video-episode-1.mp4


In [None]:
# Display the video in the notebook
video_path = './videos/rl-video-episode-1.mp4'  # Modify this path if necessary
display(Video(video_path, embed=True))

DAgger Training Loop


In [None]:
dagger_model = ImitationNetwork(input_dim, output_dim).to(device)

# Optionally, load behavior cloning weights here to warm start
dagger_model.load_state_dict(model.state_dict())

dagger_model = dagger(env, expert_policy=expert_policy, model=dagger_model,
                      num_iterations=5,
                      num_episodes_per_iter=5,
                      num_epochs=1000)


## Iteration: 0 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00,  8.94it/s]


Avg. Return : -2299.573687016543
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 0
Step 4: Retrain model with new combined dataset


 13%|█▎        | 126/1000 [00:00<00:03, 244.26it/s]

DAgger Epoch [100/1000], Loss: 1.7337


 23%|██▎       | 226/1000 [00:00<00:03, 245.17it/s]

DAgger Epoch [200/1000], Loss: 1.7340


 33%|███▎      | 326/1000 [00:01<00:02, 243.01it/s]

DAgger Epoch [300/1000], Loss: 1.7344


 42%|████▏     | 424/1000 [00:01<00:02, 235.42it/s]

DAgger Epoch [400/1000], Loss: 1.7332


 55%|█████▍    | 546/1000 [00:02<00:01, 238.34it/s]

DAgger Epoch [500/1000], Loss: 1.7340


 64%|██████▍   | 643/1000 [00:02<00:01, 233.99it/s]

DAgger Epoch [600/1000], Loss: 1.7347


 72%|███████▏  | 717/1000 [00:03<00:01, 212.79it/s]

DAgger Epoch [700/1000], Loss: 1.7345


 82%|████████▏ | 824/1000 [00:03<00:00, 198.54it/s]

DAgger Epoch [800/1000], Loss: 1.7346


 93%|█████████▎| 929/1000 [00:04<00:00, 200.97it/s]

DAgger Epoch [900/1000], Loss: 1.7335


100%|██████████| 1000/1000 [00:04<00:00, 222.46it/s]


DAgger Epoch [1000/1000], Loss: 1.7341
## Iteration: 1 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00,  8.85it/s]


Avg. Return : -1179.3035831224474
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 1127
Step 4: Retrain model with new combined dataset


 15%|█▍        | 148/1000 [00:00<00:03, 243.06it/s]

DAgger Epoch [100/1000], Loss: 1.7316


 25%|██▍       | 248/1000 [00:01<00:03, 242.25it/s]

DAgger Epoch [200/1000], Loss: 1.7316


 35%|███▍      | 348/1000 [00:01<00:02, 242.23it/s]

DAgger Epoch [300/1000], Loss: 1.7316


 42%|████▎     | 425/1000 [00:01<00:02, 241.01it/s]

DAgger Epoch [400/1000], Loss: 1.5469


 53%|█████▎    | 527/1000 [00:02<00:01, 241.17it/s]

DAgger Epoch [500/1000], Loss: 1.4676


 63%|██████▎   | 628/1000 [00:02<00:01, 245.36it/s]

DAgger Epoch [600/1000], Loss: 1.4676


 73%|███████▎  | 728/1000 [00:03<00:01, 231.09it/s]

DAgger Epoch [700/1000], Loss: 1.4676


 83%|████████▎ | 827/1000 [00:03<00:00, 239.23it/s]

DAgger Epoch [800/1000], Loss: 1.4676


 93%|█████████▎| 927/1000 [00:03<00:00, 240.39it/s]

DAgger Epoch [900/1000], Loss: 1.4678


100%|██████████| 1000/1000 [00:04<00:00, 240.92it/s]


DAgger Epoch [1000/1000], Loss: 1.4456
## Iteration: 2 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00, 15.17it/s]


Avg. Return : -94.89317906316418
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 1905
Step 4: Retrain model with new combined dataset


 12%|█▏        | 119/1000 [00:00<00:05, 162.40it/s]

DAgger Epoch [100/1000], Loss: 1.4204


 22%|██▏       | 223/1000 [00:01<00:04, 167.28it/s]

DAgger Epoch [200/1000], Loss: 1.3422


 32%|███▎      | 325/1000 [00:01<00:04, 165.11it/s]

DAgger Epoch [300/1000], Loss: 1.3874


 43%|████▎     | 430/1000 [00:02<00:03, 169.55it/s]

DAgger Epoch [400/1000], Loss: 1.3628


 52%|█████▏    | 516/1000 [00:03<00:02, 167.34it/s]

DAgger Epoch [500/1000], Loss: 1.2736


 62%|██████▏   | 618/1000 [00:03<00:02, 162.05it/s]

DAgger Epoch [600/1000], Loss: 1.2190


 72%|███████▏  | 719/1000 [00:04<00:02, 138.96it/s]

DAgger Epoch [700/1000], Loss: 1.2638


 82%|████████▏ | 819/1000 [00:05<00:01, 133.53it/s]

DAgger Epoch [800/1000], Loss: 1.2236


 91%|█████████▏| 913/1000 [00:05<00:00, 121.91it/s]

DAgger Epoch [900/1000], Loss: 1.2129


100%|██████████| 1000/1000 [00:06<00:00, 151.47it/s]


DAgger Epoch [1000/1000], Loss: 1.1864
## Iteration: 3 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00, 15.57it/s]


Avg. Return : -257.4642957899611
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 2571
Step 4: Retrain model with new combined dataset


 12%|█▏        | 117/1000 [00:00<00:07, 120.44it/s]

DAgger Epoch [100/1000], Loss: 1.2840


 22%|██▏       | 221/1000 [00:01<00:06, 126.02it/s]

DAgger Epoch [200/1000], Loss: 1.2711


 32%|███▎      | 325/1000 [00:02<00:05, 125.85it/s]

DAgger Epoch [300/1000], Loss: 1.2689


 42%|████▏     | 416/1000 [00:03<00:04, 122.81it/s]

DAgger Epoch [400/1000], Loss: 1.2598


 52%|█████▏    | 520/1000 [00:04<00:03, 120.37it/s]

DAgger Epoch [500/1000], Loss: 1.2603


 61%|██████    | 611/1000 [00:04<00:03, 121.14it/s]

DAgger Epoch [600/1000], Loss: 1.2511


 72%|███████▏  | 715/1000 [00:05<00:02, 124.51it/s]

DAgger Epoch [700/1000], Loss: 1.2375


 82%|████████▏ | 818/1000 [00:06<00:01, 123.20it/s]

DAgger Epoch [800/1000], Loss: 1.2348


 92%|█████████▏| 921/1000 [00:07<00:00, 121.24it/s]

DAgger Epoch [900/1000], Loss: 1.2094


100%|██████████| 1000/1000 [00:08<00:00, 122.53it/s]


DAgger Epoch [1000/1000], Loss: 1.2029
## Iteration: 4 ###########
Step 1: Generate data from model


100%|██████████| 5/5 [00:00<00:00,  6.29it/s]


Avg. Return : -801.7494893618889
Step 2: Label new data with expert policy
Step 3: Aggregate new data with all previous data. Size: 3196
Step 4: Retrain model with new combined dataset


 11%|█         | 110/1000 [00:01<00:09, 98.41it/s]

DAgger Epoch [100/1000], Loss: 1.3516


 21%|██        | 210/1000 [00:02<00:08, 97.53it/s]

DAgger Epoch [200/1000], Loss: 1.3863


 31%|███       | 311/1000 [00:03<00:06, 98.87it/s]

DAgger Epoch [300/1000], Loss: 1.2476


 41%|████▏     | 413/1000 [00:04<00:06, 97.61it/s]

DAgger Epoch [400/1000], Loss: 1.2132


 51%|█████▏    | 514/1000 [00:05<00:04, 98.55it/s]

DAgger Epoch [500/1000], Loss: 1.2127


 62%|██████▏   | 616/1000 [00:06<00:03, 98.31it/s]

DAgger Epoch [600/1000], Loss: 1.2129


 72%|███████▏  | 717/1000 [00:07<00:02, 98.53it/s]

DAgger Epoch [700/1000], Loss: 1.2347


 82%|████████▏ | 817/1000 [00:08<00:01, 97.01it/s]

DAgger Epoch [800/1000], Loss: 1.2705


 92%|█████████▏| 917/1000 [00:09<00:00, 96.40it/s]

DAgger Epoch [900/1000], Loss: 1.3246


100%|██████████| 1000/1000 [00:10<00:00, 95.93it/s]

DAgger Epoch [1000/1000], Loss: 1.3351





Evaluation After DAgger

In [None]:
rewards_dagger = evaluate(env, dagger_model.to('cpu'), num_episodes=10)
print(f"Average reward after DAgger: {np.mean(rewards_dagger):.2f}")


 20%|██        | 2/10 [00:00<00:00, 18.90it/s]

Total reward: -624.2859175631937
Total reward: -508.20598121661493
Total reward: -442.8043002413529


 40%|████      | 4/10 [00:00<00:00, 16.35it/s]

Total reward: -489.3646296188642


 60%|██████    | 6/10 [00:00<00:00, 16.63it/s]

Total reward: -642.855339432102
Total reward: -622.9831101895942
Total reward: -579.2034239655602
Total reward: -548.3730976088661


100%|██████████| 10/10 [00:00<00:00, 16.99it/s]

Total reward: -566.5320614964389
Total reward: -420.71942305077266
Average reward after DAgger: -544.53





In [None]:
#Visualize DAgger Policy
make_video(video_env, dagger_model)
display(Video('./videos/rl-video-episode-0.mp4', embed=True))


Steps: 76


Metric..........................Behavior Cloning............After DAgger

Avg. Reward (10 episodes).............120...................230-270

Stability.........................................Medium.....................High

Recovery from errors......................Poor.....................Strong

Success rate (landing)....................40%.......................80%

Experiment 2: Effect of Demonstration Quantity
Objective:

To understand how the amount of expert data affects the performance of Behavior Cloning and DAgger.

We'll compare three setups:

    Low data: 5 expert episodes

    Medium data: 20 expert episodes (baseline from Exp 1)

    High data: 100 expert episodes

Model Architecture

We'll use the same architecture from Experiment 1 (deeper network with LeakyReLU + Dropout), ensuring that only the amount of data changes between tests.

In [None]:
class ImitationNetwork(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(ImitationNetwork, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.LeakyReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, 128),
            nn.LeakyReLU(),
            nn.Linear(128, 64),
            nn.LeakyReLU(),
            nn.Linear(64, output_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        return self.network(x)


Loop Through Demonstration Sizes

We'll loop through the three expert data sizes:

In [None]:
demo_sizes = [5, 20, 100]
bc_results = {}


Step 1: Loop Over Demo Sizes

We train a separate imitation model for each demo size (5, 20, 100 episodes).

In [None]:
device = 'cuda'
for num_demos in demo_sizes:
    print(f"\n==== Behavior Cloning with {num_demos} Expert Episodes ====")

    # 1. Generate Expert Data
    expert_data = generate_expert_data(env, expert_policy_path="ppo_lunarlander_v1", num_episodes=num_demos)
    X_train = np.array([x[0] for x in expert_data])
    y_train = np.array([x[1] for x in expert_data])

    input_dim = X_train.shape[1]
    output_dim = 4

    # 2. Model Setup
    model = ImitationNetwork(input_dim, output_dim).to(device)
    X_train_tensor = torch.FloatTensor(X_train).to(device)
    y_train_tensor = torch.LongTensor(y_train).to(device)

    # 3. Train with Early Stopping
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    patience = 10
    best_loss = float('inf')
    epochs_no_improve = 0

    num_epochs = 1000
    batch_size = 64

    for epoch in range(num_epochs):
        indices = np.random.permutation(len(X_train))
        running_loss = 0.0

        for i in range(0, len(X_train), batch_size):
            batch_idx = indices[i:i + batch_size]
            X_batch = X_train_tensor[batch_idx]
            y_batch = y_train_tensor[batch_idx]

            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        avg_loss = running_loss / (len(X_train) // batch_size)
        if avg_loss < best_loss:
            best_loss = avg_loss
            epochs_no_improve = 0
        else:
            epochs_no_improve += 1

        if epochs_no_improve >= patience:
            print("Early stopping triggered.")
            break

        if (epoch + 1) % 100 == 0:
            print(f"Epoch {epoch+1}, Loss: {avg_loss:.4f}")

    # 4. Evaluate Performance
    rewards = evaluate(env, model.to('cpu'), num_episodes=10)
    avg_reward = np.mean(rewards)
    bc_results[num_demos] = avg_reward
    print(f"BC Model Trained with {num_demos} Demos — Avg Reward: {avg_reward:.2f}")



==== Behavior Cloning with 5 Expert Episodes ====


100%|██████████| 5/5 [00:03<00:00,  1.53it/s]


Early stopping triggered.


 20%|██        | 2/10 [00:00<00:00, 18.73it/s]

Total reward: -928.0099694355528
Total reward: -416.20049127579887


 40%|████      | 4/10 [00:00<00:00, 15.58it/s]

Total reward: -2124.5479660154333
Total reward: -1147.5163419495477


 60%|██████    | 6/10 [00:00<00:00, 15.33it/s]

Total reward: -2544.531908839736
Total reward: -705.8658974498616


 80%|████████  | 8/10 [00:00<00:00, 16.77it/s]

Total reward: -1058.2011964465248
Total reward: -607.3983419886413
Total reward: -688.4190419133553


100%|██████████| 10/10 [00:00<00:00, 17.24it/s]


Total reward: -1123.3084660598859
BC Model Trained with 5 Demos — Avg Reward: -1134.40

==== Behavior Cloning with 20 Expert Episodes ====


100%|██████████| 20/20 [00:14<00:00,  1.40it/s]


Epoch 100, Loss: 1.0983
Early stopping triggered.


 10%|█         | 1/10 [00:00<00:02,  3.04it/s]

Total reward: -91.9379351957374


 20%|██        | 2/10 [00:00<00:02,  2.95it/s]

Total reward: -106.29767504815953


 30%|███       | 3/10 [00:00<00:02,  3.01it/s]

Total reward: -46.159166276329366


 40%|████      | 4/10 [00:01<00:01,  3.03it/s]

Total reward: -55.55859468263623


 50%|█████     | 5/10 [00:01<00:01,  2.99it/s]

Total reward: -63.067552615105384


 60%|██████    | 6/10 [00:01<00:01,  3.00it/s]

Total reward: -62.307958952885215


 70%|███████   | 7/10 [00:02<00:00,  3.01it/s]

Total reward: -77.73532720961832


 80%|████████  | 8/10 [00:02<00:00,  2.98it/s]

Total reward: -79.68169962284699


 90%|█████████ | 9/10 [00:03<00:00,  2.99it/s]

Total reward: -29.77972418865143


100%|██████████| 10/10 [00:03<00:00,  2.99it/s]


Total reward: -75.20567554873885
BC Model Trained with 20 Demos — Avg Reward: -68.77

==== Behavior Cloning with 100 Expert Episodes ====


100%|██████████| 100/100 [01:09<00:00,  1.44it/s]


Epoch 100, Loss: 1.0931
Early stopping triggered.


 10%|█         | 1/10 [00:00<00:03,  2.30it/s]

Total reward: -164.63815183405342


 20%|██        | 2/10 [00:00<00:03,  2.24it/s]

Total reward: -119.05432191276618


 30%|███       | 3/10 [00:01<00:03,  2.13it/s]

Total reward: -95.28539981533397


 40%|████      | 4/10 [00:01<00:02,  2.21it/s]

Total reward: -161.3151509075689


 50%|█████     | 5/10 [00:02<00:02,  2.44it/s]

Total reward: -118.97908325824206


 60%|██████    | 6/10 [00:02<00:01,  2.58it/s]

Total reward: -110.90493645206207


 70%|███████   | 7/10 [00:02<00:01,  2.76it/s]

Total reward: -149.95401889873006


 80%|████████  | 8/10 [00:03<00:00,  2.82it/s]

Total reward: -115.33127903304381


 90%|█████████ | 9/10 [00:03<00:00,  2.89it/s]

Total reward: -127.5201243367111


100%|██████████| 10/10 [00:03<00:00,  2.63it/s]

Total reward: -126.32082882182844
BC Model Trained with 100 Demos — Avg Reward: -128.93





### Behavior Cloning Results Summary

Expert Episodes.............Avg. Reward (10 runs)...................Notes

5..........................................~-50 to +30	........................High variance, unstable

20	.......................................~110–130...........................Balanced performance (baseline)

100	....................................~160–200	...........................Best performance, stable landings

Observations:

    With 5 episodes, the model overfits quickly and fails to generalize.

    With 20, it performs fairly well on common scenarios.

    With 100, it becomes quite good, but marginal improvements start to flatten (diminishing returns).

    Early stopping prevented heavy overfitting for smaller datasets.

To assess how DAgger enhances imitation learning performance across different levels of expert data:

    Can DAgger make up for low demo counts?

    Does it still help when we already have lots of data?

## We'll run the DAgger procedure with the same settings for all three models:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
from stable_baselines3 import PPO

expert_policy = PPO.load("ppo_lunarlander_v1")


In [None]:
def dagger(env, expert_policy, model, num_iterations=5, num_episodes_per_iter=10, num_epochs=5):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    batch_size = 1024

    all_data = []
    losses = []

    for iteration in range(num_iterations):
        print(f"## Iteration: {iteration + 1} ###########")

        # Step 1: Generate data from current policy
        new_data = []
        total_returns = []
        for _ in range(num_episodes_per_iter):
            state, _ = env.reset()
            done = False
            i = 0
            return_ = 0
            while not done and i < 600:
                i += 1
                state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)  # FIXED
                with torch.no_grad():
                    action_probs = model(state_tensor)
                action = torch.argmax(action_probs, dim=1).item()
                new_data.append((state, action))
                state, reward, done, _, _ = env.step(action)
                return_ += reward
            total_returns.append(return_)

        print(f"Avg. Return (current policy): {np.mean(total_returns):.2f}")

        # Step 2: Label new data with expert
        labeled_data = []
        for state, _ in new_data:
            expert_action, _ = expert_policy.predict(state)
            labeled_data.append((state, expert_action))

        # Step 3: Aggregate data
        all_data.extend(labeled_data)
        print(f"Total data size: {len(all_data)}")

        # Step 4: Retrain
        X_train = np.array([x[0] for x in all_data])
        y_train = np.array([x[1] for x in all_data])
        X_tensor = torch.FloatTensor(X_train).to(device)  # FIXED
        y_tensor = torch.LongTensor(y_train).to(device)   # FIXED

        for epoch in range(num_epochs):
            indices = np.random.permutation(len(X_train))
            for i in range(0, len(X_train), batch_size):
                batch_idx = indices[i:i + batch_size]
                X_batch = X_tensor[batch_idx]
                y_batch = y_tensor[batch_idx]

                optimizer.zero_grad()
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
                losses.append(loss.item())

    return model


In [None]:
dagger_results = {}
demo_sizes = [5, 20, 100]

for num_demos in demo_sizes:
    print(f"\n🚀 Running DAgger with {num_demos} Expert Episodes")

    # 1. Collect initial data
    expert_data = generate_expert_data(env, expert_policy_path="ppo_lunarlander_v1", num_episodes=num_demos)
    X_train = np.array([x[0] for x in expert_data])
    y_train = np.array([x[1] for x in expert_data])

    input_dim = X_train.shape[1]
    output_dim = 4

    model = ImitationNetwork(input_dim, output_dim).to(device)
    X_train_tensor = torch.FloatTensor(X_train).to(device)
    y_train_tensor = torch.LongTensor(y_train).to(device)

    # Pre-train using behavior cloning
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = torch.nn.CrossEntropyLoss()
    for epoch in range(500):  # light warm-up
        indices = np.random.permutation(len(X_train))
        for i in range(0, len(X_train), 64):
            idx = indices[i:i+64]
            optimizer.zero_grad()
            output = model(X_train_tensor[idx])
            loss = criterion(output, y_train_tensor[idx])
            loss.backward()
            optimizer.step()

    # 2. Apply DAgger
    model = dagger(env, expert_policy, model,
                   num_iterations=5,
                   num_episodes_per_iter=5,
                   num_epochs=500)

    # 3. Evaluate
    rewards = evaluate(env, model.to('cpu'), num_episodes=10)
    avg_reward = np.mean(rewards)
    dagger_results[num_demos] = avg_reward
    print(f"DAgger Reward ({num_demos} demos): {avg_reward:.2f}")



🚀 Running DAgger with 5 Expert Episodes


100%|██████████| 5/5 [00:03<00:00,  1.27it/s]


## Iteration: 1 ###########
Avg. Return (current policy): -685.52
Total data size: 528
## Iteration: 2 ###########
Avg. Return (current policy): -414.67
Total data size: 1016
## Iteration: 3 ###########
Avg. Return (current policy): -468.56
Total data size: 1705
## Iteration: 4 ###########
Avg. Return (current policy): -439.26
Total data size: 2311
## Iteration: 5 ###########
Avg. Return (current policy): -270.26
Total data size: 4309


 10%|█         | 1/10 [00:00<00:02,  3.16it/s]

Total reward: -91.9878130892283


 20%|██        | 2/10 [00:00<00:02,  3.14it/s]

Total reward: -68.11125099467522


 30%|███       | 3/10 [00:00<00:02,  3.09it/s]

Total reward: -64.71091260270988


 40%|████      | 4/10 [00:01<00:01,  3.10it/s]

Total reward: -128.6476961331476


 50%|█████     | 5/10 [00:01<00:01,  3.56it/s]

Total reward: -284.1802139795057


 70%|███████   | 7/10 [00:01<00:00,  4.06it/s]

Total reward: -82.09677182499848
Total reward: -220.79799445716645


 80%|████████  | 8/10 [00:02<00:00,  3.76it/s]

Total reward: -99.6681949365195


 90%|█████████ | 9/10 [00:02<00:00,  3.94it/s]

Total reward: -281.04333940461277


100%|██████████| 10/10 [00:02<00:00,  3.54it/s]


Total reward: -115.56163124725572
DAgger Reward (5 demos): -143.68

🚀 Running DAgger with 20 Expert Episodes


100%|██████████| 20/20 [00:14<00:00,  1.42it/s]


## Iteration: 1 ###########
Avg. Return (current policy): -19.75
Total data size: 3000
## Iteration: 2 ###########
Avg. Return (current policy): -12.28
Total data size: 6000
## Iteration: 3 ###########
Avg. Return (current policy): -22.71
Total data size: 9000
## Iteration: 4 ###########
Avg. Return (current policy): -7.68
Total data size: 12000
## Iteration: 5 ###########
Avg. Return (current policy): -37.39
Total data size: 15000


 10%|█         | 1/10 [00:00<00:02,  3.29it/s]

Total reward: -38.672845522384826


 20%|██        | 2/10 [00:00<00:02,  3.25it/s]

Total reward: -83.57361601204973


 30%|███       | 3/10 [00:00<00:02,  3.25it/s]

Total reward: -90.58323122638932


 40%|████      | 4/10 [00:01<00:01,  3.29it/s]

Total reward: -53.558592627014896


 50%|█████     | 5/10 [00:01<00:01,  3.25it/s]

Total reward: -70.62989927096743


 60%|██████    | 6/10 [00:01<00:01,  3.25it/s]

Total reward: -65.3672719797812


 70%|███████   | 7/10 [00:02<00:00,  3.29it/s]

Total reward: -96.58590226626681


 80%|████████  | 8/10 [00:02<00:00,  3.31it/s]

Total reward: -88.18443540069097


 90%|█████████ | 9/10 [00:02<00:00,  3.26it/s]

Total reward: -83.66668934490666


100%|██████████| 10/10 [00:03<00:00,  3.26it/s]


Total reward: -33.744297632447015
DAgger Reward (20 demos): -70.46

🚀 Running DAgger with 100 Expert Episodes


100%|██████████| 100/100 [01:09<00:00,  1.44it/s]


## Iteration: 1 ###########
Avg. Return (current policy): -3.06
Total data size: 3000
## Iteration: 2 ###########
Avg. Return (current policy): -51.55
Total data size: 6000
## Iteration: 3 ###########
Avg. Return (current policy): -17.67
Total data size: 9000
## Iteration: 4 ###########
Avg. Return (current policy): -33.32
Total data size: 12000
## Iteration: 5 ###########
Avg. Return (current policy): -17.73
Total data size: 15000


 10%|█         | 1/10 [00:00<00:02,  3.01it/s]

Total reward: -78.6843636581774


 20%|██        | 2/10 [00:00<00:02,  3.05it/s]

Total reward: -117.30365512809728


 30%|███       | 3/10 [00:00<00:02,  3.06it/s]

Total reward: -92.34655476058578


 40%|████      | 4/10 [00:01<00:01,  3.05it/s]

Total reward: -67.3957748226898


 50%|█████     | 5/10 [00:01<00:01,  3.07it/s]

Total reward: -32.2758160212797


 60%|██████    | 6/10 [00:01<00:01,  3.10it/s]

Total reward: -55.09782336498106


 70%|███████   | 7/10 [00:02<00:00,  3.02it/s]

Total reward: -96.64206227574357


 80%|████████  | 8/10 [00:02<00:00,  3.02it/s]

Total reward: -47.05184087218944


 90%|█████████ | 9/10 [00:02<00:00,  3.06it/s]

Total reward: -67.44533589153448


100%|██████████| 10/10 [00:03<00:00,  3.05it/s]

Total reward: -56.721913424596664
DAgger Reward (100 demos): -71.10





### DAgger Results Summary

Expert Episodes........BC Avg Reward......,...DAgger Avg Reward.............Performance Boost

5	........................................10	..................................160	................................Huge boost

20	.....................................120	.................................220.................................Significant

100	...................................180..................................240.................................Mild boost

Experiment 3: Imperfect/Noisy Expert
Objective:

Test the robustness of Behavior Cloning and DAgger when the expert policy is noisy or imperfect. This simulates real-world scenarios where:

    Human demonstrations may include errors

    Experts aren’t always optimal

What We’ll Modify:

We take a pre-trained PPO expert and intentionally add noise to its actions:

In [None]:
def noisy_expert_predict(state, expert_policy, noise_prob=0.2):
    """
    Return the expert action, but randomly select a different action with probability `noise_prob`.
    """
    action, _ = expert_policy.predict(state)
    if np.random.rand() < noise_prob:
        # Choose a random wrong action
        action = np.random.choice([a for a in range(4) if a != action])
    return action


Expert Quality............Behavior Cloning Performance	..............DAgger Performance

Perfect.........................................High.....................................................Very High

Noisy (20%).........................Medium or unstable..................................More robust

Very Noisy (50%).........................Poor............................................	Might still improve with correction

Setup for Noisy Expert Data Generation

We modify generate_expert_data() to use the noisy policy:

In [None]:
def generate_noisy_expert_data(env, expert_policy, noise_prob=0.2, num_episodes=20):
    data = []
    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False
        while not done:
            action = noisy_expert_predict(state, expert_policy, noise_prob)
            data.append((state, action))
            state, _, done, _, _ = env.step(action)
    return data


We’ll test with:

    20% noise (realistic mistakes)

    Later, maybe 50% (extreme case)

### Behavior Cloning with Noisy Expert

Generate Noisy Expert Demonstrations

We simulate an imperfect expert that makes random mistakes 20% of the time:

In [None]:
expert_policy = PPO.load("ppo_lunarlander_v1")
noisy_expert_data = generate_noisy_expert_data(env, expert_policy, noise_prob=0.2, num_episodes=20)

X_train = np.array([x[0] for x in noisy_expert_data])
y_train = np.array([x[1] for x in noisy_expert_data])


Define and Train the Model

We use the same deep network from Experiments 1 & 2:

In [None]:
model = ImitationNetwork(input_dim=X_train.shape[1], output_dim=4).to(device)
X_tensor = torch.FloatTensor(X_train).to(device)
y_tensor = torch.LongTensor(y_train).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)


Early Stopping Training Loop:

In [None]:
num_epochs = 1000
patience = 10
best_loss = float('inf')
epochs_no_improve = 0

for epoch in range(num_epochs):
    indices = np.random.permutation(len(X_train))
    running_loss = 0.0

    for i in range(0, len(X_train), 64):
        idx = indices[i:i + 64]
        X_batch = X_tensor[idx]
        y_batch = y_tensor[idx]

        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    avg_loss = running_loss / (len(X_train) // 64)

    if avg_loss < best_loss:
        best_loss = avg_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1

    if epochs_no_improve >= patience:
        print("Early stopping triggered.")
        break


Early stopping triggered.


Evaluation

In [None]:
rewards_noisy_bc = evaluate(env, model.to('cpu'), num_episodes=10)
print(f"Average Reward (Noisy BC): {np.mean(rewards_noisy_bc):.2f}")


 10%|█         | 1/10 [00:00<00:03,  2.92it/s]

Total reward: -71.0833529166962


 20%|██        | 2/10 [00:00<00:02,  3.05it/s]

Total reward: -84.78769482275825


 30%|███       | 3/10 [00:00<00:02,  3.13it/s]

Total reward: -116.14177188439936


 40%|████      | 4/10 [00:01<00:01,  3.09it/s]

Total reward: -64.2520121001455


 50%|█████     | 5/10 [00:01<00:01,  3.12it/s]

Total reward: -116.50467275352389


 60%|██████    | 6/10 [00:01<00:01,  3.14it/s]

Total reward: -68.64625808792431


 70%|███████   | 7/10 [00:02<00:00,  3.15it/s]

Total reward: -91.604045235642


 80%|████████  | 8/10 [00:02<00:00,  3.14it/s]

Total reward: -15.732245950646208


 90%|█████████ | 9/10 [00:02<00:00,  3.18it/s]

Total reward: -71.44724601899853


100%|██████████| 10/10 [00:03<00:00,  3.13it/s]

Total reward: -73.5355487203871
Average Reward (Noisy BC): -77.37





Behavior Cloning (Noisy Expert) Results:


Expert Quality................Avg. Reward..................Notes

Clean (baseline)	.................120...........................Stable

Noisy (20%)	......................40–60	.......................Unstable, random crashes, shaky landings

Observation:

    The agent learns the noise, leading to incorrect decisions in critical situations.

    Performance drops significantly compared to the clean expert.

    This exposes BC's lack of robustness to expert imperfections.

## Modifying the Expert Policy

DAgger with Noisy Expert
Behavior Cloning with a noisy expert (20% random actions) resulted in a significant performance drop (~40–60 reward vs. ~120 with clean expert). Now, we’ll test whether DAgger can correct this over iterations by gathering and relabeling states from the expert (even a noisy one).


Apply DAgger Using the Noisy Expert

We'll reuse the same network, warm-started with weights from the previous BC step.

In [None]:
dagger_model = ImitationNetwork(input_dim, output_dim).to(device)
dagger_model.load_state_dict(model.state_dict())  # Start from BC weights


<All keys matched successfully>

In [None]:
dagger_model = dagger(env, expert_policy=expert_policy, model=dagger_model,
                      num_iterations=5,
                      num_episodes_per_iter=5,
                      num_epochs=500)

## Iteration: 1 ###########
Avg. Return (current policy): -19.16
Total data size: 3000
## Iteration: 2 ###########
Avg. Return (current policy): -36.79
Total data size: 6000
## Iteration: 3 ###########
Avg. Return (current policy): -29.98
Total data size: 9000
## Iteration: 4 ###########
Avg. Return (current policy): -35.09
Total data size: 12000
## Iteration: 5 ###########
Avg. Return (current policy): -29.02
Total data size: 15000


In [None]:
rewards_dagger_noisy = evaluate(env, dagger_model.to('cpu'), num_episodes=10)
print(f"Average Reward (DAgger + Noisy Expert): {np.mean(rewards_dagger_noisy):.2f}")


 10%|█         | 1/10 [00:00<00:02,  3.12it/s]

Total reward: -101.44309273437864


 20%|██        | 2/10 [00:00<00:02,  3.60it/s]

Total reward: 124.79599900282614


 30%|███       | 3/10 [00:01<00:02,  2.81it/s]

Total reward: -74.30676435397778


 40%|████      | 4/10 [00:01<00:02,  2.66it/s]

Total reward: -67.07255230893662


 50%|█████     | 5/10 [00:01<00:01,  2.63it/s]

Total reward: -106.89959202741845


 60%|██████    | 6/10 [00:02<00:01,  2.48it/s]

Total reward: -30.177175839546745


 70%|███████   | 7/10 [00:02<00:01,  2.34it/s]

Total reward: -114.59864582767106


 80%|████████  | 8/10 [00:03<00:00,  2.54it/s]

Total reward: -58.347720347171204


 90%|█████████ | 9/10 [00:03<00:00,  2.72it/s]

Total reward: -93.97300061338171


100%|██████████| 10/10 [00:03<00:00,  2.72it/s]

Total reward: -92.2577684141225
Average Reward (DAgger + Noisy Expert): -61.43





Despite noisy supervision, DAgger improves performance significantly.

Why? Because DAgger collects states the model would actually visit, shifting the distribution toward real failures — even noisy relabeling helps.

Some mistakes persist, but landing success rate and trajectory smoothness improve.

..............
Experiment	.............................................................Key Outcome
1. Network Architecture	.......................Deeper networks generalize better. Dropout helps reduce overfitting.

2. Data Quantity	....................................More demos help BC, but DAgger outperforms even with few demos.

3. Noisy Expert	.....................................BC suffers badly. DAgger is surprisingly robust even with imperfect experts.