**Implement Behaviour Cloning using RL along with Direct policy Learning and Inverse RL**

In [None]:
# For Box2D env
!apt-get install swig
!pip install gym[box2d]
!pip install stable-baselines3[extra]

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  swig3.0
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0
0 upgraded, 2 newly installed, 0 to remove and 45 not upgraded.
Need to get 1,100 kB of archives.
After this operation, 5,822 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,460 B]
Fetched 1,100 kB in 1s (748 kB/s)
Selecting previously unselected package swig3.0.
(Reading database ... 155632 files and directories currently installed.)
Preparing to unpack .../swig3.0_3.0.12-1_amd64.deb ...
Unpack

In [None]:
import gym
from tqdm import tqdm
import numpy as np

In [None]:
import torch as th
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR

In [None]:
from stable_baselines3 import PPO, A2C, SAC, TD3
from stable_baselines3.common.evaluation import evaluate_policy

In [None]:
# Example for continuous actions
# env_id = "LunarLanderContinuous-v2"

# Example for discrete actions
env_id = "CartPole-v1"

In [None]:
env = gym.make(env_id)

## Train Expert Model

We create an expert RL agent and let it learn to solve a task by interacting with the evironment.


In [None]:
ppo_expert = PPO('MlpPolicy', env_id, verbose=1, create_eval_env=True)
ppo_expert.learn(total_timesteps=3e4, eval_freq=10000)
ppo_expert.save("ppo_expert")

Using cpu device
Creating environment from the given name 'CartPole-v1'
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 21.9     |
|    ep_rew_mean     | 21.9     |
| time/              |          |
|    fps             | 1499     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 25.6        |
|    ep_rew_mean          | 25.6        |
| time/                   |             |
|    fps                  | 1165        |
|    iterations           | 2           |
|    time_elapsed         | 3           |
|    total_timesteps      | 4096        |
|

check the performance of the trained agent

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_expert, env, n_eval_episodes=10)

print(f"Mean reward = {mean_reward} +/- {std_reward}")



Mean reward = 500.0 +/- 0.0


## Create Student

We also create a student RL agent, which will later be trained with the expert dataset


In [None]:
a2c_student = A2C('MlpPolicy', env_id, verbose=1)

Using cpu device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.



We now let our expert interact with the environment (except we already have expert data) and store resultant expert observations and actions to build an expert dataset.


In [None]:
num_interactions = int(4e4)

In [None]:
if isinstance(env.action_space, gym.spaces.Box):
  expert_observations = np.empty((num_interactions,) + env.observation_space.shape)
  expert_actions = np.empty((num_interactions,) + (env.action_space.shape[0],))

else:
  expert_observations = np.empty((num_interactions,) + env.observation_space.shape)
  expert_actions = np.empty((num_interactions,) + env.action_space.shape)

obs = env.reset()

for i in tqdm(range(num_interactions)):
    action, _ = ppo_expert.predict(obs, deterministic=True)
    expert_observations[i] = obs
    expert_actions[i] = action
    obs, reward, done, info = env.step(action)
    if done:
        obs = env.reset()

np.savez_compressed(
    "expert_data",
    expert_actions=expert_actions,
    expert_observations=expert_observations,
)

100%|██████████| 40000/40000 [00:16<00:00, 2380.95it/s]


In [None]:
from torch.utils.data.dataset import Dataset, random_split

In [None]:
class ExpertDataSet(Dataset):
    def __init__(self, expert_observations, expert_actions):
        self.observations = expert_observations
        self.actions = expert_actions

    def __getitem__(self, index):
        return (self.observations[index], self.actions[index])

    def __len__(self):
        return len(self.observations)



We now instantiate the `ExpertDataSet` and split it into training and test datasets.


In [None]:
expert_dataset = ExpertDataSet(expert_observations, expert_actions)

train_size = int(0.8 * len(expert_dataset))

test_size = len(expert_dataset) - train_size

train_expert_dataset, test_expert_dataset = random_split(
    expert_dataset, [train_size, test_size]
)

In [None]:
print("test_expert_dataset: ", len(test_expert_dataset))
print("train_expert_dataset: ", len(train_expert_dataset))

test_expert_dataset:  8000
train_expert_dataset:  32000


In [None]:
def pretrain_agent(
    student,
    batch_size=64,
    epochs=1000,
    scheduler_gamma=0.7,
    learning_rate=1.0,
    log_interval=100,
    no_cuda=True,
    seed=1,
    test_batch_size=64,
):
    use_cuda = not no_cuda and th.cuda.is_available()
    th.manual_seed(seed)
    device = th.device("cuda" if use_cuda else "cpu")
    kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}

    if isinstance(env.action_space, gym.spaces.Box):
      criterion = nn.MSELoss()
    else:
      criterion = nn.CrossEntropyLoss()

    # Extract initial policy
    model = student.policy.to(device)

    def train(model, device, train_loader, optimizer):
        model.train()

        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()

            if isinstance(env.action_space, gym.spaces.Box):
              # A2C/PPO policy outputs actions, values, log_prob
              # SAC/TD3 policy outputs actions only
              if isinstance(student, (A2C, PPO)):
                action, _, _ = model(data)
              else:
                # SAC/TD3:
                action = model(data)
              action_prediction = action.double()
            else:
              # Retrieve the logits for A2C/PPO when using discrete actions
              dist = model.get_distribution(data)
              action_prediction = dist.distribution.logits
              target = target.long()

            loss = criterion(action_prediction, target)
            loss.backward()
            optimizer.step()
            if batch_idx % log_interval == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )
    def test(model, device, test_loader):
        model.eval()
        test_loss = 0
        with th.no_grad():
            for data, target in test_loader:
                data, target = data.to(device), target.to(device)

                if isinstance(env.action_space, gym.spaces.Box):
                  # A2C/PPO policy outputs actions, values, log_prob
                  # SAC/TD3 policy outputs actions only
                  if isinstance(student, (A2C, PPO)):
                    action, _, _ = model(data)
                  else:
                    # SAC/TD3:
                    action = model(data)
                  action_prediction = action.double()
                else:
                  # Retrieve the logits for A2C/PPO when using discrete actions
                  dist = model.get_distribution(data)
                  action_prediction = dist.distribution.logits
                  target = target.long()

                test_loss = criterion(action_prediction, target)
        test_loss /= len(test_loader.dataset)
        print(f"Test set: Average loss: {test_loss:.4f}")

    # Here, we use PyTorch `DataLoader` to our load previously created `ExpertDataset` for training
    # and testing
    train_loader = th.utils.data.DataLoader(
        dataset=train_expert_dataset, batch_size=batch_size, shuffle=True, **kwargs
    )
    test_loader = th.utils.data.DataLoader(
        dataset=test_expert_dataset, batch_size=test_batch_size, shuffle=True, **kwargs,
    )

    # Define an Optimizer and a learning rate schedule.
    optimizer = optim.Adadelta(model.parameters(), lr=learning_rate)
    scheduler = StepLR(optimizer, step_size=1, gamma=scheduler_gamma)

    # Now we are finally ready to train the policy model.
    for epoch in range(1, epochs + 1):
        train(model, device, train_loader, optimizer)
        test(model, device, test_loader)
        scheduler.step()

    # Implant the trained policy network back into the RL student agent
    a2c_student.policy = model

Evaluate the agent before pretraining, it should be random

In [None]:
mean_reward, std_reward = evaluate_policy(a2c_student, env, n_eval_episodes=10)

print(f"Mean reward = {mean_reward} +/- {std_reward}")



Mean reward = 98.8 +/- 11.98999582985749




Having defined the training procedure we can now run the pretraining!


In [None]:
pretrain_agent(
    a2c_student,
    epochs=3,
    scheduler_gamma=0.7,
    learning_rate=1.0,
    log_interval=100,
    no_cuda=True,
    seed=1,
    batch_size=64,
    test_batch_size=1000,
)
a2c_student.save("a2c_student")

Test set: Average loss: 0.0000
Test set: Average loss: 0.0000
Test set: Average loss: 0.0000




Finally, let us test how well our RL agent student learned to mimic the behavior of the expert


In [None]:
mean_reward, std_reward = evaluate_policy(a2c_student, env, n_eval_episodes=10)

print(f"Mean reward = {mean_reward} +/- {std_reward}")



Mean reward = 500.0 +/- 0.0


Example Scenario:

Let's consider a scenario where an autonomous vehicle needs to learn how to navigate through a city environment to reach a destination while obeying traffic rules and avoiding accidents. We'll use BC, RL with DPL, and IRL to tackle different aspects of this problem.

1. Behavior Cloning (BC):

In Behavior Cloning, we use expert demonstrations to train a model to mimic the expert's behavior. In our example, the expert could be a human driver providing demonstrations of safe and efficient driving behavior.

Usage: We collect a dataset of expert demonstrations, consisting of observations (states) and corresponding actions (steering, acceleration, etc.). We then train a model, such as a neural network, using supervised learning to predict actions from states.


2. Reinforcement Learning (RL) with Direct Policy Learning (DPL):

Reinforcement Learning allows the autonomous vehicle to learn from its interaction with the environment through trial and error. Direct Policy Learning (DPL) involves learning the policy directly from the observed state-action pairs, without relying on explicit reward signals.

Usage: We define the RL environment, including the city map, traffic rules, and other vehicles. The agent (autonomous vehicle) interacts with this environment, and we use techniques like Q-learning or Deep Q-Networks to learn a policy from scratch. With DPL, the agent directly updates its policy based on observed state-action pairs during training, without explicitly defined reward signals.


3. Inverse RL (IRL):

Inverse RL helps in learning the underlying reward function from observed behavior. In our scenario, it could help in inferring the implicit reward structure of safe and efficient driving from expert demonstrations or real-world data.

Usage: We use IRL to infer the reward function that likely led to the expert's behavior. By analyzing the expert demonstrations, IRL estimates the underlying reward structure that encourages safe and efficient driving behavior. This inferred reward function can then be used to guide the RL agent's learning process, ensuring it learns to prioritize actions that lead to high rewards.