<div style="text-align: left">
    <img src='https://avatars.githubusercontent.com/u/101578736?v=4' width=100/>  
</div>

Author: Itay Segev

E-mail: [itaysegev@campus.technion.ac.il](mailto:itaysegev@campus.technion.ac.il)



#  Imitation Learning


<img src='https://upload.wikimedia.org/wikipedia/commons/1/1f/Makak_neonatal_imitation.png?1648499532601' width=1000/>

<a id="section:intro"></a>

# <img src="https://img.icons8.com/?size=50&id=55412&format=png&color=000000" style="height:50px;display:inline"> Introduction
---

Imitation Learning (IL) is a technique for learning a policy from demonstrations produced by an "expert" (in most cases, a human). There are several types of imitation learning methods, but the simplest approach is called Behavior Cloning (BC). In BC, we attempt to learn a classifier (or regressor if actions are continuous) where the feature space $\mathcal{X}$ is some representation of the state and the label set $\mathcal{Y}$ is the set of actions. The expert provides a "correct" action for a sample set of states by running in the environment and recording the actions taken at each state. This data is used to learn a classifier that predicts what action the expert would have taken at each state.

In this notebook, we present the BC method and implement an example on the [gym taxi environment](https://gymnasium.farama.org/environments/toy_text/taxi/). Our expert will be an A* algorithm with an admissible heuristic (ensuring optimality). We will then learn from the collected expert data using a multi-layer perceptron neural network, implemented in pytorch.

Recommended Lecture on IL: [Part 1](https://www.youtube.com/watch?v=kGc8jOy5_zY), [Part 2](https://www.youtube.com/watch?v=06uB13C5pxw), [Part 3](https://www.youtube.com/watch?v=a5wkzPa4fO4)

## <img src="https://img.icons8.com/?size=50&id=43171&format=png&color=000000" style="height:30px;display:inline"> Setup


You will need to make a copy of this notebook in your Google Drive before you can edit the notebook. You can do so with **File &rarr; Save a copy in Drive**.

In [None]:
#@title mount your Google Drive
import os
connect_drive = False #@param {type: "boolean"}
if connect_drive:
  from google.colab import drive
  drive.mount('/content/gdrive', force_remount=True)

  # set up mount symlink
  DRIVE_PATH = '/content/gdrive/My\ Drive/cs236018_w24'
  DRIVE_PYTHON_PATH = DRIVE_PATH.replace('\\', '')
  if not os.path.exists(DRIVE_PYTHON_PATH):
    %mkdir $DRIVE_PATH

## the space in `My Drive` causes some issues,
## make a symlink to avoid this
SYM_PATH = '/content/cs236018_w24'
if not os.path.exists(SYM_PATH) and connect_drive:
  !ln -s $DRIVE_PATH $SYM_PATH




In [None]:
#@title apt install requirements

#@markdown Run each section with Shift+Enter

#@markdown Double-click on section headers to show code.

from IPython.display import clear_output

!apt update -qq
!apt install -y -qq --no-install-recommends \
        build-essential \
        curl \
        git \
        gnupg2 \
        make \
        cmake \
        ffmpeg \
        swig \
        libz-dev \
        unzip \
        zlib1g-dev \
        libglfw3 \
        libglfw3-dev \
        libxrandr2 \
        libxinerama-dev \
        libxi6 \
        libxcursor-dev \
        libgl1-mesa-dev \
        libgl1-mesa-glx \
        libglew-dev \
        libosmesa6-dev \
        lsb-release \
        ack-grep \
        patchelf \
        wget \
        xpra \
        xserver-xorg-dev \
        ffmpeg
!apt-get install python-opengl -y -qq
!apt install xvfb -y -qq
clear_output()

In [None]:
#@title clone course repo

%cd $SYM_PATH
# !git clone {repo_url}
!git clone --single-branch --branch main https://github.com/CLAIR-LAB-TECHNION/SDMRL.git
%cd SDMRL/tutorials/notebooks/imitation-learning

%pip install -r requirements_colab.txt


clear_output()

In [None]:
#@title set up virtual display

from pyvirtualdisplay import Display

display = Display(visible=0, size=(1400, 900))
display.start()

In [None]:
#@title test virtual display

#@markdown If you see a video of a taxi moving randomly in a grid, setup is complete!

import gym
from cs236018.infrastructure.colab_utils import (
    wrap_env,
    show_video
)

env = wrap_env(gym.make("Taxi-v3", render_mode='rgb_array'))

observation = env.reset()
for i in range(100):
    env.render()
    obs, rew, term, _ = env.step(env.action_space.sample() )
    if term:
      break;

env.close()
print('Loading video...')
show_video()

In [None]:
%matplotlib inline

%load_ext autoreload
%autoreload 2

import itertools

from collections import deque

import matplotlib.pyplot as plt
import numpy as np
import torch

from aidm.Environments.gym_problem import GymProblem
from aidm.Search.best_first_search import a_star

from torch.utils.data import DataLoader
from tqdm.auto import tqdm

from cs236018.infrastructure.dataset import ImitationLearningDataset
from cs236018.infrastructure.pytorch_util import train_torch_model_sgd
from cs236018.infrastructure.utils import evaluate_policy
from cs236018.infrastructure.colab_utils import animate_policy



# initialize taxi env
taxi_env = wrap_env(gym.make("Taxi-v3", render_mode='rgb_array'))
taxi_env.reset()

# constants for taxi env planning
PASSENGER_IN_TAXI = 4  # passenger idx when in taxi
LOCS = taxi_env.unwrapped.locs  # environment locations

# random seed
SEED = 42

---

We model the world as a deterministic planning model where the state space is the set of all possible environment configurations, and the action space is the list of actions the taxi agent can execute within the environment. A possible configuration is any combination of taxi location, passenger location (including "in_taxi" indication), and the destination location. In our example the domain map is 5x5 grid, the passenger and destination can be one of 4 possible locations, and the passenger can be either in the taxi or at the initial destination, adding up to a total of $5\cdot 5\cdot 5\cdot 4 = 500$ states. This makes this simplified taxi environment learnable via deterministic planning and state-value function estimation algorithms. However, for our purposes, we use the model 'under the hood' for simulation. While we acknowledge the model, we focus on using its notations to collect data using an expert policy, rather than directly employing the model framework for planning.


# <img src="https://img.icons8.com/?size=50&id=43254&format=png&color=000000" style="height:50px;display:inline"> The expert
---

As mentioned in the [introduction](#section:intro), the expert is an implementation of the A* algorithm for this environment. [aidm](https://github.com/CLAIR-LAB-TECHNION/aidm) is a library that provides generalized search algorithms and supports the taxi environment. The environment is wrapped as a custom `Problem` object that can be solved using A*.

In [None]:
taxi_problem = GymProblem(taxi_env, taxi_env.unwrapped.s)
taxi_problem.__class__.__bases__

A* requires an admissible heuristic if we want to guarantee optimality. Below we define the Manhatten distance heuristic:

In [None]:
def manhatten_dist(r1, c1, r2, c2):
    # calssic manhatten dist |row1 - row2| + |col1 - col2|
    return abs(r1 - r2) + abs(c1 - c2)

def taxi_heuristic(node):
    # decode state integer to interpretable values
    taxi_row, taxi_col, passenger_idx, dest_idx = taxi_env.decode(node.state.get_key())

    # dist from the taxi to the destination
    return manhatten_dist(taxi_row, taxi_col, *LOCS[dest_idx])

#### <img src="https://img.icons8.com/?size=50&id=ndnNDCLXM-H6&format=png&color=000000" style="height:50px;display:inline"> Task 1:  Create your own heuristic

Design a new admissible heuristic for the taxi problem. Fill in the following code with your implementation and test your heuristic by running the next two code cells with your heuristic. Compare the time difference it takes to return a solution. Is it an optimal solution?


In [None]:
# Your new heuristic function
def your_new_heuristic(node):
    # decode state integer to interpretable values
    taxi_row, taxi_col, passenger_idx, dest_idx = taxi_env.decode(node.state.get_key())

In [None]:
#@title Admissible heuristic

#@markdown In this hidden cell, you can find a heuristic that uses domain knowledge, which you can explore as well.
#@markdown Below we define the following heuristic:
#@markdown * if the passenger is in the taxi, calculate the Manhatten distance between the taxi and the destination and add 1 for the dropoff action

#@markdown * if the passenger is not in the taxi, calculate the Manhatten distances between the taxi and the passenger, and between the passenger and the destination. Add 2 for the pickup and dropoff actions.

#@markdown Double-click on section headers to show code.

def manhatten_dist(r1, c1, r2, c2):
    # calssic manhatten dist |row1 - row2| + |col1 - col2|
    return abs(r1 - r2) + abs(c1 - c2)

def taxi_heuristic(node):
    # decode state integer to interpretable values
    taxi_row, taxi_col, passenger_idx, dest_idx = taxi_env.decode(node.state.get_key())

    # split to 2 cases where the passenger is in the taxi and not in the taxi.
    if passenger_idx == PASSENGER_IN_TAXI:
        # dist from the taxi to the destination
        return manhatten_dist(taxi_row, taxi_col, *LOCS[dest_idx]) + 1  # include dropoff
    elif passenger_idx == dest_idx:
        # passenger has reached the destination. this is a goal state
        return 0
    else:
        # dist from the taxi to the passenger and from the passenger to the destination
        passenger_dist = manhatten_dist(taxi_row, taxi_col, *LOCS[passenger_idx])
        dest_dist = manhatten_dist(*LOCS[passenger_idx], *LOCS[dest_idx])
        return passenger_dist + dest_dist + 2  # include pickup and dropoff actions

A policy takes an observation and creates a plan. While there are still acitons in that plan, it performs the next action. Otherwise, it starts a new plan.

This concept is implemented in the `TaxiAStarPolicy` class.
The main method of the class is `__call__`, which takes an observation (`obs`) as input. The method works as follows:

1. **Check Current Plan**:
   - If there are no actions left in the current plan, or if the observation does not match the expected observation in the current plan, a new plan is created.

2. **Create New Plan**:
   - The problem is refreshed with a new initial state from the environment.
   - The A* algorithm is used to find a solution from the current state to the goal. This solution includes a series of states (`state_lst`) and corresponding actions (`sol`).

3. **Save the Plan**:
   - The expected states and actions are combined into tuples and stored in `cur_plan` for later use. Each tuple contains an expected observation and the corresponding expert action.

4. **Execute Next Action**:
   - The next action in the current plan is performed by popping it from the `cur_plan` deque.


In [None]:
class TaxiAStarPolicy:
    def __init__(self, heuristic):
        self.heuristic = heuristic

        # a container for the plan actions.
        self.cur_plan = deque()

    def __call__(self, obs):
        # if out of actions (finished previous plan), or if observation is not in current plan,
        # create a new plan.
        if not self.cur_plan or self.cur_plan[0][0] != obs:
            # refresh the problem with a new initial state
            taxi_prob = GymProblem(taxi_env, taxi_env.unwrapped.s)

            # find the solution with the A* algorithm
            _, node, sol, _, _ = a_star(taxi_prob, heuristic_func=self.heuristic)

            # get a list of expected states
            state_lst = []
            while node.parent:
                node = node.parent
                state_lst.append(node.state.key)
            state_lst = reversed(state_lst)

            # save the plan for later extraction
            # a plan is a tuple of expected observations and the corresponding expert action
            self.cur_plan = deque(list(zip(state_lst, map(int, sol))))

        # pop the next action
        return self.cur_plan.popleft()[1]

taxi_expert = TaxiAStarPolicy(taxi_heuristic) #TODO: Change to your_new_heuristic

Let's see how our policy performs

In [None]:
# This code can be terminated early with an interruption
animate_policy(taxi_env, taxi_expert, episode_limit=10)
print('Loading video...')
show_video()

As we can see, we can use planning to define an optimal policy for this domain that solves the problem in real time. This will be our expert from which we will collect our demonstration data.

# <img src="https://img.icons8.com/?size=50&id=104319&format=png&color=000000" style="height:50px;display:inline"> Trajectories
---



Given an initial state $s$ and a policy $\pi$, a **trajectory** (in a deterministic environment) is a collection of state action pairs $((s_1, a_1), ..., (s_n, a_n))$ where $s_1=s$, $a_i = \pi(s_i)$, and since we are in a deterministic environment, $s_{i+1}=P(s_i, a_i)$. In other words, a trajectory is an ordered collection of states and actions as experienced by running policy $\pi$ starting from state $s$.



In [None]:
# trajectory struct
class Trajectory:
    def __init__(self, observations=None, actions=None):
        self.observations = observations or []
        self.actions = actions or []

    def add_step(self, observation, action):
        self.observations.append(observation)
        self.actions.append(action)

    def __str__(self):
        return 'trajectory: ' + str(list(zip(self.observations, self.actions)))

    def __repr__(self):
        return str(self)

We can collect trajectories with our expert by simply recording the observations and the actions taken at these observations

In [None]:
def get_trajectory(env, policy, max_trajectory_length=float('inf')):
    # init trajectory object
    trajectory = Trajectory()

    # get first observation
    obs = env.reset()

    # iterate and step in environment.
    # limit num actions for incomplete policies
    for i in itertools.count(start=1):
        action = policy(obs)
        trajectory.add_step(obs, action)
        obs, reward, done, info = env.step(action)

        if done or i >= max_trajectory_length:
            break

    return trajectory

trajectory = get_trajectory(taxi_env, taxi_expert)
trajectory

# <img src="https://img.icons8.com/?size=50&id=pkrAODkotBly&format=png&color=000000" style="height:50px;display:inline"> Data collection and preparation
---

We will now collect the data with which we will train by collecting multiple trajectories. As with most supervised learning settings, we will collect 3 datasets: training, validation, and testing. Since this is a relatively simple environment with a small number of states, we will collect a small number of trajectories so we do not encounter the entire state space in training. This way, we can see if our model is generalizing to new, unseen states.

In [None]:
def collect_data(env, policy, num_trajectories, max_trajectory_length=float('inf')):
    trajectories = []
    for _ in tqdm(range(num_trajectories)):
        trajectories.append(get_trajectory(env, policy, max_trajectory_length))

    return trajectories

# get the same trajectories every time!
taxi_env.seed(SEED)

taxi_raw_train_data = collect_data(taxi_env, taxi_expert, num_trajectories=400)
taxi_raw_val_data = collect_data(taxi_env, taxi_expert, num_trajectories=250)
taxi_raw_test_data = collect_data(taxi_env, taxi_expert, num_trajectories=250)

# show the first 5 training trajectories
taxi_raw_train_data[:5]

In the taxi environment, states are represented as integers. However, this type of input is not very informative for supervised learning algorithms. To enhance the representation, we can decompose the integer into its state attributes. This preprocessing function represents the state as a vector of features $\mathcal{X}$ consisting of the taxi location, the passenger location, the destination location, and an indicator of whether the passenger is in the taxi. Using this structured representation allows for more effective learning by the algorithm.

In [None]:
def prep_taxi_state(state):
    # decompose state bits
    taxi_row, taxi_col, passenger_idx, destination_idx = taxi_env.decode(state)

    # get destination true location coordinates
    destination_row, destination_col = LOCS[destination_idx]

    # get passenger true location coordinates
    # add `in_taxi` indicator bit
    if passenger_idx == PASSENGER_IN_TAXI:
        passenger_row, passenger_col = taxi_row, taxi_col
        passenger_in_taxi = 1
    else:
        passenger_row, passenger_col = LOCS[passenger_idx]
        passenger_in_taxi = 0

    # return all data as a flat Tensor object for pytorch compatibility
    return torch.Tensor([taxi_row,
                         taxi_col,
                         passenger_row,
                         passenger_col,
                         passenger_in_taxi,
                         destination_row,
                         destination_col])

We build the `ImitationLearningDataset` class to package trajectories and preprocessing functions into a format compatible with PyTorch. This class unwraps the trajectories to extract state-action pairs required for supervised learning. Note that we do **NOT** remove duplicates. This is to maintain the true trajectory sample distribution of the expert policy. The dataset is designed to work seamlessly with PyTorch DataLoaders for efficient batching during model training.

In [None]:
taxi_ds_train = ImitationLearningDataset(taxi_raw_train_data, prep_obs=prep_taxi_state)
taxi_ds_val = ImitationLearningDataset(taxi_raw_val_data, prep_obs=prep_taxi_state)
taxi_ds_test = ImitationLearningDataset(taxi_raw_test_data, prep_obs=prep_taxi_state)

taxi_ds_train[0]

# <img src="https://img.icons8.com/?size=50&id=46802&format=png&color=000000" style="height:50px;display:inline"> Learning model
---

The learning model we will be using is the [multi-layer perceptron](https://www.sciencedirect.com/topics/computer-science/multilayer-perceptron#:~:text=Multi%20layer%20perceptron%20(MLP)%20is,input%20signal%20to%20be%20processed.) (MLP). This was one of the first neural network architectures. It uses multiple fully connected linear layers separated by non-linear activation functions ([ReLU](https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/) in our case). MLP's excell at finding statistical correlations in vector data (data represented as arrays or lists of numbers, where each element represents a different feature) that is strongly ordered (sequence of elements in the vector matters and is consistent) and real valued, especially if these corelations are continuous or near-continuous. Given that our state representation matches this description, the MLP model is well-suited to handle this task efficiently.

<img src="https://miro.medium.com/v2/resize:fit:800/1*-IPQlOd46dlsutIbUq1Zcw.png">

In [None]:
from cs236018.policies.MLP_policy import build_mlp
# get the input vector length from a training example
in_features = len(taxi_ds_train[0][0])

# The output vector length is the number of actions
num_actions = taxi_env.action_space.n


# create MLP model with 3 hidden layers
mlp_taxi = build_mlp(input_size=in_features, output_size=num_actions, n_layers=3, size=32)


# TODO: Change the hyperparameters (n_layers, size) to see their effect on the learning process
# Try different values for n_layers and size and observe how they impact the performance of the model.

mlp_taxi

We selected the hyperparameters (`n_layers`, `size`) following the common practice of tuning these values to find the optimal configuration for our task.

#### <img src="https://img.icons8.com/?size=50&id=ndnNDCLXM-H6&format=png&color=000000" style="height:50px;display:inline"> Task 2: Explore different hyperparameters

Try different values for n_layers and size and observe how they impact the performance of the model.

In [None]:
# When you fill code in cells for tasks, make sure you delete the following line
%%script true


# TODO: Change the hyperparameters (n_layers, size) to see their effect on the learning process
# Try different values for n_layers and size and observe how they impact the performance of the model.
n_layers = _
size = _


# create MLP model with 3 hidden layers
mlp_taxi = build_mlp(in_features, num_actions, n_layers, size)

mlp_taxi

# <img src="https://img.icons8.com/?size=50&id=104328&format=png&color=000000" style="height:50px;display:inline"> Training
---


We will train the classifier with [stochastic gradient descent](https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31) optimization. We train using the Cross-Entropy loss which punishes the classifier for giving low scores to the true classes and higher scores to wrong classes (reminder, in this case a class is an action).

In [None]:
(train_losses,
 val_losses,
 train_accs,
 val_accs) = train_torch_model_sgd(
    model=mlp_taxi,                  # The neural network model to train
    ds_train=taxi_ds_train,          # The training dataset
    ds_val=taxi_ds_val,              # The validation dataset
    loss_fn=torch.nn.CrossEntropyLoss(), # The loss function used for training
    batch_size=16,                   # Number of samples per batch
    shuffle_data=True,               # Whether to shuffle the data before each epoch
    num_epochs=200,                  # Number of epochs to train the model
    learning_rate=1e-2,              # Learning rate for the optimizer
    weight_decay=1e-5,               # Weight decay (L2 regularization) for the optimizer
    print_every=10,                  # Frequency of printing training progress
    include_accs=True,               # Whether to calculate and return accuracies
    seed=SEED                        # Random seed for reproducibility
)

Now let us visualize the results

In [None]:
import seaborn as sns

# Set Seaborn style
sns.set(style="whitegrid")

# Create 1x2 figure grid
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))

# Plot losses
sns.lineplot(x=range(len(train_losses)), y=train_losses, ax=ax1, label='train')
sns.lineplot(x=range(len(val_losses)), y=val_losses, ax=ax1, label='validation')
ax1.set_title('Cross-Entropy Loss per Epoch')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Cross-Entropy Loss')
ax1.legend()

# Plot accuracies
sns.lineplot(x=range(len(train_accs)), y=train_accs, ax=ax2, label='train')
sns.lineplot(x=range(len(val_accs)), y=val_accs, ax=ax2, label='validation')
ax2.set_title('Accuracy per Epoch')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()

# Adjust layout for better spacing
plt.tight_layout()

# Show the plots
plt.show()


We see a relatiely short and stable training process. Seemingly, our learning algorithm is able to clone the expert's behavior almost perfectly. Below are the final accuracies of our model on our datasets.

In [None]:
def ds_acc(ds, model):
    # get the entire dataset at once.
    # this is will cause memory issues with large datasets
    ds_data, ds_labels = next(iter(DataLoader(ds, batch_size=len(ds))))

    # run get model predictions
    model.eval()
    with torch.no_grad():
        ds_preds = model(ds_data)

    # calculate prediction accuracies
    return torch.mean((torch.argmax(ds_preds, dim=-1) == ds_labels).float()).item()

# print test set accuracy
print(f'train acc      = {ds_acc(taxi_ds_train, mlp_taxi)}')
print(f'validation acc = {ds_acc(taxi_ds_val, mlp_taxi)}')
print(f'test acc       = {ds_acc(taxi_ds_test, mlp_taxi)}')

### <img src="https://img.icons8.com/?size=50&id=103792&format=png&color=000000" style="height:50px;display:inline"> **Discussion**


The above reults are considered to be excellent results in classification tasks, surpassing human annotators. Does this mean we have an excellent policy? What happens when a predicted action is not optimal? Can our agent recover from such mistakes?

# <img src="https://img.icons8.com/?size=50&id=55069&format=png&color=000000" style="height:50px;display:inline"> Evaluating the policy
---


Our newly trained classifier can be used as a policy for the taxi environment. With every input observation, the classifier gives each action a score. The action that was scored the highest is the one returned by the policy.

In [None]:
class ClassifierPolicy:
    def __init__(self, model, prep_fn=None):
        self.model = model

        # if no preprocessing function is given, use the identity function
        if prep_fn is None:
            self.prep_fn = lambda x: x
        else:
            self.prep_fn = prep_fn

    def __call__(self, observation):
        # preprocess observation
        prepped_obs = self.prep_fn(observation)
        one_obs_batch = prepped_obs[None]  # convert to batch of size 1

        # run model to get action scores
        self.model.eval()
        with torch.no_grad():
            batch_scores = self.model(one_obs_batch)

        # get scores for single observation
        obs_score = batch_scores[0]

        # choose the action with the highest score
        return torch.argmax(obs_score).item()

# create a policy driven by the MLP model that uses the same observation preprocessing
# function as in training
taxi_il_policy = ClassifierPolicy(mlp_taxi, prep_fn=taxi_ds_train.prep_obs)

## How good can an imitation policy be?

Let us compare the performance of the expert policy and the classifier policy

In [None]:
taxi_env = gym.make("Taxi-v3").env
taxi_env.reset()

In [None]:
total_reward, mean_reward = evaluate_policy(taxi_env, taxi_expert, num_episodes=10_000,
                                            seed=SEED)
print('A* Policy')
print('---------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

In [None]:
total_reward, mean_reward = evaluate_policy(taxi_env, taxi_il_policy, num_episodes=10_000, seed=SEED)
print('Classifier Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

We note that, in expectation, our imitation policy performs almost as well as the expert. Sadly, this is not the only measure of a policy.

## Imitation learning failures



In the vast majority of cases, our policy acts identically to the expert. However, during the above animation, you may have seen the policy fail miserably in one or more starting position. Let us recreate one such a scenario.

In [None]:
import pygame
from renderlab import RenderFrame
# Ensure all Pygame instances are closed
pygame.quit()
# This code can be terminated early with an interruption
taxi_env = gym.make("Taxi-v3").env

# reset env
taxi_env.reset()

failure_obs = taxi_env.encode(3, 2, 3, 2)
taxi_env.unwrapped.s = failure_obs

# step using
failure_action = taxi_il_policy(failure_obs)
taxi_env.step(failure_action)

# render the environment
env_render = taxi_env.render(mode='ansi')

# Display the render result
print(env_render)

As we can see, the taxi should go North on its way to location B to pick up the passenger. However, our classifier policy chooses to go East, thus crashing into the wall. Worse yet, as long as the episode is live, the state will remain the same and our policy will choose the same action over and over. In a real-world situation, we could not afford to repeatedly crash a car into the wall. Obviously, our expert can solve this problem with ease, but our clasifier is unable to generalize. This kind of failure is caused by two phenomena, discussed below.

# <img src="https://img.icons8.com/?size=50&id=yg0Xl3Bazd07&format=png&color=000000" style="height:50px;display:inline"> Challenges in Behavioral Cloning
---


### Distributional shift



<img src="https://futureoflife.org/wp-content/uploads/2019/06/distributional-shift.png"/>

In the classification problem, we assume our finite sample set $S\subset \mathcal{X}\times \mathcal{Y}$ was sampled I.I.D. from some distribution $D$. In reality, the sample set is a collection of trajectories sampled from distribution $D_{\pi_{\text{expert}}}$ that is dependent on the expert policy. When deploying our algorithm, we sample data from $D_{\pi}$ that is dependant on our policy $\pi$. However, unless  $\pi_{\text{expert}} = \pi$, then $D_{\pi_{\text{expert}}} \neq D_{\pi}$, and so incoming data is sampled from outside the expected distribution.

When the agent observes a previously unseen state, it may act differently from the expert. When this happens, the next observation is sampled from $D_\pi$ and not $D_{\pi_{\text{expert}}}$, on which our agent is even more likely to make a mistake. This issue compounds as the distribution of the samples "shifts" from $D_{\pi_{\text{expert}}}$ to $D_\pi$.

Distributional shift is hard to demonstrate on the single taxi domain due to its simplicity. In the above example, the taxi begins from an unseen state (we know it is unseen because the training accuracy is 1). Since the policy's failure leaves the state unchanged, the distribution has nowhere to shift.

### Accumulating Errors Due to Imperfect Data
Behavioral cloning is highly sensitive to the quality of the data. If the training data is perfect, meaning it contains the optimal actions for every possible state, the policy can learn effectively. However, in real-world scenarios, perfect data is rarely available. Even small mistakes in the data can lead to significant issues because of the compounding nature of errors. When the policy makes a mistake, it moves the agent into states that are not well-represented in the training data, leading to further mistakes.


### <img src="https://img.icons8.com/?size=50&id=103792&format=png&color=000000" style="height:50px;display:inline"> **Discussion**


In the real world, many problems involve continuous state spaces, where the states are not discrete and can take any value within a range.
How can we design sampling strategies that efficiently cover the continuous state space? What approaches can be employed to enhance the model's ability to generalize from a finite set of training samples to the entire continuous state space? How might distributional shift affect a policy operating in a continuous state space differently than in a discrete one?

#### <img src="https://img.icons8.com/cute-clipart/64/000000/warning-shield.png" style="height:30px;display:inline"> Advanced Topics in Imitation Learning

In the following cells, we are going to explore some advanced topics in imitation learning. We will describe them briefly without getting into too much depth. If you're curious and want to learn more, feel free to check out the further reading section or reach out to the course team.


## Data Collection and Augmentation Techniques







### Importance of Corrections in Training Data
One way to mitigate the issue of accumulating errors is by incorporating corrections in the training data. If the dataset contains examples of mistakes and the corresponding corrective actions, the policy can learn how to recover from errors. This makes the policy more robust and capable of handling a wider range of scenarios. For instance, if the training data includes states resulting from both optimal actions and mistakes, the policy can learn to navigate from erroneous situations back to optimal paths.

### Data Augmentation Strategies
Data augmentation involves creating additional training data by modifying existing data. This technique can help simulate various states that the policy might encounter, even if they were not present in the original dataset. For example, in the context of self-driving cars, side-facing cameras can provide alternative perspectives, simulating off-center positions that the car might need to correct.


In [None]:
from IPython.display import display, HTML
#@title Example: Drone Flying Through Forests Using Camera Data Augmentation
video_id = "umRdt3zGgpU"
html_code = f"""
<iframe width="800" height="450" src="https://www.youtube.com/embed/{video_id}" frameborder="0" allowfullscreen></iframe>
"""
display(HTML(html_code))

An illustrative example of data augmentation comes from a study where drones were trained to navigate through forests. Instead of using just the forward-facing camera, the researchers mounted cameras on a hat worn by a person walking through the forest. These cameras faced forward, left, and right. The left-facing camera was labeled with the action to go right, the right-facing camera with the action to go left, and the forward-facing camera with the action to go straight. This simple augmentation provided diverse training data, enabling the drone to learn corrective actions more effectively.

## Non-Markovian Behavior

We previously noticed that in the above example, our taxi continuously chooses the East action, casuing it to hit the wall and remain in place. Why does the agent not realize this action is not helping it advance toward the goal? This is due to the ***Markovian assumption***. Under it, the next state is determined only by the current state and action, regardless of any previous states visited or actions taken. In other words, our model does account for any memory of the past. Specifically, the agent has no recollection of hitting the wall, and so it has no access to information that could hint to East being a bad action.

Remember that after training, neural networks are nothing more than functions. In our case, we have a deterministic model, i.e., for any observation $s$, $\pi(s)$ will always yield the same output. Since our model parameters will no longer change, we cannot hope to surpass this problematic situation.


### Using Sequence Models to Incorporate Temporal Context
To address non-Markovian behavior, policies can be augmented with sequence models that incorporate a history of observations. Models like Long Short-Term Memory (LSTM) networks or Transformers can process sequences of observations and learn to make decisions based on the entire sequence. This allows the policy to account for temporal dependencies and make more informed decisions.


In [None]:
#@title Example: Imitation with Transformers
video_id = "UuKAp9a6wMs"
html_code = f"""
<iframe width="800" height="450" src="https://www.youtube.com/embed/{video_id}" frameborder="0" allowfullscreen></iframe>
"""
display(HTML(html_code))

### Potential Pitfalls and Causal Confusion
While using sequence models can help address non-Markovian behavior, it can also introduce new challenges. One potential issue is causal confusion, where the policy learns spurious correlations rather than true causal relationships. For example, if the policy associates the activation of a brake light with the action of braking, it might not learn the underlying reason for braking (e.g., an obstacle ahead). Ensuring the policy focuses on the correct causal factors is crucial for effective learning.

## Multimodal Behavior

### Definition and Challenges

Multimodal behavior occurs when the expert's actions for a given state can follow multiple valid paths, leading to a complex distribution of possible actions. For instance, when faced with an obstacle, the expert might go left or right, both of which are correct. This poses a challenge for the policy, which might struggle to learn a single coherent action strategy if the training data contains such multimodal distributions.

### Solutions: Mixture of Gaussians, Latent Variable Models, Diffusion Models
Several advanced techniques can address the challenges of multimodal behavior:

**Mixture of Gaussians:**
A simple yet effective approach is to use a mixture of Gaussians. This method involves modeling the action distribution as a combination of multiple Gaussian distributions, each representing a different mode. The neural network outputs multiple means, variances, and weights for these Gaussians, allowing it to capture the multimodal nature of the actions.

**Latent Variable Models:**
Latent variable models introduce an additional latent variable that captures the underlying structure of the action distribution. Conditional variational autoencoders (CVAEs) are a popular choice, where the network learns to generate different modes by conditioning on this latent variable. During training, the latent variables are assigned to specific modes, helping the network distinguish between them.

**Diffusion Models:**
Diffusion models are gaining popularity due to their effectiveness in generating complex distributions. These models start with a highly noisy version of the action and iteratively denoise it. The neural network learns to reverse the noise addition process, effectively modeling the multimodal action distribution.




In [None]:
#@title Example: Imitation with latent variables
video_id = "w-CGSQAO5-Q"
html_code = f"""
<iframe width="800" height="450" src="https://www.youtube.com/embed/{video_id}" frameborder="0" allowfullscreen></iframe>
"""
display(HTML(html_code))

## Multitask Learning

Multitask learning involves training a policy to perform multiple tasks simultaneously. This approach can sometimes make imitation learning easier by providing more diverse training data and better state coverage.

### Training with Multiple Goals

**Example: Driving to Multiple Locations**
Instead of training an agent to drive to a single location (P1) with many demonstrations for that location, you can train a policy to drive to multiple locations. This involves conditioning the policy on the desired location, which can be provided as an input along with the state.

<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/tut03_multi_tasks_learning.png?raw=true'>

**Benefits:**
- **Diverse State Coverage:** The expert will visit many different states when attempting to reach various locations, providing more comprehensive training data.
- **Robustness to Errors:** The policy learns to handle a wider variety of states, including those resulting from suboptimal behavior.

### Goal-Conditioned Behavioral Cloning

In goal-conditioned behavioral cloning, the policy is trained using trajectories where the final state (goal) is provided as an additional input. This approach assumes that each demonstration is a good example for reaching the final state observed in the trajectory.


# <img src="https://img.icons8.com/?size=50&id=46678&format=png&color=000000" style="height:50px;display:inline"> DAGGER Algorithm
---



### Introduction to DAGGER
One interesting solution to distributional shift is [DAGGER](https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf) (Dataset Aggregation). In DAGGER, the aim is to try to converge $\pi_{\text{expert}}$ to $\pi$ via an iterative algorithm. The working assumption is that if $\pi_{\text{expert}} \sim \pi$ then $D_{\pi_{\text{expert}}} \sim D_{\pi}$. The idea is to perform behavior cloning, deploy the policy to collect more data, and then have an expert annotate the observations with the correct action. The new data is added to the old data and the process starts over. The most glaring issue with this technique is the need for a human annotator, which can be very expensive or simply unsafe. This kind of imitation learning is called policy aggregation.






### How DAGGER Works



<img src='https://github.com/CLAIR-LAB-TECHNION/CLAI/blob/main/tutorials/assets/tut03_DAgger.png?raw=true'>

#### Step-by-Step Algorithm:
1. **Initial Policy Training:** Train the initial policy on the expert demonstrations.
2. **Policy Execution:** Execute the policy in the real environment to collect observations.
3. **Human Labeling:** Ask human experts to label the collected observations with the correct actions.
4. **Data Aggregation:** Combine the new labeled data with the original training data.
5. **Policy Retraining:** Retrain the policy on the aggregated dataset.
6. **Repeat:** Iterate steps 2-5 until the policy converges.


## Run DAGGER

We have provided a script that sets up and runs the DAgger algorithm for you. You can choose between four different environments and expert policies: 'Ant-v4', 'Walker2d-v4', 'HalfCheetah-v4', and 'Hopper-v4'. Additionally, you have the flexibility to explore the results by tweaking various parameters such as the number of training steps, batch sizes, network configurations, and more.

In [None]:
#@title imports

import time

from cs236018.scripts.run import run_training_loop

%load_ext autoreload
%autoreload 2

#### <img src="https://img.icons8.com/?size=50&id=ndnNDCLXM-H6&format=png&color=000000" style="height:50px;display:inline"> Task 3: Explore the DAgger Algorithm with Different Arguments

Change the values of various arguments such as `ep_len`, `n_iter`, `batch_size`, `n_layers`, `size`, and `learning_rate`, then run the algorithm with these new settings. Observe the results and record the performance metrics and behavior changes for each set of arguments.

In [None]:
#@title runtime arguments

class Args:

  def __getitem__(self, key):
    return getattr(self, key)

  def __setitem__(self, key, val):
    setattr(self, key, val)

  #@markdown expert data
  env_name = 'Walker2d-v4' #@param ['Ant-v4', 'Walker2d-v4', 'HalfCheetah-v4', 'Hopper-v4']
  expert_policy_file = 'cs236018/policies/experts/' + env_name.split('-')[0] + '.pkl'
  expert_data = 'cs236018/expert_data/expert_data_' + env_name + '.pkl'
  exp_name = 'dagger_' + env_name.split('-')[0]
  do_dagger = True
  ep_len = 1000 #@param {type: "integer"}
  save_params = False

  num_agent_train_steps_per_iter = 1000 #@param {type: "integer"})
  n_iter = 10 #@param {type: "integer"})

  #@markdown batches & buffers
  batch_size_initial = 2000
  batch_size = 1000 #@param {type: "integer"})
  eval_batch_size = 1000 #@param {type: "integer"}
  train_batch_size = 100 #@param {type: "integer"}
  max_replay_buffer_size = 1000000

  #@markdown network
  n_layers = 2 #@param {type: "integer"}
  size = 64 #@param {type: "integer"}
  learning_rate = 5e-3 #@param {type: "number"}

  video_log_freq = 5
  scalar_log_freq = 1

  #@markdown gpu & run-time settings
  no_gpu = False
  which_gpu = 0
  seed = 1 #@param {type: "integer"}

args = Args()


In [None]:
#@title create directory for logging
import os

data_path ='/content/cs236018_S24/tut03/data'
if not (os.path.exists(data_path)):
    os.makedirs(data_path)
logdir = args.exp_name + '_' + args.env_name + \
         '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
logdir = os.path.join(data_path, logdir)
args['logdir'] = logdir
if not(os.path.exists(logdir)):
    os.makedirs(logdir)

In [None]:
## run training
print(args.logdir)
run_training_loop(args)

After running the DAgger algorithm, we can use TensorBoard to visualize the results and monitor the performance of our agent. TensorBoard is a powerful tool for visualizing machine learning experiments.

In [None]:
#@markdown You can visualize your runs with tensorboard from within the notebook

%load_ext tensorboard
%tensorboard --logdir /content/cs236018_W24/tut03/data

### <img src="https://img.icons8.com/?size=50&id=103792&format=png&color=000000" style="height:50px;display:inline"> **Discussion**

How does episode length (`ep_len`) influence learning and performance? What impact do varying training iterations (`n_iter`) and steps per iteration (`num_agent_train_steps_per_iter`) have? How do different batch sizes affect training stability and accuracy? How do changes in network architecture (`n_layers`, `size`) impact learning? How do different learning rates (`learning_rate`) affect convergence and stability?

# <img src="https://img.icons8.com/?size=100&id=46509&format=png&color=000000" style="height:50px;display:inline"> Conclusion
---



IL, and specifically BC, is a simple solution to solving basic RL problems. Although it is not perfect, it can achieve some impressive results in expectation. However, we notice two massive issues that manage to cripple even as simple a domain as single-taxi.

Our learner is unable to generalize to unseen states and so one mistake can lead to complete failure. A straight forward solution is to collect more data. Due to the small number of states, we will most likely collect optimal actions for all possible states if we collect enough trajectories. Our deep classifier is able to completely fit the training data, and so in this case our classifier correctly chooses the optimal action (as would the expert) at every state.

The above solution is only relevant for a simple domain such as this, and is not feasible in much larger state spaces or continuous spaces, in which our algorithms must be able to generalize to unseen observations if they are to be useful. In such cases, it is worth considering other approaches. (For example, A* can be practical if you have a very informative heuristic, but it may fail on large domains without one).

There are other forms of imitation learning (see [this blog bost](https://smartlabai.medium.com/a-brief-overview-of-imitation-learning-8a8a75c44a9c)) besides BC that were not discussed in this presentation. These techniques are aimed at solving issues in BC.

The limitations of imitation learning highlight the need for methods that can autonomously collect data and improve policies without extensive human involvement. In future lectures, we will explore reinforcement learning (RL) techniques that address these challenges by allowing agents to learn from their own experiences, aiming for behaviors that surpass human performance.

# <img src="https://img.icons8.com/dusk/64/000000/plus-2-math.png" style="height:50px;display:inline"> Further Reading on Imitation Learning
---

* [A brief overview of Imitation Learning](https://smartlabai.medium.com/a-brief-overview-of-imitation-learning-8a8a75c44a9c)

* [Generative Adversarial Imitation Learning Paper](https://arxiv.org/abs/1606.03476)

* [DAGGER Algorithm](https://arxiv.org/abs/1011.0686)



* [Challenges of Imitation Learning](https://arxiv.org/abs/1811.06711)


# <img src="https://img.icons8.com/?size=100&id=46756&format=png&color=000000" style="height:50px;display:inline"> Credits
---
* This tutorial is based on a previous one written by Guy Azran.
* Examples and code snippets were taken from <a href="https://rail.eecs.berkeley.edu/deeprlcourse/">CS285 - Deep Reinforcement Learning Course at UC Berkeley</a>
* Icons from <a href="https://icons8.com/">Icons8.com