# Behavior Cloning (BC)

In Behavior Cloning (BC), we find optimal parameter $\theta$ in policy $\pi_{\theta}$ by solving a regression (or classification) problem using expert's dataset $\mathcal{D}$ as a supervised learning.<br>
Therefore, you can simply apply existing regression (or classification) methods - such as, Gaussian model, GMM, non-parametric method (LWR, GPR), or neural network learners.<br>
See [my post](https://tsmatz.wordpress.com/2017/08/30/regression-in-machine-learning-math-for-beginners/) for the design choice of regression problems.

In this notebook, I'll build neural network policy $\pi_{\theta}$ and then optimize parameters (weights) by minimizing cross-entropy loss in PyTorch.

The trained policy is then available in regular reinforcement learning (RL) methods, if you refine models to get better performance. (See [here](https://github.com/tsmatz/reinforcement-learning-tutorials) for RL algorithms.)

BC is a basic approach for imitation learning, and easily applied into the various scenarios.

But it's worth noting that it also has the shortcomings to apply in some situations.<br>
One of these is that the agent trained by BC might sometimes happens to encounter unknown states which are not included in the initial expert's behaviors. (Because expert dataset doesn't have enough data for failure scenarios.) In most cases, the trained agent in BC works well in success cases, but it fails when it encounters the irregular states.<br>
In such cases, you can apply [DAgger](./02_dagger.ipynb) (next example), or the policy can be transferred to regular reinforcement learning after BC has been applied.

Now let's start.

*(back to [index](https://github.com/tsmatz/imitation-learning-tutorials/))*

Before we start, we need to install the required packages.

In [None]:
!pip install torch numpy

## Restore environment

Firstly, I restore GridWorld environment from JSON file.

For details about this environment, see [Readme.md](https://github.com/tsmatz/imitation-learning-tutorials/blob/master/Readme.md).

> Note : See [this script](./00_generate_expert_trajectories.ipynb) for generating the same environment.

In [1]:
import numpy as np
import random

GRID_SIZE = 50
MAX_TIMESTEP = 200

class GridWorld:
    """
    This environment is motivated by the following paper.
    https://proceedings.mlr.press/v15/boularias11a/boularias11a.pdf

    - It has 50 x 50 grids (cells).
    - The agent has four actions for moving in one of the directions of the compass.
    - If ```transition_prob``` = True, the actions succeed with probability 0.7,
      a failure results in a uniform random transition to one of the adjacent states.
    - A reward of 10 is given for reaching the goal state, located on the bottom-right corner.
    - For the remaining states,
      the reward function was randomly set to 0 with probability 2/3
      and to −1 with probability 1/3.
    - If the agent moves across the border, it's given the fail reward (i.e, reward=`-1`).
    - The initial state is sampled from a uniform distribution.
    """

    def __init__(self, reward_map, valid_states, transition_prob=True):
        """
        Initialize class.

        Parameters
        ----------
        reward_map : float[GRID_SIZE * GRID_SIZE]
            Reward for each state.
        valid_states : list(int[2])
            List of states, in which the agent can reach to goal state without losing any rewards.
            Each state is a 2d vector, [row, column].
            When you call reset(), the initial state is picked up from these states.
        transition_prob : bool
            True if transition probability (above) is enabled.
            False when we generate an expert agent without noise.
        """
        self.reward_map = np.array(reward_map)
        self.valid_states = np.array(valid_states)
        self.transition_prob = transition_prob

    def reset(self):
        """
        Randomly, get initial state (single state) from valid states.

        Returns
        ----------
        state : int
            Return the picked-up state id.
        """
        # initialize step count
        self.step_count = 0
        # pick up sample of valid states
        state_2d = random.choice(self.valid_states)
        # convert 2d index to 1d index
        state_1d = state_2d[0] * GRID_SIZE + state_2d[1]
        # return result
        return state_1d

    def step(self, action, state):
        """
        Take action, proceed step, and return the result.

        Parameters
        ----------
        action : int
            Actions to take
            (0=UP 1=DOWN 2=LEFT 3=RIGHT)
        state : int
            Current state id.

        Returns
        ----------
        new-state : int
            New state id.
        reward : int
            The obtained reward.
        done : bool
            Flag to check whether it terminates.
        """
        # if transition prob is enabled, apply transition
        if self.transition_prob:
            # the action succeeds with probability 0.7
            prob = [.1]*4
            prob[action] *= 7.0
            action_onehot = np.random.multinomial(1, prob)
        else:
            action_onehot = np.zeros(4, dtype=int)
            action_onehot[action] += 1
        # get 2d state
        mod, reminder = divmod(state, GRID_SIZE)
        state_2d = np.array([mod, reminder])
        # move state
        # (0=UP 1=DOWN 2=LEFT 3=RIGHT)
        up_and_down = action_onehot[1] - action_onehot[0]
        left_and_right = action_onehot[3] - action_onehot[2]
        new_state = state_2d + np.array([up_and_down, left_and_right])
        # set reward
        reward = 0.0
        if (new_state[0] < 0) or (new_state[0] >= GRID_SIZE) or (new_state[1] < 0) or (new_state[1] >= GRID_SIZE):
            # if location is out of border, set reward=-1
            reward -= 1.0
        else:
            # if succeed, add reward of current state
            state_1d = new_state[0] * GRID_SIZE + new_state[1]
            reward += self.reward_map[state_1d]
        # correct location
        new_state = np.clip(new_state, 0, GRID_SIZE-1)
        # return result
        self.step_count += 1
        return new_state[0] * GRID_SIZE + new_state[1], reward, (new_state[0]==GRID_SIZE-1 and new_state[1]==GRID_SIZE-1) or (self.step_count==MAX_TIMESTEP)

In [2]:
import json

with open("gridworld.json", "r") as f:
    json_object = json.load(f)
    env = GridWorld(**json_object, transition_prob=False)

## Define policy

Now I build a policy $\pi_{\theta}$.

This network receives the current state (one-hot state) as input and returns the optimal action (action's logits) as output.

In [3]:
import torch
import torch.nn as nn
from torch.nn import functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#
# Define model
#
class PolicyNet(nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.hidden = nn.Linear(GRID_SIZE*GRID_SIZE, hidden_dim)
        self.classify = nn.Linear(hidden_dim, 4)

    def forward(self, s):
        outs = self.hidden(s)
        outs = F.relu(outs)
        logits = self.classify(outs)
        return logits

#
# Generate model
#
policy_func = PolicyNet().to(device)

## Run agent before training

For comparison, now I run this agent without any training.

In this game, the maximum episode's reward without losing any rewards is ```10.0```. (See [Readme.md](https://github.com/tsmatz/imitation-learning-tutorials/blob/master/Readme.md) for game rule in this environment.)<br>
As you can see below, it has low average of rewards.

In [4]:
# Get feature vector (which shape is (GRID_SIZE*GRID_SIZE,)) of state to feed model
def get_feature(state):
    """
    Return one-hot feature array from 2d states (as a batch).
    e.g, [0,3] --> [0, 0, 0, 1, 0, ... ]
    """
    # get one-hot array --> size (batch_size, GRID_SIZE * GRID_SIZE)
    return F.one_hot(torch.tensor(state).to(device), num_classes=GRID_SIZE*GRID_SIZE)

# Pick stochastic samples with policy model
def pick_sample_and_logits(policy, s):
    """
    Stochastically pick up action and logits with policy model.

    Parameters
    ----------
    policy : torch.nn.Module
        Policy network to use
    s : tensor of int[GRID_SIZE*GRID_SIZE])
        The feature (one-hot) of state.

    Returns
    ----------
    action : tensor of int
        The picked-up actions.
    logits : tensor of int[4]
        Logits defining categorical distribution.
        This is needed to optimize model.
    """
    # Get logits from state
    # --> size : (4,)
    inputs = s.unsqueeze(dim=0)
    logits = policy(inputs.float())
    logits = logits.squeeze(dim=0)
    # From logits to probabilities
    # --> size : (4,)
    probs = F.softmax(logits, dim=-1)
    # Pick up action's sample
    # --> size : (1,)
    a = torch.multinomial(probs, num_samples=1)
    # --> size : ()
    a = a.squeeze()

    # Return
    return a, logits

In [5]:
def evaluate_policy(policy, eval_num, verbose=False):
    score_list = []
    for i in range(eval_num):
        score = 0
        done = False
        s = env.reset()
        while not done:
            s_onehot = get_feature(s)
            with torch.no_grad():
                a, _ = pick_sample_and_logits(policy, s_onehot)
            s, r, done = env.step(a, s)
            score += r
        score_list.append(score)
        if verbose:
            print("Processed {:4d} / {:4d} episodes ...".format(i + 1, eval_num), end="\r")
    if verbose:
        print("\nDone")
    return sum(score_list) / len(score_list)

avg = evaluate_policy(policy_func, 300, verbose=True)
print("Average reward is {}.".format(avg))

Processed  300 /  300 episodes ...
Done
Average reward is -66.23.


## Train policy

Now we train our policy with expert data.

> Note : The expert data is located in ```./expert_data``` folder in this repository. See [this script](./00_generate_expert_trajectories.ipynb) for generating expert dataset.

In this training, I compute cross-entropy loss for categorical distribution and then optimize the policy with only expert dataset.<br>
Unlike [reinforcement learning](https://github.com/tsmatz/reinforcement-learning-tutorials), the reward is unknown in this training.

As you can see below, the average reward becomes high, and the policy is well-trained. (See [Readme.md](https://github.com/tsmatz/imitation-learning-tutorials/blob/master/Readme.md) for game rule in this environment.)

> Note : You can run as a batch to speed up training, but here I run each inference one by one in order to simplify (make readable) our code. (Because the training is very simple.)

In [6]:
import pickle

# use the following expert dataset
dest_dir = "./expert_data"
checkpoint_files = ["ckpt0.pkl"]

# create optimizer
opt = torch.optim.AdamW(policy_func.parameters(), lr=0.001)

for ckpt in checkpoint_files:
    # load expert data from pickle
    with open(f"{dest_dir}/{ckpt}", "rb") as f:
        all_data = pickle.load(f)
    all_states = all_data["states"]
    all_actions = all_data["actions"]
    timestep_lens = all_data["timestep_lens"]
    # loop all episodes in demonstration
    current_timestep = 0
    for i, timestep_len in enumerate(timestep_lens):
        # pick up states and actions in a single episode
        states = all_states[current_timestep:current_timestep+timestep_len]
        actions = all_actions[current_timestep:current_timestep+timestep_len]
        # collect loss and optimize (train)
        opt.zero_grad()
        loss = []
        for s, a in zip(states, actions):
            s_onehot = get_feature(s)
            _, logits = pick_sample_and_logits(policy_func, s_onehot)
            l = F.cross_entropy(logits, torch.tensor(a).to(device), reduction="none")
            loss.append(l)
        total_loss = torch.stack(loss, dim=0)
        total_loss.sum().backward()
        opt.step()
        # log
        print("Processed {:5d} episodes in checkpoint {}...".format(i + 1, ckpt), end="\r")
        # run evaluation in each 1000 episodes
        if i % 1000 == 999:
            avg = evaluate_policy(policy_func, 200)
            print(f"\nEvaluation result (Average reward): {avg}")
        # proceed to next episode
        current_timestep += timestep_len

Processed  1000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 3.13
Processed  2000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 7.215
Processed  3000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 7.58
Processed  4000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.285
Processed  5000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.225
Processed  6000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.575
Processed  7000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.485
Processed  8000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.715
Processed  9000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.63
Processed 10000 episodes in checkpoint ckpt0.pkl...
Evaluation result (Average reward): 8.705
