# Behaviour Clonning

Behaviour cloning is a type of imitation learning where an agent learns to perform tasks by mimicking expert demonstrations. The goal is to train a policy that can replicate the behavior of an expert based on observed state-action pairs.

We are going to use the `lunar-lander-v2` environment from OpenAI Gym as our testbed. The agent will learn to land the spacecraft looking only at few demonstrations done by an *Expert*. You can use any previously trained agent to generate these demonstrations although, for simplicity, we provide a set of pre-recorded demonstrations all of them with returns above 200 points.

## Loading Expert Demonstrations

Demonstrations are stored in the `demos/` folder as `.npz` files. Each file contains two arrays: `observations` and `actions`. The `observations` array contains the state information, while the `actions` array contains the corresponding *discrete* actions taken by the expert.

File `"lunar_lander_10.npz` contains 10 expert demonstrations (full episodes).


In [1]:
import numpy as np
# load dictionary from a npz file
loaded = np.load("datasets/lunar_lander_10.npz")
dataset = dict(loaded)
print(dataset['observations'].shape)
print(dataset['actions'].shape)

(2838, 8)
(2838,)


## Neural network architecture to clone the Expert

Create a neural network that takes the continuous observations as input and outputs one value per discrete action available in the environment. Add also a couple of hidden layers like in the previous agents. We will use the **Cross-Entropy loss function**, which is suitable for classification problems with C classes. It computes the cross-entropy loss between the predicted *logits* and the target class. Therefore, the output layer should not include a `Softmax` activation, since the loss function internally applies it to the logits.

In [None]:
import torch as T
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

class BCNetwork(nn.Module):
    def __init__(self, input_dims, fc1_dims, fc2_dims, n_actions):
        # Cretate the neural network architecture
        ...

    def forward(self, state):
        # Define the forward pass
        ...

## Behavior Cloning Class

Create a `BehaviorCloning` class that encapsulates the training and evaluation logic. The class should include methods for:
- Initializing the neural network, optimizer, and loss function.
- Shuffling and batching the expert demonstrations.
- Training the model using the expert demonstrations as a regular supervised learning problem.
- Include also a method to interrogate the trained model returning the action with highest logit for a given observation.

In [None]:

class BehaviorClonning():
    def __init__(self, gamma, input_dims, n_actions, batch_size, epochs, dataset, lr=0.003, hidden_layers=64):
        # Save all parameters ...
        ...
        # Initialize the policy network as a BCNetwork, 
        ...
        # the optimizer and the Cross Entropy Loss function ...
        ...

    def shuffle_dataset(self):
        # Shuffle the dataset ...
        ...

    def learn(self):
        for e in range(self.epochs):
            # Shuffle dataset at the start of each epoch ...
            for i in range(0, self.dataset['observations'].shape[0], self.batch_size):
                # take a batch of observations and actions and transform them to the tensors in the appropiated device ...
                ...
                # Training step ... (zero grad, forward pass, compute loss, backward pass, optimizer step) ...
                ...
                # keep track of the loss to print it later ...
                ...
    def save(self, path):
        T.save(self.policy.state_dict(), path)
    
    def load(self, path):
        self.policy.load_state_dict(T.load(path))
    
    def predict(self, obs):
        # Prepare the observation tensor ... (transform to tensor, unsqueeze(0), to device ...)
        ...
        # Forward pass through the policy network without gradient calculation
        # and return the action with the highest logit ...
        ...

## Train the Behavior Cloning Agent
Train the behavior cloning agent using the expert demonstrations. Monitor the training loss to ensure that the model is learning effectively. Are 10 demonstrations enough to achieve good performance? How many epochs are required to reach convergence? 

In [None]:
# Choose hyperparameters to create the BehaviorClonning agent and 
# learn from the dataset
agent = BehaviorClonning(gamma=..., input_dims=8, n_actions=4, 
                         batch_size=..., epochs=..., dataset=dataset)
agent.learn()

## Agent Evaluation

Evaluate the trained behavior cloning agent in the `lunar-lander-v2` environment. Run multiple episodes and record the average return to assess how well the agent has learned to mimic the expert's behavior. Visualize some episodes to qualitatively evaluate the agent's performance.

In [None]:
import gymnasium as gym
import imageio

# Test the BehaviorCloning agent with the defined environment
env = gym.make("LunarLander-v2", render_mode="rgb_array")
obs = env.reset()[0]
done = False
total_reward = 0.0
i = 0
frames = [] # For storing the frames of the environment
while not done:
    # Predict the action using the BC agent and run it in the environment
    ...
    # Compute total reward and store frames for visualization
    ...
    total_reward += reward
    
print(f"Total reward: {total_reward}")
env.close()
imageio.mimsave("bc_lunarlander.gif", frames, duration=0.02)