# CPSC 533V: Assignment 3 - Behavioral Cloning and Deep Q Learning

## 48 points total (9% of final grade)

---
This assignment will help you transition from tabular approaches, topic of HW 2, to deep neural network approaches. You will implement the [Atari DQN / Deep Q-Learning](https://arxiv.org/abs/1312.5602) algorithm, which arguably kicked off the modern Deep Reinforcement Learning craze.

In this assignment we will use PyTorch as our deep learning framework.  To familiarize yourself with PyTorch, your first task is to use a behavior cloning (BC) approach to learn a policy.  Behavior cloning is a supervised learning method in which there exists a dataset of expert demonstrations (state-action pairs) and the goal is to learn a policy $\pi$ that mimics this expert.  At any given state, your policy should choose the same action the export would.

Since BC avoids the need to collect data from the policy you are trying to learn, it is relatively simple. 
This makes it a nice stepping stone for implementing DQN. Furthermore, BC is relevant to modern approaches---for example its use as an initialization for systems like [AlphaGo][go] and [AlphaStar][star], which then use RL to further adapte the BC result.  

<!--

I feel like this might be better suited to going lower in the document:

Unfortunately, in many tasks it is impossible to collect good expert demonstrations, making

it's not always possible to have good expert demonstrations for a task in an environemnt and this is where reinforcement learning comes handy. Through the reward signal retrieved by interacting with the environment, the agent learns by itself what is a good policy and can learn to outperform the experts.

-->

Goals:
- Famliarize yourself with PyTorch and its API including models, datasets, dataloaders
- Implement a supervised learning approach (behavioral cloning) to learn a policy.
- Implement the DQN objective and learn a policy through environment interaction.

[go]:  https://deepmind.com/research/case-studies/alphago-the-story-so-far
[star]: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

## Submission information

- Complete the assignment by editing and executing the associated Python files.
- Copy and paste the code and the terminal output requested in the predefined cells on this Jupyter notebook.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.

## Task 0: Preliminaries

### PyTorch

If you have never used PyTorch before, we recommend you follow this [60 Minutes Blitz][blitz] tutorial from the official website. It should give you enough context to be able to complete the assignment.


**If you have issues, post questions to Piazza**

### Installation

To install all required python packages:

```
python3 -m pip install -r requirements.txt
```

### Debugging


You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.


[blitz]: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

## Task 1: Behavioral Cloning

Behavioral Cloning is a type of supervised learning in which you are given a dataset of expert demonstrations tuple $(s, a)$ and the goal is to learn a policy function $\hat a = \pi(s)$, such that $\hat a = a$.

The optimization objective is $\min_\theta D(\pi(s), a)$ where $\theta$ are the parameters the policy $\pi$, in our case the weights of a neural network, and where $D$ represents some difference between the actions.

---

Before starting, we suggest reading through the provided files.

For Behavioral Cloning, the important files to understand are: `model.py`, `dataset.py` and `bc.py`.

- The file `model.py` has the skeleton for the model (which you will have to complete in the following questions),

- The file `dataset.py` has the skeleton for the dataset the model is being trained with,

- and, `bc.py` will have all the structure for training the model with the dataset.


### 1.1 Dataset

We provide a pickle file with pre-collected expert demonstrations on CartPole from which to learn the policy $\pi$. The data has been collected from an expert policy on the environment, with the addition of a small amount of gaussian noise to the actions.

The pickle file contains a list of tuples of states and actions in `numpy` in the following way:

```
[(state s, action a), (state s, action a), (state s, action a), ...]
```

In the `dataset.py` file, we provide skeleton code for creating a custom dataset. The provided code shows how to load the file.

Your goal is to overwrite the `__getitem__` function in order to return a dictionary of tensors of the correct type.

Hint: Look in the `bc.py` file to understand how the dataset is used.

Answer the following questions:

- [**QUESTION 2 points]** Insert your code in the placeholder below.

In [1]:
# PLACEHOLDER TO INSERT YOUR __getitem__ method here

def __getitem__(self, index):
    item = self.data[index]
    # TODO YOUR CODE HERE
    dic={'state': item[0], 'action': item[1]}
    return dic

In [6]:
import gym
import torch
import numpy as np
from eval_policy import eval_policy, device
from model import MyModel
from dataset import Dataset
ENV_NAME = 'CartPole-v0'
dataset = Dataset(data_path="{}_dataset.pkl".format(ENV_NAME))
dataloader = torch.utils.data.DataLoader(dataset, batch_size=10, num_workers=4)


In [19]:
for iteration, data in enumerate(dataloader):
#     data = {k: v.to(device) for k, v in data.items()}
    print(data.items())

dict_items([('state', tensor([[ 0.0090, -0.0427, -0.0198,  0.0450],
        [ 0.0082, -0.2375, -0.0189,  0.3313],
        [ 0.0034, -0.0421, -0.0123,  0.0327],
        [ 0.0026, -0.2370, -0.0117,  0.3215],
        [-0.0022, -0.0418, -0.0052,  0.0252],
        [-0.0030, -0.2368, -0.0047,  0.3162],
        [-0.0077, -0.0416,  0.0016,  0.0220],
        [-0.0086, -0.2368,  0.0020,  0.3152],
        [-0.0133, -0.0417,  0.0083,  0.0232],
        [-0.0141, -0.2369,  0.0088,  0.3185]], dtype=torch.float64)), ('action', tensor([0, 1, 0, 1, 0, 1, 0, 1, 0, 1]))])
dict_items([('state', tensor([[-0.0189, -0.0419,  0.0152,  0.0286],
        [-0.0197, -0.2373,  0.0157,  0.3260],
        [-0.0245, -0.0424,  0.0223,  0.0383],
        [-0.0253, -0.2378,  0.0230,  0.3380],
        [-0.0301, -0.0430,  0.0298,  0.0526],
        [-0.0309,  0.1517,  0.0308, -0.2305],
        [-0.0279, -0.0439,  0.0262,  0.0718],
        [-0.0288,  0.1509,  0.0277, -0.2125],
        [-0.0257, -0.0446,  0.0234,  0.0888],
     

In [3]:
tensor = torch.rand(3, 4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


In [4]:
z = np.zeros([dataset.__len__(),4])
for i in range(dataset.__len__()):
    z[i] = dataset.data[i][0]

In [None]:
z

In [None]:
np.min(z,0)

In [None]:
np.max(z,0)

In [None]:
np.unique(z,0)

- **[QUESTION 2 points]** How big is the dataset provided?

99660*5

- **[QUESTION 2 points]** What is the dimensionality of $s$ and what range does each dimension of $s$ span?  I.e., how much of the state space does the expert data cover?

Dim = 4, they cover from array([-0.72267057, -0.43303689, -0.05007198, -0.38122098]) to array([2.39948596, 1.84697975, 0.14641718, 0.47143314])

- **[QUESTION 2 points]** What are the dimensionalities and ranges of the action $a$ in the dataset (how much of the action space does the expert data cover)?

Dim = 1, actions are 0 and 1.


### 1.2 Environment

Recall the state and action space of CartPole, from the previous assignment.

- **[QUESTION 2 points]** Considering the full state and action spaces, do you think the provided expert dataset has good coverage?  Why or why not? How might this impact the performance of our cloned policy?

I think the range of states were different from the range of states that are provided in this dataset, so if we run out of this range , or the initial point lie outside of this range, we will have a trouble!

### 1.3 Model

The file `model.py` provides skeleton code for the model. Your goal is to create the architecture of the network by adding layers that map the input to output.

You will need to update the `__init__` method and the `forward` method.

The `select_action` method has already been written for you.  This should be used when running the policy in the environment, while the `forward` function should be used at training time.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [20]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MyModel(nn.Module):
    def __init__(self, state_size, action_size):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(state_size, 120)  
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, action_size)

    def forward(self, x):
        # TODO YOUR CODE HERE FOR THE FORWARD PASS
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
#         x = F.one_hot(x)
        return x

    def select_action(self, state):
        self.eval()
        x = self.forward(state)
        self.train()
        return x.max(1)[1].view(1, 1).to(torch.long)


Answer the following questions:

- **[QUESTION 2 points]** What is the input of the network?

**YOUR ANSWER HERE**

- **[QUESTION 2 points]** What is the output?

**YOUR ANSWER HERE**


### 1.4 Training

The file `bc.py` is the entry point for training your behavioral cloning model. The skeleton and the main components are already there.

The missing parts for you to do are:

- Initializing the model
- Choosing a loss function
- Choosing an optimizer
- Playing with hyperparameters to train your model.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [22]:
# PLACEHOLDER FOR YOUR CODE HERE
# HOW DID YOU INITIALIZE YOUR MODEL, OPTIMIZER AND LOSS FUNCTIONS? PASTE HERE YOUR FINAL CODE
# NOTE: YOU CAN KEEP THE FOLLOWING LINES COMMENTED OUT, AS RUNNING THIS CELL WILL PROBABLY RESULT IN ERRORS
import torch.optim as optim

model = MyModel(4,2)
optimizer = optim.SGD(model.parameters(), lr=0.01)

# output = net(input)
# target = 
# loss_function = F.binary_cross_entropy(output, target)

In [33]:
model(torch.tensor([1,1,1,1],dtype=double))

NameError: name 'double' is not defined

In [12]:
import gym
import torch
from eval_policy import eval_policy, device
from model import MyModel
from dataset import Dataset

BATCH_SIZE = 64
TOTAL_EPOCHS = 100
LEARNING_RATE = 10e-4
PRINT_INTERVAL = 500
TEST_INTERVAL = 2

ENV_NAME = 'CartPole-v0'

dataset = Dataset(data_path="{}_dataset.pkl".format(ENV_NAME))
dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=4)
dataloader

for epoch in range(1, 5 + 1):
    for iteration, data in enumerate(dataloader):
        data = {k: v.to(device) for k, v in data.items()}

In [14]:
data

{'state': tensor([[ 1.7024,  1.1120,  0.0678,  0.1011],
         [ 1.7247,  1.3061,  0.0698, -0.1695],
         [ 1.7508,  1.1101,  0.0664,  0.1444],
         [ 1.7730,  1.3042,  0.0693, -0.1266],
         [ 1.7991,  1.1081,  0.0668,  0.1871],
         [ 1.8212,  1.3022,  0.0705, -0.0838],
         [ 1.8473,  1.1062,  0.0688,  0.2302],
         [ 1.8694,  1.3003,  0.0734, -0.0400],
         [ 1.8954,  1.1042,  0.0726,  0.2749],
         [ 1.9175,  1.2982,  0.0781,  0.0060],
         [ 1.9435,  1.1020,  0.0782,  0.3223],
         [ 1.9655,  1.2960,  0.0847,  0.0553]], dtype=torch.float64),
 'action': tensor([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1])}

In [16]:
model(data['state'])

RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

In [1]:
import gym
import torch
from eval_policy import eval_policy, device
from model import MyModel
from dataset import Dataset
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F


BATCH_SIZE = 64
TOTAL_EPOCHS = 100
LEARNING_RATE = 10e-4
PRINT_INTERVAL = 500
TEST_INTERVAL = 2

ENV_NAME = 'CartPole-v0'

dataset = Dataset(data_path="{}_dataset.pkl".format(ENV_NAME))
dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=4)

env = gym.make(ENV_NAME)

# TODO INITIALIZE YOUR MODEL HERE
model = MyModel(4,2)
# optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)

def train_behavioral_cloning():
    
    # TODO CHOOSE A OPTIMIZER AND A LOSS FUNCTION FOR TRAINING YOUR NETWORK
    optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)
    loss_function = nn.CrossEntropyLoss()

    gradient_steps = 0

    for epoch in range(1, TOTAL_EPOCHS + 1):
        for iteration, data in enumerate(dataloader):
            data = {k: v.to(device) for k, v in data.items()}
            #print(data)
            output = model(data['state'].float())

            loss = loss_function(output, data["action"])

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if gradient_steps % PRINT_INTERVAL == 0:
                print('[epoch {:4d}/{}] [iter {:7d}] [loss {:.5f}]'
                    .format(epoch, TOTAL_EPOCHS, gradient_steps, loss.item()))
            
            gradient_steps += 1

        if epoch % TEST_INTERVAL == 0:
            score = eval_policy(policy=model, env=ENV_NAME)
            print('[Test on environment] [epoch {}/{}] [score {:.2f}]'
                .format(epoch, TOTAL_EPOCHS, score))

    model_name = "behavioral_cloning_{}.pt".format(ENV_NAME)
    print('Saving model as {}'.format(model_name))
    torch.save(model.state_dict(), model_name)


# if __name__ == "__main__":
#     train_behavioral_cloning()

train_behavioral_cloning()

[epoch    1/100] [iter       0] [loss 0.68736]
[epoch    1/100] [iter     500] [loss 0.67141]
[epoch    1/100] [iter    1000] [loss 0.65258]
[epoch    1/100] [iter    1500] [loss 0.63779]
[epoch    2/100] [iter    2000] [loss 0.62024]
[epoch    2/100] [iter    2500] [loss 0.60152]
[epoch    2/100] [iter    3000] [loss 0.57826]
[Test on environment] [epoch 2/100] [score 71.10]
[epoch    3/100] [iter    3500] [loss 0.54726]
[epoch    3/100] [iter    4000] [loss 0.52653]
[epoch    3/100] [iter    4500] [loss 0.48766]
[epoch    4/100] [iter    5000] [loss 0.47294]
[epoch    4/100] [iter    5500] [loss 0.42332]
[epoch    4/100] [iter    6000] [loss 0.42854]
[Test on environment] [epoch 4/100] [score 74.60]
[epoch    5/100] [iter    6500] [loss 0.41422]
[epoch    5/100] [iter    7000] [loss 0.39022]
[epoch    5/100] [iter    7500] [loss 0.33793]
[epoch    6/100] [iter    8000] [loss 0.34552]
[epoch    6/100] [iter    8500] [loss 0.32391]
[epoch    6/100] [iter    9000] [loss 0.30345]
[Test o

You can run your code by doing:

```
python3 bc.py
```

**During all of this assignment, the code in `eval_policy.py` will be your best friend.** At any time, you can test your model by giving as argument the path to the model weights and the environment name using the following command:

```
python3 eval_policy.py --model-path /path/to/model/weights --env ENV_NAME
````

In [7]:
import gym
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def eval_policy(policy, env='CartPole-v0', num_test_episodes=10, render=False, verbose=False):
    test_env = gym.make(env)
    test_rewards = []
    for i in range(num_test_episodes):
        state = test_env.reset()
        episode_total_reward = 0
        while True:
            state = torch.tensor([state], device=device, dtype=torch.float32)
            action = policy.select_action(state).cpu().numpy()[0][0]
            next_state, reward, done, _ = test_env.step(action)
            
            if render:
                test_env.render(mode='human')
            
            episode_total_reward += reward
            state = next_state
            if done:
                if verbose:
                    print('[Episode {:4d}/{}] [reward {:.1f}]'
                        .format(i, num_test_episodes, episode_total_reward))
                break
        test_rewards.append(episode_total_reward)
    test_env.close()
    return sum(test_rewards)/num_test_episodes


if __name__ == "__main__":
    import argparse
    from model import MyModel

    parser = argparse.ArgumentParser()
    parser.add_argument('--model-path', default=None, type=str,
        help='Path to the model weights.')
    parser.add_argument('--env', default=None, type=str,
        help='Name of the environment.')
    
    args = parser.parse_args()
    env = gym.make(args.env)
    model = MyModel(state_size=len(env.reset()), action_size=env.action_space.n)
    model.load_state_dict(torch.load(args.model_path))
    model = model.to(device)
    env.close()

    eval_policy(policy=model, env=args.env, render=True, verbose=True)

usage: ipykernel_launcher.py [-h] [--model-path MODEL_PATH] [--env ENV]
ipykernel_launcher.py: error: unrecognized arguments: -f C:\Users\12365\AppData\Roaming\jupyter\runtime\kernel-0657fba0-659a-4008-818a-49f6033ea0b9.json


SystemExit: 2

In [8]:
%tb

SystemExit: 2

In [4]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10
[Episode    0/10] [reward 200.0]
[Episode    1/10] [reward 200.0]
[Episode    2/10] [reward 200.0]
[Episode    3/10] [reward 200.0]
[Episode    4/10] [reward 200.0]
[Episode    5/10] [reward 200.0]
[Episode    6/10] [reward 193.0]
[Episode    7/10] [reward 200.0]
[Episode    8/10] [reward 200.0]
[Episode    9/10] [reward 200.0]

**[QUESTION 2 points]** Did you manage to learn a good policy? How consistent is the reward you are getting?

**YOUR ANSWER HERE**

## Task 2: Deep Q Learning

There are two main issues with the behavior cloning approach.

- First, we are not always lucky enough to have access to a dataset of expert demonstrations.
- Second, replicating an expert policy suffers from compounding error. The policy $\pi$ only sees these "perfect" examples and has no knowledge on how to recover from states not visited by the expert. For this reason, as soon as it is presented with a state that is off the expert trajectory, it will perform poorly and will continue to deviate from a good trajectory without the possibility of recovering from errors.

---
The second task consists in solving the environment from scratch, using RL, and most specifically the DQN algorithm, to learn a policy $\pi$.

For this task, familiarize yourself with the file `dqn.py`. We are going to re-use the file `model.py` for the model you created in the previous task.

Your task is very similar to the one in the previous assignment, to implement the Q-learning algorithm, but in this version, our Q-function is approximated with a neural network.

The algorithm (excerpted from Section 6.5 of [Sutton's book](http://incompleteideas.net/book/RLbook2018.pdf)) is given below:

![DQN algorithm](https://i.imgur.com/Mh4Uxta.png)

### 2.0 Think about your model...



**[QUESTION 2 points]** In DQN, we are using the same model as in task 1 for behavioral cloning. In both tasks the model receives as input the state and in both tasks the model outputs something that has the same dimensionality as the number of actions. These two outputs, though, represent very different things. What is each one representing?

**YOUR ANSWER HERE**

### 2.1 Update your Q-function

Complete the `optimize_model` function. This function receives as input a `state`, an `action`, the `next_state`, the `reward` and `done` representing the tuple $(s_t, a_t, s_{t+1}, r_t, done_t)$. Your task is to update your Q-function as shown in the [Atari DQN paper](https://arxiv.org/abs/1312.5602) environment. For now don't be concerned with the experience replay buffer. We'll get to that later.

![Loss function](https://i.imgur.com/tpTsV8m.png)

- [**QUESTION 8 points]** Insert your code in the placeholder below.

In [7]:
## PLACEHOLDER TO INSERT YOUR optimize_model function here:

# def optimize_model(state, action, next_state, reward, done):
#     # TODO given a tuple (s_t, a_t, s_{t+1}, r_t, done_t) update your model weights

#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()

### 2.2 $\epsilon$-greedy strategy

You will need a strategy to explore your environment. The standard strategy is to use $\epsilon$-greedy. Implement it in the `choose_action` function template.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [9]:
## PLACEHOLDER TO INSERT YOUR choose_action function here:

# def choose_action(state, test_mode=False):
#     # TODO implement an epsilon-greedy strategy
#     raise NotImplementedError()

### 2.3 Train your model

Try to train a model in this way.

You can run your code by doing:

```
python3 dqn.py
```

**[QUESTION 2 points]** How many episodes does it take to learn (ie. reach a good reward)?

**YOUR ANSWER HERE**

In [1]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

### 2.4 Add the Experience Replay Buffer

If you read the DQN paper (and as you can see from the algorithm picture above), the authors make use of an experience replay buffer to learn faster. We provide an implementation in the file `replay_buffer.py`. Update the `train_reinforcement_learning` code to push a tuple to the replay buffer and to sample a batch for the `optimize_model` function.

**[QUESTION 5 points]** How does the replay buffer improve performances?

In [12]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

## Task 3: Extra

Ideas to experiment with:

- Is $\epsilon$-greedy strategy the best strategy available? Why not trying something different.
- Why not make use of the model you have trained in the behavioral cloning part and fine-tune it with RL? How does that affect performance?
- You are perhaps bored with `CartPole-v0` by now. Another environment we suggest trying is `LunarLander-v2`. It will be harder to learn but with experimentation, you will find the correct optimizations for success. Piazza is also your friend :)
- What about learning from images? This requires more work because you have to extract the image from the environment. However, would it be possible? How much more challenging might you expect the learning to be in this case?
- The ReplayBuffer implementation provided is very simple. In class we have briefly mentioned Prioritized Experience Replay; how would the learning process change?
- An improvement over DQN is DoubleDQN, which is a very simple addition to the current code.



In [13]:
# YOU CAN USE THIS CODEBLOCK AND ADD ANY BLOCK BELOW AS YOU NEED
# TO SHOW US THE IDEAS AND EXTRA EXPERIMENTS YOU RUN.
# HAVE FUN!