# CPSC 533V: Assignment 3 - Behavioral Cloning and Deep Q Learning

## 48 points total (9% of final grade)

Name: Haomiao Zhang Student Number: 33074155

---
This assignment will help you transition from tabular approaches, topic of HW 2, to deep neural network approaches. You will implement the [Atari DQN / Deep Q-Learning](https://arxiv.org/abs/1312.5602) algorithm, which arguably kicked off the modern Deep Reinforcement Learning craze.

In this assignment we will use PyTorch as our deep learning framework.  To familiarize yourself with PyTorch, your first task is to use a behavior cloning (BC) approach to learn a policy.  Behavior cloning is a supervised learning method in which there exists a dataset of expert demonstrations (state-action pairs) and the goal is to learn a policy $\pi$ that mimics this expert.  At any given state, your policy should choose the same action the export would.

Since BC avoids the need to collect data from the policy you are trying to learn, it is relatively simple. 
This makes it a nice stepping stone for implementing DQN. Furthermore, BC is relevant to modern approaches---for example its use as an initialization for systems like [AlphaGo][go] and [AlphaStar][star], which then use RL to further adapte the BC result.  

<!--

I feel like this might be better suited to going lower in the document:

Unfortunately, in many tasks it is impossible to collect good expert demonstrations, making

it's not always possible to have good expert demonstrations for a task in an environemnt and this is where reinforcement learning comes handy. Through the reward signal retrieved by interacting with the environment, the agent learns by itself what is a good policy and can learn to outperform the experts.

-->

Goals:
- Famliarize yourself with PyTorch and its API including models, datasets, dataloaders
- Implement a supervised learning approach (behavioral cloning) to learn a policy.
- Implement the DQN objective and learn a policy through environment interaction.

[go]:  https://deepmind.com/research/case-studies/alphago-the-story-so-far
[star]: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

## Submission information

- Complete the assignment by editing and executing the associated Python files.
- Copy and paste the code and the terminal output requested in the predefined cells on this Jupyter notebook.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.

## Task 0: Preliminaries

### PyTorch

If you have never used PyTorch before, we recommend you follow this [60 Minutes Blitz][blitz] tutorial from the official website. It should give you enough context to be able to complete the assignment.


**If you have issues, post questions to Piazza**

### Installation

To install all required python packages:

```
python3 -m pip install -r requirements.txt
```

### Debugging


You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.


[blitz]: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

## Task 1: Behavioral Cloning

Behavioral Cloning is a type of supervised learning in which you are given a dataset of expert demonstrations tuple $(s, a)$ and the goal is to learn a policy function $\hat a = \pi(s)$, such that $\hat a = a$.

The optimization objective is $\min_\theta D(\pi(s), a)$ where $\theta$ are the parameters the policy $\pi$, in our case the weights of a neural network, and where $D$ represents some difference between the actions.

---

Before starting, we suggest reading through the provided files.

For Behavioral Cloning, the important files to understand are: `model.py`, `dataset.py` and `bc.py`.

- The file `model.py` has the skeleton for the model (which you will have to complete in the following questions),

- The file `dataset.py` has the skeleton for the dataset the model is being trained with,

- and, `bc.py` will have all the structure for training the model with the dataset.


### 1.1 Dataset

We provide a pickle file with pre-collected expert demonstrations on CartPole from which to learn the policy $\pi$. The data has been collected from an expert policy on the environment, with the addition of a small amount of gaussian noise to the actions.

The pickle file contains a list of tuples of states and actions in `numpy` in the following way:

```
[(state s, action a), (state s, action a), (state s, action a), ...]
```

In the `dataset.py` file, we provide skeleton code for creating a custom dataset. The provided code shows how to load the file.

Your goal is to overwrite the `__getitem__` function in order to return a dictionary of tensors of the correct type.

Hint: Look in the `bc.py` file to understand how the dataset is used.

Answer the following questions:

- [**QUESTION 2 points]** Insert your code in the placeholder below.

In [5]:
# PLACEHOLDER TO INSERT YOUR __getitem__ method here

def __getitem__(self, index):
    item = self.data[index]
    return dict({'state':torch.tensor(item[0]),'action':torch.tensor(item[1])})

- **[QUESTION 2 points]** How big is the dataset provided?

The dataset has a length of 99660.

- **[QUESTION 2 points]** What is the dimensionality of $s$ and what range does each dimension of $s$ span?  I.e., how much of the state space does the expert data cover?

The state has 4 dimensions since the length of the state is 4. The range of each dimension for the state covered by expert data are: [-0.7227, 2.3995], [-0.4330, 1.8470], [-0.0501, 0.1464], [-0.3812, 0.4714]. The numbers are obtained by finding the maximum and minimum of each state in the data.

- **[QUESTION 2 points]** What are the dimensionalities and ranges of the action $a$ in the dataset (how much of the action space does the expert data cover)?

The action has 1 dimension, and the action space covered by the expert data are {0, 1}, which is 100% coverage.


### 1.2 Environment

Recall the state and action space of CartPole, from the previous assignment.

- **[QUESTION 2 points]** Considering the full state and action spaces, do you think the provided expert dataset has good coverage?  Why or why not? How might this impact the performance of our cloned policy?

The expert data does not cover some of the state space. For example, the data does not cover the case when the cart is further in the negative position, has a large negative speed, or when the pole has large negative angles. Insufficient coverage will decrease the performance of our cloned policy since there might be better policies that is not demonstrated by the expert.

### 1.3 Model

The file `model.py` provides skeleton code for the model. Your goal is to create the architecture of the network by adding layers that map the input to output.

You will need to update the `__init__` method and the `forward` method.

The `select_action` method has already been written for you.  This should be used when running the policy in the environment, while the `forward` function should be used at training time.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [4]:
# PLACEHOLDER TO INSERT YOUR MyModel class here

class MyModel(nn.Module):
     def __init__(self, state_size, action_size):
        super(MyModel, self).__init__()
        
        # following piazza advice for 2 hidden layers of 64 neurons
        # syntax follow the 60 min tutorial provided
        self.hd1 = nn.Linear(state_size,64)
        self.hd2 = nn.Linear(64,64)
        self.output = nn.Linear(64,action_size)

     def forward(self, x):
        x = F.relu(self.hd1(x).float())
        x = F.relu(self.hd2(x))
        x = self.output(x)
        return x

     def select_action(self, state):
        self.eval()
        x = self.forward(state)
        self.train()
        return x.max(1)[1].view(1, 1).to(torch.long)

Answer the following questions:

- **[QUESTION 2 points]** What is the input of the network?

The input of the network are the states. The dimension is 4 for each example.

- **[QUESTION 2 points]** What is the output?

The output of the nework are the probability of each action. The dimension is 2 for each example.


### 1.4 Training

The file `bc.py` is the entry point for training your behavioral cloning model. The skeleton and the main components are already there.

The missing parts for you to do are:

- Initializing the model
- Choosing a loss function
- Choosing an optimizer
- Playing with hyperparameters to train your model.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [5]:
# PLACEHOLDER FOR YOUR CODE HER
# HOW DID YOU INITIALIZE YOUR MODEL, OPTIMIZER AND LOSS FUNCTIONS? PASTE HERE YOUR FINAL CODE
# NOTE: YOU CAN KEEP THE FOLLOWING LINES COMMENTED OUT, AS RUNNING THIS CELL WILL PROBABLY RESULT IN ERRORS

model = MyModel(4,2)
optimizer = torch.optim.SGD(model.parameters(),lr=LEARNING_RATE)

# seems to be an loss function for classification and output format matches
loss_function = torch.nn.CrossEntropyLoss()

You can run your code by doing:

```
python3 bc.py
```

**During all of this assignment, the code in `eval_policy.py` will be your best friend.** At any time, you can test your model by giving as argument the path to the model weights and the environment name using the following command:

```
python3 eval_policy.py --model-path /path/to/model/weights --env ENV_NAME
````

In [None]:
# output of training phase, running python3 bc.py
[epoch    1/100] [iter       0] [loss 0.69714]
[epoch    1/100] [iter     500] [loss 0.68122]
[epoch    1/100] [iter    1000] [loss 0.66725]
[epoch    1/100] [iter    1500] [loss 0.65466]
[epoch    2/100] [iter    2000] [loss 0.63796]
[epoch    2/100] [iter    2500] [loss 0.62442]
[epoch    2/100] [iter    3000] [loss 0.60190]
[Test on environment] [epoch 2/100] [score 53.40]
[epoch    3/100] [iter    3500] [loss 0.58616]
[epoch    3/100] [iter    4000] [loss 0.56452]
[epoch    3/100] [iter    4500] [loss 0.53839]
[epoch    4/100] [iter    5000] [loss 0.52452]
[epoch    4/100] [iter    5500] [loss 0.48137]
[epoch    4/100] [iter    6000] [loss 0.47635]
[Test on environment] [epoch 4/100] [score 61.50]
[epoch    5/100] [iter    6500] [loss 0.45083]
[epoch    5/100] [iter    7000] [loss 0.43100]
[epoch    5/100] [iter    7500] [loss 0.37357]
[epoch    6/100] [iter    8000] [loss 0.38259]
[epoch    6/100] [iter    8500] [loss 0.36017]
[epoch    6/100] [iter    9000] [loss 0.36235]
[Test on environment] [epoch 6/100] [score 60.50]
[epoch    7/100] [iter    9500] [loss 0.36913]
[epoch    7/100] [iter   10000] [loss 0.30188]
[epoch    7/100] [iter   10500] [loss 0.34008]
[epoch    8/100] [iter   11000] [loss 0.33769]
[epoch    8/100] [iter   11500] [loss 0.29680]
[epoch    8/100] [iter   12000] [loss 0.21036]
[Test on environment] [epoch 8/100] [score 69.60]
[epoch    9/100] [iter   12500] [loss 0.29051]
[epoch    9/100] [iter   13000] [loss 0.25852]
[epoch    9/100] [iter   13500] [loss 0.25409]
[epoch    9/100] [iter   14000] [loss 0.36082]
[epoch   10/100] [iter   14500] [loss 0.30018]
[epoch   10/100] [iter   15000] [loss 0.29064]
[epoch   10/100] [iter   15500] [loss 0.21888]
[Test on environment] [epoch 10/100] [score 77.20]
[epoch   11/100] [iter   16000] [loss 0.24531]
[epoch   11/100] [iter   16500] [loss 0.24884]
[epoch   11/100] [iter   17000] [loss 0.24639]
[epoch   12/100] [iter   17500] [loss 0.22901]
[epoch   12/100] [iter   18000] [loss 0.28408]
[epoch   12/100] [iter   18500] [loss 0.25482]
[Test on environment] [epoch 12/100] [score 108.40]
[epoch   13/100] [iter   19000] [loss 0.31272]
[epoch   13/100] [iter   19500] [loss 0.18302]
[epoch   13/100] [iter   20000] [loss 0.17613]
[epoch   14/100] [iter   20500] [loss 0.24361]
[epoch   14/100] [iter   21000] [loss 0.19487]
[epoch   14/100] [iter   21500] [loss 0.25849]
[Test on environment] [epoch 14/100] [score 116.70]
[epoch   15/100] [iter   22000] [loss 0.24766]
[epoch   15/100] [iter   22500] [loss 0.21922]
[epoch   15/100] [iter   23000] [loss 0.16931]
[epoch   16/100] [iter   23500] [loss 0.26442]
[epoch   16/100] [iter   24000] [loss 0.11117]
[epoch   16/100] [iter   24500] [loss 0.28343]
[Test on environment] [epoch 16/100] [score 140.90]
[epoch   17/100] [iter   25000] [loss 0.17825]
[epoch   17/100] [iter   25500] [loss 0.16142]
[epoch   17/100] [iter   26000] [loss 0.20124]
[epoch   18/100] [iter   26500] [loss 0.14321]
[epoch   18/100] [iter   27000] [loss 0.23828]
[epoch   18/100] [iter   27500] [loss 0.20905]
[epoch   18/100] [iter   28000] [loss 0.32970]
[Test on environment] [epoch 18/100] [score 155.10]
[epoch   19/100] [iter   28500] [loss 0.12756]
[epoch   19/100] [iter   29000] [loss 0.24492]
[epoch   19/100] [iter   29500] [loss 0.23846]
[epoch   20/100] [iter   30000] [loss 0.14152]
[epoch   20/100] [iter   30500] [loss 0.13408]
[epoch   20/100] [iter   31000] [loss 0.23141]
[Test on environment] [epoch 20/100] [score 186.90]
[epoch   21/100] [iter   31500] [loss 0.21407]
[epoch   21/100] [iter   32000] [loss 0.14571]
[epoch   21/100] [iter   32500] [loss 0.19157]
[epoch   22/100] [iter   33000] [loss 0.09238]
[epoch   22/100] [iter   33500] [loss 0.12021]
[epoch   22/100] [iter   34000] [loss 0.13936]
[Test on environment] [epoch 22/100] [score 159.00]
[epoch   23/100] [iter   34500] [loss 0.17434]
[epoch   23/100] [iter   35000] [loss 0.12500]
[epoch   23/100] [iter   35500] [loss 0.24270]
[epoch   24/100] [iter   36000] [loss 0.17178]
[epoch   24/100] [iter   36500] [loss 0.14341]
[epoch   24/100] [iter   37000] [loss 0.13779]
[Test on environment] [epoch 24/100] [score 173.80]
[epoch   25/100] [iter   37500] [loss 0.11098]
[epoch   25/100] [iter   38000] [loss 0.11585]
[epoch   25/100] [iter   38500] [loss 0.12512]
[epoch   26/100] [iter   39000] [loss 0.09220]
[epoch   26/100] [iter   39500] [loss 0.13928]
[epoch   26/100] [iter   40000] [loss 0.20441]
[epoch   26/100] [iter   40500] [loss 0.13053]
[Test on environment] [epoch 26/100] [score 190.00]
[epoch   27/100] [iter   41000] [loss 0.13350]
[epoch   27/100] [iter   41500] [loss 0.12575]
[epoch   27/100] [iter   42000] [loss 0.07754]
[epoch   28/100] [iter   42500] [loss 0.12791]
[epoch   28/100] [iter   43000] [loss 0.11287]
[epoch   28/100] [iter   43500] [loss 0.25251]
[Test on environment] [epoch 28/100] [score 189.60]
[epoch   29/100] [iter   44000] [loss 0.16524]
[epoch   29/100] [iter   44500] [loss 0.12574]
[epoch   29/100] [iter   45000] [loss 0.05651]
[epoch   30/100] [iter   45500] [loss 0.15205]
[epoch   30/100] [iter   46000] [loss 0.08784]
[epoch   30/100] [iter   46500] [loss 0.12887]
[Test on environment] [epoch 30/100] [score 196.70]
[epoch   31/100] [iter   47000] [loss 0.08194]
[epoch   31/100] [iter   47500] [loss 0.12054]
[epoch   31/100] [iter   48000] [loss 0.10460]
[epoch   32/100] [iter   48500] [loss 0.12993]
[epoch   32/100] [iter   49000] [loss 0.10380]
[epoch   32/100] [iter   49500] [loss 0.11424]
[Test on environment] [epoch 32/100] [score 194.00]
[epoch   33/100] [iter   50000] [loss 0.12923]
[epoch   33/100] [iter   50500] [loss 0.07099]
[epoch   33/100] [iter   51000] [loss 0.08171]
[epoch   34/100] [iter   51500] [loss 0.09515]
[epoch   34/100] [iter   52000] [loss 0.07492]
[epoch   34/100] [iter   52500] [loss 0.13357]
[Test on environment] [epoch 34/100] [score 200.00]
[epoch   35/100] [iter   53000] [loss 0.13564]
[epoch   35/100] [iter   53500] [loss 0.08119]
[epoch   35/100] [iter   54000] [loss 0.06457]
[epoch   35/100] [iter   54500] [loss 0.18283]
[epoch   36/100] [iter   55000] [loss 0.10442]
[epoch   36/100] [iter   55500] [loss 0.10068]
[epoch   36/100] [iter   56000] [loss 0.07301]
[Test on environment] [epoch 36/100] [score 198.20]
[epoch   37/100] [iter   56500] [loss 0.14000]
[epoch   37/100] [iter   57000] [loss 0.13257]
[epoch   37/100] [iter   57500] [loss 0.07662]
[epoch   38/100] [iter   58000] [loss 0.08104]
[epoch   38/100] [iter   58500] [loss 0.13946]
[epoch   38/100] [iter   59000] [loss 0.04721]
[Test on environment] [epoch 38/100] [score 198.00]
[epoch   39/100] [iter   59500] [loss 0.09174]
[epoch   39/100] [iter   60000] [loss 0.13207]
[epoch   39/100] [iter   60500] [loss 0.08296]
[epoch   40/100] [iter   61000] [loss 0.08297]
[epoch   40/100] [iter   61500] [loss 0.08201]
[epoch   40/100] [iter   62000] [loss 0.06275]
[Test on environment] [epoch 40/100] [score 199.60]
[epoch   41/100] [iter   62500] [loss 0.07500]
[epoch   41/100] [iter   63000] [loss 0.07306]
[epoch   41/100] [iter   63500] [loss 0.00711]
[epoch   42/100] [iter   64000] [loss 0.07645]
[epoch   42/100] [iter   64500] [loss 0.12512]
[epoch   42/100] [iter   65000] [loss 0.07498]
[Test on environment] [epoch 42/100] [score 198.70]
[epoch   43/100] [iter   65500] [loss 0.05814]
[epoch   43/100] [iter   66000] [loss 0.14734]
[epoch   43/100] [iter   66500] [loss 0.09420]
[epoch   44/100] [iter   67000] [loss 0.06559]
[epoch   44/100] [iter   67500] [loss 0.08785]
[epoch   44/100] [iter   68000] [loss 0.01799]
[epoch   44/100] [iter   68500] [loss 0.04748]
[Test on environment] [epoch 44/100] [score 198.90]
[epoch   45/100] [iter   69000] [loss 0.06215]
[epoch   45/100] [iter   69500] [loss 0.14878]
[epoch   45/100] [iter   70000] [loss 0.04379]
[epoch   46/100] [iter   70500] [loss 0.08242]
[epoch   46/100] [iter   71000] [loss 0.14611]
[epoch   46/100] [iter   71500] [loss 0.03529]
[Test on environment] [epoch 46/100] [score 199.30]
[epoch   47/100] [iter   72000] [loss 0.11181]
[epoch   47/100] [iter   72500] [loss 0.05420]
[epoch   47/100] [iter   73000] [loss 0.06542]
[epoch   48/100] [iter   73500] [loss 0.02776]
[epoch   48/100] [iter   74000] [loss 0.06160]
[epoch   48/100] [iter   74500] [loss 0.05815]
[Test on environment] [epoch 48/100] [score 198.20]
[epoch   49/100] [iter   75000] [loss 0.07618]
[epoch   49/100] [iter   75500] [loss 0.04137]
[epoch   49/100] [iter   76000] [loss 0.06780]
[epoch   50/100] [iter   76500] [loss 0.05133]
[epoch   50/100] [iter   77000] [loss 0.05768]
[epoch   50/100] [iter   77500] [loss 0.05279]
[Test on environment] [epoch 50/100] [score 196.50]
[epoch   51/100] [iter   78000] [loss 0.06458]
[epoch   51/100] [iter   78500] [loss 0.02706]
[epoch   51/100] [iter   79000] [loss 0.15182]
[epoch   52/100] [iter   79500] [loss 0.02669]
[epoch   52/100] [iter   80000] [loss 0.03993]
[epoch   52/100] [iter   80500] [loss 0.04363]
[epoch   52/100] [iter   81000] [loss 0.04641]
[Test on environment] [epoch 52/100] [score 199.80]
[epoch   53/100] [iter   81500] [loss 0.04811]
[epoch   53/100] [iter   82000] [loss 0.07831]
[epoch   53/100] [iter   82500] [loss 0.13590]
[epoch   54/100] [iter   83000] [loss 0.05363]
[epoch   54/100] [iter   83500] [loss 0.01846]
[epoch   54/100] [iter   84000] [loss 0.07821]
[Test on environment] [epoch 54/100] [score 195.70]
[epoch   55/100] [iter   84500] [loss 0.06859]
[epoch   55/100] [iter   85000] [loss 0.04701]
[epoch   55/100] [iter   85500] [loss 0.05324]
[epoch   56/100] [iter   86000] [loss 0.06712]
[epoch   56/100] [iter   86500] [loss 0.06254]
[epoch   56/100] [iter   87000] [loss 0.04174]
[Test on environment] [epoch 56/100] [score 199.00]
[epoch   57/100] [iter   87500] [loss 0.11631]
[epoch   57/100] [iter   88000] [loss 0.05255]
[epoch   57/100] [iter   88500] [loss 0.06074]
[epoch   58/100] [iter   89000] [loss 0.08504]
[epoch   58/100] [iter   89500] [loss 0.05822]
[epoch   58/100] [iter   90000] [loss 0.04474]
[Test on environment] [epoch 58/100] [score 198.00]
[epoch   59/100] [iter   90500] [loss 0.04967]
[epoch   59/100] [iter   91000] [loss 0.04785]
[epoch   59/100] [iter   91500] [loss 0.04461]
[epoch   60/100] [iter   92000] [loss 0.02715]
[epoch   60/100] [iter   92500] [loss 0.04390]
[epoch   60/100] [iter   93000] [loss 0.03712]
[Test on environment] [epoch 60/100] [score 200.00]
[epoch   61/100] [iter   93500] [loss 0.04944]
[epoch   61/100] [iter   94000] [loss 0.06080]
[epoch   61/100] [iter   94500] [loss 0.02200]
[epoch   61/100] [iter   95000] [loss 0.05759]
[epoch   62/100] [iter   95500] [loss 0.04561]
[epoch   62/100] [iter   96000] [loss 0.05548]
[epoch   62/100] [iter   96500] [loss 0.05217]
[Test on environment] [epoch 62/100] [score 197.60]
[epoch   63/100] [iter   97000] [loss 0.08831]
[epoch   63/100] [iter   97500] [loss 0.02987]
[epoch   63/100] [iter   98000] [loss 0.04199]
[epoch   64/100] [iter   98500] [loss 0.04198]
[epoch   64/100] [iter   99000] [loss 0.03689]
[epoch   64/100] [iter   99500] [loss 0.04334]
[Test on environment] [epoch 64/100] [score 198.50]
[epoch   65/100] [iter  100000] [loss 0.04235]
[epoch   65/100] [iter  100500] [loss 0.05102]
[epoch   65/100] [iter  101000] [loss 0.07311]
[epoch   66/100] [iter  101500] [loss 0.06371]
[epoch   66/100] [iter  102000] [loss 0.03966]
[epoch   66/100] [iter  102500] [loss 0.04935]
[Test on environment] [epoch 66/100] [score 198.50]
[epoch   67/100] [iter  103000] [loss 0.05879]
[epoch   67/100] [iter  103500] [loss 0.06695]
[epoch   67/100] [iter  104000] [loss 0.05565]
[epoch   68/100] [iter  104500] [loss 0.03269]
[epoch   68/100] [iter  105000] [loss 0.08685]
[epoch   68/100] [iter  105500] [loss 0.06411]
[Test on environment] [epoch 68/100] [score 200.00]
[epoch   69/100] [iter  106000] [loss 0.04937]
[epoch   69/100] [iter  106500] [loss 0.04324]
[epoch   69/100] [iter  107000] [loss 0.05792]
[epoch   69/100] [iter  107500] [loss 0.04539]
[epoch   70/100] [iter  108000] [loss 0.14215]
[epoch   70/100] [iter  108500] [loss 0.04782]
[epoch   70/100] [iter  109000] [loss 0.06069]
[Test on environment] [epoch 70/100] [score 199.60]
[epoch   71/100] [iter  109500] [loss 0.03340]
[epoch   71/100] [iter  110000] [loss 0.06787]
[epoch   71/100] [iter  110500] [loss 0.06662]
[epoch   72/100] [iter  111000] [loss 0.03403]
[epoch   72/100] [iter  111500] [loss 0.08459]
[epoch   72/100] [iter  112000] [loss 0.04408]
[Test on environment] [epoch 72/100] [score 200.00]
[epoch   73/100] [iter  112500] [loss 0.02905]
[epoch   73/100] [iter  113000] [loss 0.04234]
[epoch   73/100] [iter  113500] [loss 0.04563]
[epoch   74/100] [iter  114000] [loss 0.04804]
[epoch   74/100] [iter  114500] [loss 0.05245]
[epoch   74/100] [iter  115000] [loss 0.05299]
[Test on environment] [epoch 74/100] [score 200.00]
[epoch   75/100] [iter  115500] [loss 0.02796]
[epoch   75/100] [iter  116000] [loss 0.05554]
[epoch   75/100] [iter  116500] [loss 0.03132]
[epoch   76/100] [iter  117000] [loss 0.03151]
[epoch   76/100] [iter  117500] [loss 0.02985]
[epoch   76/100] [iter  118000] [loss 0.02627]
[Test on environment] [epoch 76/100] [score 198.70]
[epoch   77/100] [iter  118500] [loss 0.04323]
[epoch   77/100] [iter  119000] [loss 0.04289]
[epoch   77/100] [iter  119500] [loss 0.03098]
[epoch   78/100] [iter  120000] [loss 0.03663]
[epoch   78/100] [iter  120500] [loss 0.04377]
[epoch   78/100] [iter  121000] [loss 0.04071]
[epoch   78/100] [iter  121500] [loss 0.06311]
[Test on environment] [epoch 78/100] [score 200.00]
[epoch   79/100] [iter  122000] [loss 0.01713]
[epoch   79/100] [iter  122500] [loss 0.07172]
[epoch   79/100] [iter  123000] [loss 0.04676]
[epoch   80/100] [iter  123500] [loss 0.05198]
[epoch   80/100] [iter  124000] [loss 0.02204]
[epoch   80/100] [iter  124500] [loss 0.06332]
[Test on environment] [epoch 80/100] [score 198.90]
[epoch   81/100] [iter  125000] [loss 0.04733]
[epoch   81/100] [iter  125500] [loss 0.02210]
[epoch   81/100] [iter  126000] [loss 0.03909]
[epoch   82/100] [iter  126500] [loss 0.01423]
[epoch   82/100] [iter  127000] [loss 0.03103]
[epoch   82/100] [iter  127500] [loss 0.07605]
[Test on environment] [epoch 82/100] [score 200.00]
[epoch   83/100] [iter  128000] [loss 0.04798]
[epoch   83/100] [iter  128500] [loss 0.07016]
[epoch   83/100] [iter  129000] [loss 0.01939]
[epoch   84/100] [iter  129500] [loss 0.02238]
[epoch   84/100] [iter  130000] [loss 0.13405]
[epoch   84/100] [iter  130500] [loss 0.02822]
[Test on environment] [epoch 84/100] [score 200.00]
[epoch   85/100] [iter  131000] [loss 0.01473]
[epoch   85/100] [iter  131500] [loss 0.05388]
[epoch   85/100] [iter  132000] [loss 0.06503]
[epoch   86/100] [iter  132500] [loss 0.01272]
[epoch   86/100] [iter  133000] [loss 0.05862]
[epoch   86/100] [iter  133500] [loss 0.06432]
[Test on environment] [epoch 86/100] [score 198.30]
[epoch   87/100] [iter  134000] [loss 0.04333]
[epoch   87/100] [iter  134500] [loss 0.04263]
[epoch   87/100] [iter  135000] [loss 0.03157]
[epoch   87/100] [iter  135500] [loss 0.02746]
[epoch   88/100] [iter  136000] [loss 0.03820]
[epoch   88/100] [iter  136500] [loss 0.01769]
[epoch   88/100] [iter  137000] [loss 0.03962]
[Test on environment] [epoch 88/100] [score 200.00]
[epoch   89/100] [iter  137500] [loss 0.02568]
[epoch   89/100] [iter  138000] [loss 0.04549]
[epoch   89/100] [iter  138500] [loss 0.03729]
[epoch   90/100] [iter  139000] [loss 0.11098]
[epoch   90/100] [iter  139500] [loss 0.04339]
[epoch   90/100] [iter  140000] [loss 0.08004]
[Test on environment] [epoch 90/100] [score 197.90]
[epoch   91/100] [iter  140500] [loss 0.01768]
[epoch   91/100] [iter  141000] [loss 0.08420]
[epoch   91/100] [iter  141500] [loss 0.03575]
[epoch   92/100] [iter  142000] [loss 0.01235]
[epoch   92/100] [iter  142500] [loss 0.04670]
[epoch   92/100] [iter  143000] [loss 0.02943]
[Test on environment] [epoch 92/100] [score 200.00]
[epoch   93/100] [iter  143500] [loss 0.03243]
[epoch   93/100] [iter  144000] [loss 0.03741]
[epoch   93/100] [iter  144500] [loss 0.03913]
[epoch   94/100] [iter  145000] [loss 0.02893]
[epoch   94/100] [iter  145500] [loss 0.02154]
[epoch   94/100] [iter  146000] [loss 0.02874]
[Test on environment] [epoch 94/100] [score 200.00]
[epoch   95/100] [iter  146500] [loss 0.03196]
[epoch   95/100] [iter  147000] [loss 0.01311]
[epoch   95/100] [iter  147500] [loss 0.03302]
[epoch   95/100] [iter  148000] [loss 0.02416]
[epoch   96/100] [iter  148500] [loss 0.02109]
[epoch   96/100] [iter  149000] [loss 0.02331]
[epoch   96/100] [iter  149500] [loss 0.05027]
[Test on environment] [epoch 96/100] [score 200.00]
[epoch   97/100] [iter  150000] [loss 0.05223]
[epoch   97/100] [iter  150500] [loss 0.02322]
[epoch   97/100] [iter  151000] [loss 0.01922]
[epoch   98/100] [iter  151500] [loss 0.05662]
[epoch   98/100] [iter  152000] [loss 0.13024]
[epoch   98/100] [iter  152500] [loss 0.02138]
[Test on environment] [epoch 98/100] [score 199.20]
[epoch   99/100] [iter  153000] [loss 0.02181]
[epoch   99/100] [iter  153500] [loss 0.04091]
[epoch   99/100] [iter  154000] [loss 0.04907]
[epoch  100/100] [iter  154500] [loss 0.01671]
[epoch  100/100] [iter  155000] [loss 0.02109]
[epoch  100/100] [iter  155500] [loss 0.03168]
[Test on environment] [epoch 100/100] [score 199.90]
Saving model as behavioral_cloning_CartPole-v0.pt

In [4]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

# Below is the result of python3 eval_policy.py --model-path behavioral_cloning_CartPole-v0.pt --env CartPole-v0
[Episode    0/10] [reward 200.0]
[Episode    1/10] [reward 200.0]
[Episode    2/10] [reward 200.0]
[Episode    3/10] [reward 200.0]
[Episode    4/10] [reward 200.0]
[Episode    5/10] [reward 200.0]
[Episode    6/10] [reward 200.0]
[Episode    7/10] [reward 200.0]
[Episode    8/10] [reward 200.0]
[Episode    9/10] [reward 200.0]

**[QUESTION 2 points]** Did you manage to learn a good policy? How consistent is the reward you are getting?

Based on the amount of reward, I think the neural network learned a good policy. The reward is also rather consistent. However, in terms of variety, I don't think it learned a good policy. There is a bias on choosing the action of moving to the right since it has more sample in expert data on the right I think. 

## Task 2: Deep Q Learning

There are two main issues with the behavior cloning approach.

- First, we are not always lucky enough to have access to a dataset of expert demonstrations.
- Second, replicating an expert policy suffers from compounding error. The policy $\pi$ only sees these "perfect" examples and has no knowledge on how to recover from states not visited by the expert. For this reason, as soon as it is presented with a state that is off the expert trajectory, it will perform poorly and will continue to deviate from a good trajectory without the possibility of recovering from errors.

---
The second task consists in solving the environment from scratch, using RL, and most specifically the DQN algorithm, to learn a policy $\pi$.

For this task, familiarize yourself with the file `dqn.py`. We are going to re-use the file `model.py` for the model you created in the previous task.

Your task is very similar to the one in the previous assignment, to implement the Q-learning algorithm, but in this version, our Q-function is approximated with a neural network.

The algorithm (excerpted from Section 6.5 of [Sutton's book](http://incompleteideas.net/book/RLbook2018.pdf)) is given below:

![DQN algorithm](https://i.imgur.com/Mh4Uxta.png)

### 2.0 Think about your model...



**[QUESTION 2 points]** In DQN, we are using the same model as in task 1 for behavioral cloning. In both tasks the model receives as input the state and in both tasks the model outputs something that has the same dimensionality as the number of actions. These two outputs, though, represent very different things. What is each one representing?

**YOUR ANSWER HERE**

### 2.1 Update your Q-function

Complete the `optimize_model` function. This function receives as input a `state`, an `action`, the `next_state`, the `reward` and `done` representing the tuple $(s_t, a_t, s_{t+1}, r_t, done_t)$. Your task is to update your Q-function as shown in the [Atari DQN paper](https://arxiv.org/abs/1312.5602) environment. For now don't be concerned with the experience replay buffer. We'll get to that later.

![Loss function](https://i.imgur.com/tpTsV8m.png)

- [**QUESTION 8 points]** Insert your code in the placeholder below.

In [7]:
## PLACEHOLDER TO INSERT YOUR optimize_model function here:

# def optimize_model(state, action, next_state, reward, done):
#     # TODO given a tuple (s_t, a_t, s_{t+1}, r_t, done_t) update your model weights

#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()

### 2.2 $\epsilon$-greedy strategy

You will need a strategy to explore your environment. The standard strategy is to use $\epsilon$-greedy. Implement it in the `choose_action` function template.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [9]:
## PLACEHOLDER TO INSERT YOUR choose_action function here:

# def choose_action(state, test_mode=False):
#     # TODO implement an epsilon-greedy strategy
#     raise NotImplementedError()

### 2.3 Train your model

Try to train a model in this way.

You can run your code by doing:

```
python3 dqn.py
```

**[QUESTION 2 points]** How many episodes does it take to learn (ie. reach a good reward)?

**YOUR ANSWER HERE**

In [1]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

### 2.4 Add the Experience Replay Buffer

If you read the DQN paper (and as you can see from the algorithm picture above), the authors make use of an experience replay buffer to learn faster. We provide an implementation in the file `replay_buffer.py`. Update the `train_reinforcement_learning` code to push a tuple to the replay buffer and to sample a batch for the `optimize_model` function.

**[QUESTION 5 points]** How does the replay buffer improve performances?

In [12]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

## Task 3: Extra

Ideas to experiment with:

- Is $\epsilon$-greedy strategy the best strategy available? Why not trying something different.
- Why not make use of the model you have trained in the behavioral cloning part and fine-tune it with RL? How does that affect performance?
- You are perhaps bored with `CartPole-v0` by now. Another environment we suggest trying is `LunarLander-v2`. It will be harder to learn but with experimentation, you will find the correct optimizations for success. Piazza is also your friend :)
- What about learning from images? This requires more work because you have to extract the image from the environment. However, would it be possible? How much more challenging might you expect the learning to be in this case?
- The ReplayBuffer implementation provided is very simple. In class we have briefly mentioned Prioritized Experience Replay; how would the learning process change?
- An improvement over DQN is DoubleDQN, which is a very simple addition to the current code.



In [13]:
# YOU CAN USE THIS CODEBLOCK AND ADD ANY BLOCK BELOW AS YOU NEED
# TO SHOW US THE IDEAS AND EXTRA EXPERIMENTS YOU RUN.
# HAVE FUN!