# CPSC 533V: Assignment 3 - Behavioral Cloning and Deep Q Learning

## 48 points total (9% of final grade)

---
This assignment will help you transition from tabular approaches, topic of HW 2, to deep neural network approaches. You will implement the [Atari DQN / Deep Q-Learning](https://arxiv.org/abs/1312.5602) algorithm, which arguably kicked off the modern Deep Reinforcement Learning craze.

In this assignment we will use PyTorch as our deep learning framework.  To familiarize yourself with PyTorch, your first task is to use a behavior cloning (BC) approach to learn a policy.  Behavior cloning is a supervised learning method in which there exists a dataset of expert demonstrations (state-action pairs) and the goal is to learn a policy $\pi$ that mimics this expert.  At any given state, your policy should choose the same action the export would.

Since BC avoids the need to collect data from the policy you are trying to learn, it is relatively simple. 
This makes it a nice stepping stone for implementing DQN. Furthermore, BC is relevant to modern approaches---for example its use as an initialization for systems like [AlphaGo][go] and [AlphaStar][star], which then use RL to further adapte the BC result.  

<!--

I feel like this might be better suited to going lower in the document:

Unfortunately, in many tasks it is impossible to collect good expert demonstrations, making

it's not always possible to have good expert demonstrations for a task in an environemnt and this is where reinforcement learning comes handy. Through the reward signal retrieved by interacting with the environment, the agent learns by itself what is a good policy and can learn to outperform the experts.

-->

Goals:
- Famliarize yourself with PyTorch and its API including models, datasets, dataloaders
- Implement a supervised learning approach (behavioral cloning) to learn a policy.
- Implement the DQN objective and learn a policy through environment interaction.

[go]:  https://deepmind.com/research/case-studies/alphago-the-story-so-far
[star]: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

## Submission information

- Complete the assignment by editing and executing the associated Python files.
- Copy and paste the code and the terminal output requested in the predefined cells on this Jupyter notebook.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.

<ul style="list-style-type: none; font-size: 1.2em;">
<li>Name (and student ID): Kim Dinh </li>
<li>Name (and student ID): Alan Milligan </li>
</ul>

## Task 0: Preliminaries

### PyTorch

If you have never used PyTorch before, we recommend you follow this [60 Minutes Blitz][blitz] tutorial from the official website. It should give you enough context to be able to complete the assignment.


**If you have issues, post questions to Piazza**

### Installation

To install all required python packages:

```
python3 -m pip install -r requirements.txt
```

### Debugging


You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.


[blitz]: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

## Task 1: Behavioral Cloning

Behavioral Cloning is a type of supervised learning in which you are given a dataset of expert demonstrations tuple $(s, a)$ and the goal is to learn a policy function $\hat a = \pi(s)$, such that $\hat a = a$.

The optimization objective is $\min_\theta D(\pi(s), a)$ where $\theta$ are the parameters the policy $\pi$, in our case the weights of a neural network, and where $D$ represents some difference between the actions.

---

Before starting, we suggest reading through the provided files.

For Behavioral Cloning, the important files to understand are: `model.py`, `dataset.py` and `bc.py`.

- The file `model.py` has the skeleton for the model (which you will have to complete in the following questions),

- The file `dataset.py` has the skeleton for the dataset the model is being trained with,

- and, `bc.py` will have all the structure for training the model with the dataset.


### 1.1 Dataset

We provide a pickle file with pre-collected expert demonstrations on CartPole from which to learn the policy $\pi$. The data has been collected from an expert policy on the environment, with the addition of a small amount of gaussian noise to the actions.

The pickle file contains a list of tuples of states and actions in `numpy` in the following way:

```
[(state s, action a), (state s, action a), (state s, action a), ...]
```

In the `dataset.py` file, we provide skeleton code for creating a custom dataset. The provided code shows how to load the file.

Your goal is to overwrite the `__getitem__` function in order to return a dictionary of tensors of the correct type.

Hint: Look in the `bc.py` file to understand how the dataset is used.

Answer the following questions:

- [**QUESTION 2 points]** Insert your code in the placeholder below.

In [5]:
# PLACEHOLDER TO INSERT YOUR __getitem__ method here
'''
def __getitem__(self, index):
    item = self.data[index]
    return {'state': item[0].astype('float32'), 'action': item[1]}
'''

- **[QUESTION 2 points]** How big is the dataset provided?

The dataset contains 99660 state-action pairs.

- **[QUESTION 2 points]** What is the dimensionality of $s$ and what range does each dimension of $s$ span?  I.e., how much of the state space does the expert data cover?

The state space has 4 dimensions. The range of each dimension is as follow: $[-0.7227, 2.3995], [-0.4330, 1.8470], [-0.0501, 0.1464], [-0.3812, 0.4714]$. The expert data only covers little of the state space. The range of cart positions in the expert data only covers about 1/3 of the space for cart position $[-4.8, 4.8]$. The range of pole angle in the expert data only covers less than 1/4 of the space for pole angle $[-0.42, 0.42]$.

- **[QUESTION 2 points]** What are the dimensionalities and ranges of the action $a$ in the dataset (how much of the action space does the expert data cover)?

The action has dimension 1, its range is two values 0 and 1, which is the entire action space.

### 1.2 Environment

Recall the state and action space of CartPole, from the previous assignment.

- **[QUESTION 2 points]** Considering the full state and action spaces, do you think the provided expert dataset has good coverage?  Why or why not? How might this impact the performance of our cloned policy?

The expert data covers all of the action space but it does not have a good coverage of the state space as shown above. The cloned policy performance may not be consistent as it may often fall into the region of state space that is not covered the expert data and easily make a bad move.

### 1.3 Model

The file `model.py` provides skeleton code for the model. Your goal is to create the architecture of the network by adding layers that map the input to output.

You will need to update the `__init__` method and the `forward` method.

The `select_action` method has already been written for you.  This should be used when running the policy in the environment, while the `forward` function should be used at training time.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [4]:
# PLACEHOLDER TO INSERT YOUR MyModel class here
'''
class MyModel(nn.Module):
    def __init__(self, state_size, action_size):
        super(MyModel, self).__init__()
        # Basic Multilayer Perceptron
        self.fc1 = nn.Linear(state_size,64)
        self.rl1 = nn.ReLU(inplace=True)
        self.fc2 = nn.Linear(64,64)
        self.rl2 = nn.ReLU(inplace=True)
        self.fc3 = nn.Linear(64, action_size)

    def forward(self, x):
        return self.fc3(self.rl2(self.fc2(self.rl1(self.fc1(x)))))

    def select_action(self, state):
        self.eval()
        x = self.forward(state)
        self.train()
        return x.max(1)[1].view(1, 1).to(torch.long)
'''

Answer the following questions:

- **[QUESTION 2 points]** What is the input of the network?

The input of the network is a state which is a vector of size 4.

- **[QUESTION 2 points]** What is the output?

The output is a vector of size 2 where each entry is associated with an action.


### 1.4 Training

The file `bc.py` is the entry point for training your behavioral cloning model. The skeleton and the main components are already there.

The missing parts for you to do are:

- Initializing the model
- Choosing a loss function
- Choosing an optimizer
- Playing with hyperparameters to train your model.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [5]:
# PLACEHOLDER FOR YOUR CODE HER
# HOW DID YOU INITIALIZE YOUR MODEL, OPTIMIZER AND LOSS FUNCTIONS? PASTE HERE YOUR FINAL CODE
# NOTE: YOU CAN KEEP THE FOLLOWING LINES COMMENTED OUT, AS RUNNING THIS CELL WILL PROBABLY RESULT IN ERRORS
'''
model = MyModel(4, 2)
optimizer = torch.torch.optim.Adam(model.parameters())
loss_function = torch.nn.CrossEntropyLoss()
'''

You can run your code by doing:

```
python3 bc.py
```

**During all of this assignment, the code in `eval_policy.py` will be your best friend.** At any time, you can test your model by giving as argument the path to the model weights and the environment name using the following command:

```
python3 eval_policy.py --model-path /path/to/model/weights --env ENV_NAME
````

In [4]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10
'''
[epoch    1/30] [iter       0] [loss 0.69308]
[epoch    1/30] [iter     500] [loss 0.14052]
[epoch    1/30] [iter    1000] [loss 0.08124]
[epoch    1/30] [iter    1500] [loss 0.05027]
[epoch    2/30] [iter    2000] [loss 0.01764]
[epoch    2/30] [iter    2500] [loss 0.01220]
[epoch    2/30] [iter    3000] [loss 0.01165]
[Test on environment] [epoch 2/30] [score 200.00]
[epoch    3/30] [iter    3500] [loss 0.00646]
[epoch    3/30] [iter    4000] [loss 0.01830]
[epoch    3/30] [iter    4500] [loss 0.01422]
[epoch    4/30] [iter    5000] [loss 0.00280]
[epoch    4/30] [iter    5500] [loss 0.02360]
[epoch    4/30] [iter    6000] [loss 0.04071]
[Test on environment] [epoch 4/30] [score 198.40]
[epoch    5/30] [iter    6500] [loss 0.00576]
[epoch    5/30] [iter    7000] [loss 0.00085]
[epoch    5/30] [iter    7500] [loss 0.01397]
[epoch    6/30] [iter    8000] [loss 0.00483]
[epoch    6/30] [iter    8500] [loss 0.00920]
[epoch    6/30] [iter    9000] [loss 0.04689]
[Test on environment] [epoch 6/30] [score 200.00]
[epoch    7/30] [iter    9500] [loss 0.02013]
[epoch    7/30] [iter   10000] [loss 0.00755]
[epoch    7/30] [iter   10500] [loss 0.00202]
[epoch    8/30] [iter   11000] [loss 0.00661]
[epoch    8/30] [iter   11500] [loss 0.01981]
[epoch    8/30] [iter   12000] [loss 0.00089]
[Test on environment] [epoch 8/30] [score 199.50]
[epoch    9/30] [iter   12500] [loss 0.01099]
[epoch    9/30] [iter   13000] [loss 0.00232]
[epoch    9/30] [iter   13500] [loss 0.01080]
[epoch    9/30] [iter   14000] [loss 0.00436]
[epoch   10/30] [iter   14500] [loss 0.01526]
[epoch   10/30] [iter   15000] [loss 0.02598]
[epoch   10/30] [iter   15500] [loss 0.03057]
[Test on environment] [epoch 10/30] [score 200.00]
[epoch   11/30] [iter   16000] [loss 0.00875]
[epoch   11/30] [iter   16500] [loss 0.00848]
[epoch   11/30] [iter   17000] [loss 0.00117]
[epoch   12/30] [iter   17500] [loss 0.00165]
[epoch   12/30] [iter   18000] [loss 0.00595]
[epoch   12/30] [iter   18500] [loss 0.00386]
[Test on environment] [epoch 12/30] [score 200.00]
[epoch   13/30] [iter   19000] [loss 0.00507]
[epoch   13/30] [iter   19500] [loss 0.00493]
[epoch   13/30] [iter   20000] [loss 0.00074]
[epoch   14/30] [iter   20500] [loss 0.00069]
[epoch   14/30] [iter   21000] [loss 0.00058]
[epoch   14/30] [iter   21500] [loss 0.01100]
[Test on environment] [epoch 14/30] [score 200.00]
[epoch   15/30] [iter   22000] [loss 0.00119]
[epoch   15/30] [iter   22500] [loss 0.00083]
[epoch   15/30] [iter   23000] [loss 0.00042]
[epoch   16/30] [iter   23500] [loss 0.00849]
[epoch   16/30] [iter   24000] [loss 0.00148]
[epoch   16/30] [iter   24500] [loss 0.00472]
[Test on environment] [epoch 16/30] [score 198.80]
[epoch   17/30] [iter   25000] [loss 0.01183]
[epoch   17/30] [iter   25500] [loss 0.00098]
[epoch   17/30] [iter   26000] [loss 0.02957]
[epoch   18/30] [iter   26500] [loss 0.02430]
[epoch   18/30] [iter   27000] [loss 0.00515]
[epoch   18/30] [iter   27500] [loss 0.01589]
[epoch   18/30] [iter   28000] [loss 0.00829]
[Test on environment] [epoch 18/30] [score 200.00]
[epoch   19/30] [iter   28500] [loss 0.00001]
[epoch   19/30] [iter   29000] [loss 0.00126]
[epoch   19/30] [iter   29500] [loss 0.00009]
[epoch   20/30] [iter   30000] [loss 0.00786]
[epoch   20/30] [iter   30500] [loss 0.00476]
[epoch   20/30] [iter   31000] [loss 0.01090]
[Test on environment] [epoch 20/30] [score 200.00]
[epoch   21/30] [iter   31500] [loss 0.00296]
[epoch   21/30] [iter   32000] [loss 0.02178]
[epoch   21/30] [iter   32500] [loss 0.00823]
[epoch   22/30] [iter   33000] [loss 0.00003]
[epoch   22/30] [iter   33500] [loss 0.00376]
[epoch   22/30] [iter   34000] [loss 0.00443]
[Test on environment] [epoch 22/30] [score 200.00]
[epoch   23/30] [iter   34500] [loss 0.00265]
[epoch   23/30] [iter   35000] [loss 0.02861]
[epoch   23/30] [iter   35500] [loss 0.01309]
[epoch   24/30] [iter   36000] [loss 0.00344]
[epoch   24/30] [iter   36500] [loss 0.00024]
[epoch   24/30] [iter   37000] [loss 0.00369]
[Test on environment] [epoch 24/30] [score 199.40]
[epoch   25/30] [iter   37500] [loss 0.07599]
[epoch   25/30] [iter   38000] [loss 0.03027]
[epoch   25/30] [iter   38500] [loss 0.00000]
[epoch   26/30] [iter   39000] [loss 0.00009]
[epoch   26/30] [iter   39500] [loss 0.02126]
[epoch   26/30] [iter   40000] [loss 0.00201]
[epoch   26/30] [iter   40500] [loss 0.02954]
[Test on environment] [epoch 26/30] [score 200.00]
[epoch   27/30] [iter   41000] [loss 0.00097]
[epoch   27/30] [iter   41500] [loss 0.02288]
[epoch   27/30] [iter   42000] [loss 0.01375]
[epoch   28/30] [iter   42500] [loss 0.00383]
[epoch   28/30] [iter   43000] [loss 0.00268]
[epoch   28/30] [iter   43500] [loss 0.01324]
[Test on environment] [epoch 28/30] [score 198.60]
[epoch   29/30] [iter   44000] [loss 0.00132]
[epoch   29/30] [iter   44500] [loss 0.00197]
[epoch   29/30] [iter   45000] [loss 0.00034]
[epoch   30/30] [iter   45500] [loss 0.00060]
[epoch   30/30] [iter   46000] [loss 0.00141]
[epoch   30/30] [iter   46500] [loss 0.01374]
[Test on environment] [epoch 30/30] [score 200.00]
'''

**[QUESTION 2 points]** Did you manage to learn a good policy? How consistent is the reward you are getting?

The learned policy is good and rewards are fairly consistent. I evaluated the policy for 30 episodes and got total rewards of 200 for 25 episodes, and total rewards above 180 for other 5 episodes.

Eventhough the expert data does not cover the full state space, the agent still learns a good policy. This may be because that the starting state is initialized in $[-0.05, 0.05]^4$, which is still covered by the expert data. So the agent may still be able to learn a good policy that also helps it stay in the covered space most of the time.

## Task 2: Deep Q Learning

There are two main issues with the behavior cloning approach.

- First, we are not always lucky enough to have access to a dataset of expert demonstrations.
- Second, replicating an expert policy suffers from compounding error. The policy $\pi$ only sees these "perfect" examples and has no knowledge on how to recover from states not visited by the expert. For this reason, as soon as it is presented with a state that is off the expert trajectory, it will perform poorly and will continue to deviate from a good trajectory without the possibility of recovering from errors.

---
The second task consists in solving the environment from scratch, using RL, and most specifically the DQN algorithm, to learn a policy $\pi$.

For this task, familiarize yourself with the file `dqn.py`. We are going to re-use the file `model.py` for the model you created in the previous task.

Your task is very similar to the one in the previous assignment, to implement the Q-learning algorithm, but in this version, our Q-function is approximated with a neural network.

The algorithm (excerpted from Section 6.5 of [Sutton's book](http://incompleteideas.net/book/RLbook2018.pdf)) is given below:

![DQN algorithm](https://i.imgur.com/Mh4Uxta.png)

### 2.0 Think about your model...



**[QUESTION 2 points]** In DQN, we are using the same model as in task 1 for behavioral cloning. In both tasks the model receives as input the state and in both tasks the model outputs something that has the same dimensionality as the number of actions. These two outputs, though, represent very different things. What is each one representing?

In behavior cloning, the output after we apply the softmax (which is what is done in the cross-entropy loss) can be interpreted as a probability distribution over the actions. In DQN, each entry in the output is the $Q$ function at the input state and an action. 

### 2.1 Update your Q-function

Complete the `optimize_model` function. This function receives as input a `state`, an `action`, the `next_state`, the `reward` and `done` representing the tuple $(s_t, a_t, s_{t+1}, r_t, done_t)$. Your task is to update your Q-function as shown in the [Atari DQN paper](https://arxiv.org/abs/1312.5602) environment. For now don't be concerned with the experience replay buffer. We'll get to that later.

![Loss function](https://i.imgur.com/tpTsV8m.png)

- [**QUESTION 8 points]** Insert your code in the placeholder below.

In [7]:
## PLACEHOLDER TO INSERT YOUR optimize_model function here:
'''
def optimize_model(state, action, next_state, reward, done):
    target_value = torch.tensor(reward).view(1)
    if not done:
        target_value += GAMMA * model(torch.tensor([next_state])).max(1)[0].detach()
    output = model(torch.tensor([state]))[0, action].view(1)
    loss = torch.nn.MSELoss()(output, target_value)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
'''

### 2.2 $\epsilon$-greedy strategy

You will need a strategy to explore your environment. The standard strategy is to use $\epsilon$-greedy. Implement it in the `choose_action` function template.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [9]:
## PLACEHOLDER TO INSERT YOUR choose_action function here:
'''
def choose_action(state, test_mode=False):
    if test_mode or random.random() > EPS_EXPLORATION:
        return torch.argmax(model(torch.tensor([state])).detach(), dim=1).view(1,1)
    else:
        return torch.tensor(random.randint(0,1)).view(1, 1)
'''

### 2.3 Train your model

Try to train a model in this way.

You can run your code by doing:

```
python3 dqn.py
```

**[QUESTION 2 points]** How many episodes does it take to learn (ie. reach a good reward)?

It takes about 400 episodes for the agent to learn a good policy. However the training is unstable. Episodes rewards after after seeing the best policy still sometimes go down and up again.

In [1]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10
'''
[Episode    5/1000] [Steps    8] [reward 9.0]
[Episode   10/1000] [Steps    8] [reward 9.0]
[Episode   15/1000] [Steps   12] [reward 13.0]
[Episode   20/1000] [Steps    8] [reward 9.0]
[Episode   25/1000] [Steps    7] [reward 8.0]
----------
saving model.
[TEST Episode 25] [Average Reward 9.0]
----------
[Episode   30/1000] [Steps   10] [reward 11.0]
[Episode   35/1000] [Steps    7] [reward 8.0]
[Episode   40/1000] [Steps    8] [reward 9.0]
[Episode   45/1000] [Steps   11] [reward 12.0]
[Episode   50/1000] [Steps   11] [reward 12.0]
----------
saving model.
[TEST Episode 50] [Average Reward 21.6]
----------
[Episode   55/1000] [Steps   10] [reward 11.0]
[Episode   60/1000] [Steps    8] [reward 9.0]
[Episode   65/1000] [Steps   10] [reward 11.0]
[Episode   70/1000] [Steps    8] [reward 9.0]
[Episode   75/1000] [Steps   10] [reward 11.0]
----------
[TEST Episode 75] [Average Reward 9.3]
----------
[Episode   80/1000] [Steps    8] [reward 9.0]
[Episode   85/1000] [Steps    8] [reward 9.0]
[Episode   90/1000] [Steps   14] [reward 15.0]
[Episode   95/1000] [Steps   16] [reward 17.0]
[Episode  100/1000] [Steps   10] [reward 11.0]
----------
[TEST Episode 100] [Average Reward 9.8]
----------
[Episode  105/1000] [Steps    8] [reward 9.0]
[Episode  110/1000] [Steps    9] [reward 10.0]
[Episode  115/1000] [Steps   10] [reward 11.0]
[Episode  120/1000] [Steps   16] [reward 17.0]
[Episode  125/1000] [Steps   10] [reward 11.0]
----------
[TEST Episode 125] [Average Reward 10.0]
----------
[Episode  130/1000] [Steps   14] [reward 15.0]
[Episode  135/1000] [Steps   21] [reward 22.0]
[Episode  140/1000] [Steps    8] [reward 9.0]
[Episode  145/1000] [Steps    8] [reward 9.0]
[Episode  150/1000] [Steps   14] [reward 15.0]
----------
[TEST Episode 150] [Average Reward 11.8]
----------
[Episode  155/1000] [Steps   12] [reward 13.0]
[Episode  160/1000] [Steps   12] [reward 13.0]
[Episode  165/1000] [Steps   21] [reward 22.0]
[Episode  170/1000] [Steps  126] [reward 127.0]
[Episode  175/1000] [Steps   10] [reward 11.0]
----------
[TEST Episode 175] [Average Reward 11.9]
----------
[Episode  180/1000] [Steps   10] [reward 11.0]
[Episode  185/1000] [Steps   19] [reward 20.0]
[Episode  190/1000] [Steps   32] [reward 33.0]
[Episode  195/1000] [Steps   14] [reward 15.0]
[Episode  200/1000] [Steps   39] [reward 40.0]
----------
saving model.
[TEST Episode 200] [Average Reward 38.8]
----------
[Episode  205/1000] [Steps   22] [reward 23.0]
[Episode  210/1000] [Steps   24] [reward 25.0]
[Episode  215/1000] [Steps   66] [reward 67.0]
[Episode  220/1000] [Steps   18] [reward 19.0]
[Episode  225/1000] [Steps   19] [reward 20.0]
----------
[TEST Episode 225] [Average Reward 20.1]
----------
[Episode  230/1000] [Steps   44] [reward 45.0]
[Episode  235/1000] [Steps   22] [reward 23.0]
[Episode  240/1000] [Steps   34] [reward 35.0]
[Episode  245/1000] [Steps   45] [reward 46.0]
[Episode  250/1000] [Steps   65] [reward 66.0]
----------
[TEST Episode 250] [Average Reward 17.6]
----------
[Episode  255/1000] [Steps   18] [reward 19.0]
[Episode  260/1000] [Steps   52] [reward 53.0]
[Episode  265/1000] [Steps   25] [reward 26.0]
[Episode  270/1000] [Steps   52] [reward 53.0]
[Episode  275/1000] [Steps   19] [reward 20.0]
----------
[TEST Episode 275] [Average Reward 25.0]
----------
[Episode  280/1000] [Steps   16] [reward 17.0]
[Episode  285/1000] [Steps   68] [reward 69.0]
[Episode  290/1000] [Steps   91] [reward 92.0]
[Episode  295/1000] [Steps  115] [reward 116.0]
[Episode  300/1000] [Steps  180] [reward 181.0]
----------
saving model.
[TEST Episode 300] [Average Reward 49.4]
----------
[Episode  305/1000] [Steps   17] [reward 18.0]
[Episode  310/1000] [Steps   32] [reward 33.0]
[Episode  315/1000] [Steps  138] [reward 139.0]
[Episode  320/1000] [Steps  199] [reward 200.0]
[Episode  325/1000] [Steps  163] [reward 164.0]
----------
saving model.
[TEST Episode 325] [Average Reward 118.4]
----------
[Episode  330/1000] [Steps    8] [reward 9.0]
[Episode  335/1000] [Steps   45] [reward 46.0]
[Episode  340/1000] [Steps   57] [reward 58.0]
[Episode  345/1000] [Steps   63] [reward 64.0]
[Episode  350/1000] [Steps  161] [reward 162.0]
----------
[TEST Episode 350] [Average Reward 11.4]
----------
[Episode  355/1000] [Steps  173] [reward 174.0]
[Episode  360/1000] [Steps   13] [reward 14.0]
[Episode  365/1000] [Steps  189] [reward 190.0]
[Episode  370/1000] [Steps  199] [reward 200.0]
[Episode  375/1000] [Steps  199] [reward 200.0]
----------
saving model.
[TEST Episode 375] [Average Reward 200.0]
----------
[Episode  380/1000] [Steps   10] [reward 11.0]
[Episode  385/1000] [Steps  154] [reward 155.0]
[Episode  390/1000] [Steps  168] [reward 169.0]
[Episode  395/1000] [Steps  197] [reward 198.0]
[Episode  400/1000] [Steps   15] [reward 16.0]
----------
[TEST Episode 400] [Average Reward 35.6]
----------
[Episode  405/1000] [Steps  199] [reward 200.0]
[Episode  410/1000] [Steps  186] [reward 187.0]
[Episode  415/1000] [Steps  199] [reward 200.0]
[Episode  420/1000] [Steps  120] [reward 121.0]
[Episode  425/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 425] [Average Reward 200.0]
----------
[Episode  430/1000] [Steps    8] [reward 9.0]
[Episode  435/1000] [Steps  115] [reward 116.0]
[Episode  440/1000] [Steps  166] [reward 167.0]
[Episode  445/1000] [Steps  169] [reward 170.0]
[Episode  450/1000] [Steps  100] [reward 101.0]
----------
[TEST Episode 450] [Average Reward 124.9]
----------
[Episode  455/1000] [Steps  199] [reward 200.0]
[Episode  460/1000] [Steps   20] [reward 21.0]
[Episode  465/1000] [Steps  146] [reward 147.0]
[Episode  470/1000] [Steps  164] [reward 165.0]
[Episode  475/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 475] [Average Reward 199.6]
----------
[Episode  480/1000] [Steps   25] [reward 26.0]
[Episode  485/1000] [Steps   13] [reward 14.0]
[Episode  490/1000] [Steps   99] [reward 100.0]
[Episode  495/1000] [Steps  118] [reward 119.0]
[Episode  500/1000] [Steps  124] [reward 125.0]
----------
[TEST Episode 500] [Average Reward 112.9]
----------
[Episode  505/1000] [Steps  164] [reward 165.0]
[Episode  510/1000] [Steps  164] [reward 165.0]
[Episode  515/1000] [Steps  147] [reward 148.0]
[Episode  520/1000] [Steps  199] [reward 200.0]
[Episode  525/1000] [Steps   16] [reward 17.0]
----------
[TEST Episode 525] [Average Reward 13.0]
----------
[Episode  530/1000] [Steps  199] [reward 200.0]
[Episode  535/1000] [Steps  199] [reward 200.0]
[Episode  540/1000] [Steps  199] [reward 200.0]
[Episode  545/1000] [Steps   12] [reward 13.0]
[Episode  550/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 550] [Average Reward 200.0]
----------
[Episode  555/1000] [Steps  199] [reward 200.0]
[Episode  560/1000] [Steps  162] [reward 163.0]
[Episode  565/1000] [Steps  160] [reward 161.0]
[Episode  570/1000] [Steps   13] [reward 14.0]
[Episode  575/1000] [Steps  158] [reward 159.0]
----------
[TEST Episode 575] [Average Reward 104.1]
----------
[Episode  580/1000] [Steps  140] [reward 141.0]
[Episode  585/1000] [Steps  147] [reward 148.0]
[Episode  590/1000] [Steps  120] [reward 121.0]
[Episode  595/1000] [Steps  135] [reward 136.0]
[Episode  600/1000] [Steps  164] [reward 165.0]
----------
[TEST Episode 600] [Average Reward 200.0]
----------
[Episode  605/1000] [Steps  131] [reward 132.0]
[Episode  610/1000] [Steps  199] [reward 200.0]
[Episode  615/1000] [Steps  199] [reward 200.0]
[Episode  620/1000] [Steps   87] [reward 88.0]
[Episode  625/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 625] [Average Reward 200.0]
----------
[Episode  630/1000] [Steps  199] [reward 200.0]
[Episode  635/1000] [Steps  199] [reward 200.0]
[Episode  640/1000] [Steps   11] [reward 12.0]
[Episode  645/1000] [Steps  126] [reward 127.0]
[Episode  650/1000] [Steps  135] [reward 136.0]
----------
[TEST Episode 650] [Average Reward 199.8]
----------
[Episode  655/1000] [Steps  199] [reward 200.0]
[Episode  660/1000] [Steps  130] [reward 131.0]
[Episode  665/1000] [Steps  141] [reward 142.0]
[Episode  670/1000] [Steps  199] [reward 200.0]
[Episode  675/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 675] [Average Reward 200.0]
----------
[Episode  680/1000] [Steps  199] [reward 200.0]
[Episode  685/1000] [Steps  199] [reward 200.0]
[Episode  690/1000] [Steps  199] [reward 200.0]
[Episode  695/1000] [Steps  196] [reward 197.0]
[Episode  700/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 700] [Average Reward 200.0]
----------
[Episode  705/1000] [Steps   14] [reward 15.0]
[Episode  710/1000] [Steps  199] [reward 200.0]
[Episode  715/1000] [Steps  199] [reward 200.0]
[Episode  720/1000] [Steps  199] [reward 200.0]
[Episode  725/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 725] [Average Reward 197.8]
----------
[Episode  730/1000] [Steps  199] [reward 200.0]
[Episode  735/1000] [Steps  199] [reward 200.0]
[Episode  740/1000] [Steps  199] [reward 200.0]
[Episode  745/1000] [Steps  199] [reward 200.0]
[Episode  750/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 750] [Average Reward 200.0]
----------
[Episode  755/1000] [Steps   16] [reward 17.0]
[Episode  760/1000] [Steps  189] [reward 190.0]
[Episode  765/1000] [Steps  199] [reward 200.0]
[Episode  770/1000] [Steps   16] [reward 17.0]
[Episode  775/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 775] [Average Reward 200.0]
----------
[Episode  780/1000] [Steps  199] [reward 200.0]
[Episode  785/1000] [Steps   59] [reward 60.0]
[Episode  790/1000] [Steps  199] [reward 200.0]
[Episode  795/1000] [Steps  199] [reward 200.0]
[Episode  800/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 800] [Average Reward 200.0]
----------
[Episode  805/1000] [Steps  199] [reward 200.0]
[Episode  810/1000] [Steps  199] [reward 200.0]
[Episode  815/1000] [Steps  199] [reward 200.0]
[Episode  820/1000] [Steps  199] [reward 200.0]
[Episode  825/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 825] [Average Reward 200.0]
----------
[Episode  830/1000] [Steps  199] [reward 200.0]
[Episode  835/1000] [Steps  199] [reward 200.0]
[Episode  840/1000] [Steps  199] [reward 200.0]
[Episode  845/1000] [Steps  199] [reward 200.0]
[Episode  850/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 850] [Average Reward 200.0]
----------
[Episode  855/1000] [Steps  199] [reward 200.0]
[Episode  860/1000] [Steps  199] [reward 200.0]
[Episode  865/1000] [Steps   12] [reward 13.0]
[Episode  870/1000] [Steps  199] [reward 200.0]
[Episode  875/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 875] [Average Reward 200.0]
----------
[Episode  880/1000] [Steps  199] [reward 200.0]
[Episode  885/1000] [Steps  199] [reward 200.0]
[Episode  890/1000] [Steps  199] [reward 200.0]
[Episode  895/1000] [Steps   12] [reward 13.0]
[Episode  900/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 900] [Average Reward 200.0]
----------
[Episode  905/1000] [Steps  199] [reward 200.0]
[Episode  910/1000] [Steps  199] [reward 200.0]
[Episode  915/1000] [Steps  199] [reward 200.0]
[Episode  920/1000] [Steps  199] [reward 200.0]
[Episode  925/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 925] [Average Reward 200.0]
----------
[Episode  930/1000] [Steps  199] [reward 200.0]
[Episode  935/1000] [Steps   11] [reward 12.0]
[Episode  940/1000] [Steps  199] [reward 200.0]
[Episode  945/1000] [Steps  199] [reward 200.0]
[Episode  950/1000] [Steps   11] [reward 12.0]
----------
[TEST Episode 950] [Average Reward 105.8]
----------
[Episode  955/1000] [Steps  199] [reward 200.0]
[Episode  960/1000] [Steps  199] [reward 200.0]
[Episode  965/1000] [Steps  199] [reward 200.0]
[Episode  970/1000] [Steps  199] [reward 200.0]
[Episode  975/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 975] [Average Reward 200.0]
----------
[Episode  980/1000] [Steps  199] [reward 200.0]
[Episode  985/1000] [Steps    9] [reward 10.0]
[Episode  990/1000] [Steps  199] [reward 200.0]
[Episode  995/1000] [Steps   11] [reward 12.0]
[Episode 1000/1000] [Steps  199] [reward 200.0]
----------
[TEST Episode 1000] [Average Reward 200.0]
'''

### 2.4 Add the Experience Replay Buffer

If you read the DQN paper (and as you can see from the algorithm picture above), the authors make use of an experience replay buffer to learn faster. We provide an implementation in the file `replay_buffer.py`. Update the `train_reinforcement_learning` code to push a tuple to the replay buffer and to sample a batch for the `optimize_model` function.

In [None]:
# PLACEHOLDER for optimize_model
'''
def optimize_model(state, action, next_state, reward, done):
    target_value = reward + GAMMA * (1-done) * model(next_state).max(1)[0].detach()
    output = model(state).gather(1, action.type(torch.long)).squeeze(1)
    
    loss = torch.nn.MSELoss()(output, target_value)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
'''

**[QUESTION 5 points]** How does the replay buffer improve performances?

With a replay buffer, the agent learns a good policy much faster (high average reward is achieved after ~ 100 episodes). This is because the replay buffer allows the agent to learn from good past experience multiple times. However, the training is still a bit unstable. This unstable behavior is due to the output and the target both depends on the same set of parameters (parameters of the Q network).

In [12]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10
'''
[Episode    5/500] [Steps    9] [reward 10.0]
[Episode   10/500] [Steps    9] [reward 10.0]
[Episode   15/500] [Steps   10] [reward 11.0]
[Episode   20/500] [Steps    7] [reward 8.0]
[Episode   25/500] [Steps    8] [reward 9.0]
----------
saving model.
[TEST Episode 25] [Average Reward 9.4]
----------
[Episode   30/500] [Steps   11] [reward 12.0]
[Episode   35/500] [Steps   11] [reward 12.0]
[Episode   40/500] [Steps   19] [reward 20.0]
[Episode   45/500] [Steps    9] [reward 10.0]
[Episode   50/500] [Steps   10] [reward 11.0]
----------
saving model.
[TEST Episode 50] [Average Reward 40.9]
----------
[Episode   55/500] [Steps   24] [reward 25.0]
[Episode   60/500] [Steps  141] [reward 142.0]
[Episode   65/500] [Steps   76] [reward 77.0]
[Episode   70/500] [Steps   59] [reward 60.0]
[Episode   75/500] [Steps   86] [reward 87.0]
----------
saving model.
[TEST Episode 75] [Average Reward 69.3]
----------
[Episode   80/500] [Steps  102] [reward 103.0]
[Episode   85/500] [Steps  199] [reward 200.0]
[Episode   90/500] [Steps  199] [reward 200.0]
[Episode   95/500] [Steps  199] [reward 200.0]
[Episode  100/500] [Steps  199] [reward 200.0]
----------
saving model.
[TEST Episode 100] [Average Reward 197.1]
----------
[Episode  105/500] [Steps  199] [reward 200.0]
[Episode  110/500] [Steps  199] [reward 200.0]
[Episode  115/500] [Steps  199] [reward 200.0]
[Episode  120/500] [Steps  199] [reward 200.0]
[Episode  125/500] [Steps  199] [reward 200.0]
----------
saving model.
[TEST Episode 125] [Average Reward 200.0]
----------
[Episode  130/500] [Steps  199] [reward 200.0]
[Episode  135/500] [Steps  199] [reward 200.0]
[Episode  140/500] [Steps    8] [reward 9.0]
[Episode  145/500] [Steps    9] [reward 10.0]
[Episode  150/500] [Steps   17] [reward 18.0]
----------
[TEST Episode 150] [Average Reward 13.6]
----------
[Episode  155/500] [Steps  199] [reward 200.0]
[Episode  160/500] [Steps  199] [reward 200.0]
[Episode  165/500] [Steps  199] [reward 200.0]
[Episode  170/500] [Steps   66] [reward 67.0]
[Episode  175/500] [Steps  199] [reward 200.0]
----------
[TEST Episode 175] [Average Reward 200.0]
----------
[Episode  180/500] [Steps   57] [reward 58.0]
[Episode  185/500] [Steps   71] [reward 72.0]
[Episode  190/500] [Steps  199] [reward 200.0]
[Episode  195/500] [Steps  197] [reward 198.0]
[Episode  200/500] [Steps  198] [reward 199.0]
----------
[TEST Episode 200] [Average Reward 164.1]
----------
[Episode  205/500] [Steps  199] [reward 200.0]
[Episode  210/500] [Steps  199] [reward 200.0]
[Episode  215/500] [Steps  199] [reward 200.0]
[Episode  220/500] [Steps  199] [reward 200.0]
[Episode  225/500] [Steps   18] [reward 19.0]
----------
[TEST Episode 225] [Average Reward 200.0]
----------
[Episode  230/500] [Steps  199] [reward 200.0]
[Episode  235/500] [Steps  199] [reward 200.0]
[Episode  240/500] [Steps  180] [reward 181.0]
[Episode  245/500] [Steps  199] [reward 200.0]
[Episode  250/500] [Steps  199] [reward 200.0]
----------
[TEST Episode 250] [Average Reward 200.0]
----------
[Episode  255/500] [Steps  199] [reward 200.0]
[Episode  260/500] [Steps   35] [reward 36.0]
[Episode  265/500] [Steps  172] [reward 173.0]
[Episode  270/500] [Steps  199] [reward 200.0]
[Episode  275/500] [Steps  184] [reward 185.0]
----------
[TEST Episode 275] [Average Reward 200.0]
----------
[Episode  280/500] [Steps  110] [reward 111.0]
[Episode  285/500] [Steps  150] [reward 151.0]
[Episode  290/500] [Steps  199] [reward 200.0]
[Episode  295/500] [Steps  199] [reward 200.0]
[Episode  300/500] [Steps  199] [reward 200.0]
----------
[TEST Episode 300] [Average Reward 200.0]
----------
[Episode  305/500] [Steps  199] [reward 200.0]
[Episode  310/500] [Steps  199] [reward 200.0]
[Episode  315/500] [Steps   64] [reward 65.0]
[Episode  320/500] [Steps  117] [reward 118.0]
[Episode  325/500] [Steps  179] [reward 180.0]
----------
[TEST Episode 325] [Average Reward 165.0]
----------
[Episode  330/500] [Steps  112] [reward 113.0]
[Episode  335/500] [Steps  130] [reward 131.0]
[Episode  340/500] [Steps  174] [reward 175.0]
[Episode  345/500] [Steps  199] [reward 200.0]
[Episode  350/500] [Steps  199] [reward 200.0]
----------
[TEST Episode 350] [Average Reward 200.0]
----------
[Episode  355/500] [Steps  199] [reward 200.0]
[Episode  360/500] [Steps  199] [reward 200.0]
[Episode  365/500] [Steps   46] [reward 47.0]
[Episode  370/500] [Steps  101] [reward 102.0]
[Episode  375/500] [Steps   57] [reward 58.0]
----------
[TEST Episode 375] [Average Reward 200.0]
----------
[Episode  380/500] [Steps  199] [reward 200.0]
[Episode  385/500] [Steps  151] [reward 152.0]
[Episode  390/500] [Steps  196] [reward 197.0]
[Episode  395/500] [Steps  199] [reward 200.0]
[Episode  400/500] [Steps  199] [reward 200.0]
----------
[TEST Episode 400] [Average Reward 198.8]
----------
[Episode  405/500] [Steps  199] [reward 200.0]
[Episode  410/500] [Steps  199] [reward 200.0]
[Episode  415/500] [Steps  199] [reward 200.0]
[Episode  420/500] [Steps  199] [reward 200.0]
[Episode  425/500] [Steps  161] [reward 162.0]
----------
[TEST Episode 425] [Average Reward 200.0]
----------
[Episode  430/500] [Steps  199] [reward 200.0]
[Episode  435/500] [Steps  186] [reward 187.0]
[Episode  440/500] [Steps   19] [reward 20.0]
[Episode  445/500] [Steps  199] [reward 200.0]
[Episode  450/500] [Steps  175] [reward 176.0]
----------
[TEST Episode 450] [Average Reward 200.0]
----------
[Episode  455/500] [Steps  199] [reward 200.0]
[Episode  460/500] [Steps   15] [reward 16.0]
[Episode  465/500] [Steps   10] [reward 11.0]
[Episode  470/500] [Steps  199] [reward 200.0]
[Episode  475/500] [Steps  199] [reward 200.0]
----------
[TEST Episode 475] [Average Reward 200.0]
----------
[Episode  480/500] [Steps  199] [reward 200.0]
[Episode  485/500] [Steps  199] [reward 200.0]
[Episode  490/500] [Steps  199] [reward 200.0]
[Episode  495/500] [Steps  162] [reward 163.0]
[Episode  500/500] [Steps  199] [reward 200.0]
----------
[TEST Episode 500] [Average Reward 200.0]
'''

## Task 3: Extra

Ideas to experiment with:

- Is $\epsilon$-greedy strategy the best strategy available? Why not trying something different.
- Why not make use of the model you have trained in the behavioral cloning part and fine-tune it with RL? How does that affect performance?
- You are perhaps bored with `CartPole-v0` by now. Another environment we suggest trying is `LunarLander-v2`. It will be harder to learn but with experimentation, you will find the correct optimizations for success. Piazza is also your friend :)
- What about learning from images? This requires more work because you have to extract the image from the environment. However, would it be possible? How much more challenging might you expect the learning to be in this case?
- The ReplayBuffer implementation provided is very simple. In class we have briefly mentioned Prioritized Experience Replay; how would the learning process change?
- An improvement over DQN is DoubleDQN, which is a very simple addition to the current code.



In [13]:
# YOU CAN USE THIS CODEBLOCK AND ADD ANY BLOCK BELOW AS YOU NEED
# TO SHOW US THE IDEAS AND EXTRA EXPERIMENTS YOU RUN.
# HAVE FUN!

# We now use the target network by a small modification to optimize_model.
# We use the target network to compute the target Q value for the model output.
# The update for the target network is already provided in the training loop.
'''
def optimize_model(state, action, next_state, reward, done):
    target_value = reward + GAMMA * (1-done) * target(next_state).max(1)[0]
    output = model(state).gather(1, action.type(torch.long)).squeeze(1)
    
    loss = torch.nn.MSELoss()(output, target_value)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
'''

The agent learns a good policy after about 400 episodes and its performance onwards is mostly consistent.

In [None]:
'''
[Episode    5/4000] [Steps   10] [reward 11.0]
[Episode   10/4000] [Steps   11] [reward 12.0]
[Episode   15/4000] [Steps    8] [reward 9.0]
[Episode   20/4000] [Steps   12] [reward 13.0]
[Episode   25/4000] [Steps    8] [reward 9.0]
----------
saving model.
[TEST Episode 25] [Average Reward 9.4]
----------
[Episode   30/4000] [Steps   11] [reward 12.0]
[Episode   35/4000] [Steps   12] [reward 13.0]
[Episode   40/4000] [Steps   10] [reward 11.0]
[Episode   45/4000] [Steps    9] [reward 10.0]
[Episode   50/4000] [Steps   14] [reward 15.0]
----------
[TEST Episode 50] [Average Reward 9.0]
----------
[Episode   55/4000] [Steps   12] [reward 13.0]
[Episode   60/4000] [Steps   10] [reward 11.0]
[Episode   65/4000] [Steps   10] [reward 11.0]
[Episode   70/4000] [Steps    7] [reward 8.0]
[Episode   75/4000] [Steps   11] [reward 12.0]
----------
saving model.
[TEST Episode 75] [Average Reward 9.6]
----------
[Episode   80/4000] [Steps   13] [reward 14.0]
[Episode   85/4000] [Steps    9] [reward 10.0]
[Episode   90/4000] [Steps    9] [reward 10.0]
[Episode   95/4000] [Steps    9] [reward 10.0]
[Episode  100/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 100] [Average Reward 9.5]
----------
[Episode  105/4000] [Steps   10] [reward 11.0]
[Episode  110/4000] [Steps    7] [reward 8.0]
[Episode  115/4000] [Steps   11] [reward 12.0]
[Episode  120/4000] [Steps    8] [reward 9.0]
[Episode  125/4000] [Steps    8] [reward 9.0]
----------
saving model.
[TEST Episode 125] [Average Reward 9.8]
----------
[Episode  130/4000] [Steps    7] [reward 8.0]
[Episode  135/4000] [Steps    9] [reward 10.0]
[Episode  140/4000] [Steps   12] [reward 13.0]
[Episode  145/4000] [Steps    8] [reward 9.0]
[Episode  150/4000] [Steps   10] [reward 11.0]
----------
saving model.
[TEST Episode 150] [Average Reward 10.3]
----------
[Episode  155/4000] [Steps   10] [reward 11.0]
[Episode  160/4000] [Steps   12] [reward 13.0]
[Episode  165/4000] [Steps   13] [reward 14.0]
[Episode  170/4000] [Steps   10] [reward 11.0]
[Episode  175/4000] [Steps   19] [reward 20.0]
----------
saving model.
[TEST Episode 175] [Average Reward 12.6]
----------
[Episode  180/4000] [Steps   12] [reward 13.0]
[Episode  185/4000] [Steps   18] [reward 19.0]
[Episode  190/4000] [Steps   35] [reward 36.0]
[Episode  195/4000] [Steps   11] [reward 12.0]
[Episode  200/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 200] [Average Reward 9.3]
----------
[Episode  205/4000] [Steps   18] [reward 19.0]
[Episode  210/4000] [Steps   15] [reward 16.0]
[Episode  215/4000] [Steps   10] [reward 11.0]
[Episode  220/4000] [Steps    7] [reward 8.0]
[Episode  225/4000] [Steps   18] [reward 19.0]
----------
saving model.
[TEST Episode 225] [Average Reward 15.4]
----------
[Episode  230/4000] [Steps    8] [reward 9.0]
[Episode  235/4000] [Steps   16] [reward 17.0]
[Episode  240/4000] [Steps   12] [reward 13.0]
[Episode  245/4000] [Steps   40] [reward 41.0]
[Episode  250/4000] [Steps   18] [reward 19.0]
----------
saving model.
[TEST Episode 250] [Average Reward 25.4]
----------
[Episode  255/4000] [Steps   28] [reward 29.0]
[Episode  260/4000] [Steps   75] [reward 76.0]
[Episode  265/4000] [Steps   41] [reward 42.0]
[Episode  270/4000] [Steps   42] [reward 43.0]
[Episode  275/4000] [Steps   52] [reward 53.0]
----------
saving model.
[TEST Episode 275] [Average Reward 84.7]
----------
[Episode  280/4000] [Steps   53] [reward 54.0]
[Episode  285/4000] [Steps   54] [reward 55.0]
[Episode  290/4000] [Steps  123] [reward 124.0]
[Episode  295/4000] [Steps   75] [reward 76.0]
[Episode  300/4000] [Steps   47] [reward 48.0]
----------
saving model.
[TEST Episode 300] [Average Reward 105.0]
----------
[Episode  305/4000] [Steps   77] [reward 78.0]
[Episode  310/4000] [Steps   78] [reward 79.0]
[Episode  315/4000] [Steps   87] [reward 88.0]
[Episode  320/4000] [Steps   73] [reward 74.0]
[Episode  325/4000] [Steps  160] [reward 161.0]
----------
saving model.
[TEST Episode 325] [Average Reward 162.4]
----------
[Episode  330/4000] [Steps  157] [reward 158.0]
[Episode  335/4000] [Steps  199] [reward 200.0]
[Episode  340/4000] [Steps  199] [reward 200.0]
[Episode  345/4000] [Steps  199] [reward 200.0]
[Episode  350/4000] [Steps  199] [reward 200.0]
----------
saving model.
[TEST Episode 350] [Average Reward 200.0]
----------
[Episode  355/4000] [Steps  199] [reward 200.0]
[Episode  360/4000] [Steps  199] [reward 200.0]
[Episode  365/4000] [Steps  199] [reward 200.0]
[Episode  370/4000] [Steps  199] [reward 200.0]
[Episode  375/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 375] [Average Reward 200.0]
----------
[Episode  380/4000] [Steps  199] [reward 200.0]
[Episode  385/4000] [Steps  199] [reward 200.0]
[Episode  390/4000] [Steps  199] [reward 200.0]
[Episode  395/4000] [Steps  199] [reward 200.0]
[Episode  400/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 400] [Average Reward 200.0]
----------
[Episode  405/4000] [Steps  199] [reward 200.0]
[Episode  410/4000] [Steps  199] [reward 200.0]
[Episode  415/4000] [Steps  199] [reward 200.0]
[Episode  420/4000] [Steps  199] [reward 200.0]
[Episode  425/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 425] [Average Reward 200.0]
----------
[Episode  430/4000] [Steps  199] [reward 200.0]
[Episode  435/4000] [Steps  199] [reward 200.0]
[Episode  440/4000] [Steps  199] [reward 200.0]
[Episode  445/4000] [Steps  199] [reward 200.0]
[Episode  450/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 450] [Average Reward 200.0]
----------
[Episode  455/4000] [Steps  199] [reward 200.0]
[Episode  460/4000] [Steps  199] [reward 200.0]
[Episode  465/4000] [Steps  199] [reward 200.0]
[Episode  470/4000] [Steps  199] [reward 200.0]
[Episode  475/4000] [Steps  199] [reward 200.0]
'''