# Code for the final report in 02456 Deep Learning 

**Authors:** *Mads Emil Dahlgaard (S164206), Morten Wehlast Jørgensen (S147056), and Niels Asp Fuglsang (S164181)*

This notebook recreates the main results from the report *An analysis of levers in deep reinforcement learning and how they affect learning speed and generalization*. The full code base can be found at https://github.com/NielsFuglsang/02456-Deep-Learning.

---

## Introduction

Deep reinforcement learning is notourisly known for results that are difficult to reproduce. Even very small changes to hyperparameters can lead to completely different results. Therefore, in this project we set out to investigate how to best increase one's chances of improving model performance. We identified the following 3 areas.
- Data set size
- Input transformation
- Choice of policy optimization algorithm
We believe that these three areas are important for any reinforcement learning practioner and particularly for newcomers to the field.

In order to investigate this we use the Starpilot environment from ProcGen Benchmark. The procedurally generated environment allows for virtually unlimited training data and is a perfect starting point for investigating the importance of data set size. In order to explore input transformation we chose to focus on to different convolutional neural networks; the simple Nature CNN and the more advanced IMPALA CNN. Lastly, we compare the state-of-the-art policy optimization algorithms PPO and TRPO. To sum up, we conclude the following three points.
- Data set size: larger volumes of varied data generally improves generalization
- Encoders: IMPALA CNN outperforms Nature CNN even on smaller tasks
- Policy optimization: PPO is both simpler and better performing than TRPO

## Technical stack

The Python version is 3.7.7 64 bit. All experiments are run on DTU HPC GPU nodes. Specifically, the experiments are run on the `gpuv100` queue consisting of 10 nodes with 2 Nvidia Tesla V100 Tensor Core GPUs - 6 of the nodes have 16GB ram and 4 of the nodes 32GB. The following list contains the most significant modules used. For a full list see `requirements.txt` in the Github repository.
- CUDA 9.2
- CuDNN 7.4.2.24
- Jupyter Lab 2.2.9
- Matplotlib 3.3.3
- Numpy 1.19.4
- OpenAI Gym 0.17.3
- Procgen 0.10.4
- PyTorch 1.7.0

## Code structure

We tried to make our code as modular as possible. The overall code structure looks like this

```
├── src
│   └── init.py
│   └── encoder.py         # Classes for each encoder structure.
│   └── experiment.py      # Class for training and evaluating a policy.
│   └── policy.py          # PPO and TRPO classes. 
│   └── utils.py           # Functions for keeping track of environments and data.
├── params
│   └── ...                # JSON files specifying experiment hyperparameters.
├── .gitignore
├── README.md
├── job.sh                 # Jobscript for running on HPC.
├── requirements.txt       # Python modules.
├── run_experiment.py      # Read parameters from JSON file, train, and evaluate policy.
├── sender.sh              # Helper function to submit jobs on HPC.
```

The folder `src` contains the main code for this project and is structured as a Python package. The file `encoder.py` implements the two Convolutional Neural Networks used in this project; Impala CNN and Nature CNN. The file `policy.py` contains classes for `PPO` and `TRPO` respectively with two main class methods; `act` for using the policy and sampling an action, and `loss` for returning the policy loss given some observation.

The file `experiment.py` contains code for training and evaluating a policy in an environment. The `Experiment` class from `experiment.py` is independent from the choice of encoder and policy. This means that the same code can be used to train both a TRPO and a PPO policy. Therefore, we just need to specify the encoder, the policy, and the hyperparameters when creating the `Experiment` class. As an example, in order to train a PPO network with the IMPALA CNN as encoder the following code is sufficient.

```python
exp = Experiment(params)
encoder = Impala(in_channels, feature_dim)
policy = PPO(encoder, feature_dim, num_actions)
policy, log = exp.train(env, policy, optimizer, storage)
```

Notice the variables `params`, `in_channels`, `feature_dim`, and so on. They are all hyperparameters needed to train and evaluate the policy. The combination of policy optimization algorithm, encoder, and hyperparameters is essentially what defines an experiment. We came up with a way to specify all the parameters needed to run an experiment, which enabled us to easily conduct multiple different experiments without changing the code. All the experiments are specified as JSON files in the folder `params`. An experiment could look like this.

```JSON
{
    "total_steps" : 2e6,
    "num_envs": 32,
    "num_levels": 10,
    "num_steps": 256,
    "num_epochs": 3,
    "batch_size": 512,
    "eps": 0.2,
    "grad_eps": 0.5,
    "value_coef": 0.5,
    "entropy_coef": 0.01,
    "feature_dim": 128,
    "policy": "ppo",
    "encoder": "nature",
    "beta": 0,
    "lr": 5e-4
}
```

The Python file `run_experiment.py` then takes as argument the filename of a JSON file with experiment parameters and runs the entire pipeline, i.e., train, evaluate, and save the results. If the above parameters are stored in `params/experiment1.json`, then the experiment can be executed by the following command.
```sh
>> python run_experiment.py experiment1
```

### Running the code on DTU HPC

We needed to specify the above as a jobscript in order to run the experiments on the DTU HPC cluster queue system. The jobscript is shown below.

```sh
#!/bin/sh
#BSUB -q gpuv100
#BSUB -gpu "num=1"
#BSUB -J name
#BSUB -n 1
#BSUB -W 10:00
#BSUB -R "rusage[mem=32GB]"
#BSUB -o logs/name.out
#BSUB -e logs/name.err

cd /zhome/ff/2/118359/projects/02456-Deep-Learning
source .venv/bin/activate

echo "Running script..."
python run_experiment.py name
```

Now in order to simplify this even further we created a small shell script that exchanges `name` in the above with an input argument and submits it to the queue using `bsub`. This script is called `sender.sh`. Therefore, one can simply conduct an experiment by creating a parameters JSON file and submitting the experiment like this
```sh
source sender.sh name_of_experiment
```

This way we could easily specify and run the different experiments.

## Training loop

In the training loop we both generate the training data and train on that data. This is possible since we are dealing with a procedurally generated environment. Therefore, the training loop goes like this.
- Generate `num_steps` of training data.
- Go over that training data for `num_epochs`.
     - For each epoch calculate the loss, perform backward propagation, and update the policy.
- Generate new data and evaluate the test reward.

The above loop is run as long as the acummulated number of steps is less than `num_steps`. Notice that this is different from most real life machine learning since we are able to simply generate new data. Furthermore, the above is run simultaneously in parallel for 32 environments to speed up the process.

In order to compare training and test performance it is important that these two metrics are evaluated in the same way. We therefore use mean episodic reward. The mean episodic reward is calculated by running all 32 environments for one episode and taking the average reward over the 32 environemnts. One episode is defined from when the agent starts playing until it fails. Here is a code snippet showing how this is implemented.
```python
workers_finished = np.zeros((self.num_envs), dtype=bool)
while not np.all(workers_finished):

    # Use policy.
    action, _, _, _ = policy.act(obs)

    # Take step in environment.
    obs, reward, done, _ = env.step(action)
    for i in range(self.num_envs):
        if done[i]:
            workers_finished[i] = True
        if workers_finished[i]:
            reward[i] = 0

    total_reward.append(torch.Tensor(reward))
   
# Calculate average reward
mean_reward = torch.stack(total_reward).sum(0).mean(0)
```

This also means that the same code can be used to evaluate both training and test mean episodic reward simply by specifying what levels the agent should be evaluated on.

## Saving the results and logging

For each experiment we save a log of the progress. This log looks like this
```python
log = {
    'step': steps,
    'train_mean_reward': train_mean_reward,
    'train_min_reward': train_min_reward,
    'train_max_reward': train_max_reward,
    'test_mean_reward': test_mean_reward,
    'test_min_reward': test_min_reward,
    'test_max_reward': test_max_reward,
    'pi_loss': pi_loss,
    'value_loss': value_loss,
    'entropy_loss': entropy_loss,
    'test_var': test_vars,
    'train_var': train_vars
}
```

where each variable is a list of size `totalSteps/numSteps`, i.e., the number of training loops. 

---