#### Berkeley CS 285 - Deep Reinforcement Learning, Decision Making, and Control - Fall 2020

# Assignment 1: Imitation Learning

##### All pictures and slides are from Sergei Levine's course CS285 - Deep RL

> http://rail.eecs.berkeley.edu/deeprlcourse/

*The goal of this assignment is to experiment with imitation learning, including direct behavior cloning and
the DAgger algorithm. In lieu of a human demonstrator, demonstrations will be provided via an expert policy
that we have trained for you. Your goals will be to set up behavior cloning and DAgger, and compare their
performance on a few different continuous control tasks from the OpenAI Gym benchmark suite. Turn in your
report and code as described in Section 4.*

*The starter-code for this assignment can be found at*

> https://github.com/berkeleydeeprlcourse/homework_fall2020

*You have the option of running the code either on Google Colab or on your own machine. Please refer to the
README for more information on setup.*

# Wide screen notebook

Do not run if you prefer the default setup.

In [21]:
from IPython.display import display, HTML

display(HTML(data="""
<style>
    div#notebook-container    { width: 95%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 99%; }
</style>
"""))

TODO:
    
   * Estructura del programa/sistemita
   * Solucion de cada parte que habia que solucionar
   * Analisis que píde el ejercicio
    

# Intro 

# Supervised Learning of Behaviours

We are going to use the same Supervised Learning setup as a neural network learning to map inputs to output category (eg. Cat/Dog photos classifier), but instead of learning to classify pictures, we're going to learn ("clone") behaviours (as a sequence of decisions).

We can think of this as a "copy-cat" of a given expert behaviour, from which we want to learn to imitate as close as possible.

![](./img/terminology.png)

The problem can be formalized as a probabilistic graphical model, where each node correspond to taking an action, being an state from which we only "see" an observation and "moving" to a new state.

* **State:** True configuration of the system (eg. All physical variables of a ball falling)

* **Observation:** Results from the state. Maybe NOT contain the whole information of the state (eg. A photo of the ball falling)

![](./img/observation_and_state.png)

![](./img/imitation.png)

Using only one photo (per timestep) as input **DOES NOT work** right away.

#### Problem

Small mistakes get accumulated until the observed state is very different from the demostrator behaviour, resulting in very bad performance (because the states are unknown for the agent).

#### One solution

Use 3 photos: a centered one, and a diagonally pointing to the left/right, each labeled as "go forth", "rotate (compensate) right" and "rotate left" 

![](./img/three_input.png)

# DAgger: Dataset Aggregation

![](./img/dagger.png)

In this homework, we're going to change step 3:

> Instead of asking a human, we're going to load an expert trained neural network, and sample an action given an observation.

This happens inside `rl_trainer.py` on the method `do_relabel_with_expert(...)`

```python
    def do_relabel_with_expert(self, expert_policy, paths):
        print("Relabelling collected observations with labels from an expert policy...")

        # TODO relabel collected obsevations (from our policy) with labels from an expert policy
        # HINT: query the policy (using the get_action function) with paths[i]["observation"]
        # and replace paths[i]["action"] with these expert labels
        for i in range(len(paths)):
            obs = paths[i]["observation"]
            expert_action = expert_policy.get_action(obs)
            paths[i]["action"] = expert_action
        return paths
```

Where the returned `paths` is a list of dictionaries containing the information for the trajectories (s,a,r,s') (see utils.py to find a Path object).

# Code

As with the `do_relabel_with_expert` method, lets go over each TODO piece of code.

The files in order to read and to be edited are:

1. `scripts/run_hw1.py` (read-only file)

2. `infrastructure/rl_trainer.py`

3. `agents/bc_agent.py` (read-only file)

4. `policies/MLP_policy.py`

5. `infrastructure/replay_buffer.py`

6. `infrastructure/utils.py`

7. `infrastructure/pytorch_util.py`

## 1. `scripts/run_hw1.py` (read-only file)

You can read all possible parameters to get a better idea of the algorithms to be implemented.

Also you can edit the main method so you don't need to write the parameters in the console, but directly run the file, eg:

```python
if __name__ == "__main__":
    args = ['--expert_policy_file', '/home/user/CS_285-Deep_Reinforcement_Learning/hw1/cs285/policies/experts/Ant.pkl',
            '--expert_data', '/home/user/CS_285-Deep_Reinforcement_Learning/hw1/cs285/expert_data/expert_data_Ant-v2.pkl',
            '--env_name', 'Ant-v2',
            '--exp_name', 'bc_ant',
            '--ep_len', '5000',
            '--eval_batch_size', '5000',
            '--train_batch_size', '1000',
            '--num_agent_train_steps_per_iter', '100',
            '--no_gpu'
            ]
    main(args)
```

Later on you can automate an hyperparameter search algorithms that calls main in a similar way.

## 2. `infrastructure/rl_trainer.py`

**Note**: This one uses `utils.sample_trajectories` and `utils.sample_n_trajectories`, so maybe you can read them right away to better understand whats going on here.

**Note:** Added tqdm() to train_agent()


In [22]:
!pygmentize cs285/infrastructure/rl_trainer.py

[34mfrom[39;49;00m [04m[36mcollections[39;49;00m [34mimport[39;49;00m OrderedDict
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mfrom[39;49;00m [04m[36mtqdm[39;49;00m [34mimport[39;49;00m tqdm
[34mimport[39;49;00m [04m[36mpkbar[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m

[34mimport[39;49;00m [04m[36mgym[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m

[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36minfrastructure[39;49;00m [34mimport[39;49;00m pytorch_util [34mas[39;49;00m ptu
[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36minfrastructure[39;49;00m[04m[36m.[39;49;00m[04m[36mlogger[39;49;00m [34mimport[39;49;00m Logger
[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36minfr

And inside `perform_logging`, added a call to a method to save recorded statistics to a file (to open later and compare runs).

```python
            # perform the logging
            for key, value in logs.items():
                print('{} : {}'.format(key, value))
                self.logger.log_scalar(value, key, itr)
                # TODO: call to log values for matplotlib or similar
                self.logs_to_file(logs)
            print('Done logging...')
            self.logger.flush()
```

## 6. `infrastructure/utils.py`

Here are defined the methods used before.

They "sample trajectories" ie. simulate and record T trajectories $\tau = [ (s_1, a_1, r_1, s'_1), (s_2, a_2, r_2, s'_2), \dots , (s_T, a_T, r_T, s'_T) ]$

In [23]:
!pygmentize cs285/infrastructure/utils.py

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m

[37m############################################[39;49;00m
[37m############################################[39;49;00m

[34mdef[39;49;00m [32msample_trajectory[39;49;00m(env, policy, max_path_length, render=[34mFalse[39;49;00m, render_mode=([33m'[39;49;00m[33mrgb_array[39;49;00m[33m'[39;49;00m)):
    [37m# initialize env for the beginning of a new rollout[39;49;00m
    ob = env.reset() [37m#TODO # HINT: should be the output of resetting the env[39;49;00m

    [37m# init vars[39;49;00m
    obs, acs, rewards, next_obs, terminals, image_obs = [], [], [], [], [], []
    steps = [34m0[39;49;00m
    [34mwhile[39;49;00m [34mTrue[39;49;00m:

        [37m# render image of the simulated env[39;49;00m
        [34mif[39;49;00m render:
            [34mif[39;49;00m [33m'[39;49;00m[33mrgb_array[39;49;00m[33m'[39;49;00m [

## 3. `agents/bc_agent.py` (read-only file)

BC (as in Behaviour Clonning) Agent is a class that represents the agent entity.

It's initialized with an random initialized network (see MLPPolicySL) an empty replay buffer, to train using Supervised Learning on top of expert data.

In [24]:
!pygmentize cs285/agents/bc_agent.py

[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36minfrastructure[39;49;00m[04m[36m.[39;49;00m[04m[36mreplay_buffer[39;49;00m [34mimport[39;49;00m ReplayBuffer
[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36mpolicies[39;49;00m[04m[36m.[39;49;00m[04m[36mMLP_policy[39;49;00m [34mimport[39;49;00m MLPPolicySL
[34mfrom[39;49;00m [04m[36m.[39;49;00m[04m[36mbase_agent[39;49;00m [34mimport[39;49;00m BaseAgent


[34mclass[39;49;00m [04m[32mBCAgent[39;49;00m(BaseAgent):
    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, env, agent_params):
        [36msuper[39;49;00m(BCAgent, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [37m# init vars[39;49;00m
        [36mself[39;49;00m.env = env
        [36mself[39;49;00m.agent_params = agent_params

        [37m# actor/policy[39;49;00m
        [36mself[39;49;00m.actor = MLPPolicySL(
            [36mself[39;

## 4. `policies/MLP_policy.py`

Note the differences and similitudes between discrete and continuous policy.

MLP: Multi Layer Perceptron

In [25]:
!pygmentize cs285/policies/MLP_policy.py

[34mimport[39;49;00m [04m[36mabc[39;49;00m
[34mimport[39;49;00m [04m[36mitertools[39;49;00m
[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m Any
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m [34mimport[39;49;00m nn
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mimport[39;49;00m functional [34mas[39;49;00m F
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m [34mimport[39;49;00m optim

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m [34mimport[39;49;00m distributions

[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36minfrastructure[39;49;00m [34mimport[39;49;00m pytorch_util [34mas[39;49;00m ptu
[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36mpolicies[39;49;00m[04m[36m.[39;49;00m[04m[36

## 5. `infrastructure/replay_buffer.py`



In [26]:
!pygmentize cs285/infrastructure/replay_buffer.py

[34mfrom[39;49;00m [04m[36mcs285[39;49;00m[04m[36m.[39;49;00m[04m[36minfrastructure[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m *


[34mclass[39;49;00m [04m[32mReplayBuffer[39;49;00m([36mobject[39;49;00m):

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, max_size=[34m1000000[39;49;00m):

        [36mself[39;49;00m.max_size = max_size

        [37m# store each rollout[39;49;00m
        [36mself[39;49;00m.paths = []

        [37m# store (concatenated) component arrays from each rollout[39;49;00m
        [36mself[39;49;00m.obs = [34mNone[39;49;00m
        [36mself[39;49;00m.acs = [34mNone[39;49;00m
        [36mself[39;49;00m.rews = [34mNone[39;49;00m
        [36mself[39;49;00m.next_obs = [34mNone[39;49;00m
        [36mself[39;49;00m.terminals = [34mNone[39;49;00m

    [34mdef[39;49;00m [32m__len__[39;49;00m([36mself[39;49;00m):
        [34mif[39;49;00m [

## 7. `infrastructure/pytorch_util.py`

In [27]:
!pygmentize cs285/infrastructure/pytorch_util.py

[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m Union

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m [34mimport[39;49;00m nn

Activation = Union[[36mstr[39;49;00m, nn.Module]


_str_to_activation = {
    [33m'[39;49;00m[33mrelu[39;49;00m[33m'[39;49;00m: nn.ReLU(),
    [33m'[39;49;00m[33mtanh[39;49;00m[33m'[39;49;00m: nn.Tanh(),
    [33m'[39;49;00m[33mleaky_relu[39;49;00m[33m'[39;49;00m: nn.LeakyReLU(),
    [33m'[39;49;00m[33msigmoid[39;49;00m[33m'[39;49;00m: nn.Sigmoid(),
    [33m'[39;49;00m[33mselu[39;49;00m[33m'[39;49;00m: nn.SELU(),
    [33m'[39;49;00m[33msoftplus[39;49;00m[33m'[39;49;00m: nn.Softplus(),
    [33m'[39;49;00m[33midentity[39;49;00m[33m'[39;49;00m: nn.Identity(),
}


[34mdef[39;49;00m [32mbuild_mlp[39;49;00m(
        input_size: [36mint[39;49;00m,
        output_size: [36mint[39;49;00m,
        n_layers: [36mint[39;49;

## Extra: `run_statistics.py` 

This one make it easier to explore hyperparameters running several simulations.


In [28]:
!pygmentize cs285/scripts/run_statistics.py

[34mfrom[39;49;00m [04m[36mrun_hw1[39;49;00m [34mimport[39;49;00m main [34mas[39;49;00m run_experiment
[34mfrom[39;49;00m [04m[36mos[39;49;00m [34mimport[39;49;00m listdir
[34mfrom[39;49;00m [04m[36mos[39;49;00m[04m[36m.[39;49;00m[04m[36mpath[39;49;00m [34mimport[39;49;00m isfile, join
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mmatplotlib[39;49;00m[04m[36m.[39;49;00m[04m[36mpyplot[39;49;00m [34mas[39;49;00m [04m[36mplt[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[37m# class stats():[39;49;00m

[34mdef[39;49;00m [32mplot_results[39;49;00m(logdirs, search_space):
    all_dicts = []
    [34mfor[39;49;00m logdir [35min[39;49;00m logdirs:
        [34mfor[39;49;00m f [35min[39;49;00m listdir(logdir):
            [34mif[39;49;00m [36mstr[39;49;00m(f)[:[34m7[39;49;00m] == [33m"[39;49;00m[33mmetrics[39;49;00m[33m"[39;49;