# Outlook

In this notebook, using BBRL, you will study some effects of partial observability
on the continuous action version of the LunarLander-v3 environment, using the TD3 algorithm.

To emulate partial observability, you will design dedicated wrappers. Then you will study
whether extending the input of the agent policy and critic with a memory of previous states
and can help solve the partial observability issue. Yu will also study whether using action chunks
instead of single actions as ouptput has an effect of the learning performance.
This will also be achieved by designing other temporal extension wrappers.

# Installation

In [None]:
# Prepare the environment

import os
import copy
import numpy as np
import gymnasium as gym
import math
import bbrl_gymnasium  # noqa: F401
import torch
import torch.nn as nn
from bbrl.agents import Agent, Agents, TemporalAgent
from bbrl_utils.algorithms import EpochBasedAlgo
from bbrl_utils.nn import build_mlp, setup_optimizer, soft_update_params
from bbrl_utils.notebook import setup_tensorboard
from bbrl.visu.plot_policies import plot_policy
from omegaconf import OmegaConf

import bbrl_utils

bbrl_utils.setup()

# Temporal modification wrappers

The [LunarLander-v3](https://gymnasium.farama.org/environments/box2d/lunar_lander/) environment is a gymnasium environment.
See the gymnasium page for a description of the state and action spaces.

To emulate partial observability in LunarLander-v3, you will hide the x and y velocities of the lander
by filtering them out of the state returned by the environment.
This is implemented with the ```FeatureFilterWrapper```.

To compensate for partial observability, you will extend the architecture of the agent
with a memory of previous states and extend its output with action chunks.
This is implemented with two wrappers, the ```ObsTimeExtensionWrapper```
and the ```ActionTimeExtensionWrapper```.

## The FeatureFilterWrapper

The FeatureFilterWrapper removes a feature from the returned observation
when calling the ```reset()``` and ```step(action)``` functions.
The index of the removed feature is given as a parameter when building the object.

To filter out the x and y velocities from the LunarLander-v3 environment,
the idea is to call the wrapper the adequate number of times, using something like
```env = FeatureFilterWrapper(FeatureFilterWrapper(inner_env, X), Y)```
where ```inner_env``` is the LunarLander-v3 environment
and X and Y are position of features you want to filter out.

Beware of filtering features in the right order,
as removing a feature changes the index of all subsequent features.

One way to put such a wrapper with a parameter into the list of wrappers is to use a lambda:
for instance, `lambda env: FeatureFilterWrapper(env, 3),`

### Exercise 1: code the FeatureFilterWrapper class below.

Beyond rewriting the ```reset()``` and ```step(action)``` functions,
beware of adapting the observation space and its shape.

In [None]:
# [[STUDENT]]...

assert False, 'Not implemented yet'


## The ObsTimeExtensionWrapper

When facing a partially observable environment, training with RL a reactive agent which just selects an action based on the current observation
is not guaranteed to reach optimality. An option to mitigate this fundamental limitation is to equip the agent with a memory of the past.

One way to do so is to use a recurrent neural network instead of a feedforward one to implement the agent: the neural network contains
some memory capacity and the RL process may tune this internal memory so as to remember exactly what is necessary from the
past observation. This has been done many times using an LSTM, see for instance
[this early paper](https://proceedings.neurips.cc/paper/2001/file/a38b16173474ba8b1a95bcbc30d3b8a5-Paper.pdf).

Another way to do so is to equip the agent with a list-like memory of the past observations
and to extend the critic and policy to take as input the current observation and the previous ones.
This removes the difficulty of learning an adequate representation of the past, but this results in
enlarging the input size of the actor and critic networks. This can only be done if the required memory
horizon to behave optimally is small enough.

In the case of the LunarLander-v3 environment, one can immediately see that a memory of the previous
x and y coordinates is enough to compensate for the absence of the velocities,
since $\dot{a} \approx (a_{t} - a_{t-1})$.

So we will extend the RL agent with a memory of size 1.

Though it may not be intuitive at first glance, the simplest way to do so is to embed the environment
into a wrapper which contains the required memory and produces the extended observations.
This way, the RL agent will naturally be built with an extended observation space,
and the wrapper will be in charge of concatenating the memorized observation from the previous step
with the current observation received from the inner environment when calling the ```step(action)``` function.
When calling the ```reset()``` function, the memory of observations should be reinitialized with null observations.

### Exercise 2: code the ObsTimeExtensionWrapper class below.

Beyond rewriting the ```reset()``` and ```step(action)``` functions, beware of adapting the observation space and its shape.

In [None]:
# [[STUDENT]]...

assert False, 'Not implemented yet'


## The ActionTimeExtensionWrapper

It has been observed that, in partially observable environments, preparing to play
a sequence of actions and only playing the first can be better than only preparing for one action.
The difference comes from the fact that the critic evaluates
sequences of actions, even if only the first is played in practice.

Similarly to the ObsTimeExtensionWrapper, the corresponding behavior can be implemented with a wrapper.
The size of the action space of the extended environment should be
M times the size of the action space of the inner environment. This ensures that the policy and the critic
will consider extended actions.
Besides, the ```step(action)``` function should receive an extended actions of size M times
the size of an action, and should only transmit the first action to the inner environment.

Warning, in gymnasium the case where the action is one dimensional requires a slightly
different treatment with respect to when it is multi-dimensional

### Exercise 3: code the ActionTimeExtensionWrapper class below.

Beyond rewriting the ```reset()``` and ```step(action)``` functions, beware of adapting the action space and its shape.

In [None]:
# [[STUDENT]]...

assert False, 'Not implemented yet'


In [None]:
class TD3(EpochBasedAlgo):
    def __init__(self, cfg, wrappers_factory):
        super().__init__(cfg, wrappers_factory)

        # Define the agents and optimizers for TD3

        assert False, 'Not implemented yet'





def run_td3(td3: TD3):
    for rb in td3.iter_replay_buffers():
        rb_workspace = rb.get_shuffled(td3.cfg.algorithm.batch_size)

        # Implement the learning loop

        assert False, 'Not implemented yet'


## Launching tensorboard to visualize the results

In [None]:
setup_tensorboard("./outputs")

# Experimental study

To run the experiments below, you can use the [TD3](http://proceedings.mlr.press/v80/fujimoto18a/fujimoto18a.pdf) algorithm.

You can just copy paste here the code you have used during the corresponding labs.
We only provide a suggested set of hyper-parameters working well on the LunarLander-v3 environment for TD3.

## Definition of the parameters

The logger is defined as `bbrl.utils.logger.TFLogger` so as to use a
tensorboard visualisation.

In [None]:
params = {
    "save_best": False,
    "base_dir": "${gym_env.env_name}/td3-S${algorithm.seed}_${current_time:}",
    "collect_stats": False,
    # Set to true to have an insight on the learned policy
    # (but slows down the evaluation a lot!)
    "plot_agents": False,
    "algorithm": {
        "seed": 2,
        "max_grad_norm": 0.5,
        "n_envs": 1,
        "n_steps": 1000,
        "nb_evals": 10,
        "discount_factor": 0.99999,
        "buffer_size": 1e6,
        "batch_size": 256,
        "tau_target": 0.005,
        "eval_interval": 5_000,
        "max_epochs": 3500,
        # Minimum number of transitions before learning starts
        "learning_starts": 100,
        "action_noise": 0.2,
        "architecture": {
            "actor_hidden_size": [64, 64],
            "critic_hidden_size": [400, 300],
        },
    },
    "gym_env": {
        "env_name": "LunarLander-v3",
        "env_args": { "continuous": True, }
    },
    "actor_optimizer": {
        "classname": "torch.optim.Adam",
        "lr": 1e-3,
        "eps": 5e-5,
    },
    "critic_optimizer": {
        "classname": "torch.optim.Adam",
        "lr": 1e-3,
        "eps": 5e-5,
    },
}

### Exercise 4:

You know have all the elements to study the impact of removing features from the environment
on the training performance, and the impact of temporally extending the agent in mitigating
partial observability, both with observation and with action extension.

In practice, you should produce the following learning curves:

- a learning curve of your algorithm on the standard LunarLander-v3 environment with full observability,
- two learning curves, one from removing $\dot{x}$ from LunarLander-v3 and the other from removing $\dot{\theta}$,
- one learning curve from removing both $\dot{x}$ and $\dot{\theta}$,
- the same four learning curves as above, but adding each of the temporal extension wrappers, separately or combined.

The way to combine these learning curves in different figures is open to you but should be carefully considered
depending on the conclusions you want to draw. Beware of drawing conclusions from insufficient statistics.

Discuss what you observe and conclude from this study.

In [None]:
# [[STUDENT]]...

assert False, 'Not implemented yet'


# Lab report

Your report should contain:
- your source code (probably this notebook), do not forget to put your names on top of the notebook,
- in a separate pdf file with your names in the name of the file (name1_name2.pdf),  no longer than 6 pages:
    + a detailed enough description of all the choices you have made: the parameters you have set, the algorithms you have used, etc.,
    + the curves obtained when doing Exercise 3,
    + your conclusion from these experiments.

Beyond the elements required in this report, any additional study will be rewarded.
For instance, you can extend the temporal horizon for the state memory and or action sequences beyond 2,
and study the impact on learning performance and training time, etc.
A great achievement would be to perform a comparison with the approach based on an LSTM.