# Outline
The experiments section consists of two parts:
- PART1: Demonstrating the performance of our implementations of various intrinsic rewards;
- PART2: Discussion of various issues of the application of intrinsic rewards.

# Preliminary work

- Mingqi:
  - Update the rllte framework to adapt the latest reward class; [DONE]
  - Prepare the paper framework and write some preliminary contents. [DONE]
- Roger:
  - Finish the rest of the transfer work; [DONE]
  - Implement the *Disagree* reward module; [DONE]
  - Check the correctness of the workflow of the implemented modules.

- I have available training results of PPO on 57 atari games, so we don't need to train them again.
- Use Super Mario 1-1 without extrinsic rewards to test correctness

# Baseline Setting

When not specified otherwise in the following Qs, this is the default config to use for ALL rewards in ALL cases. Until we answer Q1,Q2 and Q3. From there, we will use the best config for each reward.

- image input preprocessing: x / 255.
- reward normalization: rms
- combination of int. and ext. rewards: R=E+I
- reward filter: False
- update proportion (see https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_rnd_envpool.py#L468): 1.0

Make sure to debug them all in Super Mario world 1 level 1. Best environment to debug exploration

# Environments to use

Train on **intrinsic rewards only** in these environments to measue how good for **exploration** the algorithms are
- Atari hard exploration games
    - Montezumas Revenge
    - Pitfall
    - PrivateEye
    - Gravitar
    - Venture
    - Hero
    - Frostbite
    - Solaris
- SuperMarioBros-1-1-v3

Train on **intrinsic + extrinsic** in these environments to measure if they help achieving better performance
- All other Atari games
- Crafter
- Procgen


# PART1

Using the baseline settings, report the performance for all intrinsic rewards 

- backbone algorithms: PPO, SAC
- rewards: all rewards modules
- games: 
  + 9 atari games (image-based obs, discrete actions), 1e+7 steps
    + Exp.: PPO+Int. Reward
  
  + Super Mario World 1 Level 1
    + Exp.: PPO+Int. Reward (all rewards should work well here)

  + 4 bullet games (state-based obs, box actions), 1e+6 steps
    + Exp.: SAC+Int. Reward
- workload:
  + Mingqi: RE3, RISE, PseudosCounts, NGU, REVD
  + Roger: RND, ICM, RIDE, E3B, Disagree


# PART2
- General principles: one question only uses *1 kind of games that is most appropriate for the current question + 1 algo.*

## Tunning Intrinsic Rewards
The goal of this questions is to find which setting each of the intrinsic rewards needs to get best performance
At the end of this section, we can identify for each reward, it's best config

e.g. RND: obs_rms=True, rew_rms=True, forward_filter=True
e.g. E3B: obs_rms=False, rew_rms=True, forward_filter=False

### Q1: The impact of different Reward normalization mechanisms on final performance. (Roger)
- RL algo: PPO (if you use E3B, use PPO not SAC)
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates: 
  + vanilla
  + rms
  + min-max
- games: 
  + 9 atari games same as part1
  + SuperMarioBros-1-1-v3
- framework: rllte

### Q2: The impact of different Observation normalization mechanisms on final performance. (Roger)
- RL algo: PPO
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates: 
  + x/255.
  + rms
- games:
  + 9 atari games same as part1
  + SuperMarioBros-1-1-v3
- framework: rllte
  
### Q3: The impact of ForwardRewardFilter mechanisms on final performance. (Roger)
- RL algo: PPO
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates:
  + Use RewardFilter + don't cut when done=True in value estimation
  - Don't use RewardFilter + cut when done=True in value estimation
- games: 
  + 9 atari games same as part1
  + SuperMarioBros-1-1-v3
- framework: rllte

### Q4: The co-learning dynamics of policies and intrinsic rewards: (Roger)
- The problem is that many intrinsic rewards require learning auxiliary models (e.g. inverse dynamics model, forward dynamics model, etc.) and it is not clear how to co-learn them with the policy.
- RL algo: PPO
- candidates: 
  + update_proportion = 0.25
  + update_proportion = 0.5
  + update_proportion = 0.75
  + update_proportion = 0.1

- rewards: ICM, RND, RIDE, RE3, E3B
- games: 
  + 9 atari games same as part1
  + SuperMarioBros-1-1-v3
- framework: rllte

## Optimizing the intrinsic rewards

The goal of this section is to learn which are the best ways to optimize the RL algos with intrinsic rewards
Starting in this section, for each reward we will use the best config found with Q1, Q2, Q3, Q4

At the end of this section we know for each algorithm, how to configure it to better optimize the intrinsic rewards

e.g. PPO+RND: Separate adv estimation + no LSTM
e.g. PPO+E3B: Vanilla version R=ext+int + LSTM

### Q5: The impact of different integration pattern on final performance, only for on-policy setting. (Mingqi)
- RL algo: PPO 
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates:
  + vanilla version: ext. reward + int. reward -> adv. estimation -> policy update, only one branch in the value network.
  + cleanrl version: separate adv. estimation and two branches in the value network
  + RE3 version: multiply the estimated advantages by the intrinsic rewards 
- games: 
  + 9 atari games same as part1
  + SuperMarioBros-1-1-v3
- framework: rllte

### Q6: Is memory required for better optimizing intrinsic rewards? (Roger)
- RL algo: PPO
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates: 
  + LSTM policy
  + Vanilla policy
- games: 
  + 9 atari games same as part1
  + SuperMarioBros-1-1-v3
  + SuperMarioBrosRandomLevels
- framework: rllte

## Research Questions

The goal of this section is to study recent research questions of interest in the literature.

e.g. Optimizing multiple intrinsic rewards together
e.g. intrinsic rewards in contextual MDPs

### Q7: The performance of mixed intrinsic rewards. (Mingqi)
- RL algo: PPO
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates:
  - global + episodic (1)
  + RND+E3B
  + ICM+E3B
  + RIDE+E3B

  - global + episodic (2)
  + RND+RE3
  + ICM+RE3
  + RIDE+RE3

  - global + global
  + RND+ICM
  + RND+RIDE
  + ICM+RIDE

- games: 
  + 9 atari games same as part1. 
  + SuperMarioBros-1-1-v3
  + SuperMarioBrosRandomLevels
  + Procgen subset
- framework: rllte
  

### Q8: Which intrinsic rewards generalize better in Contextual MDPS? (Roger)
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates: LSTM policy vs CNN policy.
- games: 
  + Procgen subset
  + SuperMarioBrosRandomLevels
  + Crafter (?)
- framework: rllte

# Update log

## 31/01/2024 by Mingqi
- transfer all the reward modules from `experimental` folder to `rllte.xplore.reward`, old moduels are moved to `rllte/explore/reward/backup`
- updated the `on_policy_agent.py (Line 132-169)` to adapt to the new reward base. For the convenience of experiments, we compute the irs directly without using `if self.irs is not None`. And `R=E+I`
- updated the `.compute (Line 132-169)` function of `BaseReward`, now it requires the samples to contain all the potentially useful data:
    ``` py
    for key in ['observations', 'actions', 'rewards', 'terminateds', 'truncateds', 'infos', 'next_observations']:
        assert key in samples.keys(), f"Key {key} is not in samples."
    ```
    It is simpler to understand and we can let `.compute, .watch, .update` have same arguments.

## Next Steps

- Fix pseudo_count() function for all rewards that use pseudo_counts
    + Change fixed memory for deques
- Implement RewardFilter as in (https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_rnd_envpool.py)
    - When computing values and using ForwardFilter, dont cut the value estimation when done=True 
        + see Cleanrl script linked above Line 411-420
- Change `off_policy_agent.py` to adapt to the new reward class. 

## 31/01/2024 by Roger
- Added `normalize()` in the base reward class and `obs_rms` bool for all rewards (some with default to True and some to False)
    + Implemented obs normalization logic in the base reward and calls to normalize in `compute()` and `update()`
    + Added initialization of obs_norm parameters in `on_policy_agent.py (line 109-122)` based on cleanrl code

- Added `update_proportion` parameter to control how big the updates are. Necessary to answer Q6
    + Updated all `update()` functions for the rewards to use the `update_proportion` parameter

- Added SuperMario bros and checked PPO can solve it. (it works in 1M steps only)

- Added calls to `self.update()` at each `self.compute()` in rewards
- Changed intrinsic reward Encoder to Mnih encoder to process 84x84 images
- Big change to schedule to better define the study


## 01/02/2024 by Mingqi
- fixed the `pseudo_counts.py` with a reasonable episodic memory;
- fixed RIDE, NGU;
- corrected the interpretations of all the arguments and code blocks;