# Outline
The experiments section consists of two parts:
- PART1: Demonstrating the performance of our implementations of various intrinsic rewards;
- PART2: Discussion of various issues of the application of intrinsic rewards.

# Preliminary work

- Mingqi:
  - Update the rllte framework to adapt the latest reward class;
  - Prepare the paper framework and write some preliminary contents.
- Roger:
  - Finish the rest of the transfer work;
  - Implement the *Disagree* reward module;
  - Check the correctness of the workflow of the implemented modules.

- I have available training results of PPO on 57 atari games, so we don't need to train them again.
- Use Super Mario 1-1 without extrinsic rewards to test correctness

# Baseline Setting

- image input preprocessing: 
    + RND, RE3, Disagreement, REVD, RISE: rms
    + E3B, ICM, PseudoCounts, Ride: x / 255.
- reward normalization: rms
- combination of int. and ext. rewards: R=E+I
- reward filter:
    + True for: RND, ICM, Disagreement
    + False for: E3B, PseudoCounts, RIDE, REVD, RISE, RE3
- update proportion (see https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_rnd_envpool.py#L468): 100%

- Note: It's possible that some of the baselines don't achieve good performance. In this case it might be necessary to adapt this settings.
    + e.g. Maybe RIDE should have reward filter. Maybe all methods should use x/255. 
    + Make sure to debug them all in Super Mario world 1 level 1. Best environment to debug exploration

# PART1

- backbone algorithms: PPO, SAC
- rewards: all rewards modules
- games: 
  + 9 atari games (image-based obs, discrete actions), 1e+7 steps
    + Exp.: PPO+Int. Reward
  + Super Mario World 1 Level 1
    + Exp.: PPO+Int. Reward (all rewards should work well here)
  + 4 bullet games (state-based obs, box actions), 1e+6 steps
    + Exp.: SAC+Int. Reward
- rewards: all rewards modules
- workload:
  + Mingqi: RE3, RISE, PseudosCounts, NGU, REVD
  + Roger: RND, ICM, RIDE, E3B, Disagree
- The 9 atari games are: 

# PART2
- General principles: one question only uses *1 kind of games that is most appropriate for the current question + 1 algo.*

### Q1: The impact of different Reward normalization mechanisms on final performance. (Mingqi)
- RL algo: PPO (if you use E3B, use PPO not SAC)
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates: vanilla, running mean and std, min-max
- games: 9 atari games same as part1 + Super Mario
- framework: rllte
  
### Q2: The impact of different integration pattern on final performance, only for on-policy setting. (Mingqi)
- RL algo: PPO 
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates:
  - vanilla version: ext. reward + int. reward -> adv. estimation -> policy update, only one branch in the value network.
  - cleanrl version: separate adv. estimation and two branches in the value network
  - RE3 version: multiply the estimated advantages by the intrinsic rewards. -- Q: Multiply or add? 
- games: 9 atari games same as part1 + Super Mario
- framework: cleanrl (Roger: Why not Rklte?)

### Q3: The performance of mixed intrinsic rewards. (Mingqi)
- rewards: ICM, RND, RIDE, RE3, Roger: I think we must use E3B here since https://arxiv.org/abs/2306.03236 uses RND+E3B and works well!
- candidates: different/same types.
- games: 9 atari games same as part1. Roger: + Super Mario (1 level only / Multiple levels). The Idea is that RND+E3B should work better when using multiple levels than not RND alone 
- framework: rllte
  
### Q4: Is memory required for better optimizing intrinsic rewards? (Roger)
- RL algo: PPO
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates: LSTM policy vs CNN policy.
- games: 9 atari games same as part1 + Super Mario
- framework: rllte

### Q5: Which intrinsic rewards generalize better in Contextual MDPS? (Roger)
- rewards: ICM, RND, RIDE, RE3, E3B
- candidates: LSTM policy vs CNN policy.
- games: procgen games / Super Mario Multi Level
- framework: rllte

### Q6: The co-learning dynamics of policies and intrinsic rewards: (Roger)
- The problem is that many intrinsic rewards require learning auxiliary models (e.g. inverse dynamics model, forward dynamics model, etc.) and it is not clear how to co-learn them with the policy.
- RL algo: PPO
- candidates: 1 entire epoch over on-policy data every policy update | 1 single batch update every policy update (i.e. delayed updates) | Masked updates (i.e. like in cleanrl RND implementation)
- rewards: ICM, RND, RIDE, RE3, E3B
- games: 9 atari games same as part1 + Super Mario
- framework: rllte

# Update log

## 31/01/2024 by Mingqi
- transfer all the reward modules from `experimental` folder to `rllte.xplore.reward`, old moduels are moved to `rllte/explore/reward/backup`
- updated the `on_policy_agent.py (Line 132-169)` to adapt to the new reward base. For the convenience of experiments, we compute the irs directly without using `if self.irs is not None`. And `R=E+I`
- updated the `.compute (Line 132-169)` function of `BaseReward`, now it requires the samples to contain all the potentially useful data:
    ``` py
    for key in ['observations', 'actions', 'rewards', 'terminateds', 'truncateds', 'infos', 'next_observations']:
        assert key in samples.keys(), f"Key {key} is not in samples."
    ```
    It is simpler to understand and we can let `.compute, .watch, .update` have same arguments.

## Next Steps

- Fix pseudo_count() function for all rewards that use pseudo_counts
    + Change fixed memory for deques
- Implement RewardFilter as in (https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_rnd_envpool.py)
    - When computing values and using ForwardFilter, dont cut the value estimation when done=True 
        + see Cleanrl script linked above Line 411-420
- Change `off_policy_agent.py` to adapt to the new reward class. 

## 31/01/2024 by Roger
- Added `normalize()` in the base reward class and `obs_rms` bool for all rewards (some with default to True and some to False)
    + Implemented obs normalization logic in the base reward and calls to normalize in `compute()` and `update()`
    + Added initialization of obs_norm parameters in `on_policy_agent.py (line 109-122)` based on cleanrl code

- Added `update_proportion` parameter to control how big the updates are. Necessary to answer Q6
    + Updated all `update()` functions for the rewards to use the `update_proportion` parameter
- Added calls to `self.update()` at each `self.compute()` in rewards
- Changed intrinsic reward Encoder to Mnih encoder to process 84x84 images

## Next Steps
- How does obs_norm work for value-based algorithms? e.g. SAC?