# Visualising MCC Exploration

This notebook logs exploratory results on adding teleportation on MCC with state coverage visualisation. Used for rough initial exploration.

21/01/2024
- Naive teleportation to argmax works
- Longer episodes are better than shorter
- Different intrinsic rewards show significantly different behavior
- Even naively, general improvement over pure intrinsic
- Fails to beat intrinsic + extrinsic: perhaps this is due to negative extrinsic reward revealing data on target? Not comparable, and I think fully explored in that reward shifting paper
- Keeps teleporting to same target
- This may be a problem with DDPG


28/01/2024
- Probabilistic teleportation work well
- Environment reset stochasticity is important
- Time limit aware Q functions are difficult to train!
- Proposal: Dynamic Truncation!

4/02/2024
- ICM and RND leads to inherently different results - RND should be prioritised
- CATS fails to improve over baseline on RND with fixed reset, but does in ICM. After reset, the new trajectory follows the previous trajectory too closely, while resetting from the start leads to more divergence across the entire episode (and hence more exploration)
- Fixing the reset states leads to improved analysis
- Policy function gets stuck in the local minima of the Q function
- Analyse DQN instead? Skip parametrized policy function and use an approximator?? Maybe implement QT-opt https://arxiv.org/pdf/1806.10293.pdf. This may be important to obtain interesting experiment results, since on MCC the policy generally fails to follow the critic even on large learning rates (why??)

11/02/2024
- Ensemble bootstrapping (Thompson sampling) seems to have uncertain impact over baseline, maybe slightly positive?


TODO:
- Confidence Bounds (How? Without latent density estimator?)
- Termination as an action
- Epsilon greedy
- Time aware exploration

Known Failure Modes
- Teleporting to the end of the episode, and immediately truncating
- 

Ideas
- Bootstrapped Q value estimate for confidence bound guided estimation?

Interesting observations
- Qt_opt directly on critic, rather than target network explores faster??

Reward normalisation messes up learning to reset

28/02/2024

Learning the reset distribution as a proper Markov chain helps a ton with lowering the requirement of resets
It is uncertain whether the step or sigmoid reset action performs better - experiments needed. For check frequency every step, step clearly works better (sigmoid probability adds up), but sigmoid might be more fine tuned. Impact on setting on death is a lot less clear, may need better experiment. For now, recommend adding, as on certain seeds with large number of resets seems to benefit (again, experiment perhaps experiment needed). For now, use sigmoid with a check around every $10$.

Teleportation impacts the reset distribution, creating a different target.

2/4/2024
Check frequency is replaced with penalty.
This is annoying to get to work without teleportation, but works fine with.

In [None]:
# Change with your own

# Define Imports and shared training information

# Std
import os
import copy

# Training
import numpy as np
import torch
import hydra
from hydra import initialize, compose

# Evaluation
from matplotlib import pyplot as plt

# kittenfrom kitten.common.util import *
from cats.evaluation import *
from cats.run import run

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

with initialize(version_base=None, config_path="cats/config"):
    cfg = compose(
        config_name="defaults.yaml",
    )


In [None]:
SEED = 0

experiment_cfg = copy.deepcopy(cfg)
# experiment_cfg.noise.scale = [0.1, 0.01]
experiment = CatsExperiment(
    cfg=cfg,
    device=DEVICE
)
experiment.run()

N_ROWS, N_COL = 1, 4
fig, axs = plt.subplots(N_ROWS, N_COL)
fig.set_size_inches(N_COL * 6, N_ROWS * 4)
fig.subplots_adjust(wspace=0.3)
visualise_memory(experiment, fig, axs[0])
visualise_experiment_value_estimate(experiment, fig, axs[1], axs[2])
#visualise_teleport_targets(experiment, fig, axs[3])
print("Entropy ", entropy_memory(experiment.memory.rb))
print("Intrinsic Normalisation", experiment.intrinsic._reward_normalisation.__str__())
print("Intrinsic", evaluate_rnd(experiment))