Since **RLLTE** decouples RL algorithms into minimum primitives from the perspective of exploitation and exploration, intrinsic reward shaping is supported by default. Due to the large differences in the calculation of different intrinsic reward methods, **RLLTE** has the following rules:

1. The environments are assumed to be ***vectorized***;
2. The ***compute_irs*** function of each intrinsic reward module has a mandatory argument ***samples***, which is a dict like:
     - obs (n_steps, n_envs, *obs_shape), `torch.Tensor`
     - actions (n_steps, n_envs, *action_shape) `torch.Tensor`
     - rewards (n_steps, n_envs) `torch.Tensor`
     - next_obs (n_steps, n_envs, *obs_shape) `torch.Tensor`

Take RE3 for instance, it computes the intrinsic reward for each state based on the Euclidean distance between the state and 
its $k$-nearest neighbor within a mini-batch. Thus it suffices to provide ***obs*** data to compute the reward. The following code provides a usage example of RE3:

In [3]:
from rllte.xplore.reward import RE3
from rllte.env import make_dmc_env
import torch as th

if __name__ == '__main__':
    num_envs = 7
    num_steps = 128
    # create env
    env = make_dmc_env(env_id="cartpole_balance", num_envs=num_envs)
    print(env.observation_space, env.action_space)
    # create RE3 instance
    re3 = RE3(
        observation_space=env.observation_space,
        action_space=env.action_space
    )
    # compute intrinsic rewards
    obs = th.rand(size=(num_steps, num_envs, *env.observation_space.shape))
    intrinsic_rewards = re3.compute_irs(samples={'obs': obs})

    print(intrinsic_rewards.shape, type(intrinsic_rewards))
    print(intrinsic_rewards)

pygame 2.4.0 (SDL 2.26.4, Python 3.8.16)
Hello from the pygame community. https://www.pygame.org/contribute.html
Box(0, 255, (9, 84, 84), uint8) Box(-1.0, 1.0, (1,), float32)
torch.Size([128, 7]) <class 'torch.Tensor'>
tensor([[0.0081, 0.0083, 0.0079, 0.0080, 0.0075, 0.0077, 0.0079],
        [0.0075, 0.0079, 0.0078, 0.0076, 0.0080, 0.0083, 0.0083],
        [0.0077, 0.0081, 0.0083, 0.0078, 0.0078, 0.0077, 0.0076],
        [0.0081, 0.0080, 0.0080, 0.0084, 0.0085, 0.0082, 0.0080],
        [0.0079, 0.0081, 0.0077, 0.0073, 0.0080, 0.0079, 0.0079],
        [0.0083, 0.0077, 0.0081, 0.0079, 0.0075, 0.0080, 0.0082],
        [0.0085, 0.0078, 0.0076, 0.0082, 0.0078, 0.0082, 0.0080],
        [0.0081, 0.0082, 0.0078, 0.0077, 0.0076, 0.0081, 0.0082],
        [0.0075, 0.0080, 0.0087, 0.0077, 0.0076, 0.0082, 0.0078],
        [0.0080, 0.0077, 0.0080, 0.0072, 0.0080, 0.0081, 0.0079],
        [0.0078, 0.0080, 0.0076, 0.0076, 0.0077, 0.0076, 0.0081],
        [0.0084, 0.0080, 0.0076, 0.0081, 0.0082, 0.0080

You can also invoke the intrinsic reward module in all the implemented algorithms directly by `.set` function. Run the cell and you'll see the intrinsic reward module is invoked:


In [4]:
from rllte.agent import PPO
from rllte.env import make_atari_env
from rllte.xplore.reward import RE3

if __name__ == "__main__":
    # env setup
    device = "cuda:0"
    env = make_atari_env(device=device)
    eval_env = make_atari_env(device=device)
    # create agent
    agent = PPO(env=env, 
                eval_env=eval_env, 
                device=device,
                tag="ppo_atari")
    # create intrinsic reward
    re3 = RE3(observation_space=env.observation_space,
              action_space=env.action_space,
              device=device)
    # set the module
    agent.set(reward=re3)
    # start training
    agent.train(num_train_steps=5000)

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


[08/29/2023 11:55:07 AM] - [[1m[34mINFO.[0m] - Invoking RLLTE Engine...
[08/29/2023 11:55:07 AM] - [[1m[34mINFO.[0m] - Tag               : ppo_atari
[08/29/2023 11:55:07 AM] - [[1m[34mINFO.[0m] - Device            : NVIDIA GeForce RTX 3090
[08/29/2023 11:55:07 AM] - [[1m[33mDEBUG[0m] - Agent             : PPO
[08/29/2023 11:55:07 AM] - [[1m[33mDEBUG[0m] - Encoder           : MnihCnnEncoder
[08/29/2023 11:55:07 AM] - [[1m[33mDEBUG[0m] - Policy            : OnPolicySharedActorCritic
[08/29/2023 11:55:07 AM] - [[1m[33mDEBUG[0m] - Storage           : VanillaRolloutStorage
[08/29/2023 11:55:07 AM] - [[1m[33mDEBUG[0m] - Distribution      : Categorical
[08/29/2023 11:55:07 AM] - [[1m[33mDEBUG[0m] - Augmentation      : False
[08/29/2023 11:55:07 AM] - [[1m[33mDEBUG[0m] - Intrinsic Reward  : True, RE3
[08/29/2023 11:55:09 AM] - [[1m[32mEVAL.[0m] - S: 0           | E: 0           | L: 30          | R: 53.000      | T: 0:00:03    
[08/29/2023 11:55:11 AM] - [[1m