Skip to content

Latest commit

 

History

History
290 lines (206 loc) · 10.1 KB

reward.rst

File metadata and controls

290 lines (206 loc) · 10.1 KB

grid2op.Reward

Reward

This page is organized as follow:

Table of Contents

Objectives

This module implements some utilities to get rewards given an grid2op.Action an grid2op.Environment and some associated context (like has there been an error etc.)

It is possible to modify the reward to use to better suit a training scheme, or to better take into account some phenomenon by simulating the effect of some grid2op.Action using grid2op.Observation.BaseObservation.simulate.

Doing so only requires to derive the BaseReward, and most notably the three abstract methods BaseReward.__init__, BaseReward.initialize and BaseReward.__call__

Customization of the reward

In grid2op you can customize the reward function / reward kernel used by your agent. By default, when you create an environment a reward has been specified for you by the creator of the environment and you have nothing to do:

import grid2op
env_name = "l2rpn_case14_sandbox"

env = grid2op.make(env_name)

obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)

The value of the reward function above is computed by a default function that depends on the environment you are using. For the example above, the "l2rpn_case14_sandbox" environment is using the RedispReward.

Using a reward function available in grid2op ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you want to customize your environment by adapting the reward and use a reward available in grid2op it is rather simple, you need to specify it in the make command:

import grid2op
from grid2op.Reward import EpisodeDurationReward
env_name = "l2rpn_case14_sandbox"

env = grid2op.make(env_name, reward_class=EpisodeDurationReward)

obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)

In this example the reward_value is computed using the formula defined in the EpisodeDurationReward.

Note

There is no error in the syntax. You need to provide the class and not an object of the class (see next paragraph for more information about that).

At time of writing the available reward functions is :

  • AlarmReward
  • AlertReward
  • BridgeReward
  • CloseToOverflowReward
  • ConstantReward
  • DistanceReward
  • EconomicReward
  • EpisodeDurationReward
  • FlatReward
  • GameplayReward
  • IncreasingFlatReward
  • L2RPNReward
  • LinesCapacityReward
  • LinesReconnectedReward
  • N1Reward
  • RedispReward

In the provided reward you have also some convenience functions to combine different reward. These are:

  • CombinedReward
  • CombinedScaledReward

Basically these two classes allows you to combine (sum) different reward in a single one.

Passing an instance instead of a class

On some occasion, it might be easier to work with instance of classes (object) rather than to work with classes (especially if you want to customize the implementation used). You can do this without any issue:

import grid2op
from grid2op.Reward import N1Reward
env_name = "l2rpn_case14_sandbox"

n1_l1_reward = N1Reward(l_id=1)  # this is an object and not a class.
env = grid2op.make(env_name, reward_class=n1_l1_reward)

obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)

In this example reward_value is computed as being the maximum flow on all the powerlines after the disconnection of powerline 1 (because we specified l_id=1 at creation). If we want to know the maximum flows after disconnection of powerline 5 you can call:

import grid2op
from grid2op.Reward import N1Reward
env_name = "l2rpn_case14_sandbox"

n1_l5_reward = N1Reward(l_id=5)  # this is an object and not a class.
env = grid2op.make(env_name, reward_class=n1_l5_reward)

Customizing the reward for the "simulate"

In grid2op, you have the possibility to simulate the impact of an action on some future steps with the use of obs.simulate(...) (see grid2op.Observation.BaseObservation.simulate) or obs.get_forecast_env() (see grid2op.Observation.BaseObservation.get_forecast_env).

In these methods you have some computations of rewards. Grid2op lets you allow to customize how these rewards are computed. You can change it in multiple fashion:

import grid2op
from grid2op.Reward import EpisodeDurationReward
env_name = "l2rpn_case14_sandbox"

env = grid2op.make(env_name, reward_class=EpisodeDurationReward)
obs = env.reset()

an_action = env.action_space()
sim_obs, sim_reward, sim_d, sim_i = obs.simulate(an_action)

By default sim_reward is comupted with the same function as the environment, in this example EpisodeDurationReward.

If for some reason you want to customize the formula used to compute sim_reward and cannot (or does not want to) modify the reward of the environment you can:

import grid2op
from grid2op.Reward import EpisodeDurationReward
env_name = "l2rpn_case14_sandbox"

env = grid2op.make(env_name)
obs = env.reset()

env.observation_space.change_reward(EpisodeDurationReward)
an_action = env.action_space()

sim_obs, sim_reward, sim_d, sim_i = obs.simulate(an_action)
next_obs, reward_value, done, info = env.step(an_action)

In this example, sim_reward is computed using the EpisodeDurationReward (on forecast data) and reward_value is computed using the default reward of "l2rpn_case14_sandbox" on the "real" time serie data.

Creating a new reward

If you don't find any suitable reward function in grid2op (or in other package) you might want to implement one yourself.

To that end, you need to implement a class that derives from BaseReward, like this:

import grid2op
from grid2op.Reward import BaseReward
from grid2op.Action import BaseAction
from grid2op.Environment import BaseEnv


class MyCustomReward(BaseReward):
    def __init__(self, whatever, you, want, logger=None):
        self.whatever = blablabla
        # some code needed
        ...
        super().__init__(logger)

    def __call__(self,
                action: BaseAction,
                env: BaseEnv,
                has_error: bool,
                is_done: bool,
                is_illegal: bool,
                is_ambiguous: bool) -> float:
        # only method really required.
        # called at each step to compute the reward.
        # this is where you need to code the "formula" of your reward
        ...

    def initialize(self, env: BaseEnv):
        # optional
        # called once, the first time the reward is used
        pass

    def reset(self, env: BaseEnv):
        # optional
        # called by the environment each time it is "reset"
        pass

    def close(self):
        # optional called once when the environment is deleted
        pass

And then you can use your (custom) reward like any other:

import grid2op
from the_above_script import MyCustomReward
env_name = "l2rpn_case14_sandbox"

custom_reward = MyCustomReward(whatever=1, you=2, want=42)
env = grid2op.make(env_name, reward_class=custom_reward)
obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)

And now reward_value is computed using the formula you defined in __call__

Training with multiple rewards

In the standard reinforcement learning framework the reward is unique. In grid2op, we didn't want to modify that.

However powergrid are complex environment with some specific and unsual dynamics. For these reasons it can be difficult to compress all these signal into one single scalar. To speed up the learning process, to force the Agent to adopt more resilient strategies etc. it can be usefull to look at different aspect, thus using different reward. Grid2op allows to do so. At each time step (and also when using the simulate function) it is possible to compute different rewards. This rewards must inherit and be provided at the initialization of the Environment.

This can be done as followed:

import grid2op
from grid2op.Reward import GameplayReward, L2RPNReward
env = grid2op.make("case14_realistic", reward_class=L2RPNReward, other_rewards={"gameplay": GameplayReward})
obs = env.reset()
act = env.action_space()  # the do nothing action
obs, reward, done, info = env.step(act)  # immplement the do nothing action on the environment

On this example, "reward" comes from the L2RPNReward and the results of the "reward" computed with the GameplayReward is accessible with the info["rewards"]["gameplay"]. We choose for this example to name the other rewards, "gameplay" which is related to the name of the reward "GampeplayReward" for convenience. The name can be absolutely any string you want.

NB In the case of L2RPN competitions, the reward can be modified by the competitors, and so is the "other_reward" key word arguments. The only restriction is that the key "__score" will be use by the organizers to compute the score the agent. Any attempt to modify it will be erased by the score function used by the organizers without any warning.

What happens in the "reset"

TODO

Detailed Documentation by class

grid2op.Reward