grid2op.Reward
This page is organized as follow:
Table of Contents
This module implements some utilities to get rewards given an grid2op.Action
an grid2op.Environment
and some associated context (like has there been an error etc.)
It is possible to modify the reward to use to better suit a training scheme, or to better take into account some phenomenon by simulating the effect of some grid2op.Action
using grid2op.Observation.BaseObservation.simulate
.
Doing so only requires to derive the BaseReward
, and most notably the three abstract methods BaseReward.__init__
, BaseReward.initialize
and BaseReward.__call__
In grid2op you can customize the reward function / reward kernel used by your agent. By default, when you create an environment a reward has been specified for you by the creator of the environment and you have nothing to do:
import grid2op
env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name)
obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)
The value of the reward function above is computed by a default function that depends on the environment you are using. For the example above, the "l2rpn_case14_sandbox" environment is using the RedispReward
.
Using a reward function available in grid2op ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to customize your environment by adapting the reward and use a reward available in grid2op it is rather simple, you need to specify it in the make command:
import grid2op
from grid2op.Reward import EpisodeDurationReward
env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name, reward_class=EpisodeDurationReward)
obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)
In this example the reward_value is computed using the formula defined in the EpisodeDurationReward
.
Note
There is no error in the syntax. You need to provide the class and not an object of the class (see next paragraph for more information about that).
At time of writing the available reward functions is :
AlarmReward
AlertReward
BridgeReward
CloseToOverflowReward
ConstantReward
DistanceReward
EconomicReward
EpisodeDurationReward
FlatReward
GameplayReward
IncreasingFlatReward
L2RPNReward
LinesCapacityReward
LinesReconnectedReward
N1Reward
RedispReward
In the provided reward you have also some convenience functions to combine different reward. These are:
CombinedReward
CombinedScaledReward
Basically these two classes allows you to combine (sum) different reward in a single one.
On some occasion, it might be easier to work with instance of classes (object) rather than to work with classes (especially if you want to customize the implementation used). You can do this without any issue:
import grid2op
from grid2op.Reward import N1Reward
env_name = "l2rpn_case14_sandbox"
n1_l1_reward = N1Reward(l_id=1) # this is an object and not a class.
env = grid2op.make(env_name, reward_class=n1_l1_reward)
obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)
In this example reward_value is computed as being the maximum flow on all the powerlines after the disconnection of powerline 1 (because we specified l_id=1 at creation). If we want to know the maximum flows after disconnection of powerline 5 you can call:
import grid2op
from grid2op.Reward import N1Reward
env_name = "l2rpn_case14_sandbox"
n1_l5_reward = N1Reward(l_id=5) # this is an object and not a class.
env = grid2op.make(env_name, reward_class=n1_l5_reward)
In grid2op, you have the possibility to simulate the impact of an action on some future steps with the use of obs.simulate(...) (see grid2op.Observation.BaseObservation.simulate
) or obs.get_forecast_env() (see grid2op.Observation.BaseObservation.get_forecast_env
).
In these methods you have some computations of rewards. Grid2op lets you allow to customize how these rewards are computed. You can change it in multiple fashion:
import grid2op
from grid2op.Reward import EpisodeDurationReward
env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name, reward_class=EpisodeDurationReward)
obs = env.reset()
an_action = env.action_space()
sim_obs, sim_reward, sim_d, sim_i = obs.simulate(an_action)
By default sim_reward is comupted with the same function as the environment, in this example EpisodeDurationReward
.
If for some reason you want to customize the formula used to compute sim_reward and cannot (or does not want to) modify the reward of the environment you can:
import grid2op
from grid2op.Reward import EpisodeDurationReward
env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name)
obs = env.reset()
env.observation_space.change_reward(EpisodeDurationReward)
an_action = env.action_space()
sim_obs, sim_reward, sim_d, sim_i = obs.simulate(an_action)
next_obs, reward_value, done, info = env.step(an_action)
In this example, sim_reward is computed using the EpisodeDurationReward (on forecast data) and reward_value is computed using the default reward of "l2rpn_case14_sandbox" on the "real" time serie data.
If you don't find any suitable reward function in grid2op (or in other package) you might want to implement one yourself.
To that end, you need to implement a class that derives from BaseReward
, like this:
import grid2op
from grid2op.Reward import BaseReward
from grid2op.Action import BaseAction
from grid2op.Environment import BaseEnv
class MyCustomReward(BaseReward):
def __init__(self, whatever, you, want, logger=None):
self.whatever = blablabla
# some code needed
...
super().__init__(logger)
def __call__(self,
action: BaseAction,
env: BaseEnv,
has_error: bool,
is_done: bool,
is_illegal: bool,
is_ambiguous: bool) -> float:
# only method really required.
# called at each step to compute the reward.
# this is where you need to code the "formula" of your reward
...
def initialize(self, env: BaseEnv):
# optional
# called once, the first time the reward is used
pass
def reset(self, env: BaseEnv):
# optional
# called by the environment each time it is "reset"
pass
def close(self):
# optional called once when the environment is deleted
pass
And then you can use your (custom) reward like any other:
import grid2op
from the_above_script import MyCustomReward
env_name = "l2rpn_case14_sandbox"
custom_reward = MyCustomReward(whatever=1, you=2, want=42)
env = grid2op.make(env_name, reward_class=custom_reward)
obs = env.reset()
an_action = env.action_space()
obs, reward_value, done, info = env.step(an_action)
And now reward_value is computed using the formula you defined in __call__
In the standard reinforcement learning framework the reward is unique. In grid2op, we didn't want to modify that.
However powergrid are complex environment with some specific and unsual dynamics. For these reasons it can be difficult to compress all these signal into one single scalar. To speed up the learning process, to force the Agent to adopt more resilient strategies etc. it can be usefull to look at different aspect, thus using different reward. Grid2op allows to do so. At each time step (and also when using the simulate function) it is possible to compute different rewards. This rewards must inherit and be provided at the initialization of the Environment.
This can be done as followed:
import grid2op
from grid2op.Reward import GameplayReward, L2RPNReward
env = grid2op.make("case14_realistic", reward_class=L2RPNReward, other_rewards={"gameplay": GameplayReward})
obs = env.reset()
act = env.action_space() # the do nothing action
obs, reward, done, info = env.step(act) # immplement the do nothing action on the environment
On this example, "reward" comes from the L2RPNReward
and the results of the "reward" computed with the GameplayReward
is accessible with the info["rewards"]["gameplay"]. We choose for this example to name the other rewards, "gameplay" which is related to the name of the reward "GampeplayReward" for convenience. The name can be absolutely any string you want.
NB In the case of L2RPN competitions, the reward can be modified by the competitors, and so is the "other_reward" key word arguments. The only restriction is that the key "__score" will be use by the organizers to compute the score the agent. Any attempt to modify it will be erased by the score function used by the organizers without any warning.
TODO
grid2op.Reward