## Introduction

For this notebook we will use a dedicated environment called "l2rpn_neurips_2020_track1", that has 36 substations. Grid2op comes with many different environments, with different problems etc. In this notebook, we will only mention and explain this specific environment.

The approach in this notebook is a loss minimization approach. We have used SARSA algorithm with Q-learning network and this new approach is called Deep SARSA Algorithm.

The algoritm takes current action $a$ with $\epsilon$-greedy algorithm on current state $S$. The reward $r$ and next state $S'$ is observed.

Later the current state $S$, current action $a$, reward $r$ and next state $S'$ is stored in buffer replay memory. The main idea behind the buffer replay memory is to train q-network on the experiences. The q-network will calculate q-values based on these experiences $Q(S,a)$.

A copy of q-network is used as target network to calculate the q-values of next action $a'$ on next state $S'$ where the $a'$ is selected using $\epsilon$-greedy algorithm. The idea behind this network is that it will predict next q-values $\hat{Q}(S',a')$ based on next actions $a'$ predicted by $\epsilon$-greedy algorithm and the next states $S'$. These q-values are then used to calculate next state-action values as $r+\gamma.\hat{Q}(S',a')$ which are then used to calculate loss function.

The loss function that we used is mean square error loss (MSE) which is calculated as $L = \frac{1}{|K|}\sum_{i=1}^{|K|}[(r+\gamma.\hat{Q}(S',a'))-Q(S,a)]^2$ Deep SARSA Algorithm will try to minimise this loss by adjusting the weights of the q-network accordingly.

The Structure of this notebook is as follows:
1. Importing necessary Libraries
2. Preprocess environment
3. Process replay memory
4. Create Deep SARSA Agent
5. Evaluate Agent

## Importing Necessary Libraries

In [2]:
import os
import warnings
import numpy as np
import copy
import argparse
from tqdm import tqdm
import json
import warnings
from abc import ABC, abstractmethod
from collections.abc import Iterable

import grid2op
from grid2op.Exceptions import Grid2OpException
from grid2op.Agent import AgentWithConverter
from grid2op.Converter import IdToAct
from grid2op.MakeEnv import make
from grid2op.Runner import Runner

from l2rpn_baselines.utils.replayBuffer import ReplayBuffer
from l2rpn_baselines.utils.trainingParam import TrainingParam
from l2rpn_baselines.utils.save_log_gif import save_log_gif
from grid2op.Reward import L2RPNReward
from l2rpn_baselines.utils.waring_msgs import _WARN_GPU_MEMORY

In [3]:
try:
    from grid2op.Chronics import MultifolderWithCache
    _CACHE_AVAILABLE_DEEPQAGENT = True
except ImportError:
    _CACHE_AVAILABLE_DEEPQAGENT = False

try:
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=FutureWarning)
        import tensorflow as tf
        import tensorflow.keras.optimizers as tfko
        from tensorflow.keras.models import Sequential, Model
        from tensorflow.keras.layers import Activation, Dense
        from tensorflow.keras.layers import Input
    _CAN_USE_TENSORFLOW = True
except ImportError:
    _CAN_USE_TENSORFLOW = False

## Setting up defaults

In [4]:
DEFAULT_LOGS_DIR = "./logs-eval/dsarsa_baseline"
DEFAULT_NB_EPISODE = 1
DEFAULT_NB_PROCESS = 1
DEFAULT_MAX_STEPS = -1
DEFAULT_NAME = "Deep_SARSA"

## Below is the base class for the deep SARSA Agent 
This class allows to train and log the training of Deep SARSA algorithm.

It derives from :class:`grid2op.Agent.AgentWithConverter` and as such implements the :func:`DeepSARSAAgent.convert_obs` and :func:`DeepSARSAAgent.my_act`

It is suppose to be a Baseline, so it implements also the

- :func:`DeepSARSAAgent.load`: to load the agent
- :func:`DeepSARSAAgent.save`: to save the agent
- :func:`DeepSARSAAgent.train`: to train the agent

#### Attributes

filter_action_fun: ``callable``
The function used to filter the action of the action space. See the documentation of grid2op:
:class:`grid2op.Converter.IdToAct`
`here <https://grid2op.readthedocs.io/en/v0.9.3/converter.html#grid2op.Converter.IdToAct>`_ for more information.

replay_buffer: The experience replay buffer

deep_sarsa: :class:`BaseDeepSARSA` 
        The neural network, represented as a :class:`BaseDeepSARSA` object.

name: ``str``
        The name of the Agent

store_action: ``bool``
        Whether you want to register which action your agent took or not. Saving the action can slow down a bit
        the computation (less than 1%) but can help understand what your agent is doing during its learning process.

dict_action: ``str``
        The action taken by the agent, represented as a dictionnary. This can be useful to know which type of actions
        is taken by your agent. Only filled if :attr:DeepSARSAAgent.store_action` is ``True`

istraining: ``bool``
        Whether you are training this agent or not. No more really used. Mainly used for backward compatibility.

epsilon: ``float``
        The epsilon greedy exploration parameter.

nb_injection: ``int``
        Number of action tagged as "injection". See the
        `official grid2op documentation <https://grid2op.readthedocs.io/en/v0.9.3/action.html?highlight=get_types#grid2op.Action.BaseAction.get_types>`_
        for more information.

nb_voltage: ``int``
        Number of action tagged as "voltage". See the
        `official grid2op documentation <https://grid2op.readthedocs.io/en/v0.9.3/action.html?highlight=get_types#grid2op.Action.BaseAction.get_types>`_
        for more information.

nb_topology: ``int``
        Number of action tagged as "topology". See the
        `official grid2op documentation <https://grid2op.readthedocs.io/en/v0.9.3/action.html?highlight=get_types#grid2op.Action.BaseAction.get_types>`_
        for more information.

nb_redispatching: ``int``
        Number of action tagged as "redispatching". See the
        `official grid2op documentation <https://grid2op.readthedocs.io/en/v0.9.3/action.html?highlight=get_types#grid2op.Action.BaseAction.get_types>`_
        for more information.

nb_storage: ``int``
        Number of action tagged as "storage". See the
        `official grid2op documentation <https://grid2op.readthedocs.io/en/v0.9.3/action.html?highlight=get_types#grid2op.Action.BaseAction.get_types>`_
        for more information.
        
nb_curtail: ``int``
        Number of action tagged as "curtailment". See the
        `official grid2op documentation <https://grid2op.readthedocs.io/en/v0.9.3/action.html?highlight=get_types#grid2op.Action.BaseAction.get_types>`_
        for more information.

nb_do_nothing: ``int``
        Number of action tagged as "do_nothing", *ie* when an action is not modifiying the state of the grid. See the
        `official grid2op documentation <https://grid2op.readthedocs.io/en/v0.9.3/action.html?highlight=get_types#grid2op.Action.BaseAction.get_types>`_
        for more information.

verbose: ``bool``
        An effort will be made on the logging (outside of trensorboard) of the training. For now: verbose=True will
        allow some printing on the command prompt, and verbose=False will drastically reduce the amount of information
        printed during training.

In [5]:
class DeepSARSAAgent(AgentWithConverter):
    def __init__(self,
                 action_space,
                 nn_archi,
                 name="DeepSARSAAgent",
                 store_action=True,
                 istraining=False,
                 filter_action_fun=None,
                 verbose=False,
                 observation_space=None,
                 **kwargs_converters):
        if not _CAN_USE_TENSORFLOW:
            raise RuntimeError("Cannot import tensorflow, this function cannot be used.")
        
        AgentWithConverter.__init__(self, action_space, action_space_converter=IdToAct, **kwargs_converters)
        self.filter_action_fun = filter_action_fun
        if self.filter_action_fun is not None:
            self.action_space.filter_action(self.filter_action_fun)

        # and now back to the origin implementation
        self.replay_buffer = None
        self.__nb_env = None

        self.deep_sarsa = None
        self._training_param = None
        self._tf_writer = None
        self.name = name
        self._losses = None
        self.__graph_saved = False
        self.store_action = store_action
        self.dict_action = {}
        self.istraining = istraining
        self.epsilon = 1.0

        # for tensorbaord
        self._train_lr = None

        self._reset_num = None

        self._max_iter_env_ = 1000000
        self._curr_iter_env = 0
        self._max_reward = 0.

        # action type
        self.nb_injection = 0
        self.nb_voltage = 0
        self.nb_topology = 0
        self.nb_line = 0
        self.nb_redispatching = 0
        self.nb_curtail = 0
        self.nb_storage = 0
        self.nb_do_nothing = 0

        # for over sampling the hard scenarios
        self._prev_obs_num = 0
        self._time_step_lived = None
        self._nb_chosen = None
        self._proba = None
        self._prev_id = 0
        # this is for the "limit the episode length" depending on your previous success
        self._total_sucesses = 0

        # neural network architecture
        self._nn_archi = nn_archi

        # observation tranformers
        self._obs_as_vect = None
        self._tmp_obs = None
        self._indx_obs = None
        self.verbose = verbose
        if observation_space is None:
            pass
        else:
            self.init_obs_extraction(observation_space)

        # for the frequency of action type
        self.current_ = 0
        self.nb_ = 10
        self._nb_this_time = np.zeros((self.nb_, 8), dtype=int)

        #
        self._vector_size = None
        self._actions_per_ksteps = None
        self._illegal_actions_per_ksteps = None
        self._ambiguous_actions_per_ksteps = None

    def _fill_vectors(self, training_param):
        self._vector_size  = self.nb_ * training_param.update_tensorboard_freq
        self._actions_per_ksteps = np.zeros((self._vector_size, self.action_space.size()), dtype=np.int)
        self._illegal_actions_per_ksteps = np.zeros(self._vector_size, dtype=np.int)
        self._ambiguous_actions_per_ksteps = np.zeros(self._vector_size, dtype=np.int)

    # grid2op.Agent interface
    def convert_obs(self, observation):
        """
        Generic way to convert an observation. This transform it to a vector and the select the attributes that were
        selected in :attr:`l2rpn_baselines.utils.NNParams.list_attr_obs` (that have been extracted once and for all
        in the :attr:`DeepSARSAAgent._indx_obs` vector).

        Parameters
        ----------
        observation: :class:`grid2op.Observation.BaseObservation`
            The current observation sent by the environment

        Returns
        -------
        _tmp_obs: ``numpy.ndarray``
            The observation as vector with only the proper attribute selected (TODO scaling will be available
            in future version)

        """
        obs_as_vect = observation.to_vect()
        self._tmp_obs[:] = obs_as_vect[self._indx_obs]
        return self._tmp_obs

    def my_act(self, transformed_observation, reward, done=False):
        """
        This function will return the action (its id) selected by the underlying :attr:`DeepSARSAAgent.deep_sarsa` network.

        Before being used, this method require that the :attr:`DeepSARSAAgent.deep_sarsa` is created. To that end a call
        to :func:`DeepSARSAAgent.init_deep_sarsa` needs to have been performed (this is automatically done if you use
        baseline we provide and their `evaluate` and `train` scripts).

        Parameters
        ----------
        transformed_observation: ``numpy.ndarray``
            The observation, as transformed after :func:`DeepSARSAAgent.convert_obs`

        reward: ``float``
            The reward of the last time step. Ignored by this method. Here for retro compatibility with openAI
            gym interface.

        done: ``bool``
            Whether the episode is over or not. This is not used, and is only present to be compliant with
            open AI gym interface

        Returns
        -------
        res: ``int``
            The id the action taken.

        """
        predict_movement_int, *_ = self.deep_sarsa.predict_movement(transformed_observation,
                                                                epsilon=0.0,
                                                                training=False)
        res = int(predict_movement_int)
        self._store_action_played(res)
        return res

    @staticmethod
    def get_action_size(action_space, filter_fun, kwargs_converters):
        """
        This function allows to get the size of the action space if we were to built a :class:`DeepSARSAAgent`
        with this parameters.

        Parameters
        ----------
        action_space: :class:`grid2op.ActionSpace`
            The grid2op action space used.

        filter_fun: ``callable``
            see :attr:`DeepSARSAAgent.filter_fun` for more information

        kwargs_converters: ``dict``
            see the documentation of grid2op for more information:
            `here <https://grid2op.readthedocs.io/en/v0.9.3/converter.html?highlight=idToAct#grid2op.Converter.IdToAct.init_converter>`_

        """
        converter = IdToAct(action_space)
        converter.init_converter(**kwargs_converters)
        if filter_fun is not None:
            converter.filter_action(filter_fun)
        return converter.n

    def init_obs_extraction(self, observation_space):
        """
        This method should be called to initialize the observation (feed as a vector in the neural network)
        from its description as a list of its attribute names.
        """
        tmp = np.zeros(0, dtype=np.uint)  # TODO platform independant
        for obs_attr_name in self._nn_archi.get_obs_attr():
            beg_, end_, dtype_ = observation_space.get_indx_extract(obs_attr_name)
            tmp = np.concatenate((tmp, np.arange(beg_, end_, dtype=np.uint)))
        self._indx_obs = tmp
        self._tmp_obs = np.zeros((1, tmp.shape[0]), dtype=np.float32)

    # baseline interface
    def load(self, path):
        """
        Part of the l2rpn_baselines interface, this function allows to read back a trained model, to continue the
        training or to evaluate its performance for example.

        **NB** To reload an agent, it must have exactly the same name and have been saved at the right location.

        Parameters
        ----------
        path: ``str``
            The path where the agent has previously beens saved.

        """
        # not modified compare to original implementation
        tmp_me = os.path.join(path, self.name)
        if not os.path.exists(tmp_me):
            raise RuntimeError("The model should be stored in \"{}\". But this appears to be empty".format(tmp_me))
        self._load_action_space(tmp_me)

        # TODO handle case where training param class has been overidden
        self._training_param = TrainingParam.from_json(os.path.join(tmp_me, "training_params.json".format(self.name)))
        self.deep_sarsa = self._nn_archi.make_nn(self._training_param)
        try:
            self.deep_sarsa.load_network(tmp_me, name=self.name)
        except Exception as e:
            raise RuntimeError("Impossible to load the model located at \"{}\" with error \n{}".format(path, e))

        for nm_attr in ["_time_step_lived", "_nb_chosen", "_proba"]:
            conv_path = os.path.join(tmp_me, "{}.npy".format(nm_attr))
            if os.path.exists(conv_path):
                setattr(self, nm_attr, np.load(file=conv_path))

    def save(self, path):
        """
        Part of the l2rpn_baselines interface, this allows to save a model. Its name is used at saving time. The
        same name must be reused when loading it back.

        Parameters
        ----------
        path: ``str``
            The path where to save the agent.

        """
        if path is not None:
            tmp_me = os.path.join(path, self.name)
            if not os.path.exists(tmp_me):
                os.mkdir(tmp_me)
            nm_conv = "action_space.npy"
            conv_path = os.path.join(tmp_me, nm_conv)
            if not os.path.exists(conv_path):
                self.action_space.save(path=tmp_me, name=nm_conv)

            self._training_param.save_as_json(tmp_me, name="training_params.json")
            self._nn_archi.save_as_json(tmp_me, "nn_architecture.json")
            self.deep_sarsa.save_network(tmp_me, name=self.name)

            # TODO save the "oversampling" part, and all the other info
            for nm_attr in ["_time_step_lived", "_nb_chosen", "_proba"]:
                conv_path = os.path.join(tmp_me, "{}.npy".format(nm_attr))
                attr_ = getattr(self, nm_attr)
                if attr_ is not None:
                    np.save(arr=attr_, file=conv_path)

    def train(self,
              env,
              iterations,
              save_path,
              logdir,
              training_param=None):
        """
        This function allows to train the baseline.

        If `save_path` is not None, the the model is saved regularly, and also at the end of training.

        Parameters
        ----------
        env: :class:`grid2op.Environment.Environment` or :class:`grid2op.Environment.MultiEnvironment`
            The environment used to train your model.

        iterations: ``int``
            The number of training iteration. NB when reloading a model, this is **NOT** the training steps that will
            be used when re training. Indeed, if `iterations` is 1000 and the model was already trained for 750 time
            steps, then when reloaded, the training will occur on 250 (=1000 - 750) time steps only.

        save_path: ``str``
            Location at which to save the model

        logdir: ``str``
            Location at which tensorboard related information will be kept.

        training_param: :class:`l2rpn_baselines.utils.TrainingParam`
            The meta parameters for the training procedure. This is currently ignored if the model is reloaded (in that
            case the parameters used when first created will be used)

        """

        if training_param is None:
            training_param = TrainingParam()

        self._train_lr = training_param.lr

        if self._training_param is None:
            self._training_param = training_param
        else:
            training_param = self._training_param
        self._init_deep_sarsa(self._training_param, env)
        self._fill_vectors(self._training_param)

        self._init_replay_buffer()

        # efficient reading of the data (read them by chunk of roughly 1 day
        nb_ts_one_day = 24 * 60 / 5  # number of time steps per day
        self._set_chunk(env, nb_ts_one_day)

        # Create file system related vars
        if save_path is not None:
            save_path = os.path.abspath(save_path)
            os.makedirs(save_path, exist_ok=True)

        if logdir is not None:
            logpath = os.path.join(logdir, self.name)
            self._tf_writer = tf.summary.create_file_writer(logpath, name=self.name)
        else:
            logpath = None
            self._tf_writer = None
        UPDATE_FREQ = training_param.update_tensorboard_freq  # update tensorboard every "UPDATE_FREQ" steps
        SAVING_NUM = training_param.save_model_each

        if hasattr(env, "nb_env"):
            nb_env = env.nb_env
            warnings.warn("Training using {} environments".format(nb_env))
            self.__nb_env = nb_env
        else:
            self.__nb_env = 1

        self.init_obs_extraction(env.observation_space)

        training_step = self._training_param.last_step
        self.epsilon = self._training_param.initial_epsilon

        # now the number of alive frames and total reward depends on the "underlying environment". It is vector instead
        # of scalar
        alive_frame, total_reward = self._init_global_train_loop()
        reward, done = self._init_local_train_loop()
        epoch_num = 0
        self._losses = np.zeros(iterations)
        alive_frames = np.zeros(iterations)
        total_rewards = np.zeros(iterations)
        new_state = None
        self._reset_num = 0
        self._curr_iter_env = 0
        self._max_reward = env.reward_range[1]

        # action types
        # injection, voltage, topology, line, redispatching = action.get_types()
        self.nb_injection = 0
        self.nb_voltage = 0
        self.nb_topology = 0
        self.nb_line = 0
        self.nb_redispatching = 0
        self.nb_curtail = 0
        self.nb_storage = 0
        self.nb_do_nothing = 0

        # for non uniform random sampling of the scenarios
        th_size = None
        self._prev_obs_num = 0
        if self.__nb_env == 1:
            if _CACHE_AVAILABLE_DEEPQAGENT:
                if isinstance(env.chronics_handler.real_data, MultifolderWithCache):
                    th_size = env.chronics_handler.real_data.cache_size
            if th_size is None:
                th_size = len(env.chronics_handler.real_data.subpaths)

            # number of time step lived per possible scenarios
            if self._time_step_lived is None or self._time_step_lived.shape[0] != th_size:
                self._time_step_lived = np.zeros(th_size, dtype=np.uint64)
            # number of time a given scenario has been played
            if self._nb_chosen is None or self._nb_chosen.shape[0] != th_size:
                self._nb_chosen = np.zeros(th_size, dtype=np.uint)
            # number of time a given scenario has been played
            if self._proba is None or self._proba.shape[0] != th_size:
                self._proba = np.ones(th_size, dtype=np.float64)

        self._prev_id = 0
        # this is for the "limit the episode length" depending on your previous success
        self._total_sucesses = 0

        with tqdm(total=iterations - training_step, disable=not self.verbose) as pbar:
            while training_step < iterations:
                # reset or build the environment
                initial_state = self._need_reset(env, training_step, epoch_num, done, new_state)

                # Slowly decay the exploration parameter epsilon
                # if self.epsilon > training_param.FINAL_EPSILON:
                self.epsilon = self._training_param.get_next_epsilon(current_step=training_step)

                # then we need to predict the next moves. Agents have been adapted to predict a batch of data
                pm_i, pq_v, act = self._next_move(initial_state, self.epsilon, training_step)
                EPS = self.epsilon

                # todo store the illegal / ambiguous / ... actions
                reward, done = self._init_local_train_loop()
                if self.__nb_env == 1:
                    act = act[0]

                temp_observation_obj, temp_reward, temp_done, info = env.step(act)
                if self.__nb_env == 1:
                    temp_observation_obj = [temp_observation_obj]
                    temp_reward = np.array([temp_reward], dtype=np.float32)
                    temp_done = np.array([temp_done], dtype=np.bool)
                    info = [info]

                new_state = self._convert_obs_train(temp_observation_obj)
                self._updage_illegal_ambiguous(training_step, info)
                done, reward, total_reward, alive_frame, epoch_num \
                    = self._update_loop(done, temp_reward, temp_done, alive_frame, total_reward, reward, epoch_num)

                # update the replay buffer
                self._store_new_state(initial_state, pm_i, reward, done, new_state)
                
                # now train the model
                if not self._train_model(training_step):
                    # infinite loss in this case
                    raise RuntimeError("ERROR INFINITE LOSS")

                # Save the network every 1000 iterations
                if training_step % SAVING_NUM == 0 or training_step == iterations - 1:
                    self.save(save_path)

                # save some information to tensorboard
                alive_frames[epoch_num] = np.mean(alive_frame)
                total_rewards[epoch_num] = np.mean(total_reward)
                self._store_action_played_train(training_step, pm_i)
                self._save_tensorboard(training_step, epoch_num, UPDATE_FREQ, total_rewards, alive_frames)
                training_step += 1
                pbar.update(1)

        self.save(save_path)

    # auxiliary functions
    # two below function: to train with multiple environments
    def _convert_obs_train(self, observations):
        """ create the observations that are used for training."""
        if self._obs_as_vect is None:
            size_obs = self.convert_obs(observations[0]).shape[1]
            self._obs_as_vect = np.zeros((self.__nb_env, size_obs), dtype=np.float32)

        for i, obs in enumerate(observations):
            self._obs_as_vect[i, :] = self.convert_obs(obs).reshape(-1)
        return self._obs_as_vect

    def _create_action_if_not_registered(self, action_int):
        """make sure that `action_int` is present in dict_action"""
        if action_int not in self.dict_action:
            act = self.action_space.all_actions[action_int]
            is_inj, is_volt, is_topo, is_line_status, is_redisp, is_storage, is_dn, is_curtail = \
                False, False, False, False, False, False, False, False
            try:
                # feature unavailble in grid2op <= 0.9.2
                try:
                    # storage introduced in grid2op 1.5.0 so if below it is not supported
                    is_inj, is_volt, is_topo, is_line_status, is_redisp = act.get_types()
                except ValueError as exc_:
                    try:
                        is_inj, is_volt, is_topo, is_line_status, is_redisp, is_storage = act.get_types()
                    except ValueError as exc_:
                        is_inj, is_volt, is_topo, is_line_status, is_redisp, is_storage, is_curtail = act.get_types()

                is_dn = (not is_inj) and (not is_volt) and (not is_topo) and (not is_line_status) and (not is_redisp)
                is_dn = is_dn and (not is_storage)
                is_dn = is_dn and (not is_curtail)
            except Exception as exc_:
                pass

            self.dict_action[action_int] = [0, act,
                                            (is_inj, is_volt, is_topo, is_line_status, is_redisp, is_storage, is_curtail, is_dn)]

    def _store_action_played(self, action_int):
        """if activated, this function will store the action taken by the agent."""
        if self.store_action:
            self._create_action_if_not_registered(action_int)

            self.dict_action[action_int][0] += 1
            (is_inj, is_volt, is_topo, is_line_status, is_redisp, is_storage, is_curtail, is_dn) = self.dict_action[action_int][2]
            if is_inj:
                self.nb_injection += 1
            if is_volt:
                self.nb_voltage += 1
            if is_topo:
                self.nb_topology += 1
            if is_line_status:
                self.nb_line += 1
            if is_redisp:
                self.nb_redispatching += 1
            if is_storage:
                self.nb_storage += 1
                self.nb_redispatching += 1
            if is_curtail:
                self.nb_curtail += 1
            if is_dn:
                self.nb_do_nothing += 1

    def _convert_all_act(self, act_as_integer):
        """this function converts the action given as a list of integer. It ouputs a list of valid grid2op Action"""
        res = []
        for act_id in act_as_integer:
            res.append(self.convert_act(act_id))
            self._store_action_played(act_id)
        return res

    def _load_action_space(self, path):
        """ load the action space in case the model is reloaded"""
        if not os.path.exists(path):
            raise RuntimeError("The model should be stored in \"{}\". But this appears to be empty".format(path))
        try:
            self.action_space.init_converter(
                all_actions=os.path.join(path, "action_space.npy".format(self.name)))
        except Exception as e:
            raise RuntimeError("Impossible to reload converter action space with error \n{}".format(e))

    # utilities for data reading
    def _set_chunk(self, env, nb):
        """
        to optimize the data reading process. See the official grid2op documentation for the effect of setting
        the chunk size for the environment.
        """
        env.set_chunk_size(int(max(100, nb)))

    def _train_model(self, training_step):
        """train the deep sarsa networks."""
        self._training_param.tell_step(training_step)
        if training_step > max(self._training_param.min_observation, self._training_param.minibatch_size) and \
            self._training_param.do_train():

            # train the model
            s_batch, a_batch, r_batch, d_batch, s2_batch = self.replay_buffer.sample(self._training_param.minibatch_size)
            tf_writer = None
            if self.__graph_saved is False:
                tf_writer = self._tf_writer
            
            loss = self.deep_sarsa.train(s_batch, a_batch, r_batch, d_batch, s2_batch, self.epsilon, 
                                     tf_writer)
            # save learning rate for later
            self._train_lr = self.deep_sarsa._optimizer_model._decayed_lr('float32').numpy()
            self.__graph_saved = True
            if not np.all(np.isfinite(loss)):
                # if the loss is not finite i stop the learning
                return False
            self.deep_sarsa.target_train()
            self._losses[training_step:] = np.sum(loss)
        return True

    def _updage_illegal_ambiguous(self, curr_step, info):
        """update the conunt of illegal and ambiguous actions"""
        tmp_ = curr_step % self._vector_size
        self._illegal_actions_per_ksteps[tmp_] = np.sum([el["is_illegal"] for el in info])
        self._ambiguous_actions_per_ksteps[tmp_] = np.sum([el["is_ambiguous"] for el in info])

    def _store_action_played_train(self, training_step, action_id):
        """store which action were played, for tensorboard only."""
        which_row = training_step % self._vector_size
        self._actions_per_ksteps[which_row, :] = 0
        self._actions_per_ksteps[which_row, action_id] += 1

    def _fast_forward_env(self, env, time=7*24*60/5):
        """use this functio to skip some time steps when environment is reset."""
        my_int = np.random.randint(0, min(time, env.chronics_handler.max_timestep()))
        env.fast_forward_chronics(my_int)

    def _reset_env_clean_state(self, env):
        """
        reset this environment to a proper state. This should rather be integrated in grid2op. And will probably
        be integrated partially starting from grid2op 1.0.0
        """
        # /!\ DO NOT ATTEMPT TO MODIFY OTHERWISE IT WILL PROBABLY CRASH /!\
        # /!\ THIS WILL BE PART OF THE ENVIRONMENT IN FUTURE GRID2OP RELEASE (>= 1.0.0) /!\
        # AND OF COURSE USING THIS METHOD DURING THE EVALUATION IS COMPLETELY FORBIDDEN
        if self.__nb_env > 1:
            return
        env.current_obs = None
        env.env_modification = None
        env._reset_maintenance()
        env._reset_redispatching()
        env._reset_vectors_and_timings()
        _backend_action = env._backend_action_class()
        _backend_action.all_changed()
        env._backend_action =_backend_action
        env.backend.apply_action(_backend_action)
        _backend_action.reset()
        *_, fail_to_start, info = env.step(env.action_space())
        if fail_to_start:
            # this is happening because not enough care has been taken to handle these problems
            # more care will be taken when this feature will be available in grid2op directly.
            raise Grid2OpException("Impossible to initialize the powergrid, the powerflow diverge at iteration 0. "
                                   "Available information are: {}".format(info))
        env._reset_vectors_and_timings()

    def _need_reset(self, env, observation_num, epoch_num, done, new_state):
        """perform the proper reset of the environment"""
        if self._training_param.step_increase_nb_iter is not None and \
           self._training_param.step_increase_nb_iter > 0:
            self._max_iter_env(min(max(self._training_param.min_iter,
                                       self._training_param.max_iter_fun(self._total_sucesses)),
                                   self._training_param.max_iter))  # TODO
        self._curr_iter_env += 1
        if new_state is None:
            # it's the first ever loop
            obs = env.reset()
            if self.__nb_env == 1:
                obs = [obs]
            new_state = self._convert_obs_train(obs)
        elif self.__nb_env > 1:
            pass
        elif done[0]:
            nb_ts_one_day = 24*60/5
            if False:
                # the 3-4 lines below allow to reuse the loaded dataset and continue further up
                try:
                    self._reset_env_clean_state(env)
                    # random fast forward between now and next day
                    self._fast_forward_env(env, time=nb_ts_one_day)
                except (StopIteration, Grid2OpException):
                    env.reset()
                    # random fast forward between now and next week
                    self._fast_forward_env(env, time=7*nb_ts_one_day)

            # update the number of time steps it has live
            ts_lived = observation_num - self._prev_obs_num
            if self._time_step_lived is not None:
                self._time_step_lived[self._prev_id] += ts_lived
            self._prev_obs_num = observation_num
            if self._training_param.oversampling_rate is not None:
                # proba = np.sqrt(1. / (self._time_step_lived +1))
                # # over sampling some kind of "UCB like" stuff
                # # https://banditalgs.com/2016/09/18/the-upper-confidence-bound-algorithm/

                # proba = 1. / (self._time_step_lived + 1)
                self._proba[:] = 1. / (self._time_step_lived ** self._training_param.oversampling_rate + 1)
                self._proba /= np.sum(self._proba)

            _prev_id = self._prev_id
            self._prev_id = None
            if _CACHE_AVAILABLE_DEEPQAGENT:
                if isinstance(env.chronics_handler.real_data, MultifolderWithCache):
                    self._prev_id = env.chronics_handler.real_data.sample_next_chronics(self._proba)
            if self._prev_id is None:
                self._prev_id = _prev_id + 1
                self._prev_id %= self._time_step_lived.shape[0]

            obs = self._reset_env(env, epoch_num)
            if self._training_param.sample_one_random_action_begin is not None and \
                    observation_num < self._training_param.sample_one_random_action_begin:
                done = True
                while done:
                    act = env.action_space(env.action_space._sample_set_bus())
                    obs, reward, done, info = env.step(act)
                    if info["is_illegal"] or info["is_ambiguous"]:
                        # there are no guarantee that sampled action are legal nor perfectly
                        # correct.
                        # if that is the case, i "simply" restart the process, as if the action
                        # broke everything
                        done = True

                    if done:
                        obs = self._reset_env(env, epoch_num)
                    else:
                        if self.verbose:
                            print("step {}: {}".format(observation_num, act))

                obs = [obs]
            new_state = self._convert_obs_train(obs)
        return new_state

    def _reset_env(self, env, epoch_num):
        env.reset()
        if self._nb_chosen is not None:
            self._nb_chosen[self._prev_id] += 1

        # random fast forward between now and next week
        if self._training_param.random_sample_datetime_start is not None:
            self._fast_forward_env(env, time=self._training_param.random_sample_datetime_start)

        self._curr_iter_env = 0
        obs = [env.current_obs]
        if epoch_num % len(env.chronics_handler.real_data.subpaths) == 0:
            # re-shuffle the data
            env.chronics_handler.shuffle(lambda x: x[np.random.choice(len(x), size=len(x), replace=False)])
        return obs

    def _init_replay_buffer(self):
        """create and initialized the replay buffer"""
        self.replay_buffer = ReplayBuffer(self._training_param.buffer_size)

    def _store_new_state(self, initial_state, predict_movement_int, reward, done, new_state):
        """store the new state in the replay buffer"""
        # vectorized version of the previous code
        for i_s, pm_i, reward, done, ns in zip(initial_state, predict_movement_int, reward, done, new_state):
            self.replay_buffer.add(i_s,
                                   pm_i,
                                   reward,
                                   done,
                                   ns)

    def _max_iter_env(self, new_max_iter):
        """update the number of maximum iteration allowed."""
        self._max_iter_env_ = new_max_iter

    def _next_move(self, curr_state, epsilon, training_step):
        # supposes that 0 encodes for do nothing, otherwise it will NOT work (for the observer)
        pm_i, pq_v, q_actions = self.deep_sarsa.predict_movement(curr_state, epsilon, training=True)
        pm_i, pq_v = self._short_circuit_actions(training_step, pm_i, pq_v, q_actions)
        act = self._convert_all_act(pm_i)
        return pm_i, pq_v, act

    def _short_circuit_actions(self, training_step, pm_i, pq_v, q_actions):
        if self._training_param.min_observe is not None and \
                training_step < self._training_param.min_observe:
            # action is replaced by do nothing due to the "observe only" specification
            pm_i[:] = 0
            pq_v[:] = q_actions[:, 0]
        return pm_i, pq_v

    def _init_global_train_loop(self):
        alive_frame = np.zeros(self.__nb_env, dtype=np.int)
        total_reward = np.zeros(self.__nb_env, dtype=np.float32)
        return alive_frame, total_reward

    def _update_loop(self, done, temp_reward, temp_done, alive_frame, total_reward, reward, epoch_num):
        if self.__nb_env == 1:
            # force end of episode at early stage of learning
            if self._curr_iter_env >= self._max_iter_env_:
                temp_done[0] = True
                temp_reward[0] = self._max_reward
                self._total_sucesses += 1

        done = temp_done
        alive_frame[done] = 0
        total_reward[done] = 0.
        self._reset_num += np.sum(done)
        if self._reset_num >= self.__nb_env:
            # increase the "global epoch num" represented by "epoch_num" only when on average
            # all environments are "done"
            epoch_num += 1
            self._reset_num = 0

        total_reward[~done] += temp_reward[~done]
        alive_frame[~done] += 1
        return done, temp_reward, total_reward, alive_frame, epoch_num

    def _init_local_train_loop(self):
        # reward, done = np.zeros(self.nb_process), np.full(self.nb_process, fill_value=False, dtype=np.bool)
        reward = np.zeros(self.__nb_env, dtype=np.float32)
        done = np.full(self.__nb_env, fill_value=False, dtype=np.bool)
        return reward, done

    def _init_deep_sarsa(self, training_param, env):
        """
        This function serves as initializin the neural network.
        """
        if self.deep_sarsa is None:
            self.deep_sarsa = self._nn_archi.make_nn(training_param)
        self.init_obs_extraction(env.observation_space)

    def _save_tensorboard(self, step, epoch_num, UPDATE_FREQ, epoch_rewards, epoch_alive):
        """save all the informations needed in tensorboard."""
        if self._tf_writer is None:
            return

        # Log some useful metrics every even updates
        if step % UPDATE_FREQ == 0 and epoch_num > 0:
            if step % (10 * UPDATE_FREQ) == 0:
                # print the top k scenarios the "hardest" (ie chosen the most number of times
                if self.verbose:
                    top_k = 10
                    if self._nb_chosen is not None:
                        array_ = np.argsort(self._nb_chosen)[-top_k:][::-1]
                        print("hardest scenarios\n{}".format(array_))
                        print("They have been chosen respectively\n{}".format(self._nb_chosen[array_]))
                        # print("Associated proba are\n{}".format(self._proba[array_]))
                        print("The number of timesteps played is\n{}".format(self._time_step_lived[array_]))
                        print("avg (accross all scenarios) number of timsteps played {}"
                              "".format(np.mean(self._time_step_lived)))
                        print("Time alive: {}".format(self._time_step_lived[array_] / (self._nb_chosen[array_] + 1)))
                        print("Avg time alive: {}".format(np.mean(self._time_step_lived / (self._nb_chosen + 1 ))))

            with self._tf_writer.as_default():
                last_alive = epoch_alive[(epoch_num-1)]
                last_reward = epoch_rewards[(epoch_num-1)]

                mean_reward = np.nanmean(epoch_rewards[:epoch_num])
                mean_alive = np.nanmean(epoch_alive[:epoch_num])

                mean_reward_30 = mean_reward
                mean_alive_30 = mean_alive
                mean_reward_100 = mean_reward
                mean_alive_100 = mean_alive

                tmp = self._actions_per_ksteps > 0
                tmp = tmp.sum(axis=0)
                nb_action_taken_last_kstep = np.sum(tmp > 0)

                nb_illegal_act = np.sum(self._illegal_actions_per_ksteps)
                nb_ambiguous_act = np.sum(self._ambiguous_actions_per_ksteps)

                if epoch_num >= 100:
                    mean_reward_100 = np.nanmean(epoch_rewards[(epoch_num-100):epoch_num])
                    mean_alive_100 = np.nanmean(epoch_alive[(epoch_num-100):epoch_num])

                if epoch_num >= 30:
                    mean_reward_30 = np.nanmean(epoch_rewards[(epoch_num-30):epoch_num])
                    mean_alive_30 = np.nanmean(epoch_alive[(epoch_num-30):epoch_num])

                # to ensure "fair" comparison between single env and multi env
                step_tb = step  # * self.__nb_env
                # if multiply by the number of env we have "trouble" with random exploration at the beginning
                # because it lasts the same number of "real" steps

                # show first the Mean reward and mine time alive (hence the upper case)
                tf.summary.scalar("Mean_alive_30", mean_alive_30, step_tb,
                                  description="Average number of steps (per episode) made over the last 30 "
                                              "completed episodes")
                tf.summary.scalar("Mean_reward_30", mean_reward_30, step_tb,
                                  description="Average (final) reward obtained over the last 30 completed episodes")

                # then it's alpha numerical order, hence the "z_" in front of some information
                tf.summary.scalar("loss", self._losses[step], step_tb,
                                  description="Training loss (for the last training batch)")

                tf.summary.scalar("last_alive", last_alive, step_tb,
                                  description="Final number of steps for the last complete episode")
                tf.summary.scalar("last_reward", last_reward, step_tb,
                                  description="Final reward over the last complete episode")

                tf.summary.scalar("mean_reward", mean_reward, step_tb,
                                  description="Average reward over the whole episodes played")
                tf.summary.scalar("mean_alive", mean_alive, step_tb,
                                  description="Average time alive over the whole episodes played")

                tf.summary.scalar("mean_reward_100", mean_reward_100, step_tb,
                                  description="Average number of steps (per episode) made over the last 100 "
                                              "completed episodes")
                tf.summary.scalar("mean_alive_100", mean_alive_100, step_tb,
                                  description="Average (final) reward obtained over the last 100 completed episodes")

                tf.summary.scalar("nb_different_action_taken", nb_action_taken_last_kstep, step_tb,
                                  description="Number of different actions played the last "
                                              "{} steps".format(self.nb_ * UPDATE_FREQ))
                tf.summary.scalar("nb_illegal_act", nb_illegal_act, step_tb,
                                  description="Number of illegal actions played the last "
                                              "{} steps".format(self.nb_ * UPDATE_FREQ))
                tf.summary.scalar("nb_ambiguous_act", nb_ambiguous_act, step_tb,
                                  description="Number of ambiguous actions played the last "
                                              "{} steps".format(self.nb_ * UPDATE_FREQ))
                tf.summary.scalar("nb_total_success", self._total_sucesses, step_tb,
                                  description="Number of times the episode was completed entirely "
                                              "(no game over)")

                tf.summary.scalar("z_lr", self._train_lr, step_tb,
                                  description="Current learning rate")
                tf.summary.scalar("z_epsilon", self.epsilon, step_tb,
                                  description="Current epsilon (from the epsilon greedy)")
                tf.summary.scalar("z_max_iter", self._max_iter_env_, step_tb,
                                  description="Maximum number of time steps before deciding a scenario "
                                              "is over (=win)")
                tf.summary.scalar("z_total_episode", epoch_num, step_tb,
                                  description="Total number of episode played (number of \"reset\")")

                self.deep_sarsa.save_tensorboard(step_tb)

                if self.store_action:
                    self._store_frequency_action_type(UPDATE_FREQ, step_tb)

                

    def _store_frequency_action_type(self, UPDATE_FREQ, step_tb):
        self.current_ += 1
        self.current_ %= self.nb_
        nb_inj, nb_volt, nb_topo, nb_line, nb_redisp, nb_storage, nb_curtail, nb_dn = self._nb_this_time[self.current_, :]
        self._nb_this_time[self.current_, :] = [self.nb_injection,
                                                self.nb_voltage,
                                                self.nb_topology,
                                                self.nb_line,
                                                self.nb_redispatching,
                                                self.nb_storage,
                                                self.nb_curtail,
                                                self.nb_do_nothing]

        curr_inj = self.nb_injection - nb_inj
        curr_volt = self.nb_voltage - nb_volt
        curr_topo = self.nb_topology - nb_topo
        curr_line = self.nb_line - nb_line
        curr_redisp = self.nb_redispatching - nb_redisp
        curr_storage = self.nb_storage - nb_storage
        curr_curtail = self.nb_curtail - nb_curtail
        curr_dn = self.nb_do_nothing - nb_dn

        total_act_num = curr_inj + curr_volt + curr_topo + curr_line + curr_redisp + curr_dn + curr_storage
        tf.summary.scalar("zz_freq_inj",
                          curr_inj / total_act_num,
                          step_tb,
                          description="Frequency of \"injection\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))
        tf.summary.scalar("zz_freq_voltage",
                          curr_volt / total_act_num,
                          step_tb,
                          description="Frequency of \"voltage\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))
        tf.summary.scalar("z_freq_topo",
                          curr_topo / total_act_num,
                          step_tb,
                          description="Frequency of \"topo\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))
        tf.summary.scalar("z_freq_line_status",
                          curr_line / total_act_num,
                          step_tb,
                          description="Frequency of \"line status\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))
        tf.summary.scalar("z_freq_redisp",
                          curr_redisp / total_act_num,
                          step_tb,
                          description="Frequency of \"redispatching\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))
        tf.summary.scalar("z_freq_do_nothing",
                          curr_dn / total_act_num,
                          step_tb,
                          description="Frequency of \"do nothing\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))
        tf.summary.scalar("z_freq_storage",
                          curr_storage / total_act_num,
                          step_tb,
                          description="Frequency of \"storage\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))
        tf.summary.scalar("z_freq_curtail",
                          curr_curtail / total_act_num,
                          step_tb,
                          description="Frequency of \"curtailment\" actions "
                                      "type played over the last {} actions"
                                      "".format(self.nb_ * UPDATE_FREQ))

## Abstract class to create neural networks

This class aims at representing the Q value (or more in case of SAC) parametrization by a neural network.	
        
It is composed of 2 different networks:

- model: which is the main model
- target_model: which has the same architecture and same initial weights as "model" but is updated less frequently to stabilize training

It has basic methods to make predictions, to train the model, and train the target model.

This class is abstraction and need to be overide in order to create object from this class. The only pure virtual function is :func:`BaseDeepSARSA.construct_q_network` that creates the neural network from the nn_params (:class:`NNParam`) provided as input

#### Attributes

_action_size: ``int``
        Total number of actions

_observation_size: ``int``
        Size of the observation space considered

_nn_archi: :class:`NNParam`
        The parameters of the neural networks that will be created

_training_param: :class:`TrainingParam`
        The meta parameters for the training scheme (used especially for learning rate or gradient clipping for example)

_lr: ``float``
        The  initial learning rate

_lr_decay_steps: ``float``
        The decay step of the learning rate

_lr_decay_rate: ``float``
        The rate at which the learning rate will decay

_model:
        Main neural network model, here a keras Model object.

_target_model:
        a copy of the main neural network that will be updated less frequently (also known as "target model" in RL
        community)




In [6]:
class BaseDeepSARSA(ABC):
    
    def __init__(self,
                 nn_params,
                 training_param=None,
                 verbose=False):
        if not _CAN_USE_TENSORFLOW:
            raise RuntimeError("Cannot import tensorflow, this function cannot be used.")
        self._action_size = nn_params.action_size
        self._observation_size = nn_params.observation_size
        self._nn_archi = nn_params
        self.verbose = verbose

        if training_param is None:
            self._training_param = TrainingParam()
        else:
            self._training_param = training_param

        self._lr = training_param.lr
        self._lr_decay_steps = training_param.lr_decay_steps
        self._lr_decay_rate = training_param.lr_decay_rate

        self._model = None
        self._target_model = None
        self._schedule_model = None
        self._optimizer_model = None
        self._custom_objects = None  # to be able to load other keras layers type

    def make_optimiser(self):
        """
        helper function to create the proper optimizer (Adam) with the learning rates and its decay
        parameters.
        """
        schedule = tfko.schedules.InverseTimeDecay(self._lr, self._lr_decay_steps, self._lr_decay_rate)
        return schedule, tfko.Adam(learning_rate=schedule)

    @abstractmethod
    def construct_q_network(self):
        """
         Abstract method that need to be overide.

         It should create :attr:`BaseDeepSARSA._model` and :attr:`BaseDeepSARSA._target_model`
        """
        raise NotImplementedError("Not implemented")

    def predict_movement(self, data, epsilon, batch_size=None, training=False):
        """
        Predict movement of game controler where is epsilon probability randomly move.
        """
        if batch_size is None:
            batch_size = data.shape[0]

        q_actions = self._model(data, training=training).numpy()
        opt_policy = np.argmax(q_actions, axis=-1)
        if epsilon > 0.:
            rand_val = np.random.random(batch_size)
            opt_policy[rand_val < epsilon] = np.random.randint(0, self._action_size, size=(np.sum(rand_val < epsilon)))
        return opt_policy, q_actions[np.arange(batch_size), opt_policy], q_actions

    def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, epsilon_tr, tf_writer=None, batch_size=None):
        """
        Trains network to fit given parameters:
        
        .. seealso::
            https://towardsdatascience.com/dueling-double-deep-q-learning-using-tensorflow-2-x-7bbbcec06a2a
            for the update rules
        
        Parameters
        ----------
        s_batch:
            the state vector (before the action is taken)
        a_batch:
            the action taken
        s2_batch:
            the state vector (after the action is taken)
        d_batch:
            says whether or not the episode was over
        r_batch:
            the reward obtained this step
        """
        if batch_size is None:
            batch_size = s_batch.shape[0]

        # Save the graph just the first time
        if tf_writer is not None:
            tf.summary.trace_on()
        target = self._model(s_batch, training=True).numpy()
        # this fut_action should come from epsilon policy
        next_a, fut_actions_3, fut_action_2 = self.predict_movement(s2_batch,epsilon=epsilon_tr,training=True)
        fut_action = self._model(s2_batch, training=True).numpy()
        

        if tf_writer is not None:
            with tf_writer.as_default():
                tf.summary.trace_export("model-graph", 0)
            tf.summary.trace_off()
        target_next = self._target_model(s2_batch, training=True).numpy()

        idx = np.arange(batch_size)
        target[idx, a_batch] = r_batch
        # update the value for not done episode
        nd_batch = ~d_batch  # update with this rule only batch that did not game over
        next_action = np.argmax(fut_action, axis=-1)  # compute the future action i will take in the next state
        fut_Q = target_next[idx, next_a]  # get its Q value
        fut_Q_new = target_next[idx, next_action]  # get its Q value
        
        target[nd_batch, a_batch[nd_batch]] += self._training_param.discount_factor * fut_Q[nd_batch]
        loss = self.train_on_batch(self._model, self._optimizer_model, s_batch, target)
        return loss

    def train_on_batch(self, model, optimizer_model, x, y_true):
        """train the model on a batch of example. This can be overide"""
        loss = model.train_on_batch(x, y_true)
        return loss

    @staticmethod
    def get_path_model(path, name=None):
        """
        Get the location at which the neural networks will be saved.

        Returns
        -------
        path_model: ``str``
            The path at which the model will be saved (path include both path and name, it is the full path at which
            the neural networks are saved)

        path_target_model: ``str``
            The path at which the target model will be saved
        """

        if name is None:
            path_model = path
        else:
            path_model = os.path.join(path, name)
        path_target_model = "{}_target".format(path_model)
        return path_model, path_target_model

    def save_network(self, path, name=None, ext="h5"):
        """
        save the neural networks.

        Parameters
        ----------
        path: ``str``
            The path at which the models need to be saved
        name: ``str``
            The name given to this model

        ext: ``str``
            The file extension (by default h5)
        """
        # Saves model at specified path as h5 file
        # nothing has changed
        path_model, path_target_model = self.get_path_model(path, name)
        self._model.save('{}.{}'.format(path_model, ext))
        self._target_model.save('{}.{}'.format(path_target_model, ext))

    def load_network(self, path, name=None, ext="h5"):
        """
        Load the neural networks.
        Parameters
        ----------
        path: ``str``
            The path at which the models need to be saved
        name: ``str``
            The name given to this model

        ext: ``str``
            The file extension (by default h5)
        """
        path_model, path_target_model = self.get_path_model(path, name)
        # fix for issue https://github.com/keras-team/keras/issues/7440
        self.construct_q_network()

        self._model.load_weights('{}.{}'.format(path_model, ext))

        with warnings.catch_warnings():
            warnings.filterwarnings("ignore")
            self._target_model.load_weights('{}.{}'.format(path_target_model, ext))
        if self.verbose:
            print("Succesfully loaded network.")

    def target_train(self, tau=None):
        """
        update the target model with the parameters given in the :attr:`BaseDeepSARSA._training_param`.
        """
        if tau is None:
            tau = self._training_param.tau
        tau_inv = 1.0 - tau

        target_params = self._target_model.trainable_variables
        source_params = self._model.trainable_variables
        for src, dest in zip(source_params, target_params):
            # Polyak averaging
            var_update = src.value() * tau
            var_persist = dest.value() * tau_inv
            dest.assign(var_update + var_persist)

    def save_tensorboard(self, current_step):
        """function used to save other information to tensorboard"""
        pass

## Saving of trained neural networks
This class provides an easy way to save and restore, as json, the shape of your neural networks (number of layers, non linearities, size of each layers etc.)
        
#### Attributes

nn_class: :class:`l2rpn_baselines.BaseDeepSARSA`
        The neural network class that will be created with each call of :func:`l2rpn_baselines.make_nn`

observation_size: ``int``
        The size of the observation space.

action_size: ``int``
        The size of the action space.

sizes: ``list``
        A list of integer, each will represent the number of hidden units. The number of hidden layer is given by
        the size / length of this list.

activs: ``list``
        List of activation functions (given as string). It should have the same length as the :attr:`NNParam.sizes`.
        This function should be name of keras activation function.

list_attr_obs: ``list``
        List of the attributes that will be used from the observation and concatenated to be fed to the neural network.

In [7]:
class NNParam(object):

    _int_attr = ["action_size", "observation_size"]
    _float_attr = []
    _str_attr = []
    _list_float = []
    _list_str = ["activs", "list_attr_obs"]
    _list_int = ["sizes"]
    nn_class = BaseDeepSARSA

    def __init__(self,
                 action_size,
                 observation_size,
                 sizes,
                 activs,
                 list_attr_obs,
                 ):
        self.observation_size = observation_size
        self.action_size = action_size
        self.sizes = [int(el) for el in sizes]
        self.activs = [str(el) for el in activs]
        if len(self.sizes) != len(self.activs):
            raise RuntimeError("\"sizes\" and \"activs\" lists have not the same size. It's not clear how many layers "
                               "you want your neural network to have.")
        self.list_attr_obs = [str(el) for el in list_attr_obs]

    @classmethod
    def get_path_model(cls, path, name=None):
        """get the path at which the model will be saved"""
        return cls.nn_class.get_path_model(path, name=name)

    def make_nn(self, training_param):
        """build the appropriate BaseDeepSARSA"""
        res = self.nn_class(self, training_param)
        return res

    @staticmethod
    def get_obs_size(env, list_attr_name):
        """get the size of the flatten observation"""
        res = 0
        for obs_attr_name in list_attr_name:
            beg_, end_, dtype_ = env.observation_space.get_indx_extract(obs_attr_name)
            res += end_ - beg_  # no "+1" needed because "end_" is exclude by python convention
        return res

    def get_obs_attr(self):
        """get the names of the observation attributes that will be extracted """
        return self.list_attr_obs

    # utilitaries, do not change
    def to_dict(self):
        """convert this instance to a dictionnary"""
        # TODO copy and paste from TrainingParam
        res = {}
        for attr_nm in self._int_attr:
            tmp = getattr(self, attr_nm)
            if tmp is not None:
                res[attr_nm] = int(tmp)
            else:
                res[attr_nm] = None
        for attr_nm in self._float_attr:
            tmp = getattr(self, attr_nm)
            if tmp is not None:
                res[attr_nm] = float(tmp)
            else:
                res[attr_nm] = None
        for attr_nm in self._str_attr:
            tmp = getattr(self, attr_nm)
            if tmp is not None:
                res[attr_nm] = str(tmp)
            else:
                res[attr_nm] = None

        for attr_nm in self._list_float:
            tmp = getattr(self, attr_nm)
            res[attr_nm] = self._convert_list_to_json(tmp, float)
        for attr_nm in self._list_int:
            tmp = getattr(self, attr_nm)
            res[attr_nm] = self._convert_list_to_json(tmp, int)
        for attr_nm in self._list_str:
            tmp = getattr(self, attr_nm)
            res[attr_nm] = self._convert_list_to_json(tmp, str)
        return res

    @classmethod
    def _convert_list_to_json(cls, obj, type_):
        if isinstance(obj, type_):
            res = obj
        elif isinstance(obj, np.ndarray):
            if len(obj.shape) == 1:
                res = [type_(el) for el in obj]
            else:
                res = [cls._convert_list_to_json(el, type_) for el in obj]
        elif isinstance(obj, Iterable):
            res = [cls._convert_list_to_json(el, type_) for el in obj]
        else:
            res = type_(obj)
        return res

    @classmethod
    def _attr_from_json(cls, json, type_):
        if isinstance(json, type_):
            res = json
        elif isinstance(json, list):
            res = [cls._convert_list_to_json(obj=el, type_=type_) for el in json]
        else:
            res = type_(json)
        return res

    @classmethod
    def from_dict(cls, tmp):
        """load from a dictionnary"""
        # TODO copy and paste from TrainingParam (more or less)
        cls_as_dict = {}
        for attr_nm in cls._int_attr:
            if attr_nm in tmp:
                tmp_ = tmp[attr_nm]
                if tmp_ is not None:
                    cls_as_dict[attr_nm] = int(tmp_)
                else:
                    cls_as_dict[attr_nm] = None

        for attr_nm in cls._float_attr:
            if attr_nm in tmp:
                tmp_ = tmp[attr_nm]
                if tmp_ is not None:
                    cls_as_dict[attr_nm] = float(tmp_)
                else:
                    cls_as_dict[attr_nm] = None

        for attr_nm in cls._str_attr:
            if attr_nm in tmp:
                tmp_ = tmp[attr_nm]
                if tmp_ is not None:
                    cls_as_dict[attr_nm] = str(tmp_)
                else:
                    cls_as_dict[attr_nm] = None

        for attr_nm in cls._list_float:
            if attr_nm in tmp:
                cls_as_dict[attr_nm] = cls._attr_from_json(tmp[attr_nm], float)
        for attr_nm in cls._list_int:
            if attr_nm in tmp:
                cls_as_dict[attr_nm] = cls._attr_from_json(tmp[attr_nm], int)
        for attr_nm in cls._list_str:
            if attr_nm in tmp:
                cls_as_dict[attr_nm] = cls._attr_from_json(tmp[attr_nm], str)

        res = cls(**cls_as_dict)
        return res

    @classmethod
    def from_json(cls, json_path):
        """load from a json file"""
        # TODO copy and paste from TrainingParam
        if not os.path.exists(json_path):
            raise FileNotFoundError("No path are located at \"{}\"".format(json_path))
        with open(json_path, "r") as f:
            dict_ = json.load(f)
        return cls.from_dict(dict_)

    def save_as_json(self, path, name=None):
        """save as a json file"""
        # TODO copy and paste from TrainingParam
        res = self.to_dict()
        if name is None:
            name = "neural_net_parameters.json"
        if not os.path.exists(path):
            raise RuntimeError("Directory \"{}\" not found to save the NN parameters".format(path))
        if not os.path.isdir(path):
            raise NotADirectoryError("\"{}\" should be a directory".format(path))
        path_out = os.path.join(path, name)
        with open(path_out, "w", encoding="utf-8") as f:
            json.dump(res, fp=f, indent=4, sort_keys=True)

    def center_reduce(self, env):
        """currently not implemented for this class, "coming soon" as we might say"""
        # TODO see TestLeapNet for this feature
        self._center_reduce_vect(env.get_obs(), "x")

    def _get_adds_mults_from_name(self, obs, attr_nm):
        if attr_nm in ["prod_p"]:
            add_tmp = np.array([-0.5 * (pmax + pmin) for pmin, pmax in zip(obs.gen_pmin, obs.gen_pmax)])
            mult_tmp = np.array([1. / max((pmax - pmin), 0.) for pmin, pmax in zip(obs.gen_pmin, obs.gen_pmax)])
        elif attr_nm in ["prod_q"]:
            add_tmp = 0.
            mult_tmp = np.array([1. / max(abs(val), 1.0) for val in obs.prod_q])
        elif attr_nm in ["load_p", "load_q"]:
            add_tmp = np.array([-val for val in getattr(obs, attr_nm)])
            mult_tmp = 0.5
        elif attr_nm in ["load_v", "prod_v", "v_or", "v_ex"]:
            add_tmp = 0.
            mult_tmp = np.array([1. / val for val in getattr(obs, attr_nm)])
        elif attr_nm == "hour_of_day":
            add_tmp = -12.
            mult_tmp = 1.0 / 12
        elif attr_nm == "minute_of_hour":
            add_tmp = -30.
            mult_tmp = 1.0 / 30
        elif attr_nm == "day_of_week":
            add_tmp = -4.
            mult_tmp = 1.0 / 4
        elif attr_nm == "day":
            add_tmp = -15.
            mult_tmp = 1.0 / 15.
        elif attr_nm in ["target_dispatch", "actual_dispatch"]:
            add_tmp = 0.
            mult_tmp = np.array([1. / (pmax - pmin) for pmin, pmax in zip(obs.gen_pmin, obs.gen_pmax)])
        elif attr_nm in ["a_or", "a_ex", "p_or", "p_ex", "q_or", "q_ex"]:
            add_tmp = 0.
            mult_tmp = np.array([1.0 / max(val, 1.0) for val in getattr(obs, attr_nm)])
        else:
            add_tmp = 0.
            mult_tmp = 1.0
        return add_tmp, mult_tmp

    def _center_reduce_vect(self, obs, nn_part):
        """
        compute the xxxx_adds and xxxx_mults for one part of the neural network called nn_part,
        depending on what attribute of the observation is extracted
        """
        if not isinstance(obs, grid2op.Observation.BaseObservation):
            # in multi processing i receive a set of observation there so i might need
            # to extract only the first one
            obs = obs[0]

        li_attr_obs = getattr(self, "list_attr_obs_{}".format(nn_part))
        adds = []
        mults = []
        for attr_nm in li_attr_obs:
            add_tmp, mult_tmp = self._get_adds_mults_from_name(obs, attr_nm)
            mults.append(mult_tmp)
            adds.append(add_tmp)
        setattr(self, "{}_adds".format(nn_part), adds)
        setattr(self, "{}_mults".format(nn_part), mults)

## To save training parameters of the model

A class to store the training parameters of the models.
	
        
#### Attributes

buffer_size: ``int``
        Size of the replay buffer

minibatch_size: ``int``
        Size of the training minibatch
update_freq: ``int``
        Frequency at which the model is trained. Model is trained once every `update_freq` steps using `minibatch_size`
        from an experience replay buffer.

final_epsilon: ``float``
        value for the final epsilon (for the e-greedy)
initial_epsilon: ``float``
        value for the initial epsilon (for the e-greedy)
step_for_final_epsilon: ``int``
        number of step at which the final epsilon (for the epsilon greedy exploration) will be reached

min_observation: ``int``
        number of observations before starting to train the neural nets. Before this number of iterations, the agent
        will simply interact with the environment.

lr: ``float``
        The initial learning rate

lr_decay_steps: ``int``
        The learning rate decay step

lr_decay_rate: ``float``
        The learning rate decay rate

num_frames: ``int``
        Currently not used

discount_factor: ``float``
        The discount factor (a high discount factor is in favor of longer episode, a small one not really). This is
        often called "gamma" in some RL paper. It's the gamma in: "RL wants to minize the sum of the dicounted reward,
        which are sum_{t >= t_0} \gamma^{t - t_0} r_t

tau: ``float``
        Update the target model. Target model is updated according to
        $target_model_weights[i] = self.training_param.tau * model_weights[i] + (1 - self.training_param.tau) * \
                                              target_model_weights[i]$

min_iter: ``int``
        It is possible in the training schedule to limit the number of time steps an episode can last. This is mainly
        useful at beginning of training, to not get in a state where the grid has been modified so much the agent
        will never get into a state resembling this one ever again). Stopping the episode before this happens can
        help the learning.

max_iter: ``int``
        Just like "min_iter" but instead of being the minimum number of iteration, it's the maximum.

update_nb_iter: ``int``
        If max_iter_fun is the default one, this numer give the number of time we need to succeed a scenario before
        having to increase the maximum number of timestep allowed

step_increase_nb_iter: ``int`` or  ``None``
        Of how many timestep we increase the maximum number of timesteps allowed per episode. Set it to O to deactivate
        this.

max_iter_fun: ``function``
        A function that return the maximum number of steps an episode can count as for the current epoch. For example
        it can be `max_iter_fun = lambda epoch_num : np.sqrt(50 * epoch_num)`
        [default lambda x: x / self.update_nb_iter]

oversampling_rate: ``float`` or ``None``
        Set it to None to deactivate the oversampling of hard scenarios. Otherwise, this oversampling is done
        with something like `proba = 1. / (time_step_lived**oversampling_rate + 1)` where `proba` is the probability
        to be selected at the next call to "reset" and `time_step_lived` is the number of time steps

random_sample_datetime_start: ``int`` or ``None``
        If ``None`` during training the chronics will always start at the datetime the chronics start.
        Otherwise, the training scheme will skip a number of time steps between 0 and  `random_sample_datetime_start`
        when loading the next chronics. This is particularly useful when you want your agent to learn to operate
        the grid regardless of the hour of day or day of the week.

update_tensorboard_freq: ``int``
        Frequency at which tensorboard is refresh (tensorboard summaries are saved every update_tensorboard_freq
        steps)

save_model_each: ``int``
        Frequency at which the model is saved (it is saved every "save_model_each" steps)

max_global_norm_grad: ``float``
        Maximum gradient norm allowed (can make the training more stable) default to None if deactivated.
        Not all baselines are compatible.

max_value_grad: ``float``
        Maximum value the gradient can take. Assign it to ``None`` to deactivate it. This can make the training
        more stable in some cases, but can slow down the training process too. Not all baselines are compatible.

max_loss: ``float``
        Clip the value of the loss function. Set it to ``None`` to deactivate it. Again, this can make the training
        more stable but possibly slower. Not all baselines are compatible.

In [8]:
class TrainingParam(object):
    
    _tol_float_equal = float(1e-8)

    _int_attr = ["buffer_size", "minibatch_size", "step_for_final_epsilon",
                 "min_observation", "last_step", "num_frames", "update_freq",
                 "min_iter", "max_iter", "update_tensorboard_freq", "save_model_each", "_update_nb_iter",
                 "step_increase_nb_iter", "min_observe", "sample_one_random_action_begin"]
    _float_attr = ["_final_epsilon", "_initial_epsilon", "lr", "lr_decay_steps", "lr_decay_rate",
                   "discount_factor", "tau", "oversampling_rate",
                   "max_global_norm_grad", "max_value_grad", "max_loss"]

    def __init__(self,
                 buffer_size=40000,
                 minibatch_size=64,
                 step_for_final_epsilon=100000,  # step at which min_espilon is obtain
                 min_observation=5000,  # 5000
                 final_epsilon=1./(7*288.),  # have on average 1 random action per week of approx 7*288 time steps
                 initial_epsilon=0.4,
                 lr=1e-4,
                 lr_decay_steps=10000,
                 lr_decay_rate=0.999,
                 num_frames=1,
                 discount_factor=0.99,
                 tau=0.01,
                 update_freq=256,
                 min_iter=50,
                 max_iter=8064,  # 1 month
                 update_nb_iter=10,
                 step_increase_nb_iter=0,  # by default no oversampling / under sampling based on difficulty
                 update_tensorboard_freq=1000,  # update tensorboard every "update_tensorboard_freq" steps
                 save_model_each=10000,  # save the model every "update_tensorboard_freq" steps
                 random_sample_datetime_start=None,
                 oversampling_rate=None,
                 max_global_norm_grad=None,
                 max_value_grad=None,
                 max_loss=None,

                 # observer: let the neural network "observe" for a given amount of time
                 # all actions are replaced by a do nothing
                 min_observe=None,

                 # i do a random action at the beginning of an episode until a certain number of step
                 # is made
                 # it's recommended to have "min_observe" to be larger that this (this is an int)
                 sample_one_random_action_begin=None,
                 ):

        self.random_sample_datetime_start = random_sample_datetime_start

        self.buffer_size = int(buffer_size)
        self.minibatch_size = int(minibatch_size)
        self.min_observation = int(min_observation)
        self._final_epsilon = float(final_epsilon)  # have on average 1 random action per day of approx 288 timesteps at the end (never kill completely the exploration)
        self._initial_epsilon = float(initial_epsilon)
        self.step_for_final_epsilon = float(step_for_final_epsilon)
        self.lr = float(lr)
        self.lr_decay_steps = float(lr_decay_steps)
        self.lr_decay_rate = float(lr_decay_rate)

        # gradient clipping (if supported)
        self.max_global_norm_grad = max_global_norm_grad
        self.max_value_grad = max_value_grad
        self.max_loss = max_loss

        # observer
        self.min_observe = min_observe
        self.sample_one_random_action_begin = sample_one_random_action_begin

        self.last_step = int(0)
        self.num_frames = int(num_frames)
        self.discount_factor = float(discount_factor)
        self.tau = float(tau)
        self.update_freq = int(update_freq)
        self.min_iter = int(min_iter)
        self.max_iter = int(max_iter)
        self._1_update_nb_iter = None
        self._update_nb_iter = int(update_nb_iter)
        if step_increase_nb_iter is None:
            # 0 and None have the same effect: it disable the feature
            step_increase_nb_iter = 0
        self.step_increase_nb_iter = step_increase_nb_iter

        if oversampling_rate is not None:
            self.oversampling_rate = float(oversampling_rate)
        else:
            self.oversampling_rate = None

        self.update_tensorboard_freq = update_tensorboard_freq
        self.save_model_each = save_model_each
        self.max_iter_fun = self.default_max_iter_fun
        self._compute_exp_facto()

    @property
    def final_epsilon(self):
        return self._final_epsilon

    @final_epsilon.setter
    def final_epsilon(self, final_epsilon):
        self._final_epsilon = final_epsilon
        self._compute_exp_facto()

    @property
    def initial_epsilon(self):
        return self._initial_epsilon

    @initial_epsilon.setter
    def initial_epsilon(self, initial_epsilon):
        self._initial_epsilon = initial_epsilon
        self._compute_exp_facto()

    @property
    def update_nb_iter(self):
        return self._update_nb_iter

    @update_nb_iter.setter
    def update_nb_iter(self, update_nb_iter):
        self._update_nb_iter = update_nb_iter
        if self._update_nb_iter is not None and self._update_nb_iter > 0:
            self._1_update_nb_iter = 1.0 / self._update_nb_iter
        else:
            self._1_update_nb_iter = 1.0

    def _compute_exp_facto(self):
        if self.final_epsilon is not None and self.initial_epsilon is not None and self.final_epsilon > 0:
            self._exp_facto = np.log(self.initial_epsilon/self.final_epsilon)
        else:
            # TODO
            self._exp_facto = 1

    def default_max_iter_fun(self, nb_success):
        """the default max iteration function used"""
        return self.step_increase_nb_iter * int(nb_success * self._1_update_nb_iter)

    def tell_step(self, current_step):
        """tell this instance the number of training steps that have been made"""
        self.last_step = current_step

    def get_next_epsilon(self, current_step):
        """get the next epsilon for the e greedy exploration"""
        self.tell_step(current_step)
        if self.step_for_final_epsilon is None or self.initial_epsilon is None \
                or self._exp_facto is None or self.final_epsilon is None:
            res = 0.
        else:
            if current_step > self.step_for_final_epsilon:
                res = self.final_epsilon
            else:
                # exponential decrease
                res = self.initial_epsilon * np.exp(- (current_step / self.step_for_final_epsilon) * self._exp_facto )
        return res

    def to_dict(self):
        """serialize this instance to a dictionnary."""
        res = {}
        for attr_nm in self._int_attr:
            tmp = getattr(self, attr_nm)
            if tmp is not None:
                res[attr_nm] = int(tmp)
            else:
                res[attr_nm] = None
        for attr_nm in self._float_attr:
            tmp = getattr(self, attr_nm)
            if tmp is not None:
                res[attr_nm] = float(tmp)
            else:
                res[attr_nm] = None
        return res

    @staticmethod
    def from_dict(tmp):
        """initialize this instance from a dictionary"""
        if not isinstance(tmp, dict):
            raise RuntimeError("TrainingParam from dict must be called with a dictionary, and not {}".format(tmp))
        res = TrainingParam()
        for attr_nm in TrainingParam._int_attr:
            if attr_nm in tmp:
                tmp_ = tmp[attr_nm]
                if tmp_ is not None:
                    setattr(res, attr_nm, int(tmp_))
                else:
                    setattr(res, attr_nm, None)

        for attr_nm in TrainingParam._float_attr:
            if attr_nm in tmp:
                tmp_ = tmp[attr_nm]
                if tmp_ is not None:
                    setattr(res, attr_nm, float(tmp_))
                else:
                    setattr(res, attr_nm, None)
        res.update_nb_iter = res._update_nb_iter
        res.initial_epsilon = res._initial_epsilon
        res._compute_exp_facto()
        return res

    @staticmethod
    def from_json(json_path):
        """initialize this instance from a json"""
        if not os.path.exists(json_path):
            raise FileNotFoundError("No path are located at \"{}\"".format(json_path))
        with open(json_path, "r") as f:
            dict_ = json.load(f)
        return TrainingParam.from_dict(dict_)

    def save_as_json(self, path, name=None):
        """save this instance as a json"""
        res = self.to_dict()
        if name is None:
            name = "training_parameters.json"
        if not os.path.exists(path):
            raise RuntimeError("Directory \"{}\" not found to save the training parameters".format(path))
        if not os.path.isdir(path):
            raise NotADirectoryError("\"{}\" should be a directory".format(path))
        path_out = os.path.join(path, name)
        with open(path_out, "w", encoding="utf-8") as f:
            json.dump(res, fp=f, indent=4, sort_keys=True)

    def do_train(self):
        """return whether or not i should train the model at this time step"""
        return self.last_step % self.update_freq == 0

    def __eq__(self, other):
        res = True
        for el in self._int_attr:
            me_ = getattr(self, el)
            oth_ = getattr(other, el)
            if me_ is None and oth_ is not None:
                res = False
                break
            if oth_ is None and me_ is not None:
                res = False
                break
            if me_ is None and oth_ is None:
                continue
            if int(me_) != int(oth_):
                res = False
                break
        if res:
            for el in self._float_attr:
                me_ = getattr(self, el)
                oth_ = getattr(other, el)
                if me_ is None and oth_ is not None:
                    res = False
                    break
                if oth_ is None and me_ is not None:
                    res = False
                    break
                if me_ is None and oth_ is None:
                    continue
                if abs(float(me_) - float(oth_)) > self._tol_float_equal:
                    res = False
                    break
        return res

## Create neural network using abstract class created above
Constructs the desired deep q learning network
        
#### Attributes

schedule_lr_model:
        The schedule for the learning rate.


In [9]:
class DeepQ_NN(BaseDeepSARSA):

    def __init__(self,
                 nn_params,
                 training_param=None):
        if not _CAN_USE_TENSORFLOW:
            raise RuntimeError("Cannot import tensorflow, this function cannot be used.")
        
        if training_param is None:
            training_param = TrainingParam()
        BaseDeepSARSA.__init__(self,
                           nn_params,
                           training_param)
        self.schedule_lr_model = None
        self.construct_q_network()

    def construct_q_network(self):
        """
        This function will make 2 identical models, one will serve as a target model, the other one will be trained
        regurlarly.
        """
        self._model = Sequential()
        input_layer = Input(shape=(self._nn_archi.observation_size,),
                            name="state")
        lay = input_layer
        for lay_num, (size, act) in enumerate(zip(self._nn_archi.sizes, self._nn_archi.activs)):
            lay = Dense(size, name="layer_{}".format(lay_num))(lay)  # put at self.action_size
            lay = Activation(act)(lay)

        output = Dense(self._action_size, name="output")(lay)

        self._model = Model(inputs=[input_layer], outputs=[output])
        self._schedule_lr_model, self._optimizer_model = self.make_optimiser()
        self._model.compile(loss='mse', optimizer=self._optimizer_model)

        self._target_model = Model(inputs=[input_layer], outputs=[output])
        self._target_model.set_weights(self._model.get_weights())

## Define parameters for deep neural nets
This defined the specific parameters for the DeepQ network. 
Nothing really different compared to the base class except that :attr:`l2rpn_baselines.utils.NNParam.nn_class` (nn_class) is :class:`deepQ_NN.DeepQ_NN`

In [10]:
class DeepQ_NNParam(NNParam):
    _int_attr = copy.deepcopy(NNParam._int_attr)
    _float_attr = copy.deepcopy(NNParam._float_attr)
    _str_attr = copy.deepcopy(NNParam._str_attr)
    _list_float = copy.deepcopy(NNParam._list_float)
    _list_str = copy.deepcopy(NNParam._list_str)
    _list_int = copy.deepcopy(NNParam._list_int)

    nn_class = DeepQ_NN

    def __init__(self,
                 action_size,
                 observation_size,  # TODO this might not be usefull
                 sizes,
                 activs,
                 list_attr_obs
                 ):
        NNParam.__init__(self,
                         action_size,
                         observation_size,  # TODO this might not be usefull
                         sizes,
                         activs,
                         list_attr_obs
                         )

## A Skeleton Class That Inherits DeepSARSAAgent Class
A simple deep q learning algorithm. It does nothing different thant its base class.

In [11]:
class DeepQSimple(DeepSARSAAgent):
    pass


## A function to train the deep SARSA agent created above

This function implements the "training" part of the balines "DeepQSimple".
        
#### Parameters

env: :class:`grid2op.Environment`
        Then environment on which you need to train your agent.

name: ``str``
        The name of your agent.

iterations: ``int``
        For how many iterations (steps) do you want to train your agent. NB these are not episode, these are steps.

save_path: ``str``
        Where do you want to save your baseline.

load_path: ``str``
        If you want to reload your baseline, specify the path where it is located. **NB** if a baseline is reloaded
        some of the argument provided to this function will not be used.

logs_dir: ``str``
        Where to store the tensorboard generated logs during the training. ``None`` if you don't want to log them.

training_param: :class:`l2rpn_baselines.utils.TrainingParam`
        The parameters describing the way you will train your model.

filter_action_fun: ``function``
        A function to filter the action space. See
        `IdToAct.filter_action <https://grid2op.readthedocs.io/en/latest/converter.html#grid2op.Converter.IdToAct.filter_action>`_
        documentation.

verbose: ``bool``
        If you want something to be printed on the terminal (a better logging strategy will be put at some point)

kwargs_converters: ``dict``
        A dictionary containing the key-word arguments pass at this initialization of the
        :class:`grid2op.Converter.IdToAct` that serves as "Base" for the Agent.

kwargs_archi: ``dict``
        Key word arguments used for making the :class:`DeepQ_NNParam` object that will be used to build the baseline.

Returns
-------

baseline: :class:`DeepQSimple`
        The trained baseline.

In [12]:
def train(env,
          name=DEFAULT_NAME,
          iterations=1,
          save_path=None,
          load_path=None,
          logs_dir=None,
          training_param=None,
          filter_action_fun=None,
          kwargs_converters={},
          kwargs_archi={},
          verbose=True):
    
    import tensorflow as tf  # lazy import to save import time
    # Limit gpu usage
    try:
        physical_devices = tf.config.list_physical_devices('GPU')
        if len(physical_devices) > 0:
            tf.config.experimental.set_memory_growth(physical_devices[0], True)
    except AttributeError:
         # issue of https://stackoverflow.com/questions/59266150/attributeerror-module-tensorflow-core-api-v2-config-has-no-attribute-list-p
        try:
            physical_devices = tf.config.experimental.list_physical_devices('GPU')
            if len(physical_devices) > 0:
                tf.config.experimental.set_memory_growth(physical_devices[0], True)
        except Exception:
            warnings.warn(_WARN_GPU_MEMORY)
    except Exception:
        warnings.warn(_WARN_GPU_MEMORY)

    if training_param is None:
        training_param = TrainingParam()

    # compute the proper size for the converter
    kwargs_archi["action_size"] = DeepQSimple.get_action_size(env.action_space, filter_action_fun, kwargs_converters)

    if load_path is not None:
        path_model, path_target_model = DeepQ_NN.get_path_model(load_path, name)
        if verbose:
            print("INFO: Reloading a model, the architecture parameters will be ignored")
        nn_archi = DeepQ_NNParam.from_json(os.path.join(path_model, "nn_architecture.json"))
    else:
        nn_archi = DeepQ_NNParam(**kwargs_archi)

    baseline = DeepQSimple(action_space=env.action_space,
                           nn_archi=nn_archi,
                           name=name,
                           istraining=True,
                           verbose=verbose,
                           filter_action_fun=filter_action_fun,
                            **kwargs_converters
                            )

    if load_path is not None:
        if verbose:
            print("INFO: Reloading a model, training parameters will be ignored")
        baseline.load(load_path)
        training_param = baseline._training_param

    baseline.train(env,
                   iterations,
                   save_path=save_path,
                   logdir=logs_dir,
                   training_param=training_param)
    # as in our example (and in our explanation) we recommend to save the mode regurlarly in the "train" function
    # it is not necessary to save it again here. But if you chose not to follow these advice, it is more than
    # recommended to save the "baseline" at the end of this function with:
    # baseline.save(path_save)
    return baseline

## Method to evaluate the agent
How to evaluate the performances of the trained :class:`DeepQSimple` agent.


#### Parameters

env: :class:`grid2op.Environment`
        The environment on which you evaluate your agent.

name: ``str``
        The name of the trained baseline

load_path: ``str``
        Path where the agent has been stored

logs_path: ``str``
        Where to write the results of the assessment

nb_episode: ``str``
        How many episodes to run during the assessment of the performances

nb_process: ``int``
        On how many process the assessment will be made. (setting this > 1 can lead to some speed ups but can be
        unstable on some plaform)

max_steps: ``int``
        How many steps at maximum your agent will be assessed

verbose: ``bool``
        Currently un used

save_gif: ``bool``
        Whether or not you want to save, as a gif, the performance of your agent. It might cause memory issues (might
        take a lot of ram) and drastically increase computation time.

### Returns

agent: :class:`l2rpn_baselines.utils.DeepSARSAAgent`
        The loaded agent that has been evaluated thanks to the runner.

res: ``list``
        The results of the Runner on which the agent was tested.

In [13]:
def evaluate(env,
             name=DEFAULT_NAME,
             load_path=None,
             logs_path=DEFAULT_LOGS_DIR,
             nb_episode=DEFAULT_NB_EPISODE,
             nb_process=DEFAULT_NB_PROCESS,
             max_steps=DEFAULT_MAX_STEPS,
             verbose=False,
             save_gif=False,
             filter_action_fun=None):
    

    import tensorflow as tf  # lazy import to save import time
    # Limit gpu usage
    physical_devices = tf.config.list_physical_devices('GPU')
    if len(physical_devices):
        tf.config.experimental.set_memory_growth(physical_devices[0], True)

    runner_params = env.get_params_for_runner()
    runner_params["verbose"] = verbose

    if load_path is None:
        raise RuntimeError("Cannot evaluate a model if there is nothing to be loaded.")
    path_model, path_target_model = DeepQ_NN.get_path_model(load_path, name)
    nn_archi = DeepQ_NNParam.from_json(os.path.join(path_model, "nn_architecture.json"))

    # Run
    # Create agent
    agent = DeepQSimple(action_space=env.action_space,
                        name=name,
                        store_action=nb_process == 1,
                        nn_archi=nn_archi,
                        observation_space=env.observation_space,
                        filter_action_fun=filter_action_fun)

    # Load weights from file
    agent.load(load_path)

    # Build runner
    runner = Runner(**runner_params,
                    agentClass=None,
                    agentInstance=agent)

    # Print model summary
    stringlist = []
    agent.deep_sarsa._model.summary(print_fn=lambda x: stringlist.append(x))
    short_model_summary = "\n".join(stringlist)
    if verbose:
        print(short_model_summary)

    # Run
    os.makedirs(logs_path, exist_ok=True)
    res = runner.run(path_save=logs_path,
                     nb_episode=nb_episode,
                     nb_process=nb_process,
                     max_iter=max_steps,
                     pbar=verbose)

    # Print summary
    if verbose:
        print("Evaluation summary:")
        for _, chron_name, cum_reward, nb_time_step, max_ts in res:
            msg_tmp = "chronics at: {}".format(chron_name)
            msg_tmp += "\ttotal score: {:.6f}".format(cum_reward)
            msg_tmp += "\ttime steps: {:.0f}/{:.0f}".format(nb_time_step, max_ts)
            print(msg_tmp)

        if len(agent.dict_action):
            # I output some of the actions played
            print("The agent played {} different action".format(len(agent.dict_action)))
            for id_, (nb, act, types) in agent.dict_action.items():
                print("Action with ID {} was played {} times".format(id_, nb))
                print("{}".format(act))
                print("-----------")

    if save_gif:
        if verbose:
            print("Saving the gif of the episodes")
        save_log_gif(logs_path, res)

    return agent, res

## Below code snippet defines the network, trains it, and saves it

In [None]:
warnings.filterwarnings('ignore')
env = grid2op.make("l2rpn_neurips_2020_track1_small", reward_class=L2RPNReward)
tp = TrainingParam()

li_attr_obs_X = ["day_of_week", "hour_of_day", "minute_of_hour", "prod_p", "prod_v", "load_p", "load_q",
                         "actual_dispatch", "target_dispatch", "topo_vect", "time_before_cooldown_line",
                         "time_before_cooldown_sub", "rho", "timestep_overflow", "line_status"]

observation_size = NNParam.get_obs_size(env, li_attr_obs_X)
sizes = [800, 800, 800, 494, 494, 494]  # sizes of each hidden layers
kwargs_archi = {'observation_size': observation_size,
                        'sizes': sizes,
                        'activs': ["relu" for _ in sizes],  # all relu activation function
                        "list_attr_obs": li_attr_obs_X}

kwargs_converters = {"all_actions": None,
                             "set_line_status": False,
                             "change_bus_vect": True,
                             "set_topo_vect": False
                             }
# define the name of the model
nm_ = "Deep_SARSA_Agent"
try:
    train(env,
          name=nm_,
          iterations=1000000,
          save_path="./D_SARSA_Agent/model",
          load_path=None,
          logs_dir="./D_SARSA_Agent/logs",
          training_param=tp,
          kwargs_converters=kwargs_converters,
          kwargs_archi=kwargs_archi)
finally:
    env.close()

  1%|▋                                                                       | 9999/1000000 [07:59<13:28:12, 20.42it/s]

hardest scenarios
[143 109  92  93  94  95  96  97  98  99]
They have been chosen respectively
[1 1 1 1 1 1 1 1 1 1]
The number of timesteps played is
[40 40  8 28 16 17 36 31 20 40]
avg (accross all scenarios) number of timsteps played 17.302083333333332
Time alive: [20.  20.   4.  14.   8.   8.5 18.  15.5 10.  20. ]
Avg time alive: 8.668402777777779


  2%|█▍                                                                     | 19998/1000000 [16:02<12:07:03, 22.46it/s]

hardest scenarios
[143 184 190 189 188 187 186 185 183 175]
They have been chosen respectively
[1 1 1 1 1 1 1 1 1 1]
The number of timesteps played is
[40 12 12 48 60 53 64 21 20 28]
avg (accross all scenarios) number of timsteps played 34.40972222222222
Time alive: [20.   6.   6.  24.  30.  26.5 32.  10.5 10.  14. ]
Avg time alive: 17.22222222222222


  3%|██▏                                                                    | 29999/1000000 [23:53<12:12:00, 22.09it/s]

hardest scenarios
[287 204 211 210 209 208 207 206 205 203]
They have been chosen respectively
[1 1 1 1 1 1 1 1 1 1]
The number of timesteps played is
[ 63  44  29  20  28  24  28 148  65  76]
avg (accross all scenarios) number of timsteps played 51.05902777777778
Time alive: [31.5 22.  14.5 10.  14.  12.  14.  74.  32.5 38. ]
Avg time alive: 25.546875


  4%|██▊                                                                    | 39999/1000000 [31:42<11:33:05, 23.08it/s]

hardest scenarios
[287 382 312 313 314 315 316 317 318 319]
They have been chosen respectively
[1 1 1 1 1 1 1 1 1 1]
The number of timesteps played is
[ 63 223 705 816 244 104  32 212 232 196]
avg (accross all scenarios) number of timsteps played 67.63541666666667
Time alive: [ 31.5 111.5 352.5 408.  122.   52.   16.  106.  116.   98. ]
Avg time alive: 33.83506944444444


  5%|███▌                                                                   | 49999/1000000 [39:21<11:08:38, 23.68it/s]

hardest scenarios
[287 303 293 294 295 296 297 298 299 300]
They have been chosen respectively
[1 1 1 1 1 1 1 1 1 1]
The number of timesteps played is
[ 63 261 377 168  20  60 322  96 228 436]
avg (accross all scenarios) number of timsteps played 85.94618055555556
Time alive: [ 31.5 130.5 188.5  84.   10.   30.  161.   48.  114.  218. ]
Avg time alive: 42.990451388888886


  6%|████▎                                                                  | 59998/1000000 [47:00<11:00:46, 23.71it/s]

hardest scenarios
[287 284 273 274 275 276 277 278 279 280]
They have been chosen respectively
[1 1 1 1 1 1 1 1 1 1]
The number of timesteps played is
[ 63 235 132  16 193 276  56  40 271 211]
avg (accross all scenarios) number of timsteps played 103.85243055555556
Time alive: [ 31.5 117.5  66.    8.   96.5 138.   28.   20.  135.5 105.5]
Avg time alive: 51.943576388888886


  7%|████▉                                                                  | 70000/1000000 [55:00<11:00:05, 23.48it/s]

hardest scenarios
[287 221 207 208 209 210 211 212 213 214]
They have been chosen respectively
[1 1 1 1 1 1 1 1 1 1]
The number of timesteps played is
[ 63  22  28  24  28  20  29 188   7  19]
avg (accross all scenarios) number of timsteps played 121.47395833333333
Time alive: [31.5 11.  14.  12.  14.  10.  14.5 94.   3.5  9.5]
Avg time alive: 60.75434027777778


  8%|█████▌                                                               | 80000/1000000 [1:02:46<10:43:41, 23.82it/s]

hardest scenarios
[16  9 19 18 17 15 14 13 11 10]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[144  45  17 145 549 124 232 889  80 190]
avg (accross all scenarios) number of timsteps played 138.54513888888889
Time alive: [ 48.          15.           5.66666667  48.33333333 183.
  41.33333333  77.33333333 296.33333333  26.66666667  63.33333333]
Avg time alive: 67.95515046296296


  9%|██████▏                                                              | 89999/1000000 [1:10:24<10:43:39, 23.56it/s]

hardest scenarios
[36 22 24 25 26 27 29 30 31 32]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 318  479  940  530   89   46  328  109  761 1081]
avg (accross all scenarios) number of timsteps played 154.50173611111111
Time alive: [106.         159.66666667 313.33333333 176.66666667  29.66666667
  15.33333333 109.33333333  36.33333333 253.66666667 360.33333333]
Avg time alive: 72.69733796296298


 10%|██████▉                                                              | 99998/1000000 [1:18:04<10:34:12, 23.65it/s]

hardest scenarios
[64 39 58 57 56 55 54 53 52 50]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 469  774  239  252 1327  536 1396  391  105  310]
avg (accross all scenarios) number of timsteps played 172.99131944444446
Time alive: [156.33333333 258.          79.66666667  84.         442.33333333
 178.66666667 465.33333333 130.33333333  35.         103.33333333]
Avg time alive: 78.20052083333333


 11%|███████▍                                                            | 109999/1000000 [1:25:43<12:46:28, 19.35it/s]

hardest scenarios
[97 67 79 78 77 76 75 73 72 71]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[261  69 669  52 524 549 235 196  55  83]
avg (accross all scenarios) number of timsteps played 190.953125
Time alive: [ 87.          23.         223.          17.33333333 174.66666667
 183.          78.33333333  65.33333333  18.33333333  27.66666667]
Avg time alive: 83.57204861111111


 12%|████████▏                                                           | 120000/1000000 [1:33:25<10:17:20, 23.76it/s]

hardest scenarios
[143  93  95  96  97  98  99 100 101 102]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 66  32 680 249 261 341  45  57 288 875]
avg (accross all scenarios) number of timsteps played 208.07638888888889
Time alive: [ 22.          10.66666667 226.66666667  83.          87.
 113.66666667  15.          19.          96.         291.66666667]
Avg time alive: 88.68344907407408


 13%|████████▊                                                           | 129999/1000000 [1:41:13<10:21:17, 23.34it/s]

hardest scenarios
[143 141 131 132 133 134 135 136 137 138]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 66 183 104 128  12  75 221 109  64  80]
avg (accross all scenarios) number of timsteps played 225.41493055555554
Time alive: [22.         61.         34.66666667 42.66666667  4.         25.
 73.66666667 36.33333333 21.33333333 26.66666667]
Avg time alive: 93.89756944444444


 14%|█████████▌                                                          | 139999/1000000 [1:48:58<10:03:20, 23.76it/s]

hardest scenarios
[143 173 160 161 162 163 164 165 166 167]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 66  30 532 261  31 138 721 296 264 238]
avg (accross all scenarios) number of timsteps played 242.90104166666666
Time alive: [ 22.          10.         177.33333333  87.          10.33333333
  46.         240.33333333  98.66666667  88.          79.33333333]
Avg time alive: 98.81626157407408


 15%|██████████▏                                                         | 149998/1000000 [1:56:38<10:06:02, 23.38it/s]

hardest scenarios
[143 184 190 189 188 187 186 185 183 175]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 66 613  29 111  67 284  79 236  76 270]
avg (accross all scenarios) number of timsteps played 259.0642361111111
Time alive: [ 22.         204.33333333   9.66666667  37.          22.33333333
  94.66666667  26.33333333  78.66666667  25.33333333  90.        ]
Avg time alive: 102.82783564814815


 16%|███████████                                                          | 159999/1000000 [2:04:17<9:50:16, 23.72it/s]

hardest scenarios
[287 205 212 211 210 209 208 207 206 204]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[215 278 284 145  24 768 451 235 503 107]
avg (accross all scenarios) number of timsteps played 276.74131944444446
Time alive: [ 71.66666667  92.66666667  94.66666667  48.33333333   8.
 256.         150.33333333  78.33333333 167.66666667  35.66666667]
Avg time alive: 105.51446759259258


 17%|███████████▌                                                        | 169998/1000000 [2:11:57<12:00:18, 19.20it/s]

hardest scenarios
[287 347 339 340 341 342 343 344 345 346]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 215    4   19  509  804 1092  524  130 1013  543]
avg (accross all scenarios) number of timsteps played 295.0138888888889
Time alive: [ 71.66666667   1.33333333   6.33333333 169.66666667 268.
 364.         174.66666667  43.33333333 337.66666667 181.        ]
Avg time alive: 110.11863425925925


 18%|████████████▍                                                        | 179999/1000000 [2:19:37<9:52:36, 23.06it/s]

hardest scenarios
[287 382 312 313 314 315 316 317 318 319]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 215  614  797  820 1051  120   49  219  259  283]
avg (accross all scenarios) number of timsteps played 312.171875
Time alive: [ 71.66666667 204.66666667 265.66666667 273.33333333 350.33333333
  40.          16.33333333  73.          86.33333333  94.33333333]
Avg time alive: 114.67158564814814


 19%|█████████████                                                        | 189998/1000000 [2:27:15<9:33:33, 23.54it/s]

hardest scenarios
[287 303 293 294 295 296 297 298 299 300]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 215  497 1755  309  267   64  398  105  232  858]
avg (accross all scenarios) number of timsteps played 329.3645833333333
Time alive: [ 71.66666667 165.66666667 585.         103.          89.
  21.33333333 132.66666667  35.          77.33333333 286.        ]
Avg time alive: 117.75173611111111


 20%|█████████████▊                                                       | 199999/1000000 [2:35:00<9:27:01, 23.51it/s]

hardest scenarios
[287 269 257 258 259 260 261 262 263 264]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[215 604 834 547 179 315 983 779 697 251]
avg (accross all scenarios) number of timsteps played 346.1458333333333
Time alive: [ 71.66666667 201.33333333 278.         182.33333333  59.66666667
 105.         327.66666667 259.66666667 232.33333333  83.66666667]
Avg time alive: 119.75376157407406


 21%|██████████████▍                                                      | 210000/1000000 [2:42:43<9:20:53, 23.47it/s]

hardest scenarios
[287 242 229 230 231 232 233 234 235 236]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[215 784 178 103 539  35  25  57 344  50]
avg (accross all scenarios) number of timsteps played 364.40625
Time alive: [ 71.66666667 261.33333333  59.33333333  34.33333333 179.66666667
  11.66666667   8.33333333  19.         114.66666667  16.66666667]
Avg time alive: 124.4950810185185


 22%|███████████████▏                                                     | 219998/1000000 [2:50:23<9:12:39, 23.52it/s]

hardest scenarios
[287 227 213 214 215 216 217 218 219 220]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[ 215   89   11  446 1060  527  187   60  823  207]
avg (accross all scenarios) number of timsteps played 381.59027777777777
Time alive: [ 71.66666667  29.66666667   3.66666667 148.66666667 353.33333333
 175.66666667  62.33333333  20.         274.33333333  69.        ]
Avg time alive: 129.89091435185185


 23%|███████████████▊                                                     | 230000/1000000 [2:58:07<9:04:41, 23.56it/s]

hardest scenarios
[287 191 197 196 195 194 193 192 190 199]
They have been chosen respectively
[2 2 2 2 2 2 2 2 2 2]
The number of timesteps played is
[215  28 248  95 303  66 255 235  29 239]
avg (accross all scenarios) number of timsteps played 398.48090277777777
Time alive: [ 71.66666667   9.33333333  82.66666667  31.66666667 101.
  22.          85.          78.33333333   9.66666667  79.66666667]
Avg time alive: 132.90943287037035


 24%|████████████████▌                                                    | 240000/1000000 [3:05:50<8:52:14, 23.80it/s]

hardest scenarios
[30 15 29 28 27 26 25 24 23 22]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 109  323 1038  224 1043   96  761 1172  376  494]
avg (accross all scenarios) number of timsteps played 415.85069444444446
Time alive: [ 27.25  80.75 259.5   56.   260.75  24.   190.25 293.    94.   123.5 ]
Avg time alive: 136.01938657407405


 25%|█████████████████▏                                                   | 249998/1000000 [3:13:29<8:47:39, 23.69it/s]

hardest scenarios
[40 42 29 30 31 32 34 35 36 37]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[1174  264 1038 1484  882 1199  530 1455  546 1611]
avg (accross all scenarios) number of timsteps played 433.7986111111111
Time alive: [293.5   66.   259.5  371.   220.5  299.75 132.5  363.75 136.5  402.75]
Avg time alive: 138.80280671296293


 26%|█████████████████▉                                                   | 260000/1000000 [3:21:13<8:39:46, 23.73it/s]

hardest scenarios
[97 52 77 76 75 74 72 71 70 69]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 407  129  768  972  243  973  272   99 1165   44]
avg (accross all scenarios) number of timsteps played 450.5208333333333
Time alive: [101.75  32.25 192.   243.    60.75 243.25  68.    24.75 291.25  11.  ]
Avg time alive: 140.3304398148148


 27%|██████████████████▋                                                  | 269998/1000000 [3:28:57<8:34:45, 23.64it/s]

hardest scenarios
[143  71  91  92  93  94  95  96  97  98]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 66  99 702 531 249 829 838 360 407 434]
avg (accross all scenarios) number of timsteps played 468.62152777777777
Time alive: [ 16.5   24.75 175.5  132.75  62.25 207.25 209.5   90.   101.75 108.5 ]
Avg time alive: 143.0687210648148


 28%|███████████████████▎                                                 | 280000/1000000 [3:36:32<8:33:28, 23.37it/s]

hardest scenarios
[143 108 101 102 103 104 105 106 107 109]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 302  112  567 1011  307 1888 1694  367 1559  136]
avg (accross all scenarios) number of timsteps played 485.21527777777777
Time alive: [ 75.5   28.   141.75 252.75  76.75 472.   423.5   91.75 389.75  34.  ]
Avg time alive: 146.63888888888889


 29%|████████████████████                                                 | 290000/1000000 [3:44:17<8:25:00, 23.43it/s]

hardest scenarios
[143 125 127 128 129 130 131 132 133 134]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[302 282 542 194 295 571 192 707 406 639]
avg (accross all scenarios) number of timsteps played 503.40972222222223
Time alive: [ 75.5   70.5  135.5   48.5   73.75 142.75  48.   176.75 101.5  159.75]
Avg time alive: 150.03949652777777


 30%|████████████████████▋                                                | 299998/1000000 [3:52:04<8:30:13, 22.87it/s]

hardest scenarios
[143 176 163 164 165 166 167 168 169 170]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[302 390 258 736 305 916 257 822 127 112]
avg (accross all scenarios) number of timsteps played 520.0225694444445
Time alive: [ 75.5   97.5   64.5  184.    76.25 229.    64.25 205.5   31.75  28.  ]
Avg time alive: 151.62037037037035


 31%|█████████████████████▍                                               | 309999/1000000 [3:59:44<8:05:14, 23.70it/s]

hardest scenarios
[287 217 197 196 195 194 193 192 191 190]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 222  195  466  226  807 1051 1348 1019  220  531]
avg (accross all scenarios) number of timsteps played 537.8628472222222
Time alive: [ 55.5   48.75 116.5   56.5  201.75 262.75 337.   254.75  55.   132.75]
Avg time alive: 153.17491319444446


 32%|██████████████████████                                               | 319999/1000000 [4:07:27<7:59:04, 23.66it/s]

hardest scenarios
[287 220 228 227 226 225 224 223 222 221]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[222 294 568 184 274 123 115 407 197 105]
avg (accross all scenarios) number of timsteps played 554.8350694444445
Time alive: [ 55.5   73.5  142.    46.    68.5   30.75  28.75 101.75  49.25  26.25]
Avg time alive: 154.2941261574074


 33%|██████████████████████▊                                              | 329998/1000000 [4:15:10<9:29:12, 19.62it/s]

hardest scenarios
[287 332 323 324 325 326 327 328 329 330]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 222  567 1095  322  717  453  187  980 1199 1181]
avg (accross all scenarios) number of timsteps played 572.7170138888889
Time alive: [ 55.5  141.75 273.75  80.5  179.25 113.25  46.75 245.   299.75 295.25]
Avg time alive: 157.1997974537037


 34%|███████████████████████▍                                             | 339998/1000000 [4:22:52<8:00:22, 22.90it/s]

hardest scenarios
[287 306 296 297 298 299 300 301 302 303]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 222 1249   82  449  109 1101 1233  268  738  504]
avg (accross all scenarios) number of timsteps played 589.21875
Time alive: [ 55.5  312.25  20.5  112.25  27.25 275.25 308.25  67.   184.5  126.  ]
Avg time alive: 157.6161747685185


 35%|████████████████████████▏                                            | 350000/1000000 [4:30:32<7:40:13, 23.54it/s]

hardest scenarios
[287 285 274 275 276 277 278 279 280 281]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 222  439 1385  351 1102  372  611 2142  386  241]
avg (accross all scenarios) number of timsteps played 607.3072916666666
Time alive: [ 55.5  109.75 346.25  87.75 275.5   93.   152.75 535.5   96.5   60.25]
Avg time alive: 159.54571759259258


 36%|████████████████████████▊                                            | 359999/1000000 [4:38:15<7:28:36, 23.78it/s]

hardest scenarios
[287 240 242 243 244 245 246 247 248 249]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 222  377  808 1048  311  539  544  662  724  215]
avg (accross all scenarios) number of timsteps played 624.5034722222222
Time alive: [ 55.5   94.25 202.   262.    77.75 134.75 136.   165.5  181.    53.75]
Avg time alive: 161.57118055555554


 37%|█████████████████████████▌                                           | 370000/1000000 [4:45:53<7:32:26, 23.21it/s]

hardest scenarios
[287 318 224 225 226 227 228 229 230 231]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[222 331 115 123 274 184 568 202 451 595]
avg (accross all scenarios) number of timsteps played 641.8454861111111
Time alive: [ 55.5   82.75  28.75  30.75  68.5   46.   142.    50.5  112.75 148.75]
Avg time alive: 164.24146412037035


 38%|██████████████████████████▏                                          | 380000/1000000 [4:53:33<7:17:30, 23.62it/s]

hardest scenarios
[287 206 212 211 210 209 208 207 205 231]
They have been chosen respectively
[3 3 3 3 3 3 3 3 3 3]
The number of timesteps played is
[ 222  527  489  224  209  824 1500  274  370  595]
avg (accross all scenarios) number of timsteps played 659.1892361111111
Time alive: [ 55.5  131.75 122.25  56.    52.25 206.   375.    68.5   92.5  148.75]
Avg time alive: 166.55295138888889


 39%|██████████████████████████▉                                          | 389998/1000000 [5:01:16<7:08:55, 23.70it/s]

hardest scenarios
[14  7 13 12 11  9  8 10  6  5]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 732  408  961  566  920  220  743  432 2507  504]
avg (accross all scenarios) number of timsteps played 676.7239583333334
Time alive: [146.4  81.6 192.2 113.2 184.   44.  148.6  86.4 501.4 100.8]
Avg time alive: 168.25703125


 40%|███████████████████████████▌                                         | 399999/1000000 [5:08:57<7:05:04, 23.53it/s]

hardest scenarios
[34 31 22 23 24 25 27 28 29 30]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[1006 2574  523  498 1333  834 1167  252 1193 1938]
avg (accross all scenarios) number of timsteps played 693.859375
Time alive: [201.2 514.8 104.6  99.6 266.6 166.8 233.4  50.4 238.6 387.6]
Avg time alive: 169.75190972222222


 41%|████████████████████████████▎                                        | 409998/1000000 [5:16:39<7:04:57, 23.14it/s]

hardest scenarios
[66 54 62 61 60 59 58 57 56 55]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[1757 2138  890  760  823   78 1277  301 1886  675]
avg (accross all scenarios) number of timsteps played 711.5868055555555
Time alive: [351.4 427.6 178.  152.  164.6  15.6 255.4  60.2 377.2 135. ]
Avg time alive: 171.00711805555557


 42%|████████████████████████████▉                                        | 420000/1000000 [5:24:20<6:50:02, 23.57it/s]

hardest scenarios
[100  74  87  86  85  84  83  82  81  80]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 301  981 2012  419  301  971  894  835 1306 1193]
avg (accross all scenarios) number of timsteps played 729.0954861111111
Time alive: [ 60.2 196.2 402.4  83.8  60.2 194.2 178.8 167.  261.2 238.6]
Avg time alive: 172.59505208333334


 43%|█████████████████████████████▋                                       | 429998/1000000 [5:31:59<6:39:46, 23.76it/s]

hardest scenarios
[143 110  93  94  95  96  97  98  99 100]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[1401  879 1631 1104 1539  564 1064  518  955  301]
avg (accross all scenarios) number of timsteps played 745.9427083333334
Time alive: [280.2 175.8 326.2 220.8 307.8 112.8 212.8 103.6 191.   60.2]
Avg time alive: 174.6756076388889


 44%|██████████████████████████████▎                                      | 439998/1000000 [5:39:38<6:41:08, 23.27it/s]

hardest scenarios
[143 127 118 119 120 121 122 123 124 125]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[1401  758 1550  463  773  688  725  543  788  298]
avg (accross all scenarios) number of timsteps played 763.8298611111111
Time alive: [280.2 151.6 310.   92.6 154.6 137.6 145.  108.6 157.6  59.6]
Avg time alive: 176.53098958333334


 45%|███████████████████████████████                                      | 449998/1000000 [5:47:20<6:25:17, 23.79it/s]

hardest scenarios
[143 145 147 148 149 150 151 152 153 154]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[1401 1345 1489  777  834  296 2764  630 1535  851]
avg (accross all scenarios) number of timsteps played 779.9270833333334
Time alive: [280.2 269.  297.8 155.4 166.8  59.2 552.8 126.  307.  170.2]
Avg time alive: 177.80208333333334


 46%|███████████████████████████████▋                                     | 460000/1000000 [5:55:01<6:18:22, 23.79it/s]

hardest scenarios
[143 165 167 168 169 170 171 172 173 174]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[1401 1977  410  826  358  636  225  525  351  135]
avg (accross all scenarios) number of timsteps played 798.4184027777778
Time alive: [280.2 395.4  82.  165.2  71.6 127.2  45.  105.   70.2  27. ]
Avg time alive: 179.88081597222222


 47%|████████████████████████████████▍                                    | 469998/1000000 [6:02:46<6:17:23, 23.41it/s]

hardest scenarios
[287 208 215 214 213 212 211 210 209 207]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 230 1625 1315  808   78  760  583  217  836  373]
avg (accross all scenarios) number of timsteps played 815.2534722222222
Time alive: [ 46.  325.  263.  161.6  15.6 152.  116.6  43.4 167.2  74.6]
Avg time alive: 179.52986111111113


 48%|█████████████████████████████████                                    | 479998/1000000 [6:10:26<6:05:16, 23.73it/s]

hardest scenarios
[287 226 340 341 342 343 344 345 346 227]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 230  441  757 1068 2231  784 1127 1052  719  428]
avg (accross all scenarios) number of timsteps played 833.234375
Time alive: [ 46.   88.2 151.4 213.6 446.2 156.8 225.4 210.4 143.8  85.6]
Avg time alive: 180.9650173611111


 49%|█████████████████████████████████▊                                   | 490000/1000000 [6:18:07<6:50:15, 20.72it/s]

hardest scenarios
[287 328 319 320 321 322 323 324 325 326]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 230 1275 1224  726 1296  937 1295  584 1056  880]
avg (accross all scenarios) number of timsteps played 850.4635416666666
Time alive: [ 46.  255.  244.8 145.2 259.2 187.4 259.  116.8 211.2 176. ]
Avg time alive: 183.0246527777778


 50%|██████████████████████████████████▌                                  | 500000/1000000 [6:25:47<6:01:53, 23.03it/s]

hardest scenarios
[287 309 299 300 301 302 303 304 305 306]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 230  490 1333 1257  291  815  512  298  655 1292]
avg (accross all scenarios) number of timsteps played 867.5954861111111
Time alive: [ 46.   98.  266.6 251.4  58.2 163.  102.4  59.6 131.  258.4]
Avg time alive: 184.22925347222224


 51%|███████████████████████████████████▏                                 | 509999/1000000 [6:33:29<5:44:15, 23.72it/s]

hardest scenarios
[287 127 269 270 271 272 273 274 275 276]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 230  758 1504  546  816 1345 1303 1517  538 1193]
avg (accross all scenarios) number of timsteps played 885.1059027777778
Time alive: [ 46.  151.6 300.8 109.2 163.2 269.  260.6 303.4 107.6 238.6]
Avg time alive: 184.48906250000002


 52%|███████████████████████████████████▉                                 | 519999/1000000 [6:41:10<5:40:57, 23.46it/s]

hardest scenarios
[287 213 245 246 247 248 249 250 251 252]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 230   78  763  776  743 2017  231 1381  379 1118]
avg (accross all scenarios) number of timsteps played 902.6041666666666
Time alive: [ 46.   15.6 152.6 155.2 148.6 403.4  46.2 276.2  75.8 223.6]
Avg time alive: 186.5287326388889


 53%|████████████████████████████████████▌                                | 530000/1000000 [6:48:51<5:31:34, 23.62it/s]

hardest scenarios
[287 251 221 222 223 224 225 226 227 228]
They have been chosen respectively
[4 4 4 4 4 4 4 4 4 4]
The number of timesteps played is
[ 230  379  658  428  490  424  438  441  428 1433]
avg (accross all scenarios) number of timsteps played 919.6875
Time alive: [ 46.   75.8 131.6  85.6  98.   84.8  87.6  88.2  85.6 286.6]
Avg time alive: 187.32899305555554


 53%|████████████████████████████████████                                | 531007/1000000 [6:49:40<13:35:28,  9.59it/s]

## Below code snippet evaluates the saved agent and returns the requested data
If ``verbose= True`` then a detailed log of the process is printed

In [14]:
env = grid2op.make("l2rpn_neurips_2020_track1_small", reward_class=L2RPNReward)
evaluate(env,
         name="Deep_SARSA_Agent",
         load_path="./D_SARSA_Agent/model",
         logs_path="./D_SARSA_Agent/logs",
         nb_episode=10,
         nb_process=1,
         max_steps=-1,
         verbose=False,
         save_gif=False)

(<__main__.DeepQSimple at 0x238d617d790>,
 [('C:\\Users\\tejus_\\data_grid2op\\l2rpn_neurips_2020_track1_small\\chronics\\Scenario_april_000',
   'Scenario_april_000',
   31518.462890625,
   617,
   8062),
  ('C:\\Users\\tejus_\\data_grid2op\\l2rpn_neurips_2020_track1_small\\chronics\\Scenario_april_001',
   'Scenario_april_001',
   39090.1484375,
   780,
   8062),
  ('C:\\Users\\tejus_\\data_grid2op\\l2rpn_neurips_2020_track1_small\\chronics\\Scenario_april_002',
   'Scenario_april_002',
   65330.15625,
   1299,
   8062),
  ('C:\\Users\\tejus_\\data_grid2op\\l2rpn_neurips_2020_track1_small\\chronics\\Scenario_april_003',
   'Scenario_april_003',
   35800.421875,
   689,
   8062),
  ('C:\\Users\\tejus_\\data_grid2op\\l2rpn_neurips_2020_track1_small\\chronics\\Scenario_april_004',
   'Scenario_april_004',
   25113.173828125,
   484,
   8062),
  ('C:\\Users\\tejus_\\data_grid2op\\l2rpn_neurips_2020_track1_small\\chronics\\Scenario_april_005',
   'Scenario_april_005',
   12245.357421875,
