## Train an RL agent
This notebook will focus the following topics:
 - training an RL agent,
 - define reward function,
 - define featureize function.

In this notebook a reinforcement learning agent is trained to control the current flowing through an inductor.
It will be shown for an easy case how the agent can learn and be applied to an electrical power grid simulated with de Dare package.

The use case is shown in the figure below.
This environment consists of a single phase electrical power grid with 1 source and 1 load connected via a cable.

![](figures/RL_single_agent.png "")

First the environment is defined in the configuration shown in the figure. 
For more information how to setup an environment see `Env_Create_DEMO.ipynb`.


In [1]:
using Dare
using ReinforcementLearning

In [2]:
# calculate passive load for wanted setting / power rating
R_load, L_load, X, Z = Parallel_Load_Impedance(100e3, 1, 230)

# define grid using CM
CM = [0. 1.
    -1. 0.]

# Set parameters accoring graphic above
parameters = Dict{Any, Any}(
    "source" => Any[
                    Dict{Any, Any}("pwr" => 200e3, "control_type" => "RL", "mode" => "user_def", "fltr" => "L"),
                    ],
    "load"   => Any[
                    Dict{Any, Any}("impedance" => "RL", "R" => R_load, "L" => L_load,"v_limit"=>1e4, "i_limit"=>1e4),
                    ],
    "grid" => Dict{Any, Any}("phase" => 1)
)

(1.058, Inf, Inf, 1.058 + 0.0im)

To teach the agent that it should control the current it need on the one hand the information about which value the current shoud be (reference value) (->`featurize`) and how good the current state is which was reached using the chosen action (-> `reward`).

Therefore, the reference value has to be defined. 
Here we will use a constant value to keep the example simple.
But since the the `reference(t)` function take the simulation time as argument, more complex, time dependent signals could be defined.

In [None]:
function reference(t)
    return [1]
end

Then the `featurize()` function is defined. 
It has to jobs:
1. Add the reference value based on the control target which should be learned:
Here the signal generated by the `reference` function is added to the states given to the agent. This is neccessary for the agent to learn in this case that the reward is maximized if the measured current fits the the reference value.
These reference value has to be normalized in an appropirate way that it fits to the range of the normalized states.

2. Hand over the states to the agent which should be known to the agent:
The environment can constits of more states then should be known to te agent. Reasons for example can be like shown in this examples, that the agent is supplying a load which is ,e.g., 1 km away from the source the agent controls. 
In that case it is common that the agent has no knowlegde about state of the load since to communication and measurements exchange is assumed between the source and the load.
Anonther example can be that the electrical power grid consits of more sources and loads. The other sources are controlled by other agents or classic controllers. In that case, typically every controller / agent has knowlegde of the states of the source it controls but not about the states another agent/controller controls.

Both functionalities are implemented in the following featurize function:

In [2]:
function featurize(x0 = nothing, t0 = nothing; env = nothing, name = nothing)
    if !isnothing(name)
        state = env.state
        if name == "agent"
            state = state[findall(x -> x in env.state_ids_RL, env.state_ids)]
            norm_ref = env.nc.parameters["source"][1]["i_limit"]
            state = vcat(state, reference(env.t)/norm_ref)
        end
    elseif isnothing(env)
        return vcat(x0, zeros(size(reference(t0))))
    else
        return env.state
    end
    return state
end

featurize (generic function with 3 methods)

Before defining the environment, the `reward()` function has to be defined to give a feedback to the agent how good the chose action was.
First, the state to be controlled is taken from the current environment states.
Since the states are normalized by the limits the electrical components can handle, a value greater then `1` means that the state limit is exceeded typically leading to a ssystem crash.

Therefore, first it is checked if the measured state is greate then `1`. In that case a punishment is returned which , here, is chosen to `r = -1`.
In the case that the controlled state is within the valide state space, the reward is caculated based on the error between the wanted reference value and the measured state value. 
If these values are the same, meaning the agent perfectly fullfills the control task, a reward of `r = 1` is returned to the agent. ( -> r $\in$ [-1, 1]).
If the measured value differs from the reference, the error - based on the root-mean square error in this example - is substracted from the maximal reward: `r = 1 - MRE`:

$r = 1 - \sqrt{\frac{|i_\mathrm{L,ref} - i_\mathrm{L1}|}{2}}$

To keep the reward in the wanted range, the current difference is devided by 2. (E.g., in worst case, if a reference value equal to the corresponding current limit is chosen $i_\mathrm{L,ref} = i_\mathrm{lim}$ and the measured current is the negative current limit $i_\mathrm{L1} = -i_\mathrm{lim}$ more the 1 would be substracted without / 2).

In [None]:
function reward_function(env, name = nothing)

    index_1 = findfirst(x -> x == "source1_i_L1", env.state_ids)
    state_to_control = [env.state[index_1]]

    if any(abs.(u).>1)
        return -1
    else

        refs = reference(env.t)
        norm_ref = env.nc.parameters["source"][1]["i_limit"]          
        r = 1-((abs.(refs/norm_ref - state_to_control)/2).^0.5)
        return r 
    end
end

Then, the defined parameters, featurize and reward functions are used to create an envorinment consisting of the electircal power grid. To keep the first learning example simple the action given to the env is internally not delayed. 

In [None]:
env = SimEnv(
    CM = CM, 
    parameters = parameters, 
    t_end = 0.1, 
    featurize = featurize, 
    reward_function = reward_function, 
    action_delay = 0)

In this example a `Deep Deterministic Policy Gradient` agent (https://arxiv.org/abs/1509.02971, https://spinningup.openai.com/en/latest/algorithms/ddpg.html) is chosen which can learn a control task on continous state and action space.
It is configured inside the `setup_agents()` function which takes the control types defined in the parameter dict and hands the correct indices to the corrensponding controllers / agents.

TODO: Give agent(s) from the outside, automize indices used in reward

multi_agent = setup_agents(env)

The `setup_agent` function returns a multi_agent which is an instance of the dare package `MultiAgentGridController` which contains the different agents and classic controllers and maps their actions to the corresponding sources defined by the indices in `setup_agent`.

The multi_agent inthis examples consits only of one RL agent and can be trained usin the `learn()` function to train 20 episodes:

In [None]:
learn(multi_agent, env, num_episodes = 20)

Plot a testcase using simulate()

In [None]:

states_to_plot = ["source1_v_C_filt_a", "source1_v_C_filt_b", "source1_v_C_filt_c"]
action_to_plot = ["source1_u_a"]

hook = DataHook(collect_state_ids = states_to_plot)

Multi_Agent = setup_agents(env)
simulate(Multi_Agent, env, hook=hook)

plot_hook_results(hook = hook,
                  states_to_plot  = states_to_plot)