# Single-Agent RL training and deployment

The following notebook provides an introduction to training a single reinforcement learning (RL) agent with the proximal policy optimization (PPO) algorithm. We use CommonPower to create a simulation of a power system, within which one node (corresponding to a multi-family household) is controlled by the RL agent. Since RL does not naturally allow considering constraints, such as a minimum state of charge of a battery, we have implemented a safety layer that is wrapped around the RL agent. It extracts all necessary constraints from the power system model and checks whether a control action suggested by the agent is safe. If necessary, the safety layer adjust the action, before passing it on to the simulation. The agent then receives a feedback informing it about the adjustment of its action.

Within this notebook, you will learn how to 
- use CommonPower to modularly construct a power system,
- set up an RL agent,
- assign nodes to this agent, 
- train the RL agent, and
- monitor the training process using Tensorboard.

## Before getting started
1. Make sure you install all necessary requirements following the `Readme.txt`
2. Optional (only if you want to experiment with tracking training using Weights&Biases): Sign up for the academic version of Weights&Biases [here](https://wandb.ai/site/research).

## Important ressources for further information
### Short introduction to RL
If you have never worked with RL before, we recommend reading the [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html) introduction of RL.
### PPO implementation
We use the RL algorithm implementations from the StableBaselines3 (SB3) repository. You can learn more about the repository and the available algorithms [here](https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html).
### Tensorboard
[Tensorboard](https://www.tensorflow.org/tensorboard/get_started) can be used to track training of any kind of network.
### Weights&Biases
Weights&Biases (W&B) is an alternative to Tensorboard with very nice visualizations and some advanced options. It helps you keep an overview of your experiments and compare different hyperparameter settings. Find more information in their [documentation](https://docs.wandb.ai/quickstart)

In [None]:
import pathlib
import wandb
import matplotlib.pyplot as plt
from functools import partial
from commonpower.modelling import ModelHistory
from commonpower.core import System, Node, Bus
from commonpower.models.busses import *
from commonpower.models.components import *
from commonpower.models.powerflow import *
from commonpower.control.controllers import RLControllerSB3, OptimalController
from commonpower.control.safety_layer.safety_layers import ActionProjectionSafetyLayer
from commonpower.control.runners import SingleAgentTrainer, DeploymentRunner
from commonpower.control.wrappers import SingleAgentWrapper
from commonpower.control.logging.loggers import *
from commonpower.data_forecasting import *
from commonpower.utils.param_initialization import *
from stable_baselines3 import PPO
import tensorboard

## System set-up

First, we have to define the power system within which we want to control one node using an RL agent. 


In [None]:
horizon = timedelta(hours=24)
frequency = timedelta(minutes=60)
fixed_start = "27.11.2016"

# path to data profiles
current_path = pathlib.Path().absolute()
data_path = current_path / 'data' / '1-LV-rural2--1-sw'
data_path = data_path.resolve()

ds1 = CSVDataSource(data_path  / 'LoadProfile.csv',
            delimiter=";", 
            datetime_format="%d.%m.%Y %H:%M", 
            rename_dict={"time": "t", "H0-A_pload": "p", "H0-A_qload": "q"},
            auto_drop=True, 
            resample=timedelta(minutes=60))

ds2 = CSVDataSource(data_path / 'LoadProfile.csv',
            delimiter=";", 
            datetime_format="%d.%m.%Y %H:%M", 
            rename_dict={"time": "t", "G1-B_pload": "psib", "G1-C_pload": "psis", "G2-A_pload": "psi"},
            auto_drop=True, 
            resample=timedelta(minutes=60))

ds3 = CSVDataSource(data_path / 'RESProfile.csv', 
        delimiter=";", 
        datetime_format="%d.%m.%Y %H:%M", 
        rename_dict={"time": "t", "PV3": "p"},
        auto_drop=True, 
        resample=timedelta(minutes=60)).apply_to_column("p", lambda x: -x)

dp1 = DataProvider(ds1, LookBackForecaster(frequency=frequency, horizon=horizon))
dp2 = DataProvider(ds2, LookBackForecaster(frequency=frequency, horizon=horizon))
dp3 = DataProvider(ds3, PerfectKnowledgeForecaster(frequency=frequency, horizon=horizon))

In [None]:
# nodes
n1 = Bus("MultiFamilyHouse", {
    'p': (-50, 50),
    'q': (-50, 50),
    'v': (0.95, 1.05),
    'd': (-15, 15)
})

# trading unit with price data for buying and selling electricity (to reduce problem complexity, we assume that
# prices for selling and buying are the same --> TradingLinear)
m1 = TradingBusLinear("Trading1", {
    'p': (-50, 50),
    'q': (-50, 50)
}).add_data_provider(dp2)

# components
# energy storage sytem
capacity = 3  #kWh
e1 = ESSLinear("ESS1", {
    'rho': 0.1, 
    'p': (-1.5, 1.5), 
    'q': (0, 0), 
    'soc': (0.2 * capacity, 0.8 * capacity), 
    "soc_init": RangeInitializer(0.2 * capacity, 0.8 * capacity)
})

# photovoltaic with generation data
r1 = RenewableGen("PV1").add_data_provider(dp3)

# static load with data source
d1 = Load("Load1").add_data_provider(dp1)

# we first have to add the nodes to the system 
# and then add components to the node in order to obtain a tree-like structure
sys = System(power_flow_model=PowerBalanceModel()).add_node(n1).add_node(m1)

# add components to nodes
n1.add_node(d1).add_node(e1).add_node(r1)

# show system structure: 
sys.pprint()

## Setting up the RL Controller

We first set up a controller, then add the node we want to controll. The system will be balanced through the market node, which is controlled by an optimal controller (handed over as `global_controller` when instantiating the `SingleAgentTrainer`).

Since RL controllers do not naturally allow considering constraints (such as a limit on the state of charge of the storage system), we have to add a safety layer to the controller. The `ActionProjectionSafetyLayer` outputs an action that is as close as possible to the action suggested by the RL controller while also satisfying all constraints of the system. Every time the safety layer has to intervene, a penalty term is added to the reward of the RL agent. A `penalty_factor` is used to weigh this penalty and the rest of the reward. 

Furthermore, the node will have to buy electricity to even out its power balance. To inform the controller of the cost of electricity, we use the `price_callback` function which is linked to the market node controlled by the global controller.

In [None]:
agent1 = RLControllerSB3(
    name='agent1', 
    safety_layer=ActionProjectionSafetyLayer(penalty_factor=0.1)
)

We use the SB3 PPO implementation to train our RL agent and log the training progress using Tensorboard. If you want to try Weights&Biases for logging, you can uncomment the respective line. For more information on potential hyperparameters for PPO, check the [documentation](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html).

In [None]:
# specify a seed for the random number generator used during training (It is common to train with ~5 different
# random seeds when you are, for example, testing a new safeguarding approach. For this notebook, one seed is enough.
# It will improve reproducibility of results.)
training_seed = 42

# set up configuration for the PPO algorithm
alg_config = {}
alg_config['total_steps'] = 1500*int(horizon.total_seconds() // 3600)
alg_config['algorithm'] = PPO
alg_config['policy'] = 'MlpPolicy'
alg_config['device'] = 'cpu'
alg_config['n_steps'] = int(horizon.total_seconds() // 3600)
alg_config['learning_rate'] = 0.0008
alg_config['batch_size'] = 12

# set up logger
log_dir = './test_run/'
logger = TensorboardLogger(log_dir='./test_run/')
# You can also use Weights&Biases to monitor training. If you uncomment the next line, make sure to exchange the 
# "entity_name" parameter!
# logger = WandBLogger(log_dir='./test_run/', entity_name="srl4ps", project_name="commonpower", alg_config=alg_config, callback=WandBSafetyCallback)

## Running the Training

To run the training, we first need to instantiate a runner object. The `SingleAgentWrapper` is used to make the system compatible with the stable baselines PPO implementation. WARNING: This will take a while (~two hours).

In [None]:
# specify the path where the model should be saved
model_path = "./saved_models/my_model"

runner = SingleAgentTrainer(
    sys=sys, 
    global_controller=agent1, 
    wrapper=SingleAgentWrapper, 
    alg_config=alg_config, 
    forecast_horizon=horizon,
    control_horizon=horizon,
    logger = logger,
    save_path = model_path, 
    seed = training_seed
)
runner.run(fixed_start=fixed_start)

### Training visualization
If you used the TensorBoardLogger, you can plot the training metrics using the notebook magic of tensorboard. The most interesting charts for us are the `safety/mean_eps_penalty`, `safety/n_action_correction`, `rollout/ep_reward_mean`, `train/loss`, and `train/explained_variance`. Think about what these charts tell you and discuss it!

In [None]:
%load_ext tensorboard
%tensorboard --logdir test_run

## Deploying the trained agent
After training the agent, you can deploy it on the system. This means that the neural network representing our controller will deterministically chose the best control input for the given observation according to its policy.

In [None]:
# Just for demonstration purposes, we show here how to load a pre-trained policy
# However, in the present case this would not be necessary, since "agent1" has saved the policy after training

# First, we need to create a new agent and pass the pretrained_policy_path from which to load the neural network 
# params. Adding n1 to this agent will create a warning since we are overwriting agent1, which is desired in this case
agent2 = RLControllerSB3(
    name="pretrained_agent", 
    safety_layer=ActionProjectionSafetyLayer(penalty_factor=0.1),
    pretrained_policy_path = model_path
)

# The deployment runner has to be instantiated with the same arguments used during training
# The runner will automatically recognize that it has to load the policy for agent2
# To ensure proper comparison of the trained RL agent with an optimal controller, we use the same seed for both
eval_seed = 5

rl_model_history = ModelHistory([sys])
rl_deployer = DeploymentRunner(
    sys=sys, 
    global_controller=agent2,  
    alg_config=alg_config,
    wrapper=SingleAgentWrapper,
    forecast_horizon=horizon,
    control_horizon=horizon,
    history=rl_model_history,
    seed = eval_seed
)
# Finally, we can simulate the system with the trained controller for the given day
rl_deployer.run(n_steps=24, fixed_start=fixed_start)
# let us extract some logs for comparison with an optimal controller
# We want to compare the cost of the household over the curse of the day. 
rl_power_import_cost = rl_model_history.get_history_for_element(m1, name='cost') # cost for buying electricity
rl_dispatch_cost = rl_model_history.get_history_for_element(n1, name='cost') # cost for operating the components in the household
rl_total_cost = [(rl_power_import_cost[t][0], rl_power_import_cost[t][1] + rl_dispatch_cost[t][1]) for t in range(len(rl_power_import_cost))]
rl_soc = rl_model_history.get_history_for_element(e1, name="soc") # state of charge of the battery

## Benchmarking Trained Agent and Optimal Controller
We want to compare the results of our trained agent with an optimal controller. 

In [None]:
# We can use the same system but we have to set up a new runner. 
# This time, the global controller will take over the control of the household
oc_model_history = ModelHistory([sys])
oc_deployer = DeploymentRunner(
    sys=sys, 
    global_controller=OptimalController('global'), 
    forecast_horizon=horizon,
    control_horizon=horizon,
    history=oc_model_history,
    seed = eval_seed
)

In [None]:
oc_deployer.run(n_steps=24, fixed_start=fixed_start)
# we retrieve logs for the system cost
oc_power_import_cost = oc_model_history.get_history_for_element(m1, name='cost') # cost for buying electricity
oc_dispatch_cost = oc_model_history.get_history_for_element(n1, name='cost') # cost for operating the components in the household
oc_total_cost = [(oc_power_import_cost[t][0], oc_power_import_cost[t][1] + oc_dispatch_cost[t][1]) for t in range(len(oc_power_import_cost))]
oc_soc = oc_model_history.get_history_for_element(e1, name="soc") # state of charge

In [None]:
# plotting the cost of RL agent and optimal controller
plt.plot(range(len(rl_total_cost)), [x[1] for x in rl_total_cost], label="Cost RL")
plt.plot(range(len(oc_total_cost)), [x[1] for x in oc_total_cost], label="Cost optimal control")
plt.xticks(ticks=range(len(rl_power_import_cost)), labels=[x[0] for x in rl_power_import_cost])
plt.xticks(rotation=45)
plt.xlabel("Timestamp")
plt.ylabel("Value")
plt.title("Comparison of household cost for RL and optimal controller")
plt.tight_layout()
plt.legend()
plt.show()

To make sure that both runs are comparable, we check that they started with the same initial SOC of the battery.
It is the only random element in the current system set-up.

In [None]:
# plotting the state of charge of the batteries
plt.plot(range(len(rl_soc)), [x[1] for x in rl_soc], label="SOC RL")
plt.plot(range(len(oc_soc)), [x[1] for x in oc_soc], label="SOC optimal control")
plt.xticks(ticks=range(len(rl_soc)), labels=[x[0] for x in rl_soc])
plt.xticks(rotation=45)
plt.xlabel("Timestamp")
plt.ylabel("Value")
plt.title("Comparison of battery state of charge (SOC) for RL and optimal controller")
plt.tight_layout()
plt.legend()
plt.show()

In [None]:
# Let's get the total cost for one day:
cost_day_rl = sum([rl_total_cost[t][1] for t in range(len(rl_total_cost))])
cost_day_oc = sum([oc_total_cost[t][1] for t in range(len(oc_total_cost))])
print(f"The daily cost \n a) with the RL controller: {cost_day_rl} \n b) with the optimal controller: {cost_day_oc}")

As you can see, the RL controller does not quite achieve the performance of the optimal controller. Why might that be?

## Things to try

You can use this notebook to experiment a bit. Here are some ideas:
- Try changing the `penalty_factor` and see how it affects the training
- Try setting the `fixed_start` argument in `runner.run()` to `None` to train on multiple days from one year. WARNING: You will also have to increase the `total_steps` and probably do some hyper parameter tuning!