# Simple PPO agent

The notebook provides a quick training of [Proximal-Policy Optimization](https://arxiv.org/abs/1707.06347) (PPO) algorithm on the `MicroGridEnv` environment.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%cd ../..

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime
from tqdm import tqdm
from collections import OrderedDict
from gymnasium.utils.env_checker import check_env
import gymnasium as gym

from gym4real.envs.microgrid.utils import parameter_generator

In [None]:
sns.set_style('darkgrid')
plot_colors = sns.color_palette()
sns.set(font_scale=1.2)

## PPO Agent
We are adopting the Stable-Baselines 3 version of PPO, described [here](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html).

Here we initialize both the environment for training the agent and the environment to evaluate the agent. Indeed, the evaluation is done on an environment which presents different consumption profiles. The evaluation is done on 5 profiles.

In [None]:
# Uncomment the following line to install stable-baselines3
#!pip install stable-baselines3

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.env_util import make_vec_env

In [None]:
n_episodes = 5
n_envs = 4

# Validation profiles belonging to the train set
eval_profiles = [350, 351, 352, 353, 354] 

In [None]:
params = parameter_generator(world_options='gym4real/envs/microgrid/world_train.yaml')
env = make_vec_env("gym4real/microgrid-v0", n_envs=n_envs, env_kwargs={'settings':params})

In [None]:
model = PPO(MlpPolicy, env)
model.learn(total_timesteps=len(env.get_attr('generation')[0]) * n_envs * n_episodes, 
            progress_bar=True)
model.save('examples/microgrid/trained_models/PPO_quick')

## Comparison with Random policy

Here we will compare the PPO model saved with a simple random policy. The policies will be compared on several test profiles never seen before by the agent.

In [None]:
eval_params = parameter_generator(world_options='gym4real/envs/microgrid/world_test.yaml')

# Test profiles belonging to the test set
test_profiles = [370, 371, 372, 373, 374, 375, 376, 377, 378, 379]
rewards = {}

### Random Policy
The action is chosen randomly at each decision step by randomly sampling within the action space.

In [None]:
env = gym.make("gym4real/microgrid-v0", **{'settings':eval_params})

alg = 'random'
rewards[alg] = {}

for profile in tqdm(test_profiles):
    obs, info = env.reset(options={'eval_profile': str(profile)})
    done = False
    cumulated_reward = 0
    rewards[alg][profile] = []

    while not done:
        action = env.action_space.sample()  # Randomly select an action
        obs, reward, terminated, truncated, info = env.step(action)  
        done = terminated or truncated
        cumulated_reward += reward
        rewards[alg][profile].append(cumulated_reward)

### PPO agent
Here we load the previously created model `PPO_quick`.

In [None]:
env = make_vec_env("gym4real/microgrid-v0", n_envs=1, env_kwargs={'settings':eval_params})

alg = 'ppo'
rewards[alg] = {}

model = PPO(MlpPolicy, env, verbose=1)
vec_env = model.get_env()
model = PPO.load("examples/microgrid/trained_models/PPO_quick")

for profile in tqdm(test_profiles):
    vec_env.set_options({'eval_profile': str(profile)})
    obs = vec_env.reset()

    cumulated_reward = 0
    rewards[alg][profile] = []
    done = False
    
    while not done:
        action, _states = model.predict(obs)
        obs, r, dones, info = vec_env.step(action)
        done = dones[0]
        cumulated_reward += r[0]
        rewards[alg][profile].append(cumulated_reward)

Let's compare the cumulative rewards averaged among the test profiles between `PPO` undergone a quick training and the `random` policy.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4), tight_layout=True)

for i, alg in enumerate(rewards.keys()):
    means = np.mean([(rewards[alg][profile]) for profile in rewards[alg].keys()], axis=0)
    stds = np.std([(rewards[alg][profile]) for profile in rewards[alg].keys()], axis=0)
    ci = 1.96 * stds/np.sqrt(len(rewards[alg].keys()))
    
    ax.plot(means, label=alg)        
    ax.fill_between(range(len(means)), means + ci, means - ci, alpha=0.1)
    ax.legend()