## PPO with single-agent control

In this notebook, we show how to use Proximal Policy Optimization (PPO) with Nocturne and [Stable Baselines 3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/index.html). SB3 is a library that has implementations of various well-known RL algorithms.

### Wrappers

The Nocturne `BaseEnv` returns output as dictionaries, but the SB3 `PPO` class expects numpy arrays. To make our environment compatible with SB3, we create a wrapper class. Wrappers modify an environment without altering code directly, which reduces boilerplate and increasing modularity.

In [3]:
import wandb
# Import base environment and wrapper
from utils.config import load_config_nb
from nocturne.envs.base_env import BaseEnv
from nocturne.wrappers.sb3_wrappers import NocturneToSB3

In [7]:
# Load environment settings
env_config = load_config_nb('env_config')
env_config.data_path = f'../{env_config.data_path}'

# Make sure to only control a single agent at a time. This is achieved by setting max_num_vehicles = 1
env_config["max_num_vehicles"] = 1

In [8]:
# Initialize env and wrap it with SB3 wrapper
env = NocturneToSB3(BaseEnv(env_config))

### PPO

Now all we have to do is initialize the SB3 `PPO` class and we're ready to learn! We use Weights & Biases (`wandb`) to take care of the logging. If you prefer not to use `wandb`, set `LOGGING = False` and `verbose=1`. 


---

> üî¶ More info on PPO and settings can be found in the [SB3 docs](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html).

---

In [9]:
from stable_baselines3 import PPO

In [10]:
LOGGING = True

In [None]:
if LOGGING:
    wandb.login()
    run = wandb.init(
        project="single_agent_control_sb3_ppo",
        sync_tensorboard=True,
    )
    run_id = run.id
else:
    run_id = None

# Init PPO algorithm
model = PPO(      
    policy="MlpPolicy",  # Policy type
    n_steps=4096, # Number of steps per rollout
    batch_size=128, # Minibatch size
    env=env, # Our wrapped environment
    seed=42, # Always seed for reproducibility
    verbose=0,
    tensorboard_log=f"runs/{run_id}" if run_id is not None else None, # Sync with wandb
)

# Learn
model.learn(total_timesteps=200_000)

### ü§î How good is your policy?

Hooray! You have just trained your first PPO agent in Nocturne! üèÅ 

Now take a look at information you've logged over training; did we learn?

One important metric for assess the effectiveness of your policy is the average cumulative reward per episode. In our case, the **maximum** achievable return per episode is 1 per agent. With the configurations above, your policy should approach this value in 150,000 steps. Here, steps (the `global_step`) represents the total number of **frames** our policy network has seen, you can think of it as the accumulated experience.