## Proximal Policy Optimization (PPO)

PPO is a popular Reinforcement Learning (RL) algorithm widely used for its simplicity and effectiveness in various environments. PPO is an on-policy algorithm that aims to strike a balance between policy improvement and policy stability. It is suitable for both continuous and discrete action spaces.

In this tutorial, we will use the `CartPole-v1` environment from OpenAI Gym to demonstrate how PPO works with stable_baselines3 and PyTorch. The goal is to balance a pole on a cart by applying forces to the cart. We will cover the following sections:

1. Setting up the environment
2. Configuring PPO
3. Training the agent
4. Saving and loading the model
5. Evaluating the agent
6. Visualizing the agent's performance


In [None]:
# Required imports
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

### 1. Setting up the environment

We will create a `CartPole-v1` environment using OpenAI Gym and wrap it in a `DummyVecEnv` to ensure compatibility with stable_baselines3.

In [None]:
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env])

### 2. Configuring PPO

We will now configure the PPO algorithm. Some important hyperparameters are:

- `learning_rate`: The learning rate for the optimizer.
- `n_steps`: The number of steps to run for each environment per update.
- `batch_size`: Minibatch size for each gradient update.
- `n_epochs`: Number of times to iterate through the entire batch when calculating the gradient.
- `gamma`: Discount factor for future rewards.
- `gae_lambda`: Factor for trade-off between bias and variance for Generalized Advantage Estimator (GAE).
- `clip_range`: Clipping parameter to limit the change in policy during an update.

You can customize these hyperparameters based on your specific problem or use the default settings.

In [None]:
model = PPO(
    'MlpPolicy', env, learning_rate=0.0003, n_steps=2048, batch_size=64, n_epochs=10,
    gamma=0.99, gae_lambda=0.95, clip_range=0.2, verbose=1
)

### 3. Training the agent

Train the agent by calling the `learn()` method and specifying the total number of training timesteps.

In [None]:
model.learn(total_timesteps=100000)

### 4. Saving and loading the model

Save the trained model using the `save()` method and load it using the `load()` method for future use.

In [None]:
model.save('ppo_cartpole')
model = PPO.load('ppo_cartpole')

### 5. Evaluating the agent

Evaluate the performance of the trained agent using the `evaluate_policy()` function from stable_baselines3. It returns the mean reward and the standard deviation over multiple episodes.

In [None]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f'Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}')

### 6. Visualizing the agent's performance

To visualize the agent's performance, you can render the environment using the `render()` method. Note that this might not work in all environments and platforms, such as Jupyter notebooks or non-GUI systems.

In [None]:
for i in range(3):
    obs = env.reset()
    done = False
    while not done:
        action, _ = model.predict(obs)
        obs, _, done, _ = env.step(action)
        env.render()
env.close()