<a href="https://colab.research.google.com/github/ToiLaKiet/UIT-CS115/blob/main/PPO_StableBaselines_Play_CartPoleV1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install "stable-baselines3[extra]>=2.0.0a4"

## Imports

In [2]:
import gymnasium as gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [3]:
from stable_baselines3 import PPO

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).

In [4]:
from stable_baselines3.ppo.policies import MlpPolicy

## Create the Gym env and instantiate the agent
For this example, we will use CartPole environment, a classic control problem.

In [5]:
env = gym.make("CartPole-v1")
model = PPO(MlpPolicy, env, verbose=0)

We create a helper function to evaluate the agent:

In [6]:
from stable_baselines3.common.base_class import BaseAlgorithm
def evaluate(
    model: BaseAlgorithm,
    num_episodes: int = 100,
    deterministic: bool = True,
) -> float:
    """
    Evaluate an RL agent for `num_episodes`.
:param model: the RL Agent
:param env: the gym Environment
:param num_episodes: number of episodes to evaluate it
:param deterministic: Whether to use deterministic or stochastic actions
:return: Mean reward for the last `num_episodes`
"""
    vec_env = model.get_env()
    obs = vec_env.reset()
    all_episode_rewards = []
    for _ in range(num_episodes):
        episode_rewards = []
        done = False
        while not done:
            action, _states = model.predict(obs, deterministic=deterministic)
            obs, reward, done, _info = vec_env.step(action)
            episode_rewards.append(reward)
        all_episode_rewards.append(sum(episode_rewards))
    mean_episode_reward = np.mean(all_episode_rewards)
    print(f"Mean reward: {mean_episode_reward:.2f} - Num episodes: {num_episodes}")
    return mean_episode_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [7]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_episodes=100, deterministic=True)

Mean reward: 193.08 - Num episodes: 100


Stable-Baselines already provides you with that helper:

In [8]:
from stable_baselines3.common.evaluation import evaluate_policy

In [9]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)
print(f"mean_reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward: 183.54 +/- 48.50


## Train the model

In [10]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10_000)


Training model for 10000 steps...
Training complete!


<stable_baselines3.ppo.ppo.PPO at 0x7e26c2bd7010>

## Evaluate the trained agent

In [11]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=200)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Evaluating trained agent...
mean_reward: 403.90 +/- 108.30


## Let's test the agent after training

In [12]:
# Evaluate trained model
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Average reward: {mean_reward:.2f}")

Evaluating trained model...
Average reward: 200.50
