In [2]:
#ML Imports
import gymnasium as gym
from stable_baselines3 import A2C, PPO, DQN
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env
import pygame
import time
import numpy as np

# Goals for this Stage of Research
* Create and render three different Gymnasium environments
* Train three different StableBaseline3 RL models on each environment
* Measure performance of Models

### 1. Taxi-V3
Here we're running the Taxi-v3 environment in Gymnasium with StableBaseline3's VecEnv (Vectorized Environment) which allows for simultaneous training on multiple instances of a Gymnasium environment.

In [3]:
vec_env = make_vec_env("Taxi-v3",n_envs=4)

#### Now, we will train 3 models and look at their performance increases after training, using the **evaluate_policy** function from Stable-Baeslines3.
1. Proximal Policy Optimization (PPO): A reinforcement learning algorithm in which changes in the model weights are clipped to prevent drastic changes in policy which may result in a sudden drop in performance.
2. Actor Advantage Critic (A2C): This algorithm uses two function approximators (neural networks) -- One to determine the policy, and one to estimate how good the action chosen is (the action's *Q-value*). The policy decider (Actor) is actually dependent not on the ultimate outcome of its policy, but on the opinion of the Q-value estimator (Critic). This model can actually speed up the learning process in some environments, though it isn't obvious to me why that is. Perhaps it's because the critic offers a more efficient way to estimate the q-value of a given action than averaging over the course of an entire episode of the game, especially when the rewards provided by the environment are sparse.
3. Deep Q-Network (DQN): The fundamental model when applying deep-learning to a reinforcement learning algorithm. Older RL algorithms used a *Q-table(State,Action->Q)* where Q is the expected reward of a given action in a given state. A DQN model replaces the Q table with a function approximator (i.e. a neural network) for the function *(State,Action)->Q*.

In [7]:
#PPO MODEL

model1 = PPO(MlpPolicy, vec_env, verbose=0)

mean_reward,std_reward = evaluate_policy(model1, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model1.learn(total_timesteps=10000)

mean_reward, std_reward = evaluate_policy(model1, vec_env, n_eval_episodes=100)
print(f"Mean Reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: -1258.49 +/- 882.40
Mean Reward: -200.00 +/- 0.00


In [8]:
#A2C MODEL

model2 = A2C("MlpPolicy", vec_env, verbose=0)

mean_reward,std_reward = evaluate_policy(model2, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model2.learn(total_timesteps=10000)

mean_reward,std_reward = evaluate_policy(model2, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: -1294.40 +/- 875.09
mean reward: -200.00 +/- 0.00


In [9]:
#DQN MODEL

model3 = DQN("MlpPolicy", vec_env, verbose=0)

mean_reward,std_reward = evaluate_policy(model3, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model3.learn(total_timesteps=10000)

mean_reward,std_reward = evaluate_policy(model3, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: -2000.00 +/- 0.00
mean reward: -200.00 +/- 0.00


### All three models improved in the Taxi-v0 Env!

Interestingly, the A2C and PPO models out-performed the DQN model initially, but all three converged on what appears to be the maximum score of -200 after only 10000 training timesteps.

All three algorithms are viable in this simple Env.

### Cartpole-v1

Let's now test three models (A2C, PPO, and DQN) in Cartpole-v1 to see how they perform.

In [14]:
vec_env = make_vec_env("CartPole-v1", n_envs=4)
model1 = DQN("MlpPolicy", vec_env, verbose=0)

mean_reward,std_reward = evaluate_policy(model1, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model1.learn(total_timesteps=100000)

mean_reward,std_reward = evaluate_policy(model1, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: 9.36 +/- 0.77
mean reward: 120.66 +/- 37.59


In [15]:
model2 = A2C("MlpPolicy", vec_env, verbose=0)
mean_reward,std_reward = evaluate_policy(model2, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model2.learn(total_timesteps=100000)

mean_reward,std_reward = evaluate_policy(model2, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: 9.18 +/- 0.70
mean reward: 500.00 +/- 0.00


In [16]:
model3 = PPO("MlpPolicy", vec_env, verbose=0)

mean_reward,std_reward = evaluate_policy(model3, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model3.learn(total_timesteps=100000)

mean_reward,std_reward = evaluate_policy(model1, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: 30.00 +/- 6.25
mean reward: 118.85 +/- 48.78


### A2C Outperformed DQN and PPO!

Interestingly, the A2C model converged on the optimal solution the fastest, and is now able to keep the pole up indefinitely. PPO and DQN are slower to progress, possibly because PPO is clipped, but more likely because A2C's second function approximator sped up its learning process!

### FrozenLake-v1
This is another simple environment, which includes an element of randomness in how the model interacts with its' environment. In this environment, the agent attempts to reach a finish goal by moving across a slippery ice-field, trying to avoid holes in the ice. Due to the 'slipperiness' of the ice, the agent has a small chance of randomly moving in a direction other than the one chosen by the policy.

In [10]:
vec_env = make_vec_env('FrozenLake-v1', n_envs=4)
vec_env.reset()

array([0, 0, 0, 0], dtype=int64)

In [11]:
model1 = PPO('MlpPolicy',vec_env,verbose=0)

mean_reward,std_reward = evaluate_policy(model1, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model1.learn(total_timesteps=100000)

mean_reward,std_reward = evaluate_policy(model1, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: 0.00 +/- 0.00
mean reward: 0.54 +/- 0.50


In [12]:
model2 = A2C('MlpPolicy',vec_env,verbose=0)

mean_reward,std_reward = evaluate_policy(model2, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model2.learn(total_timesteps=100000)

mean_reward,std_reward = evaluate_policy(model2, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: 0.00 +/- 0.00
mean reward: 0.74 +/- 0.44


In [13]:
model3 = DQN('MlpPolicy',vec_env,verbose=0)

mean_reward,std_reward = evaluate_policy(model3, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

model3.learn(total_timesteps=100000)

mean_reward,std_reward = evaluate_policy(model3, vec_env, n_eval_episodes=100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: 0.00 +/- 0.00
mean reward: 0.52 +/- 0.50


Again, it seems the A2C model out-performed the other two models. This might imply that A2C is better in general, or just better at training in Vectorized environments. Jury's still out on this.

Edit: Future Evan here, interestingly, PPO consistently out-performed A2C in the Embodied Communication Game. However, these models are also profoundly inconsistent, and running a single test per **Model**x**Environment** is likely an unreliable metric of model performance overall. I'm looking into a way to combin