# Beating Brock in Pokémon Red using Reinforcement Learning

## By: Patrick Sharp

It's recommend you read this notebook from the [GitHub](https://github.com/PSharp725/Pokemon-Red-RL/blob/main/src/notebooks/report.ipynb) to be able to see the embedded videos.

## Overview

This project aims to explore Reinforcement Learning (RL) algorithms within the context of classic video game environments, specifically focusing on Pokémon Red. The project's goals are twofold:

1. to implement and evaluate various RL algorithms in a custom Gymnasium environment based on Pokémon Red (thanks to [this repo](https://github.com/PWhiddy/PokemonRedExperiments/tree/master) by PWhiddy), and

2. to investigate the critical balance between exploration and exploitation in RL agent performance.

Unlike many traditional RL projects that focus on low-dimensional or heavily simplified environments (e.g., CartPole, FrozenLake), this project attempts to tackle a partially observable, high-variance game world. This naturally introduces complexity in terms of state representation, reward sparsity, and policy generalization.

The ultimate practical goal is to train agents capable of defeating Brock, the first gym leader, thereby obtaining the Boulder Badge — a milestone early in the game but nontrivial in terms of action selection, state abstraction, and long-term planning.

### What is Pokémon Red?

Pokémon Red (1996, Game Freak/Nintendo) is a role-playing video game (RPG) in which players control a protagonist navigating a fictional world, capturing and training creatures called Pokémon, and battling other trainers to earn badges and progress the storyline.

From an RL perspective, Pokémon Red presents a sequential decision-making problem with attributes including:

- Partial Observability: The agent cannot directly observe true world states (e.g., enemy Pokémon's hidden stats).

- Long-Term Dependencies: Success depends not just on immediate actions but on strategies developed across long sequences (e.g., choosing to train a Pokémon early affects performance hours later).

- Stochasticity: Many events (critical hits, enemy move choices) introduce randomness into outcomes.

- Sparse Rewards: Winning a battle or earning a badge occurs only after potentially hundreds of intermediate steps without explicit reward signals.

Thus, it provides a rich testbed beyond simplistic RL benchmarks.

### Why Pokémon Red?

Pokémon Red was selected due to personal nostalgia, as it was my first introduction to video games at the age of four, played on my yellow Gameboy Pocket. This nostalgic connection provides intrinsic motivation to dive deeper into the problem space.

---


## Setting up the environment 

### PyBoy

[PyBoy](https://github.com/Baekalfen/PyBoy) is a Python-based emulator for the Nintendo Game Boy, designed to provide programmatic access to the emulation process through a clean API. It allows external scripts to read game memory, send controller inputs, and observe screen outputs — all crucial capabilities for integrating reinforcement learning agents with a game environment that was never originally designed for AI training.

For this project, PyBoy acts as the critical bridge between the RL algorithms and Pokémon Red. It enables the custom Gymnasium environment to interface directly with the game's internal state, sending actions (e.g., pressing 'A', 'Start', navigating menus) and receiving observations (e.g., screen pixels, memory values) in a way that is compatible with modern RL pipelines. Without such programmatic control and visibility into game state, training agents in a complex environment like Pokémon Red would be effectively infeasible.

Moreover, using PyBoy ensures deterministic, reproducible experiments — an essential property for debugging RL agents, evaluating exploration strategies, and properly measuring algorithmic performance.

We will be using PyBoy to help with running our environment.

### Action Space

The game environment takes these controls and creates the following action lists that can be used within the environment wrapper:

In [1]:
import warnings
warnings.filterwarnings("ignore", message="Using SDL2 binaries from pysdl2-dll*")
from pyboy.utils import WindowEvent

valid_actions = [
            WindowEvent.PRESS_ARROW_DOWN,
            WindowEvent.PRESS_ARROW_LEFT,
            WindowEvent.PRESS_ARROW_RIGHT,
            WindowEvent.PRESS_ARROW_UP,
            WindowEvent.PRESS_BUTTON_A,
            WindowEvent.PRESS_BUTTON_B,
            WindowEvent.PRESS_BUTTON_START,
        ]


![alt text](../assets/images/Pokemon_red_controls.png "Controls")

Nintendo. (1996) Pokémon Red Trainer's Guide. Nintendo of America Inc. Retrieved from https://pokemon-project.com/juegos/manual/manual-GB-Pokemon-Rojo-Azul-EN.pdf

---

### Verifying the game file

Before using Pokémon Red within the custom Gymnasium environment, it is critical to ensure that the game file (ROM) being used matches the expected version supported by the environment. To do this, we verify the integrity of the PokemonRed.gb file by calculating its SHA-1 checksum.

Using the following command:

In [2]:
# Check the hash of the ROM file
!shasum ../../PokemonRed.gb

ea9bcae617fdf159b045185467ae58b2e4a48b9a  ../../PokemonRed.gb


we compute the SHA-1 hash of the ROM file. The expected hash, according to PWhiddy's repository documentation, is: `ea9bcae617fdf159b045185467ae58b2e4a48b9a`.

If the output of the command matches this expected value, it confirms that the ROM file is identical at the binary level to the one the custom Gymnasium environment was built and tested against.

---

## The Environment

The environment can determine the following keys for the gamestate:

| Observation Key | Description |Type |
|------------|-------------|---------------|
| event      | Number of events observed | int |
| level      | Sum of all Pokémon levels | int |
| heal       | Amount of healing from items or visiting a Pokémon Center | float |
| op_lvl     | Opponent Pokémon's level | int |
| dead       | Number of times "dying" (Pokémon fainting) | int |
| badge      | Number of Gym Badges | int |
| explore    | Number of map tiles visited | int |
| stuck      | Number of times stuck | int |


The reward structure is designed to encourage both exploration and meaningful game progress. Events triggered and new tiles visited serve to promote thorough map exploration and storyline advancement. Achieving Gym Badges and capturing Pokémon are explicitly incentivized, with further rewards tied to strengthening the player's team through leveling. To foster strategic and sustainable gameplay, the agent is rewarded for maintaining the health of its Pokémon through healing, while deaths are penalized. Additionally, the system detects and penalizes situations where the agent becomes stuck, encouraging continuous forward movement and discouraging inefficient behaviors.

An example of the environment setup can be seen below.

In [3]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from red_gym_env import RedGymEnv

STATE_REWARDS = {
    "event": True,
    "level": False,
    "heal": True,
    "op_lvl": False,
    "dead": False,
    "badge": True,
    "explore": True,
    "stuck": True,
}
STATE_REWARD_WEIGHTS = {
    "event": 4,
    "level": 1,
    "heal": 10,
    "op_lvl": 0.2,
    "dead": -0.1,
    "badge": 10,
    "explore": 0.1,
    "stuck": -0.05,
}
REWARD_SCALE = 0.5
EXPLORE_WEIGHT = 0.25
EP_LENGTH = 2048 * 80
NUM_CPU = 32  # Also sets the number of episodes per training iteration

env_config = {
    'headless': True,
    'save_final_state': True,
    'early_stop': False,
    'action_freq': 24,
    'init_state': './init.state',
    'max_steps': EP_LENGTH,
    'save_video': True,
    'fast_video': False,
    'session_path': 'temp_path',
    'gb_path': './PokemonRed.gb',
    'debug': False,
    'reward_scale': REWARD_SCALE,
    'explore_weight': EXPLORE_WEIGHT,
    'print_rewards': True,
}

---

## Approach

### Random Agent

To establish a performance baseline, a RandomAgent was implemented. The agent is a subclass of a generic BaseAgent, and its decision policy is trivially simple: it samples an action uniformly at random from the environment's available action_space. The relevant code structure is as follows:

```Python
import numpy as np
from agents.base_agent import BaseAgent

class RandomAgent(BaseAgent):
    def __init__(self, action_space):
        super().__init__(action_space)

    def select_action(self, observation):
        return self.action_space.sample()

env = RedGymEnv(env_config)
agent = RandomAgent(env.action_space)

obs, _ = env.reset()
done = False

 while not done:
    action = agent.select_action(obs)
    obs, reward, _, done, _ = env.step(action)
    env.save_and_print_info(done, obs)
```

At each timestep, the agent selects a random action without considering the environment's current state (observation), any past experience, or the long-term consequences of its actions. This leads to purely stochastic behavior, serving as a control for comparison against more sophisticated reinforcement learning algorithms.

The agent was tested in a custom Gymnasium environment that simulates Pokémon Red gameplay (using a RedGym environment wrapper). The agent was allowed to operate from the initial game start until a termination condition was reached (Episode length).

#### Gameplay Behavior and Observations

An analysis of the gameplay rollout shows the RandomAgent displaying several key characteristics:

- Erratic and Inefficient Movement: The agent frequently alternates between movement directions without consistent navigation goals. For example, it may move up briefly, then left, then up again, then open menus without any strategic pattern.

- Excessive Menu Interaction: Because random button presses include selecting the "Start" button, the agent often opens the game menu inadvertently. Upon opening the menu, it issues random inputs, sometimes moving the cursor but rarely exiting the menu intentionally. This behavior significantly interrupts any forward gameplay progress.

- Minimal Progress Toward Objectives: The agent fails to make meaningful progress toward defeating the first Gym Leader, Brock, or even toward reaching the Viridian City Pokémon Center. Any movement toward objectives is purely coincidental and almost immediately undone by subsequent random actions.

- High Frequency of No-Op Actions: Many random inputs have little to no effect on the environment's state (e.g., pressing a directional input when movement is blocked by a wall). These actions contribute to wasted timesteps and further delay progress.

Critical Evaluation

The gameplay of the random agent highlights the extreme inefficiency of unguided exploration in large, structured environments like Pokémon Red. Even with a simple spatial goal such as reaching a nearby city, random movement fails spectacularly because:

- State-Space Size: Pokémon Red has a very large and complex state space. Random actions almost never produce beneficial state transitions.

- Sparse Rewards: Positive feedback (e.g., gaining experience, winning battles) is sparse and conditional on complex sequences of actions. Random agents are exceedingly unlikely to stumble into these sequences by chance.

- Structured Tasks: The game requires highly structured sequences (e.g., talking to an NPC, navigating menus carefully) that random behavior simply cannot achieve.

As a result, the RandomAgent acts as a clear lower bound on performance. Future agents must significantly outperform this baseline to be considered successful.

<img src="../assets/gifs/Random_agent.gif" alt="Random Agent" width="300"/>


---

## PPO

### Agent Selection Rationale

For the first reinforcement learning baseline beyond random behavior, the **Proximal Policy Optimization (PPO)** algorithm was selected.  
PPO is a policy-gradient method that strikes a strong balance between **training stability** and **sample efficiency**, two factors critical in complex environments like Pokémon Red.

**Key reasons for choosing PPO include:**
- **Robustness**: PPO is known for its reliability and ease of tuning compared to older policy-gradient methods like Vanilla Policy Gradient (VPG) or Trust Region Policy Optimization (TRPO) ([Schulman et al., 2017](https://arxiv.org/abs/1707.06347)).
- **Efficient Use of Data**: PPO uses a clipped objective function that avoids making overly large policy updates, allowing for more stable learning from batches of collected experience.
- **Good for High-Dimensional Action Spaces**: Pokémon Red's environment involves both spatial navigation and complex interaction mechanics (menus, items, NPC dialogues), making PPO's generality across discrete and continuous actions highly advantageous.
- **Wide Empirical Success**: PPO has been successfully applied to a wide range of tasks from robotics control to game playing, making it a safe and widely trusted starting point for early agent development.

Thus, PPO serves as a **natural first choice** for building an agent capable of strategic gameplay in Pokémon Red.



### PPO Agent Implementation Details

The PPO agent was implemented using [**Stable-Baselines3**](https://github.com/DLR-RM/stable-baselines3)'s `PPO` algorithm, with some modifications to better handle the complexity of Pokémon Red gameplay.

To improve sample efficiency and training speed, a vectorized environment setup (`SubprocVecEnv`) was used to run multiple instances of the RedGym environment in parallel. Each parallel environment was initialized with a unique random seed to encourage diverse experiences across subprocesses.

The core environment creation function was structured as follows:

```Python
def make_env(rank, env_conf, seed=0):
    """
    Utility function for multiprocessed env.
    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environments you wish to have in subprocesses
    :param seed: (int) the initial seed for RNG
    :param rank: (int) index of the subprocess
    """
    def _init():
        env = RedGymEnv(env_conf)
        env.reset(seed=(seed + rank))
        return env
    set_random_seed(seed)
    return _init


num_cpu = NUM_CPU 
env = SubprocVecEnv([make_env(i, env_config) for i in range(NUM_CPU)])

checkpoint_callback = CheckpointCallback(
    save_freq=ep_length//2,
    save_path=sess_path,
    name_prefix="poke"
)   
callbacks = [checkpoint_callback, TensorboardCallback(sess_path)]

model = PPO(
    "MultiInputPolicy",
    env,
    verbose=1,
    n_steps=train_steps_batch,
    batch_size=512,
    n_epochs=1,
    gamma=0.997,
    ent_coef=0.01,
    tensorboard_log=sess_path
)
model.learn(
    total_timesteps=(ep_length) * num_cpu * 10_000,  # Attempt to run 10,000 iterations
    callback=CallbackList(callbacks),
    tb_log_name="poke_ppo"
)
```


#### Overview of PPO

Proximal Policy Optimization (PPO) is a policy-gradient reinforcement learning algorithm that optimizes the expected cumulative reward by adjusting the agent’s action policy directly.  
Unlike traditional policy gradients, which can suffer from instability due to large updates, PPO introduces a clipped objective that penalizes updates which move too far from the previous policy.

This ensures that learning proceeds in small, trusted steps, maintaining a balance between improving the policy and preserving exploration.

Formally, the core PPO objective is:


$$L^{CLIP}(\theta) = \mathbb{E} \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$


where $r_t(\theta)$ is the probability ratio between the new and old policy, and  $\hat{A}_t$ is the advantage function estimating the relative value of an action compared to baseline.



#### Gameplay Behavior and Observations

Reviewing the PPO agent’s rollout shows clear signs of learning compared to the RandomAgent baseline:

- **Purposeful Movement**: The agent demonstrates more directed movement patterns. It tends to move persistently in a single direction (e.g., heading toward city exits), reducing random directional changes.
- **Reduced Menu Misuse**: The agent still occasionally interacts with the game menu, but it now exits menus more quickly and does not get "stuck" inside menus nearly as often.
- **Navigation Toward Objectives**: The agent shows a tendency to reach new areas and even approach entering Viridian City. This indicates some understanding (learned through reward feedback) that spatial exploration is valuable.
- **State Awareness**: Unlike the random agent, the PPO agent occasionally stops or adjusts movement in response to blocked paths or obstacles, hinting at a learned association between certain states and actions.



#### Critical Evaluation

While the PPO agent's gameplay is still imperfect, it shows significant improvements in several areas compared to the RandomAgent baseline:

| Aspect                  | RandomAgent                         | PPOAgent                                   |
|--------------------------|-------------------------------------|-------------------------------------------|
| Movement Directionality  | Random and chaotic                  | Mostly consistent, goal-oriented         |
| Menu Interaction         | Frequent, disruptive               | Occasional, quickly exited               |
| Progress Toward Goals    | Almost none                         | Partial — beginning to reach objectives  |
| Reward Accumulation      | Minimal or none                     | Noticeably improved                      |

However, despite clear evidence of learning, several limitations were observed:

- **Training Time Constraints**: PPO, like most modern deep reinforcement learning algorithms, requires very large amounts of environment interaction to fully optimize its policy. Given the complexity and sparsity of the reward structure in Pokémon Red, training had to be curtailed before the agent could fully master the game mechanics. This model is after ~2,000 iterations of training with (167,772,160 total time steps, or ~20 hours)
- **Partial Success Evidence**: Despite the limited training, during several experimental runs, the agent was observed to successfully defeat Brock and obtain the first badge. These successes demonstrate that the agent possesses the capacity to learn complex sequences** when given sufficient training time and feedback.
- **Continued Learning Potential**: The agent's performance was still improving in the final training runs, suggesting that additional training would likely have further enhanced both reliability and gameplay sophistication.

In summary, the PPO agent shows clear evidence of learning under challenging conditions, but also highlights the resource demands inherent to applying deep reinforcement learning to large, structured environments like Pokémon Red.

<img src="../assets/gifs/ppo_agent.gif" alt="PPO Agent" width="300"/>


---

## A2C

#### Agent Selection Rationale

In addition to PPO, the **Advantage Actor-Critic (A2C)** algorithm was also implemented as a baseline reinforcement learning agent.  
A2C is a synchronous variant of the Actor-Critic family of algorithms, combining **policy optimization** and **value function estimation** into a single framework.

**Key reasons for exploring A2C include:**
- **Simplicity**: A2C is algorithmically simpler than PPO and faster to implement and tune.
- **Faster Updates**: A2C performs updates every `n_steps` without the need for complex clipping mechanisms like PPO, which can speed up early-stage learning.
- **Strong Theoretical Foundations**: Actor-Critic methods directly optimize the policy while using a critic to reduce variance in policy gradient estimates, providing more stable updates than pure policy gradient methods.

Although A2C is less stable and less robust compared to PPO in very large environments, it serves as a useful comparison point to evaluate how algorithmic complexity influences agent performance in Pokémon Red.



#### Overview of A2C

Advantage Actor-Critic (A2C) works by maintaining two neural networks:
- **Actor**: Proposes actions based on the current policy.
- **Critic**: Estimates the value of the current state to guide the actor's learning.

The key idea behind A2C is to use the **advantage function** $A(s, a)$ to update the policy:

$$A(s, a) = Q(s, a) - V(s)$$

where:
- $Q(s, a)$ is the estimated return for taking action $a$ in state $s$,
- $V(s)$ is the estimated value of state $s $.

This advantage reduces the variance of policy gradient updates, making learning more stable.



## A2C Agent Implementation Details

The A2C agent was implemented using **Stable-Baselines3**'s `A2C` algorithm.  
The setup was intentionally kept similar to PPO for a fair comparison, with parallelized environments (`SubprocVecEnv`) used for sample efficiency.

The model was configured as follows:

```python
model = A2C(
    "MultiInputPolicy",
    env,
    verbose=1,
    n_steps=train_steps_batch,
    gamma=0.997,
    ent_coef=0.01,
    tensorboard_log=sess_path
)

model.learn(
    total_timesteps=(ep_length) * num_cpu * 10_000,
    callback=CallbackList(callbacks),
    tb_log_name="poke_a2c"
)
```

#### Key Hyperparameters

The key hyperparameters used for training the A2C agent were:

- **Policy**: `MultiInputPolicy`
- **Batch Size**: Determined internally by `n_steps` (A2C does not use a separate batch size hyperparameter like PPO).
- **Discount Factor (gamma)**: 0.997
- **Entropy Coefficient**: 0.01 (to promote exploration)
- **Number of Steps per Update (`n_steps`)**: Defined via `train_steps_batch`

Similar to PPO, training time constraints limited the extent of optimization, as large amounts of environment interaction are necessary for effective learning in Pokémon Red.  The A2C agent was trained for a similar amount of time as the PPO agent, but had less training steps as it was slower and more memory intensive.



## Gameplay Behavior and Observations

The A2C agent displayed some improvements over the RandomAgent baseline, but way less polished behavior compared to the PPO agent:

- **Directed Movement**: The agent generally moves in a similar chaotic fashion as the random agent. Indicating the agent still needs to train.
- **Menu Interaction**: Similar to the random agent, the A2C agent spends a lot of unnecessary time in the menu moving around.
- **Exploration Patterns**: The agent explores the map but appears less systematic than the PPO agent, sometimes retracing steps unnecessarily.
- **Progress Toward Objectives**: Reaches new areas occasionally but at a slower pace than PPO.



## Critical Evaluation

The A2C agent demonstrates an ability to learn environmental structure and take meaningful actions, but with more variability and less reliability compared to PPO:

| Aspect                   | RandomAgent                        | A2CAgent                                   | PPOAgent                                  |
|---------------------------|------------------------------------|-------------------------------------------|------------------------------------------|
| Movement Directionality   | Random and chaotic                 | Random and chaotic   | Mostly consistent, goal-oriented         |
| Menu Interaction          | Frequent, disruptive              | Frequent, disruptive           | Rare and quickly exited                  |
| Progress Toward Goals     | Almost none                        | Partial — slow but observable             | Partial — faster and more consistent     |
| Reward Accumulation       | Minimal or none                    | Minimal or none                      | Noticeably improved                      |

**Additional observations:**
- **Training Time Constraints**: Similar to PPO, the A2C agent was limited by the extensive training time required to fully master the environment.
- **Signs of Continued Learning**: The agent showed ongoing improvements during training, but no consistent achievement of major milestones (such as defeating Brock) within the recorded training sessions.

Overall, the A2C agent represents a clear improvement over pure random behavior, but its performance lags slightly behind PPO, reflecting the known limitations of A2C in highly complex, sparse-reward environments.

<img src="../assets/gifs/a2c_agent.gif" alt="A2C Agent" width="300"/>


---

## Results

# Final Results Table

| Agent            | Performance Summary                                                                                      | Training Notes                                              | Training Time | Total Training Steps | Gameplay Example                                      |
|------------------|----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|---------------|----------------------|------------------------------------------------------|
| **Random Agent** | - Purely random movement<br>- Frequent accidental menu openings<br>- No meaningful exploration or progress | - No training required; purely random baseline             | N/A           | N/A                  | ![Random Agent](../assets/gifs/Random_agent.gif)      |
| **A2C Agent**     | - Partial goal-directed movement<br>- Occasional menu errors<br>- Some map exploration, but inefficient   | - Requires long training sessions<br>- Sensitive to memory constraints | 20 hours   | 84,541,440          | ![A2C Agent](../assets/gifs/a2c_agent.gif)            |
| **PPO Agent**     | - Consistent and purposeful movement<br>- Minimal menu errors<br>- Strong exploration and level progress  | - High training time needed<br>- Most stable and efficient to optimize | 20 hours   |      167,772,160      | ![PPO Agent](../assets/gifs/ppo_agent.gif)            |




## Conclusion


The current set of experiments demonstrated that PPO consistently outperformed both the A2C and Random agents, showing strong goal-directed behavior, efficient exploration, and meaningful progress through the game environment. However, despite these promising early results, the agent’s training remains incomplete.

Moving forward, I plan to continue training the PPO agent to achieve even more stable and optimized policies. One key area of improvement involves expanding the reward structure. Through additional research into the game's memory locations, I intend to introduce new reward signals that scale with the rarity and difficulty of Pokémon encountered, as well as the complexity of in-game events triggered. This refinement would provide the agent with a more nuanced understanding of game dynamics beyond simply exploration and badge collection.

Additionally, I plan to implement a dynamic objective system where the agent is provided a list of high-level goals (e.g., "reach Pewter City Gym" or "capture a rare Pokémon"). The reward would then scale based on the agent’s distance from its current objective, encouraging more strategic, directed behavior instead of purely reactive exploration.

These enhancements are intended to create a richer, denser reward landscape, improving both the sample efficiency and long-term learning potential of the PPO agent. Over time, I aim for the agent not only to reach early-game milestones but to progressively achieve higher-order objectives in Pokémon Red in a more human-like, goal-driven manner.

## References

References
Game Freak. (1996). Pokémon Red Version [Video game]. Nintendo.

Nintendo. (1998). Pokémon Red Version Instruction Manual. Nintendo of America Inc. Retrieved from https://pokemon-project.com/juegos/manual/manual-GB-Pokemon-Rojo-Azul-EN.pdf

Sutton, Richard S., & Barto, Andrew G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Retrieved from http://incompleteideas.net/book/the-book-2nd.html

Whiddy, Patrick. (2024). PokemonRedExperiments: A Gymnasium environment for Pokémon Red. GitHub [Pokemon Red env]. Retrieved from https://github.com/PWhiddy/PokemonRedExperiments

OpenAI. (2016). OpenAI Gym: A toolkit for developing and comparing reinforcement learning algorithms. GitHub. Retrieved from https://github.com/openai/gym

Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, & Zaremba, Wojciech. (2016). OpenAI Gym [Software]. GitHub. Retrieved from https://github.com/openai/gym

Danielsen, Baekgaard, Niklasson, Jakob, & others. (2018–2024). PyBoy: Game Boy Emulator for Reinforcement Learning Research [Software]. GitHub. Retrieved from https://github.com/Baekalfen/PyBoy

Raffin, Antonin, Hill, Ashley, Gleave, Adam, Kanervisto, Anssi, Ernestus, Noah, & Dormann, Christian. (2021). Stable-Baselines3: Reliable Reinforcement Learning Implementations [Software]. GitHub. Retrieved from https://github.com/DLR-RM/stable-baselines3