# Attempting to Beat Brock in Pokémon Red using Reinforcement Learning

## By: Patrick Sharp

## Overview

This project aims to explore Reinforcement Learning (RL) algorithms within the context of classic video game environments, specifically focusing on Pokémon Red. The project's goals are twofold:

1. to implement and evaluate various RL algorithms in a custom Gymnasium environment based on Pokémon Red (thanks to [this repo](https://github.com/PWhiddy/PokemonRedExperiments/tree/master) by PWhiddy), and

2. to investigate the critical balance between exploration and exploitation in RL agent performance.

Unlike many traditional RL projects that focus on low-dimensional environments (e.g., CartPole, FrozenLake), this project attempts to tackle a partially observable, high-variance game world. This naturally introduces complexity in terms of state representation, reward sparsity, and policy generalization.

The ultimate practical goal is to train agents capable of defeating Brock, the first gym leader, thereby obtaining the Boulder Badge — a milestone early in the game but nontrivial in terms of action selection, state abstraction, and long-term planning.

### What is Pokémon Red?

Pokémon Red (1996, Game Freak/Nintendo) is a role-playing video game (RPG) in which players control a protagonist navigating a fictional world, capturing and training creatures called Pokémon, and battling other trainers to earn badges and progress the storyline.

From an RL perspective, Pokémon Red presents a sequential decision-making problem with attributes including:

- Partial Observability: The agent cannot directly observe true world states (e.g., enemy Pokémon's hidden stats).

- Long-Term Dependencies: Success depends not just on immediate actions but on strategies developed across long sequences (e.g., choosing to train a Pokémon early affects performance hours later).

- Stochasticity: Many events (critical hits, enemy move choices) introduce randomness into outcomes.

- Sparse Rewards: Winning a battle or earning a badge occurs only after potentially hundreds of intermediate steps without explicit reward signals.

Thus, it provides a rich testbed beyond simplistic RL benchmarks.

### Why Pokémon Red?

Pokémon Red was selected due to personal nostalgia, as it was my first introduction to video games at the age of four, played on my yellow Gameboy Pocket. This nostalgic connection provides intrinsic motivation to dive deeper into the problem space.

---


## Setting up the environment 




### PyBoy

[PyBoy](https://github.com/Baekalfen/PyBoy) is a Python-based emulator for the Nintendo Game Boy, designed to provide programmatic access to the emulation process through a clean API. It allows external scripts to read game memory, send controller inputs, and observe screen outputs — all crucial capabilities for integrating reinforcement learning agents with a game environment that was never originally designed for AI training.

For this project, PyBoy acts as the critical bridge between the RL algorithms and Pokémon Red. It enables the custom Gymnasium environment to interface directly with the game's internal state, sending actions (e.g., pressing 'A', 'Start', navigating menus) and receiving observations (e.g., screen pixels, memory values) in a way that is compatible with modern RL pipelines. Without such programmatic control and visibility into game state, training agents in a complex environment like Pokémon Red would be effectively infeasible.

Moreover, using PyBoy ensures deterministic, reproducible experiments — an essential property for debugging RL agents, evaluating exploration strategies, and properly measuring algorithmic performance.

### Action Space

The game environment takes these controls and creates the following action lists that can be used within the environment wrapper:

In [1]:
import warnings
warnings.filterwarnings("ignore", message="Using SDL2 binaries from pysdl2-dll*")
from pyboy.utils import WindowEvent

valid_actions = [
            WindowEvent.PRESS_ARROW_DOWN,
            WindowEvent.PRESS_ARROW_LEFT,
            WindowEvent.PRESS_ARROW_RIGHT,
            WindowEvent.PRESS_ARROW_UP,
            WindowEvent.PRESS_BUTTON_A,
            WindowEvent.PRESS_BUTTON_B,
            WindowEvent.PRESS_BUTTON_START,
        ]

We omit `BUTTON_SELECT` and `NOOP` from the available actions, as choosing them would not meaningfully contribute to exploration and could hinder agent progress.

![alt text](../assets/images/Pokemon_red_controls.png "Controls")

Nintendo. (1996) [Pokémon Red Trainer's Guide](../../Pokemon-Red-User-Manual.pdf). Nintendo of America Inc. Retrieved from https://pokemon-project.com/juegos/manual/manual-GB-Pokemon-Rojo-Azul-EN.pdf


---

### Verifying the game file

Before using Pokémon Red within the custom Gymnasium environment, it is critical to ensure that the game file (ROM) being used matches the expected version supported by the environment. To do this, we verify the integrity of the PokemonRed.gb file by calculating its SHA-1 checksum.

Using the following command:

In [None]:
# Check the hash of the ROM file
!shasum ../../PokemonRed.gb

ea9bcae617fdf159b045185467ae58b2e4a48b9a  ../../PokemonRed.gb


we compute the SHA-1 hash of the ROM file. The expected hash, according to PWhiddy's repository documentation, is: `ea9bcae617fdf159b045185467ae58b2e4a48b9a`.

If the output of the command matches this expected value, it confirms that the ROM file is identical at the binary level to the one the custom Gymnasium environment was built and tested against.

## Approach


### RL Environment:

In [None]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from red_gym_env import RedGymEnv

STATE_REWARDS = {
    "event": True,
    "level": False,
    "heal": True,
    "op_lvl": False,
    "dead": False,
    "badge": True,
    "explore": True,
    "stuck": True,
}
STATE_REWARD_WEIGHTS = {
    "event": 4,
    "level": 1,
    "heal": 10,
    "op_lvl": 0.2,
    "dead": -0.1,
    "badge": 10,
    "explore": 0.1,
    "stuck": -0.05,
}
REWARD_SCALE = 0.5
EXPLORE_WEIGHT = 0.25
EP_LENGTH = 2048 * 80
NUM_CPU = 32  # Also sets the number of episodes per training iteration

env_config = {
    'headless': True,
    'save_final_state': True,
    'early_stop': False,
    'action_freq': 24,
    'init_state': './init.state',
    'max_steps': EP_LENGTH,
    'save_video': True,
    'fast_video': False,
    'session_path': sess_path,
    'gb_path': './PokemonRed.gb',
    'debug': False,
    'reward_scale': REWARD_SCALE,
    'explore_weight': EXPLORE_WEIGHT,
    'print_rewards': True,
}
env = RedGymEnv(env_conf)

---

### Random Agent

```Python
import numpy as np
from agents.base_agent import BaseAgent

class RandomAgent(BaseAgent):
    def __init__(self, action_space):
        super().__init__(action_space)

    def select_action(self, observation):
        return self.action_space.sample()

env = RedGymEnv(env_config)
agent = RandomAgent(env.action_space)

obs, _ = env.reset()
done = False

 while not done:
    action = agent.select_action(obs)
    obs, reward, _, done, _ = env.step(action)
    env.save_and_print_info(done, obs)
```

---


### PPO

```Python
def make_env(rank, env_conf, seed=0):
    """
    Utility function for multiprocessed env.
    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environments you wish to have in subprocesses
    :param seed: (int) the initial seed for RNG
    :param rank: (int) index of the subprocess
    """
    def _init():
        env = RedGymEnv(env_conf)
        env.reset(seed=(seed + rank))
        return env
    set_random_seed(seed)
    return _init


num_cpu = NUM_CPU 
env = SubprocVecEnv([make_env(i, env_config) for i in range(NUM_CPU)])

checkpoint_callback = CheckpointCallback(
    save_freq=ep_length//2,
    save_path=sess_path,
    name_prefix="poke"
)   
callbacks = [checkpoint_callback, TensorboardCallback(sess_path)]

model = PPO(
    "MultiInputPolicy",
    env,
    verbose=1,
    n_steps=train_steps_batch,
    batch_size=512,
    n_epochs=1,
    gamma=0.997,
    ent_coef=0.01,
    tensorboard_log=sess_path
)
model.learn(
    total_timesteps=(ep_length) * num_cpu * 10_000,  # Attempt to run 10,000 iterations
    callback=CallbackList(callbacks),
    tb_log_name="poke_ppo"
)
```

---

### A2C


```Python
model = A2C(
    "MultiInputPolicy",
    env,
    verbose=1,
    n_steps=train_steps_batch,
    gamma=0.997,
    ent_coef=0.01,
    tensorboard_log=sess_path
)

model.learn(
    total_timesteps=(ep_length) * num_cpu * 10000,
    callback=CallbackList(callbacks),
    tb_log_name="poke_a2c"
)
```