## Experiment Goal

The goal of this experiment is to setup very minimalistic implementation of a battleing agent. This wil then function as a starting point for further development.

In [44]:
import pandas as pd
import numpy as np
import os
import poke_battle_sim as pb
import random
from sklearn.preprocessing import LabelEncoder

import gymnasium as gym
from typing import Optional

In [45]:
# Data imports
package_dir = str(os.sep).join(str(pb.poke_sim.__file__).split(os.sep)[0:-1])
data_dir = os.path.join(package_dir, 'data')

# Load dataframes
abilities = pd.read_csv(os.path.join(data_dir, 'abilities.csv'))
items_gen4 = pd.read_csv(os.path.join(data_dir, 'items_gen4.csv'))
move_list = pd.read_csv(os.path.join(data_dir, 'move_list.csv'))
natures = pd.read_csv(os.path.join(data_dir, 'natures.csv'))
pokemon_stats = pd.read_csv(os.path.join(data_dir, 'pokemon_stats.csv'))
# pokemon_stats.set_index('ndex', inplace=True)
type_effectiveness = pd.read_csv(os.path.join(data_dir, 'type_effectiveness.csv'))

In [46]:
# Data helper methods
def get_random_nature():
    return random.choice(natures.values)

def get_stats_by_id(pokedex_id: int):
    if pokedex_id < min(pokemon_stats['ndex']) or pokedex_id > max(pokemon_stats['ndex']):
        raise ValueError(f'{pokedex_id} is not a valid pokedex id')
    
    return pb.PokeSim._pokemon_stats[pokedex_id - 1][4:10]

def get_stats_by_name(name: str):
    if name not in pokemon_stats['name'].values:
        raise ValueError(f'{name} is not a valid pokemon name')
    
    search_results = [ i for i in pb.PokeSim._pokemon_stats if i[1] == name ] # TODO make this search more time efficient
    if len(search_results) != 1:
        raise ValueError(f'Invalid search results: expected 1, got {len(search_results)} while searching for {name}')

    return search_results[0][4:10]

def get_ability_id_by_name(name: str):
    if name not in abilities['ability_name'].values:
        raise ValueError(f'{name} is not a valid starter ability')

    return abilities[abilities['ability_name'] == name]['ability_id'].values[0]

In [47]:
# Encoding/Decoding methods
# print(pb.conf.global_settings.POSSIBLE_GENDERS)
gender_encoder = LabelEncoder()
gender_encoder.fit(pb.conf.global_settings.POSSIBLE_GENDERS)

def get_gender_encoding(gender: str):
    return gender_encoder.transform([gender])[0]

def get_gender_decoding(gender: int):
    return gender_encoder.inverse_transform([gender])[0]

def get_random_gender_mf():
    return random.choice(['male', 'female'])

type_encoder = LabelEncoder()
type_encoder.fit(pokemon_stats[[ 'type 1', 'type 2' ]].values.flatten())

def get_type_encoding(type_name: str | float):
    if isinstance(type_name, float) and np.isnan(type_name):
        return type_encoder.transform([np.nan])[0]
    
    if not type_name or type_name.lower() == 'none' or type_name.lower() == 'nan' or type_name == '':
        return type_encoder.transform([np.nan])[0]
    
    return type_encoder.transform([type_name])[0]

def get_type_decoding(type_id: int):
    return type_encoder.inverse_transform([type_id])[0]

def all_type_encodings():
    return np.array(type_encoder.transform(type_encoder.classes_))

# for c in type_encoder.classes_:
#     print(f'{c} -> {get_type_encoding(c)} -> {get_type_decoding(get_type_encoding(c))}')

## About the state space

The starting point for the state space comes from the first rival battle in the game. The choice for this is to keep the state space as small as possible to make it easier to debug and understand the agent's behavior. It is also arguably the most interesting of starting points as this is the very first battle in the game. Making the state space any smaller would result in an agent that does not really learn to do anything of meaning.

### Sizing the state space

Alto the statespace seems small, it is still quite large. At first glance, all the state space is are 2 pokemons:
> $ \{ p1, p2 \} $

However, when we look into what each pokemon's attributes it becomes aperent how fast the state space grows. To make things easier, lets look at the pokemon showdown calculator to see what could be included in a battle state:

![pokemon showdown calculator screenshot](showdown_calculator_screenshot.png)

The calculator shows inputs for the current pokemons out in battle (so not the remaining party in the players parties). Every single input (be it buttons or text fields) in the calculator is a part of the state space. Note that not all inputs apply to all party members, for example, the buttons centered on the calculator (say Protect for example) only apply to the pokemon out in battle. Given the valid input space for each of these fields (base hp can range from 0 to a practical maximum of 150 for example), it becomse almost astronamicaly large. This large state space is what will dictate the approaches usable in this experiment.

### Choosing an approach for the sized state space

From this first glance I can already safely state that Q-Tabels are not a viable approach, as the tables would be to large to work with within a reasonable amount of time (for my computer at least). We could try to see which parts of the state space to scrap, which would be an interesting experiment. It might be nice to see wheter an agent could learn to battle, for example, without knowing anything about a pokemons attribute that dictate its stats (like base stats, IVs, EVs, etc) except perhaps its level. However I will be opting for different approache.

I will be opting for a Deep Q-Learning approach (page 867 of the book). I might even try to implement Double Deep Q-Learning if the model overfits. Deep Q-Learning is an on-policy, model-free approach that uses a deep neural network to approximate the Q-function. This is a good approach for this experiment as it can handle large state spaces. Double Deep Q-Learning is the off-policy variant of Deep Q-Learning that uses two networks to prevent overfitting.

## Observation Space

The observation space should include the following:
- The agents party
- That NPC's party
- Stat changes from buffs and debuffs (like leer, growl, etc)

A party consists of one to six Pokémon in the form of a dictionary, where the keys are the position of a Pokémon within the party (so key = 0 is the Pokémon in front of the party, key one the next Pokémon in the party etc).

A Pokémon is a tuple with:
- Stat totals (computed stats based on EV’s, IV’s, Base stats, Level and nature)
- Types (one or two types)
- Its ability
- Available moves

Some more notes on the observation space:
- For the agent's party, all this information is known beforehand. 
- For the NPC's party, the agent will have to learn how to gather this information throughout every episode.

To get these observations, we will first define it and then see how we can get it from the simulator.

#### Data on available pokemon in the starting battle

In [48]:
starter_lvl = 5
starter_names = ['turtwig', 'chimchar', 'piplup']
starter_moves = {
    'turtwig': ['tackle', 'withdraw'],
    'chimchar': ['scratch', 'leer'],
    'piplup': ['pound', 'growl']
}
starter_abilities = {
    'turtwig': 'overgrow',
    'chimchar': 'blaze',
    'piplup': 'torrent'
}

In [49]:
def get_starter(name: str):
    if name not in starter_names:
        raise ValueError(f'{name} is not a valid starter name')

    name = random.choice(starter_names)
    stats = get_stats_by_name(name)

    return pb.Pokemon(
        name_or_id=name,
        level=starter_lvl,
        moves=starter_moves[name],
        gender=get_random_gender_mf(),
        ability=starter_abilities[name],
        nature=get_random_nature(),
        cur_hp=stats[0],
        stats_actual=stats
    )


def get_random_starter():
    name = random.choice(starter_names)
    return get_starter(name)


def get_rival_starter(agent_starter_name: str):
    name = ''
    if agent_starter_name == 'turtwig':
        name = 'chimchar'
    elif agent_starter_name == 'chimchar':
        name = 'piplup'
    else:
        name = 'turtwig'

    return get_starter(name)

#### From data to observation space 

In order to turn the data we have at our disposal to the observation space, we will have to do the following:
- See what [state spaces gym makes available](https://gymnasium.farama.org/api/spaces/fundamental/#fundamental-spaces) to us
- See what the data looks like
- Translate the data to the available state spaces

#### About state spaces

Statespaces all have numerical values so it seems, take the Discrete space for example: its essentially just a set of integers. The dictionary might have textual keys, but the values are all just other numerical spaces (or nested dictionaries). I will try and summerize the state space in terms of what they are and when to use them

**Fundamental Spaces:**
> | Name          | Description                                      | When to Use                               |
> |---------------|------------------------------------------------- |-------------------------------------------|
> | Box           | Continuous space with bounds for each dimension. | For continuous values like positions.     |
> | Discrete      | Finite range of non-negative integers.           | For finite actions or states.             |
> | MultiBinary   | Binary space, each dimension is 0 or 1.          | For independent on/off states.            |
> | MultiDiscrete | Multi-dimensional discrete ranges.               | For actions with separate finite options. |
> | Text          | Space for text or character sequences.           | For tasks involving text input/output.    |

**Composite Spaces:**
> | Name     | Description                               | When to Use                                |
> |----------|-------------------------------------------|--------------------------------------------|
> | Dict     | Combines spaces as key-value pairs.       | For JSON-like structures.                  |
> | Tuple    | Combines spaces by position.              | For ordered combinations like coordinates. |
> | Sequence | Variable-length sequences of elements.    | For variable input/output, e.g., lists.    |
> | Graph    | Represents nodes and edges with features. | For relational or graph data.              |
> | OneOf    | Allows elements from multiple spaces.     | For mutually exclusive action types.       |

**State utility functions:**
> | Name            | Description                                       | When to Use                                 |
> |-----------------|---------------------------------------------------|---------------------------------------------|
> | flatten_space() | Converts composite space to a flat `Box`.         | For vectorizing complex spaces.             |
> | flatten()       | Converts a space element into a vector.           | For preprocessing data into a flat form.    |
> | flatdim()       | Gets the dimensionality of a flat space.          | For model input size or preprocessing.      |
> | unflatten()     | Converts a vector back to the original structure. | For restoring structured data.              |

#### Pokemon stats

It seems that the Discrete space is the most fitting for the stats of the pokemon. The stats are all integers, each stat having their own minimum and maximum value.

In [50]:
starter_df = pokemon_stats.copy()
starter_df = starter_df[starter_df['name'].isin(starter_names)]
starter_df

Unnamed: 0,ndex,name,type 1,type 2,hp,attack,defense,sp. atk,sp. def,speed,height,weight,base exp.,gen
386,387,turtwig,grass,,55,68,64,45,55,31,4,102,64,4
389,390,chimchar,fire,,44,58,44,58,44,61,5,62,62,4
392,393,piplup,water,,53,51,53,61,56,40,4,52,63,4


In [51]:
stat_columns = ['hp', 'attack', 'defense', 'sp. atk', 'sp. def', 'speed']

In [52]:
min_max_stat = {}
for col in stat_columns:
    min_max_stat[col] = (starter_df[col].min(), starter_df[col].max() + 1)

In [53]:
hp_space = gym.spaces.Discrete(min_max_stat['hp'][1]) # HP is the only stat that can be 0, so starts=0 (which is the default)
attack_space = gym.spaces.Discrete(min_max_stat['attack'][1], start=min_max_stat['attack'][0])
defense_space = gym.spaces.Discrete(min_max_stat['defense'][1], start=min_max_stat['defense'][0])
sp_atk_space = gym.spaces.Discrete(min_max_stat['sp. atk'][1], start=min_max_stat['sp. atk'][0])
sp_def_space = gym.spaces.Discrete(min_max_stat['sp. def'][1], start=min_max_stat['sp. def'][0])
speed_space = gym.spaces.Discrete(min_max_stat['speed'][1], start=min_max_stat['speed'][0])

#### Typing

Types are strings that we already have an encoded representation for. We can again use the Discrete space for this.

In [54]:
typing_space = gym.spaces.Discrete(all_type_encodings().max())

#### Abilities

It seems that we already have a numerical representation of the abilities. We can use the Discrete space for this as well.

In [55]:
starter_abilities_df = abilities[abilities['ability_name'].isin(starter_abilities.values())]
starter_abilities_df

Unnamed: 0,ability_id,ability_name,gen
64,65,overgrow,3
65,66,blaze,3
66,67,torrent,3


In [56]:
min_max_ability = (starter_abilities_df['ability_id'].min(), starter_abilities_df['ability_id'].max())
print(min_max_ability)
ability_space = gym.spaces.Discrete(min_max_ability[1], start=min_max_ability[0])

(65, 67)


#### Moves

I would prefer if we could make each move a tuple for each individual move and have the values of the tuple be discrete spaces. First lets look at the columns of the moves dataframe.

In [57]:
starter_moves_values = np.array(list(starter_moves.values())).flatten()
starter_move_list = move_list[move_list['identifier'].isin(starter_moves_values)].copy()
starter_move_list

Unnamed: 0,id,identifier,generation_id,type_id,power,pp,accuracy,priority,target_id,move_class,effect_id,effect_chance,effect_amt,effect_stat
0,1,pound,1,normal,40.0,35,100.0,0,10,2,1,,,
9,10,scratch,1,normal,40.0,35,100.0,0,10,2,1,,,
32,33,tackle,1,normal,40.0,35,100.0,0,10,2,1,,,
42,43,leer,1,normal,,30,100.0,0,11,1,17,,-1.0,2.0
44,45,growl,1,normal,,40,100.0,0,11,1,17,,-1.0,1.0
109,110,withdraw,1,water,,40,,0,7,1,16,,1.0,2.0


In [58]:
starter_move_list.isna().sum()

id               0
identifier       0
generation_id    0
type_id          0
power            3
pp               0
accuracy         1
priority         0
target_id        0
move_class       0
effect_id        0
effect_chance    6
effect_amt       3
effect_stat      3
dtype: int64

#### About the moves dataframe

About these columns:
- The `id` column we can drop as a move is essentially defined by other stats and its effect.
- The `identifier` column we can drop as it is not needed for the agent.
- The `generation_id` column we can drop as it is not needed for the agent.
- The `type_id` we need to apply label encodeding (which should be easy).
- The `power` column we can use as is, as it is a numerical value.
  - The `np.nan` values we can replace with 0 for moves that do stat changes (leer, growl and withdraw).
  - These stats being changed by these moves are dictated by the `effect_stat` column.
- The `pp` column we can use as is, as it is a numerical value.
- The `accuracy` column we can use as is, as it is a numerical value.
  - The `np.nan` values we can replace with -1 for moves that are accuracy independent (like withdraw).
- The `priority` column we can use as is, as it is a numerical value.
- The `target_id` column we can use as is.
  - The column describes what the move targets. 
  - A move that targets the users stats (like withdraw) or a move that targets the opponents HP (like tackle) for example all have a unique `target_id`.
- The `move_class` column we can use as is.
  - The column describes what kind of move it is (like physical, special or status).
  - I thought label encoding would be needed, but it seems the column is already encoded (1 for status, 2 for physical and 3 for special).
- The `effect_id` column we can use as is.
  - The column describes what kind of effect the move has (like stat change, status effect or damage).
  - It is essentially a label encoding for each unique effect, which is perfect!
- The `effect_chance` column we can use as is.
  - The column describes the chance of an extra effect happening, if any.
  - We can replace the `np.nan` values with 0 for moves that have no effect (like tackle).
- The `effect_amt` column we can use as is.
  - The column describes the amount of the effect that happens, if any.
  - It impact moves with a secondary effect such as stat changes (like with the move ominous wind).
  - We can replace the `np.nan` values with 0 for moves that have no effect (like tackle).
- The `effect_stat` column we can use as is.
  - The column describes what stat the move changes, if any.
  - We can replace the `np.nan` values with 0 for moves that deal direct damage (like tackle) to indicate it targets the HP stat.

In [59]:
starter_move_list.drop(columns=['id', 'identifier', 'generation_id'], inplace=True)
starter_move_list['type_id'] = starter_move_list['type_id'].apply(lambda x: get_type_encoding(x))
# starter_move_list

In [60]:
stat_change_effect_ids = [ 16, 17 ]
condition = starter_move_list['effect_id'].isin(stat_change_effect_ids)
starter_move_list.loc[condition, 'power'] = starter_move_list.loc[condition, 'power'].fillna(0)
# starter_move_list

In [61]:
effect_id_that_are_accuracy_independend = [ 16 ]
condition = starter_move_list['effect_id'].isin(effect_id_that_are_accuracy_independend)
starter_move_list.loc[condition, 'accuracy'] = starter_move_list.loc[condition, 'accuracy'].fillna(-1)
# starter_move_list

In [62]:
effect_id_that_have_no_secondary_effect = [ 1, 16, 17 ]
condition = starter_move_list['effect_id'].isin(effect_id_that_have_no_secondary_effect)
starter_move_list.loc[condition, 'effect_amt'] = starter_move_list.loc[condition, 'effect_amt'].fillna(0)
starter_move_list.loc[condition, 'effect_chance'] = starter_move_list.loc[condition, 'effect_chance'].fillna(0)
# starter_move_list

In [63]:
effect_id_that_deal_direct_damage = [ 1 ]
condition = starter_move_list['effect_id'].isin(effect_id_that_deal_direct_damage)
starter_move_list.loc[condition, 'effect_stat'] = starter_move_list.loc[condition, 'effect_stat'].fillna(0)
# starter_move_list

In [64]:
starter_move_list = starter_move_list.astype(int)

#### Move data frame after above changes

In [65]:
starter_move_list

Unnamed: 0,type_id,power,pp,accuracy,priority,target_id,move_class,effect_id,effect_chance,effect_amt,effect_stat
0,11,40,35,100,0,10,2,1,0,0,0
9,11,40,35,100,0,10,2,1,0,0,0
32,11,40,35,100,0,10,2,1,0,0,0
42,11,0,30,100,0,11,1,17,0,-1,2
44,11,0,40,100,0,11,1,17,0,-1,1
109,16,0,40,-1,0,7,1,16,0,1,2


In [66]:
starter_move_list.describe()

Unnamed: 0,type_id,power,pp,accuracy,priority,target_id,move_class,effect_id,effect_chance,effect_amt,effect_stat
count,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0
mean,11.833333,20.0,35.833333,83.166667,0.0,9.833333,1.5,8.833333,0.0,-0.166667,0.833333
std,2.041241,21.908902,3.763863,41.233077,0.0,1.47196,0.547723,8.588752,0.0,0.752773,0.983192
min,11.0,0.0,30.0,-1.0,0.0,7.0,1.0,1.0,0.0,-1.0,0.0
25%,11.0,0.0,35.0,100.0,0.0,10.0,1.0,1.0,0.0,-0.75,0.0
50%,11.0,20.0,35.0,100.0,0.0,10.0,1.5,8.5,0.0,0.0,0.5
75%,11.0,40.0,38.75,100.0,0.0,10.75,2.0,16.75,0.0,0.0,1.75
max,16.0,40.0,40.0,100.0,0.0,11.0,2.0,17.0,0.0,1.0,2.0


In [67]:
starter_move_list.isna().sum()

type_id          0
power            0
pp               0
accuracy         0
priority         0
target_id        0
move_class       0
effect_id        0
effect_chance    0
effect_amt       0
effect_stat      0
dtype: int64

#### About the empty move

It is important to think about the case where a pokemon has less then 4 moves. We could use splash as a placeholder move, as its a move that litteraly does nothing, but this would perhaps be missleading for the agent. 

**Important note:**
> All these values need to be distinctly unique, as we can not have this empty move tuple be the same as any other move tuple. Otherwise it will negativly impact the agent's learning.

In [68]:
for c in starter_move_list.columns:
    print(c, starter_move_list[c].unique())

type_id [11 16]
power [40  0]
pp [35 30 40]
accuracy [100  -1]
priority [0]
target_id [10 11  7]
move_class [2 1]
effect_id [ 1 17 16]
effect_chance [0]
effect_amt [ 0 -1  1]
effect_stat [0 2 1]


The empty move will be defined as followed:

| Column          | Value                                          |
|-----------------|------------------------------------------------|
| `type_id`       | 17 (the `np.nan` encoded value)                |
| `power`         | `starter_move_list['power'].min() - 1`         |
| `pp`            | `starter_move_list['pp'].min() - 1`            |
| `accuracy`      | `starter_move_list['accuracy'].min() - 1`      |
| `priority`      | `starter_move_list['priority'].min() - 1`      |
| `target_id`     | `starter_move_list['target_id'].min() - 1`     |
| `move_class`    |`starter_move_list['move_class'].min() - 1`     |
| `effect_id`     |`starter_move_list['effect_id'].min() - 1`      |
| `effect_chance` | `starter_move_list['effect_chance'].min() - 1` |
| `effect_amt`    | `starter_move_list['effect_amt'].min() - 1`    |
| `effect_stat`   | `starter_move_list['effect_stat'].min() - 1`   |

This results in the following tuple:
> $\lambda = (17, -1, -1, -1, 0, -1, -1, -1, -1, -1, -1)$

In [69]:
starter_move_list.loc[len(starter_move_list)] = {
    'type_id': get_type_encoding(np.nan),
    'power': starter_move_list['power'].min() - 1,
    'pp': starter_move_list['pp'].min() - 1,
    'accuracy': starter_move_list['accuracy'].min() - 1,
    'priority': starter_move_list['priority'].min() - 1,
    'target_id': starter_move_list['target_id'].min() - 1,
    'move_class': starter_move_list['move_class'].min() - 1,
    'effect_id': starter_move_list['effect_id'].min() - 1,
    'effect_chance': starter_move_list['effect_chance'].min() - 1,
    'effect_amt': starter_move_list['effect_amt'].min() - 1,
    'effect_stat': starter_move_list['effect_stat'].min() - 1
}
starter_move_list

Unnamed: 0,type_id,power,pp,accuracy,priority,target_id,move_class,effect_id,effect_chance,effect_amt,effect_stat
0,11,40,35,100,0,10,2,1,0,0,0
9,11,40,35,100,0,10,2,1,0,0,0
32,11,40,35,100,0,10,2,1,0,0,0
42,11,0,30,100,0,11,1,17,0,-1,2
44,11,0,40,100,0,11,1,17,0,-1,1
109,16,0,40,-1,0,7,1,16,0,1,2
6,17,-1,29,-2,-1,6,0,0,-1,-2,-1


#### Moves as a tuple of Discrete spaces

In [70]:
# Type space is already defined
move_power_space = gym.spaces.Discrete(starter_move_list['power'].max(), start=starter_move_list['power'].min() - 1)
move_pp_space = gym.spaces.Discrete(starter_move_list['pp'].max(), start=starter_move_list['pp'].min() - 1)
move_accuracy_space = gym.spaces.Discrete(starter_move_list['accuracy'].max(), start=starter_move_list['accuracy'].min() - 1)
# move_priority_space = gym.spaces.Discrete(starter_move_list['priority'].max(), start=starter_move_list['priority'].min() - 1)
move_target_space = gym.spaces.Discrete(starter_move_list['target_id'].max(), start=starter_move_list['target_id'].min() - 1)
move_class_space = gym.spaces.Discrete(starter_move_list['move_class'].max(), start=starter_move_list['move_class'].min() - 1)
move_effect_id_space = gym.spaces.Discrete(starter_move_list['effect_id'].max(), start=starter_move_list['effect_id'].min() - 1)
# move_effect_chance_space = gym.spaces.Discrete(starter_move_list['effect_chance'].max(), start=starter_move_list['effect_chance'].min() - 1)
move_effect_amt_space = gym.spaces.Discrete(starter_move_list['effect_amt'].max(), start=starter_move_list['effect_amt'].min() - 1)
move_effect_stat_space = gym.spaces.Discrete(starter_move_list['effect_stat'].max(), start=starter_move_list['effect_stat'].min() - 1)

The 2 spaces commented out are the ones that, specifically for the starter pokemons, are always 0, making them redundant. 

In [71]:
move_space = gym.spaces.Tuple([
    typing_space,
    move_power_space,
    move_pp_space,
    move_accuracy_space,
    # move_priority_space,
    move_target_space,
    move_class_space,
    move_effect_id_space,
    # move_effect_chance_space,
    move_effect_amt_space,
    move_effect_stat_space
])

#### Now for the pokemon tuple

In [72]:
pokemon_space = gym.spaces.Tuple([
    hp_space,
    attack_space,
    defense_space,
    sp_atk_space,
    sp_def_space,
    speed_space,
    typing_space,
    typing_space,
    ability_space,
    move_space,
    move_space,
    move_space,
    move_space
])

In [73]:
# Manual space size calculation
space_size = 0
for space in pokemon_space:
    if isinstance(space, gym.spaces.Discrete):
        space_size += len(range(space.start, space.n))
    elif isinstance(space, gym.spaces.Tuple):
        for s in space:
            space_size += len(range(s.start, s.n))

print(space_size)

# Recursive space size calculation
def recursive_space_size(space: gym.spaces.Space, size: int = 0):
    if isinstance(space, gym.spaces.Discrete):
        size += len(range(space.start, space.n))
    elif isinstance(space, gym.spaces.Tuple):
        for s in space:
            size = recursive_space_size(s, size)

    return size

print(recursive_space_size(pokemon_space))

1028
1028


#### And finally the party tuple

In [74]:
party_space = gym.spaces.Tuple([
    pokemon_space,
    pokemon_space,
    pokemon_space,
    pokemon_space,
    pokemon_space,
    pokemon_space
])

In [75]:
recursive_space_size(party_space) # 6 * 1022

6168

#### Volatile status

In `%ENV-DIR%/poke_battle_sim/poke_sim/core/pokemon.py::Pokemon::reset_stats()` we can see that for that pokemon instance, `self.stat_tages` is set to a list of ints. This prorty is not available once the pokemon is instantiated, as their is no refrence to it in the `__init__` method. The `reset_stats()` has a refrence in the `%ENV-DIR%/poke_battle_sim/poke_sim/util/process_move.py::_ef_050()` method. It seems that each effect ID from the `move_list` dataframe has its own method in this file. Lets look at the `ef_017` (the effect ID of growl and leer) method to see what it does. 

```py
    if defender.is_alive and defender.trainer.mist:
        battle.add_text(defender.nickname + "'s protected by mist.")
        return True
    give_stat_change(defender, battle, move_data.ef_stat, move_data.ef_amount)
```

It seems the `%ENV-DIR%/poke_battle_sim/poke_sim/util/process_move.py::give_stat_change()` method is used to apply stat changes. This in turn is used in `%ENV-DIR%/poke_battle_sim/poke_sim/core/pokemon.py::Battle` instance. This allows me to conclude that somewhere when the battle is started, the `stat_tages` property becomes available.

In [76]:
lucas = pb.Trainer('lucas', [get_random_starter()])
barry = pb.Trainer('barry', [get_rival_starter(lucas.poke_list[0].name)])
battle = pb.Battle(lucas, barry)

In [77]:
battle.start()

In [78]:
print(battle.t1.poke_list[0].stat_stages)
print(battle.t2.poke_list[0].stat_stages)

[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]


In [79]:
battle.turn(
    t1_turn=['move', lucas.poke_list[0].moves[1].name],
    t2_turn=['move', barry.poke_list[0].moves[1].name]
)
print(battle.cur_text)

['lucas sent out PIPLUP!', 'barry sent out CHIMCHAR!', 'Turn 1:', 'CHIMCHAR used Leer!', "PIPLUP's Defense fell!", 'PIPLUP used Growl!', "CHIMCHAR's Attack fell!"]


In [80]:
print(battle.t1.poke_list[0].stat_stages)
print(battle.t2.poke_list[0].stat_stages)

[0, 0, -1, 0, 0, 0]
[0, -1, 0, 0, 0, 0]


So now we know how to get this from the simulator, we need to define a statespace that can hold this information. The Discrete space seems to be the best fit for this. This is because all stat change stages are integers ranging from -6 to 6.

In [81]:
stat_stage_space = gym.spaces.Discrete(6, start=-6)

## Action Space

The complete action space for the agent is defined by the set of buttons that can be pressed on the controller. These include the arrow keys (`up`, `down`, `left`, `right`), `A`, `B`, `L`, `R`, `X`, `Y`, `start`, and `select`. This will be referred to as the **fundamental action space**.

Utilizing the fundamental action space directly may be unmeaningful due to its granularity and lack of abstraction. Instead, I define a **derived action space** that represents higher-level, semantically meaningful actions. These actions are constructed by combining or sequencing fundamental actions to achieve specific in-game outcomes. More specically, the dericed action space will include the following actions:
- `switch`: Switch the active pokemon.
- `move`: Use a move.
- `item`: Use an item.

**HOWEVER**, in the starter battle the agent will only be able to use the `move` action. The `item` and `switch` actions will be added in later experiments.

Luckily, the `poke-battle-sim` supports the use of these derived actions!

> From `%ENV-DIR%/Lib/site-packages/poke_battle_sim/core/battle.py::Battle::turn`:
> ```python
> """
> The three types of valid actions are:
> 1. Moves - formatted as ['move', $move_name]
> 2. Items - formatted as ['item', $item, $item_target_pos, $move_target_name?]
> 3. Switch-out - formatted as ['other', 'switch']
> """
> ```

## Reward Function

The initial reward function will be quite simplistic. The agent will be rewarded for winning a battle and penalized for losing a battle. The agent will also be rewarded for fainting an NPC's Pokémon and penalized for when its own Pokémon faints. The rewards and penalties will be kept small to prevent the agent from learning to exploit the reward function.

We want to keep te rewards small to prevent the agent from learning to exploit the reward function. The rewards and penalties will be as follows:

| State Description | Reward Associated with reaching this state |
|-------------------|--------------------------------------------|
| Win | +1 |
| Lose | -1 |

### On future reward shaping

It might be good to research how to make the reward function follow a behaviour that takes into account different party sizes. For example: should the reward of winning a 6v6 battle be higher, less then or equal to winning a 1v1 battle? This question will be explored in later experiments.

## Environment Implementation

In [82]:
from poke_battle_sim.core.move import Move
from poke_battle_sim.core.pokemon import Pokemon

# Important: Destroy any battle object before creating a new one
lucas = None
barry = None
battle = None


class StarterBattleEnvironment(gym.Env):
    def __init__(self):
        self._lucas = pb.Trainer('lucas', [get_random_starter()])
        self._barry = pb.Trainer(
            'barry', [get_rival_starter(self._lucas.poke_list[0].name)])
        self._battle = pb.Battle(self._lucas, self._barry)
        self._battle.start()

        # Action mappings, formated as:
        #   action_id: (action_type, pokemon_id, move_id)'
        # Where:
        #   action_type is one of 'move', 'switch', 'item'
        #   pokemon_id is always 0 (for targetting the pokemon in the first party slot)
        #   move_id is the index of the move in the pokemon's move list
        self._action_mappings = {
            0: ('move', 0, 0),
            1: ('move', 0, 1),
        }
        self.action_space = gym.spaces.Discrete(len(self._action_mappings))

        # Observation Space
        self.observation_space = gym.spaces.Dict({
            "agent_pokemon_1_hp": hp_space,
            "agent_pokemon_1_attack": attack_space,
            "agent_pokemon_1_defense": defense_space,
            "agent_pokemon_1_sp_atk": sp_atk_space,
            "agent_pokemon_1_sp_def": sp_def_space,
            "agent_pokemon_1_speed": speed_space,
            "agent_pokemon_1_typing": typing_space,
            "agent_pokemon_1_typing": typing_space,
            "agent_pokemon_1_ability": ability_space,

            "agent_pokemon_1_move_1_type": typing_space,
            "agent_pokemon_1_move_1_power": move_power_space,
            "agent_pokemon_1_move_1_pp": move_pp_space,
            "agent_pokemon_1_move_1_accuracy": move_accuracy_space,
            "agent_pokemon_1_move_1_target": move_target_space,
            "agent_pokemon_1_move_1_class": move_class_space,
            "agent_pokemon_1_move_1_effect_id": move_effect_id_space,
            "agent_pokemon_1_move_1_effect_amt": move_effect_amt_space,
            "agent_pokemon_1_move_1_effect_stat": move_effect_stat_space,

            "agent_pokemon_1_move_2_type": typing_space,
            "agent_pokemon_1_move_2_power": move_power_space,
            "agent_pokemon_1_move_2_pp": move_pp_space,
            "agent_pokemon_1_move_2_accuracy": move_accuracy_space,
            "agent_pokemon_1_move_2_target": move_target_space,
            "agent_pokemon_1_move_2_class": move_class_space,
            "agent_pokemon_1_move_2_effect_id": move_effect_id_space,
            "agent_pokemon_1_move_2_effect_amt": move_effect_amt_space,
            "agent_pokemon_1_move_2_effect_stat": move_effect_stat_space,

            "npc_pokemon_1_hp": hp_space,
            "npc_pokemon_1_attack": attack_space,
            "npc_pokemon_1_defense": defense_space,
            "npc_pokemon_1_sp_atk": sp_atk_space,
            "npc_pokemon_1_sp_def": sp_def_space,
            "npc_pokemon_1_speed": speed_space,
            "npc_pokemon_1_typing": typing_space,
            "npc_pokemon_1_typing": typing_space,
            "npc_pokemon_1_ability": ability_space,

            "npc_pokemon_1_move_1_type": typing_space,
            "npc_pokemon_1_move_1_power": move_power_space,
            "npc_pokemon_1_move_1_pp": move_pp_space,
            "npc_pokemon_1_move_1_accuracy": move_accuracy_space,
            "npc_pokemon_1_move_1_target": move_target_space,
            "npc_pokemon_1_move_1_class": move_class_space,
            "npc_pokemon_1_move_1_effect_id": move_effect_id_space,
            "npc_pokemon_1_move_1_effect_amt": move_effect_amt_space,
            "npc_pokemon_1_move_1_effect_stat": move_effect_stat_space,

            "npc_pokemon_1_move_2_type": typing_space,
            "npc_pokemon_1_move_2_power": move_power_space,
            "npc_pokemon_1_move_2_pp": move_pp_space,
            "npc_pokemon_1_move_2_accuracy": move_accuracy_space,
            "npc_pokemon_1_move_2_target": move_target_space,
            "npc_pokemon_1_move_2_class": move_class_space,
            "npc_pokemon_1_move_2_effect_id": move_effect_id_space,
            "npc_pokemon_1_move_2_effect_amt": move_effect_amt_space,
            "npc_pokemon_1_move_2_effect_stat": move_effect_stat_space,

            # TODO change this to a for loop
        })

    def _get_info(self):
        return {
            't1': self._battle.t1.name,
            't1_pokemons': [p.name for p in self._battle.t1.poke_list],
            't2': self._battle.t2.name,
            't2_pokemons': [p.name for p in self._battle.t2.poke_list],
        }

    def _get_move_obs(self, move: Move) -> list[int]:
        return [
            get_type_encoding(move.type),
            move.power if move.power else 0,
            move.cur_pp if move.cur_pp else 0,
            move.acc if move.acc else 0,
            # move.prio if move.prio else 0,
            move.target if move.target else 0,
            move.category if move.category else 0,
            move.ef_id if move.ef_id else 0,
            # move.ef_chance if move.ef_chance else 0,
            move.ef_amount if move.ef_amount else 0,
            move.ef_stat if move.ef_stat else 0
        ]

    def _get_empty_move_obs(self) -> list[int]:
        return [17, -1, -1, -1, 0, -1, -1, -1, -1, -1, -1]

    def _get_pokemon_obs(self, pokemon: Pokemon) -> list[int]:
        move1_obs = self._get_move_obs(pokemon.moves[0])

        if len(pokemon.moves) == 4:
            move2_obs = self._get_move_obs(pokemon.moves[1])
            move3_obs = self._get_move_obs(pokemon.moves[2])
            move4_obs = self._get_move_obs(pokemon.moves[3])
        elif len(pokemon.moves) == 3:
            move2_obs = self._get_move_obs(pokemon.moves[1])
            move3_obs = self._get_move_obs(pokemon.moves[2])
            move4_obs = self._get_empty_move_obs()
        elif len(pokemon.moves) == 2:
            move2_obs = self._get_move_obs(pokemon.moves[1])
            move3_obs = self._get_empty_move_obs()
            move4_obs = self._get_empty_move_obs()
        else:
            move2_obs = self._get_empty_move_obs()
            move3_obs = self._get_empty_move_obs()
            move4_obs = self._get_empty_move_obs()

        output = [
            pokemon.cur_hp,
            pokemon.stats_actual[1],
            pokemon.stats_actual[2],
            pokemon.stats_actual[3],
            pokemon.stats_actual[4],
            pokemon.stats_actual[5],
            get_type_encoding(pokemon.types[0]),
            get_type_encoding(pokemon.types[1]),
            get_ability_id_by_name(pokemon.ability)
        ]
        output.extend(move1_obs)
        output.extend(move2_obs)
        output.extend(move3_obs)
        output.extend(move4_obs)

        return output

    def _get_obs(self):
        output = []

        output.extend(self._get_pokemon_obs(self._lucas.poke_list[0]))
        output.extend(self._lucas.poke_list[0].stat_stages)

        output.extend(self._get_pokemon_obs(self._barry.poke_list[0]))
        output.extend(self._barry.poke_list[0].stat_stages)

        return {
            "agent_pokemon_1_hp": output[0],
            "agent_pokemon_1_attack": output[1],
            "agent_pokemon_1_defense": output[2],
            "agent_pokemon_1_sp_atk": output[3],
            "agent_pokemon_1_sp_def": output[4],
            "agent_pokemon_1_speed": output[5],
            "agent_pokemon_1_typing": output[6],
            "agent_pokemon_1_typing": output[7],
            "agent_pokemon_1_ability": output[8],
            "agent_pokemon_1_move_1_type": output[9],
            "agent_pokemon_1_move_1_power": output[10],
            "agent_pokemon_1_move_1_pp": output[11],
            "agent_pokemon_1_move_1_accuracy": output[12],
            "agent_pokemon_1_move_1_target": output[13],
            "agent_pokemon_1_move_1_class": output[14],
            "agent_pokemon_1_move_1_effect_id": output[15],
            "agent_pokemon_1_move_1_effect_amt": output[16],
            "agent_pokemon_1_move_1_effect_stat": output[17],
            "agent_pokemon_1_move_2_type": output[18],
            "agent_pokemon_1_move_2_power": output[19],
            "agent_pokemon_1_move_2_pp": output[20],
            "agent_pokemon_1_move_2_accuracy": output[21],
            "agent_pokemon_1_move_2_target": output[22],
            "agent_pokemon_1_move_2_class": output[23],
            "agent_pokemon_1_move_2_effect_id": output[24],
            "agent_pokemon_1_move_2_effect_amt": output[25],
            "agent_pokemon_1_move_2_effect_stat": output[26],
            "npc_pokemon_1_hp": output[27],
            "npc_pokemon_1_attack": output[28],
            "npc_pokemon_1_defense": output[29],
            "npc_pokemon_1_sp_atk": output[30],
            "npc_pokemon_1_sp_def": output[31],
            "npc_pokemon_1_speed": output[32],
            "npc_pokemon_1_typing": output[33],
            "npc_pokemon_1_typing": output[34],
            "npc_pokemon_1_ability": output[35],
            "npc_pokemon_1_move_1_type": output[36],
            "npc_pokemon_1_move_1_power": output[37],
            "npc_pokemon_1_move_1_pp": output[38],
            "npc_pokemon_1_move_1_accuracy": output[39],
            "npc_pokemon_1_move_1_target": output[40],
            "npc_pokemon_1_move_1_class": output[41],
            "npc_pokemon_1_move_1_effect_id": output[42],
            "npc_pokemon_1_move_1_effect_amt": output[43],
            "npc_pokemon_1_move_1_effect_stat": output[44],
            "npc_pokemon_1_move_2_type": output[45],
            "npc_pokemon_1_move_2_power": output[46],
            "npc_pokemon_1_move_2_pp": output[47],
            "npc_pokemon_1_move_2_accuracy": output[48],
            "npc_pokemon_1_move_2_target": output[49],
            "npc_pokemon_1_move_2_class": output[50],
            "npc_pokemon_1_move_2_effect_id": output[51],
            "npc_pokemon_1_move_2_effect_amt": output[52],
            "npc_pokemon_1_move_2_effect_stat": output[53],
        }

    def _reward(self):
        if self._battle is None:
            raise ValueError('Battle not initialized')

        if self._battle.get_winner() == self._lucas:
            return 1
        elif self._battle.get_winner() == self._barry:
            return -1
        else:
            return -0.01  # Time penalty

    def step(self, action):
        if self._battle is None:
            raise ValueError('Battle not initialized')

        # Perform the action
        action_type, pokemon_id, move_id = self._action_mappings[action]
        self._battle.turn(
            t1_turn=[action_type,
                     self._lucas.poke_list[pokemon_id].moves[move_id].name],
            t2_turn=[
                'move',
                np.random.choice(self._barry.poke_list[0].moves).name
            ]  # TODO find a decision tree implementation of gen 4 AI
        )

        # Get the observation
        observation = self._get_obs()
        reward = self._reward()
        terminated = self._battle.is_finished()
        truncated = False
        info = self._get_info()

        return observation, reward, terminated, truncated, info

    def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
        # We need the following line to seed self.np_random
        super().reset(seed=seed)

        # Reset the battle simulation
        self._lucas = None
        self._barry = None
        self._battle = None

        self._lucas = pb.Trainer('lucas', [get_random_starter()])
        self._barry = pb.Trainer(
            'barry', [get_rival_starter(self._lucas.poke_list[0].name)])
        self._battle = pb.Battle(self._lucas, self._barry)
        self._battle.start()

        # Get the initial observation
        observation = self._get_obs()
        info = self._get_info()
        return observation, info

In [105]:
env = StarterBattleEnvironment()
obs, info = env.reset()

In [107]:
for k in obs.keys():
    print(obs[k])

44
58
44
58
44
61
17
66
11
40
35
100
10
2
1
0
0
11
0
30
100
11
1
17
-1
2
17
-1
-1
-1
0
-1
-1
-1
-1
-1
17
-1
-1
-1
0
-1
-1
-1
-1
-1
-1
0
0
0
0
0


## Policy

For this starter battle environment I will be starting out by using an epsilon greedy policy. This is chosen as it is a simple policy that is easy to implement and understand. The epsilon greedy policy is a policy that selects the best action with a probability of $1 - \epsilon$ and a random action with a probability of $\epsilon$. This allows the agent to explore the environment while still exploiting the best actions it has learned.

If epslin greedy yields poor results, I will switch to an epsilon decay policy. This policy is similar to the epsilon greedy policy, but the epsilon value decays over time. This allows the agent to explore more in the beginning and exploit more towards the end of training.

<!-- 
- Eplsion greedy for starter battle
- Eplsion greedy compared with Boltzmann exploration for future battles 
-->

In [83]:
class BasePolicy:
    def __init__(self) -> None:
        pass

    def action(self, action: np.ndarray) -> int:
        raise NotImplementedError

    def update(self, step: int) -> None:
        raise NotImplementedError

    def config(self) -> dict:
        d = {k: v for k, v in self.__dict__.items() if not k.startswith('_') and not callable(v)}
        d['type'] = self.__class__.__name__
        return d
    

class EpsilonGreedy(BasePolicy):
    def __init__(self, epsilon: float, n_actions: int) -> None:
        self.epsilon = epsilon
        self.n_actions = n_actions

    def action(self, q_values: np.ndarray) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(q_values)
        
    def update(self, step: int) -> None:
        pass


class EpsilonDecay(BasePolicy):
    def __init__(self, epsilon: float, _min: float, decay_rate: float, n_actions: int) -> None:
        self.epsilon_init = epsilon
        self.epsilon_current = epsilon
        self.epsilon_min = _min
        self.decay_rate = decay_rate
        self.n_actions = n_actions

    def action(self, q_values: np.ndarray) -> int:
        if np.random.random() < self.epsilon_current:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(q_values)

    def update(self, step: int) -> None:
        self.epsilon_current = max(
            self.epsilon_current * (self.decay_rate ** step),
            self.epsilon_min
        )

## Logging data for analysis

The rewards will be logged over time to see potential exploitations of the reward function.

Logging:
- Cumuliative reward (must have)
  - Should rise over time
  - Should become less volatile over time
- Every N percent, do a test run (must have)
  - Do a battle
  - Log the battle text
  - Log the battle outcome
- Log how much of the state space has been explored by the agent (should have)
- Log how the agents decision making changes over time (quite advanded, could have)
- Loss over time (should have)
- Reward trends over time (should have)
- Exploration (e.g., epsilon value) over time (should have)

In [84]:
tensorboard_dir = os.path.abspath('./initial_pokemon_battleing_agent')
if not os.path.exists(tensorboard_dir):
    os.makedirs(tensorboard_dir)

## Model Free Approach (Deep Q-Learning)

Architecture: Decide on the architecture of your Q-network. For example:
- Fully connected layers for small, discrete state spaces.
- Convolutional layers if your state is represented as images (e.g., screenshots of the game).

Output: Ensure the network outputs a value for each action in the action space.

In [85]:
from stable_baselines3 import DQN

env = StarterBattleEnvironment()
model = DQN('MultiInputPolicy', env, verbose=1, tensorboard_log=tensorboard_dir)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Unfortunatly, the DQN model does not accept a custom policy. This means that the epsilon greedy policy will not be used in this approach. In future experiments, I will be looking into making a custom model with a custom policy.

In [None]:
timesteps = 10000
log_interval = timesteps // 100

In [103]:
model.learn(
    total_timesteps=timesteps, 
    log_interval=log_interval, 
    tb_log_name=f"dqn_starter_battle"
)

Logging to d:\Users\luc\repos\deth\research_lvde\experiments\initial_pokemon_battleing_agent\dqn_starter_battle_18


OverflowError: int too big to convert

## Model Based Approach (...)

TODO research model based approach

In [93]:
# TODO: implement model based stuff

## Training The Agents

In [94]:
env = StarterBattleEnvironment()
policy = EpsilonGreedy(0.1, env.action_space.n)

## Debugging and Visualization

...

## Conclusion

- Expand observation space to include more information about the battle.
  - Define empty pokemon in observation space
- Research gen4 ai and implement it the environment