## Experiment Goal

The goal of this experiment is to setup very minimalistic implementation of a battleing agent. This wil then function as a starting point for further development.

In [240]:
import pandas as pd
import numpy as np
import os
import poke_battle_sim as pb
import random
from sklearn.preprocessing import LabelEncoder

import gymnasium as gym
from typing import Optional
from stable_baselines3 import DQN

import time
import re

In [152]:
# Data imports
package_dir = str(os.sep).join(str(pb.poke_sim.__file__).split(os.sep)[0:-1])
data_dir = os.path.join(package_dir, 'data')

# Load dataframes
abilities = pd.read_csv(os.path.join(data_dir, 'abilities.csv'))
items_gen4 = pd.read_csv(os.path.join(data_dir, 'items_gen4.csv'))
move_list = pd.read_csv(os.path.join(data_dir, 'move_list.csv'))
natures = pd.read_csv(os.path.join(data_dir, 'natures.csv'))
pokemon_stats = pd.read_csv(os.path.join(data_dir, 'pokemon_stats.csv'))
# pokemon_stats.set_index('ndex', inplace=True)
type_effectiveness = pd.read_csv(os.path.join(data_dir, 'type_effectiveness.csv'))

In [153]:
# Data helper methods
def get_random_nature() -> str:
    return random.choice(natures.values)[0]

def get_stats_by_id(pokedex_id: int):
    if pokedex_id < min(pokemon_stats['ndex']) or pokedex_id > max(pokemon_stats['ndex']):
        raise ValueError(f'{pokedex_id} is not a valid pokedex id')
    
    return pb.PokeSim._pokemon_stats[pokedex_id - 1][4:10]

def get_stats_by_name(name: str):
    if name not in pokemon_stats['name'].values:
        raise ValueError(f'{name} is not a valid pokemon name')
    
    search_results = [ i for i in pb.PokeSim._pokemon_stats if i[1] == name ] # TODO make this search more time efficient
    if len(search_results) != 1:
        raise ValueError(f'Invalid search results: expected 1, got {len(search_results)} while searching for {name}')

    return search_results[0][4:10]

def get_ability_id_by_name(name: str):
    if name not in abilities['ability_name'].values:
        raise ValueError(f'{name} is not a valid starter ability')

    return abilities[abilities['ability_name'] == name]['ability_id'].values[0]

In [154]:
# Encoding/Decoding methods
# print(pb.conf.global_settings.POSSIBLE_GENDERS)
gender_encoder = LabelEncoder()
gender_encoder.fit(pb.conf.global_settings.POSSIBLE_GENDERS)

def get_gender_encoding(gender: str):
    return gender_encoder.transform([gender])[0]

def get_gender_decoding(gender: int):
    return gender_encoder.inverse_transform([gender])[0]

def get_random_gender_mf():
    return random.choice(['male', 'female'])

type_encoder = LabelEncoder()
type_encoder.fit(pokemon_stats[[ 'type 1', 'type 2' ]].values.flatten())

def get_type_encoding(type_name: str | float):
    if isinstance(type_name, float) and np.isnan(type_name):
        return type_encoder.transform([np.nan])[0]
    
    if not type_name or type_name.lower() == 'none' or type_name.lower() == 'nan' or type_name == '':
        return type_encoder.transform([np.nan])[0]
    
    return type_encoder.transform([type_name])[0]

def get_type_decoding(type_id: int):
    return type_encoder.inverse_transform([type_id])[0]

def all_type_encodings():
    return np.array(type_encoder.transform(type_encoder.classes_))

# for c in type_encoder.classes_:
#     print(f'{c} -> {get_type_encoding(c)} -> {get_type_decoding(get_type_encoding(c))}')

## About the state space

The starting point for the state space comes from the first rival battle in the game. The choice for this is to keep the state space as small as possible to make it easier to debug and understand the agent's behavior. It is also arguably the most interesting of starting points as this is the very first battle in the game. Making the state space any smaller would result in an agent that does not really learn to do anything of meaning.

### Sizing the state space

Alto the statespace seems small, it is still quite large. At first glance, all the state space is are 2 pokemons:
> $ \{ p1, p2 \} $

However, when we look into what each pokemon's attributes it becomes aperent how fast the state space grows. To make things easier, lets look at the pokemon showdown calculator to see what could be included in a battle state:

![pokemon showdown calculator screenshot](showdown_calculator_screenshot.png)

The calculator shows inputs for the current pokemons out in battle (so not the remaining party in the players parties). Every single input (be it buttons or text fields) in the calculator is a part of the state space. Note that not all inputs apply to all party members, for example, the buttons centered on the calculator (say Protect for example) only apply to the pokemon out in battle. Given the valid input space for each of these fields (base hp can range from 0 to a practical maximum of 150 for example), it becomse almost astronamicaly large. This large state space is what will dictate the approaches usable in this experiment.

### Choosing an approach for the sized state space

From this first glance I can already safely state that Q-Tabels are not a viable approach, as the tables would be to large to work with within a reasonable amount of time (for my computer at least). We could try to see which parts of the state space to scrap, which would be an interesting experiment. It might be nice to see wheter an agent could learn to battle, for example, without knowing anything about a pokemons attribute that dictate its stats (like base stats, IVs, EVs, etc) except perhaps its level. However I will be opting for different approache.

I will be opting for a Deep Q-Learning approach (page 867 of the book). I might even try to implement Double Deep Q-Learning if the model overfits. Deep Q-Learning is an on-policy, model-free approach that uses a deep neural network to approximate the Q-function. This is a good approach for this experiment as it can handle large state spaces. Double Deep Q-Learning is the off-policy variant of Deep Q-Learning that uses two networks to prevent overfitting.

## Observation Space

The observation space should include the following:
- The agents party
- That NPC's party
- Stat changes from buffs and debuffs (like leer, growl, etc)

A party consists of one to six Pokémon in the form of a dictionary, where the keys are the position of a Pokémon within the party (so key = 0 is the Pokémon in front of the party, key one the next Pokémon in the party etc).

A Pokémon is a tuple with:
- Stat totals (computed stats based on EV’s, IV’s, Base stats, Level and nature)
- Types (one or two types)
- Its ability
- Available moves

Some more notes on the observation space:
- For the agent's party, all this information is known beforehand. 
- For the NPC's party, the agent will have to learn how to gather this information throughout every episode.

To get these observations, we will first define it and then see how we can get it from the simulator.

#### Data on available pokemon in the starting battle

In [155]:
starter_lvl = 5
starter_names = ['turtwig', 'chimchar', 'piplup']
starter_moves = {
    'turtwig': ['tackle', 'withdraw'],
    'chimchar': ['scratch', 'leer'],
    'piplup': ['pound', 'growl']
}
starter_abilities = {
    'turtwig': 'overgrow',
    'chimchar': 'blaze',
    'piplup': 'torrent'
}

In [156]:
def get_starter(name: str):
    if name not in starter_names:
        raise ValueError(f'{name} is not a valid starter name')
    
    stats = get_stats_by_name(name)

    return pb.Pokemon(
        name_or_id=name,
        level=starter_lvl,
        moves=starter_moves[name],
        gender=get_random_gender_mf(),
        ability=starter_abilities[name],
        nature=get_random_nature(),
        cur_hp=stats[0],
        stats_actual=stats
    )


def get_random_starter():
    name = random.choice(starter_names)
    return get_starter(name)


def get_rival_starter(agent_starter_name: str):
    name = ''
    if agent_starter_name == 'turtwig':
        name = 'chimchar'
    elif agent_starter_name == 'chimchar':
        name = 'piplup'
    elif agent_starter_name == 'piplup':
        name = 'turtwig'

    return get_starter(name)

In [157]:
starter_name = get_random_starter().name
print(f'Random starter: {starter_name}')
print(f'Rival starter: {get_rival_starter(starter_name).name}')

Random starter: chimchar
Rival starter: piplup


#### From data to observation space 

In order to turn the data we have at our disposal to the observation space, we will have to do the following:
- See what [state spaces gym makes available](https://gymnasium.farama.org/api/spaces/fundamental/#fundamental-spaces) to us
- See what the data looks like
- Translate the data to the available state spaces

#### About state spaces

Statespaces all have numerical values so it seems, take the Discrete space for example: its essentially just a set of integers. The dictionary might have textual keys, but the values are all just other numerical spaces (or nested dictionaries). I will try and summerize the state space in terms of what they are and when to use them

**Fundamental Spaces:**
> | Name          | Description                                      | When to Use                               |
> |---------------|------------------------------------------------- |-------------------------------------------|
> | Box           | Continuous space with bounds for each dimension. | For continuous values like positions.     |
> | Discrete      | Finite range of non-negative integers.           | For finite actions or states.             |
> | MultiBinary   | Binary space, each dimension is 0 or 1.          | For independent on/off states.            |
> | MultiDiscrete | Multi-dimensional discrete ranges.               | For actions with separate finite options. |
> | Text          | Space for text or character sequences.           | For tasks involving text input/output.    |

**Composite Spaces:**
> | Name     | Description                               | When to Use                                |
> |----------|-------------------------------------------|--------------------------------------------|
> | Dict     | Combines spaces as key-value pairs.       | For JSON-like structures.                  |
> | Tuple    | Combines spaces by position.              | For ordered combinations like coordinates. |
> | Sequence | Variable-length sequences of elements.    | For variable input/output, e.g., lists.    |
> | Graph    | Represents nodes and edges with features. | For relational or graph data.              |
> | OneOf    | Allows elements from multiple spaces.     | For mutually exclusive action types.       |

**State utility functions:**
> | Name            | Description                                       | When to Use                                 |
> |-----------------|---------------------------------------------------|---------------------------------------------|
> | flatten_space() | Converts composite space to a flat `Box`.         | For vectorizing complex spaces.             |
> | flatten()       | Converts a space element into a vector.           | For preprocessing data into a flat form.    |
> | flatdim()       | Gets the dimensionality of a flat space.          | For model input size or preprocessing.      |
> | unflatten()     | Converts a vector back to the original structure. | For restoring structured data.              |

### **EDITORS NOTE**

I just found out that stabel baselines does not work very well with discrete state spaces that do not start with 0 (either negative or possitive). I was planning on reducing the state space size by making it so that the spaces are smaller, by starting them at the lowest possible value a stat can be. Unfortunatly, this is not possible with stabel baselines.

Since I have been working on defining a environment for about a week now (of which most of the time went towards defining the state space), I am going to simplify my life by making the state space alot smaller. If you think about it, the only impact the agents actions have in the starter battle are 2 fold:
- The agent can choose to attack, lowering the opponents health (and vice versa for the opponent)
- The agent can choose to lower one of the opponents stats (and vice versa for the opponent)

Thus I will be reducing the state space to just include the pokemon stats and the stat stages.

#### Pokemon stats

It seems that the Discrete space is the most fitting for the stats of the pokemon. The stats are all integers, each stat having their own minimum and maximum value.

In [158]:
starter_df = pokemon_stats.copy()
starter_df = starter_df[starter_df['name'].isin(starter_names)]
starter_df

Unnamed: 0,ndex,name,type 1,type 2,hp,attack,defense,sp. atk,sp. def,speed,height,weight,base exp.,gen
386,387,turtwig,grass,,55,68,64,45,55,31,4,102,64,4
389,390,chimchar,fire,,44,58,44,58,44,61,5,62,62,4
392,393,piplup,water,,53,51,53,61,56,40,4,52,63,4


In [159]:
stat_columns = ['hp', 'attack', 'defense', 'sp. atk', 'sp. def', 'speed']

In [160]:
hp_space = gym.spaces.Discrete(starter_df['hp'].max() + 1)
attack_space = gym.spaces.Discrete(starter_df['attack'].max() + 1)
defense_space = gym.spaces.Discrete(starter_df['defense'].max() + 1)
sp_atk_space = gym.spaces.Discrete(starter_df['sp. atk'].max() + 1)
sp_def_space = gym.spaces.Discrete(starter_df['sp. def'].max() + 1)
speed_space = gym.spaces.Discrete(starter_df['speed'].max() + 1)

In [161]:
# Sanity checks
_max = starter_df['hp'].max()
assert all([ not hp_space.contains(-1), hp_space.contains(0), hp_space.contains(_max), not hp_space.contains(_max + 1) ])

_max = starter_df['attack'].max()
assert all([ not attack_space.contains(-1), attack_space.contains(0), attack_space.contains(_max), not attack_space.contains(_max + 1) ])

_max = starter_df['defense'].max()
assert all([ not defense_space.contains(-1), defense_space.contains(0), defense_space.contains(_max), not defense_space.contains(_max + 1) ])

_max = starter_df['sp. atk'].max()
assert all([ not sp_atk_space.contains(-1), sp_atk_space.contains(0), sp_atk_space.contains(_max), not sp_atk_space.contains(_max + 1) ])

_max = starter_df['sp. def'].max()
assert all([ not sp_def_space.contains(-1), sp_def_space.contains(0), sp_def_space.contains(_max), not sp_def_space.contains(_max + 1) ])

_max = starter_df['speed'].max()
assert all([ not speed_space.contains(-1), speed_space.contains(0), speed_space.contains(_max), not speed_space.contains(_max + 1) ])

#### Volatile status

In `%ENV-DIR%/poke_battle_sim/poke_sim/core/pokemon.py::Pokemon::reset_stats()` we can see that for that pokemon instance, `self.stat_tages` is set to a list of ints. This prorty is not available once the pokemon is instantiated, as their is no refrence to it in the `__init__` method. The `reset_stats()` has a refrence in the `%ENV-DIR%/poke_battle_sim/poke_sim/util/process_move.py::_ef_050()` method. It seems that each effect ID from the `move_list` dataframe has its own method in this file. Lets look at the `ef_017` (the effect ID of growl and leer) method to see what it does. 

```py
    if defender.is_alive and defender.trainer.mist:
        battle.add_text(defender.nickname + "'s protected by mist.")
        return True
    give_stat_change(defender, battle, move_data.ef_stat, move_data.ef_amount)
```

It seems the `%ENV-DIR%/poke_battle_sim/poke_sim/util/process_move.py::give_stat_change()` method is used to apply stat changes. This in turn is used in `%ENV-DIR%/poke_battle_sim/poke_sim/core/pokemon.py::Battle` instance. This allows me to conclude that somewhere when the battle is started, the `stat_tages` property becomes available.

In [162]:
lucas = pb.Trainer('lucas', [get_random_starter()])
barry = pb.Trainer('barry', [get_rival_starter(lucas.poke_list[0].name)])
battle = pb.Battle(lucas, barry)
battle.start()

print(battle.t1.poke_list[0].stat_stages)
print(battle.t2.poke_list[0].stat_stages)

battle.turn(
    t1_turn=['move', lucas.poke_list[0].moves[1].name],
    t2_turn=['move', barry.poke_list[0].moves[1].name]
)
print(battle.t1.poke_list[0].stat_stages)
print(battle.t2.poke_list[0].stat_stages)

[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]
[0, -1, 0, 0, 0, 0]
[0, 0, -1, 0, 0, 0]


So now we know how to get this from the simulator, we need to define a statespace that can hold this information. The Discrete space seems to be the best fit for this. This is because all stat change stages are integers ranging from -6 to 6. In order to make the observation space work with stable baselines, we will have to map the stat stages to a range of 0 to 12 from -6 to 6.

In [163]:
stat_stage_space = gym.spaces.Discrete(13)

In [164]:
assert all([ not stat_stage_space.contains(-1), stat_stage_space.contains(0), stat_stage_space.contains(12), not stat_stage_space.contains(13) ])

In [165]:
def map_stat_stages(stat_stages: list[int]) -> np.ndarray:
    if len(stat_stages) != 6:
        raise ValueError('Expected exactly 6 stat stages')
    
    # map from -6 / 6 to 0 / 12
    return np.array(stat_stages) + 6

assert np.array_equal(map_stat_stages([0, 0, 0, 0, 0, 0]), np.array([6, 6, 6, 6, 6, 6]))
assert np.array_equal(map_stat_stages([6, 6, 6, 6, 6, 6]), np.array([12, 12, 12, 12, 12, 12]))
assert np.array_equal(map_stat_stages([-6, -6, -6, -6, -6, -6]), np.array([0, 0, 0, 0, 0, 0]))
assert np.array_equal(map_stat_stages([-6, -4, -2, 2, 4, 6]), np.array([0, 2, 4, 8, 10, 12]))

#### Move observations

In [166]:
all_starter_moves = np.array(list(starter_moves.values())).flatten()

In [167]:
move_list = move_list[move_list['identifier'].isin(all_starter_moves)]
move_list

Unnamed: 0,id,identifier,generation_id,type_id,power,pp,accuracy,priority,target_id,move_class,effect_id,effect_chance,effect_amt,effect_stat
0,1,pound,1,normal,40.0,35,100.0,0,10,2,1,,,
9,10,scratch,1,normal,40.0,35,100.0,0,10,2,1,,,
32,33,tackle,1,normal,40.0,35,100.0,0,10,2,1,,,
42,43,leer,1,normal,,30,100.0,0,11,1,17,,-1.0,2.0
44,45,growl,1,normal,,40,100.0,0,11,1,17,,-1.0,1.0
109,110,withdraw,1,water,,40,,0,7,1,16,,1.0,2.0


The following columns will be included in the observation space:
- power
- pp
- target_id
- move_class
- effect_id

In [168]:
columns_of_interest = ['power', 'pp', 'target_id', 'move_class', 'effect_id']
move_list = move_list[columns_of_interest]

In [169]:
move_list

Unnamed: 0,power,pp,target_id,move_class,effect_id
0,40.0,35,10,2,1
9,40.0,35,10,2,1
32,40.0,35,10,2,1
42,,30,11,1,17
44,,40,11,1,17
109,,40,7,1,16


In [170]:
move_list['power'].fillna(0, inplace=True)
move_list = move_list.astype(int)
move_list

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  move_list['power'].fillna(0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  move_list['power'].fillna(0, inplace=True)


Unnamed: 0,power,pp,target_id,move_class,effect_id
0,40,35,10,2,1
9,40,35,10,2,1
32,40,35,10,2,1
42,0,30,11,1,17
44,0,40,11,1,17
109,0,40,7,1,16


In [171]:
move_power_space = gym.spaces.Discrete(move_list['power'].max() + 1)
move_pp_space = gym.spaces.Discrete(move_list['pp'].max() + 1)
move_target_space = gym.spaces.Discrete(move_list['target_id'].max() + 1)
move_class_space = gym.spaces.Discrete(move_list['move_class'].max() + 1)
move_effect_id_space = gym.spaces.Discrete(move_list['effect_id'].max() + 1)

In [172]:
assert (False, True, True, False) == (
    move_power_space.contains(-1), 
    move_power_space.contains(0), 
    move_power_space.contains(move_list['power'].max()), 
    move_power_space.contains(move_list['power'].max() + 1)
)

assert (False, True, True, False) == (
    move_pp_space.contains(-1), 
    move_pp_space.contains(0), 
    move_pp_space.contains(move_list['pp'].max()), 
    move_pp_space.contains(move_list['pp'].max() + 1)
)

assert (False, True, True, False) == (
    move_target_space.contains(-1), 
    move_target_space.contains(0), 
    move_target_space.contains(move_list['target_id'].max()), 
    move_target_space.contains(move_list['target_id'].max() + 1)
)

assert (False, True, True, False) == (
    move_class_space.contains(-1), 
    move_class_space.contains(0), 
    move_class_space.contains(move_list['move_class'].max()), 
    move_class_space.contains(move_list['move_class'].max() + 1)
)

assert (False, True, True, False) == (
    move_effect_id_space.contains(-1), 
    move_effect_id_space.contains(0), 
    move_effect_id_space.contains(move_list['effect_id'].max()), 
    move_effect_id_space.contains(move_list['effect_id'].max() + 1)
)

#### The empty move

In [173]:
empty_move = {
    'power': 0,
    'pp': 0,
    'target_id': 0,
    'move_class': 0,
    'effect_id': 0
}

### OLD STATE SPACES (not included in starter battle environment)

This chapter is purley for archival purposes to show my work.

#### Typing

Types are strings that we already have an encoded representation for. We can again use the Discrete space for this.

In [174]:
# typing_space = gym.spaces.Discrete(all_type_encodings().max() + 1)

In [175]:
# # Sanity checks
# assert not typing_space.contains(-1)
# for i in type_encoder.classes_:
#     assert typing_space.contains(get_type_encoding(i))
# assert not typing_space.contains(18)

#### Abilities

It seems that we already have a numerical representation of the abilities. We can use the Discrete space for this as well.

In [176]:
# starter_abilities_df = abilities[abilities['ability_name'].isin(starter_abilities.values())]
# starter_abilities_df

In [177]:
# min_max_ability = (starter_abilities_df['ability_id'].min(), starter_abilities_df['ability_id'].max() + 1 - starter_abilities_df['ability_id'].min())
# ability_space = gym.spaces.Discrete(min_max_ability[1], start=min_max_ability[0])
# ability_space

In [178]:
# ability_space = gym.spaces.Discrete(starter_abilities_df['ability_id'].max() + 1)

In [179]:
# assert not ability_space.contains(-1)
# assert ability_space.contains(starter_abilities_df['ability_id'].min())
# assert ability_space.contains(starter_abilities_df['ability_id'].max())
# assert not ability_space.contains(starter_abilities_df['ability_id'].max() + 1)

#### Moves

I would prefer if we could make each move a tuple for each individual move and have the values of the tuple be discrete spaces. First lets look at the columns of the moves dataframe.

In [180]:
# starter_moves_values = np.array(list(starter_moves.values())).flatten()
# starter_move_list = move_list[move_list['identifier'].isin(starter_moves_values)].copy()
# starter_move_list

In [181]:
# starter_move_list.isna().sum()

#### About the moves dataframe

About these columns:
- The `id` column we can drop as a move is essentially defined by other stats and its effect.
- The `identifier` column we can drop as it is not needed for the agent.
- The `generation_id` column we can drop as it is not needed for the agent.
- The `type_id` we need to apply label encodeding (which should be easy).
- The `power` column we can use as is, as it is a numerical value.
  - The `np.nan` values we can replace with 0 for moves that do stat changes (leer, growl and withdraw).
  - These stats being changed by these moves are dictated by the `effect_stat` column.
- The `pp` column we can use as is, as it is a numerical value.
- The `accuracy` column we can use as is, as it is a numerical value.
  - The `np.nan` values we can replace with -1 for moves that are accuracy independent (like withdraw).
- The `priority` column we can use as is, as it is a numerical value.
- The `target_id` column we can use as is.
  - The column describes what the move targets. 
  - A move that targets the users stats (like withdraw) or a move that targets the opponents HP (like tackle) for example all have a unique `target_id`.
- The `move_class` column we can use as is.
  - The column describes what kind of move it is (like physical, special or status).
  - I thought label encoding would be needed, but it seems the column is already encoded (1 for status, 2 for physical and 3 for special).
- The `effect_id` column we can use as is.
  - The column describes what kind of effect the move has (like stat change, status effect or damage).
  - It is essentially a label encoding for each unique effect, which is perfect!
- The `effect_chance` column we can use as is.
  - The column describes the chance of an extra effect happening, if any.
  - We can replace the `np.nan` values with 0 for moves that have no effect (like tackle).
- The `effect_amt` column we can use as is.
  - The column describes the amount of the effect that happens, if any.
  - It impact moves with a secondary effect such as stat changes (like with the move ominous wind).
  - We can replace the `np.nan` values with 0 for moves that have no effect (like tackle).
- The `effect_stat` column we can use as is.
  - The column describes what stat the move changes, if any.
  - We can replace the `np.nan` values with 0 for moves that deal direct damage (like tackle) to indicate it targets the HP stat.

In [182]:
# move_list[move_list['identifier'] == 'aerial-ace']

In [183]:
# best_case_accuracy = round(move_list['accuracy'].max() * (8/2) * 1.3 * 1.1 * 1.2 * 1.2 * (5/3))
# worst_case_accuracy = round(max(move_list['accuracy'].min(), 1) * (2/8) * 0.8 * 0.8 * 0.6 * 0.8 * 0.5 * 0.9 * 0.9)

# best_case_accuracy, worst_case_accuracy

In [184]:
# move_list[move_list['accuracy'] <= 0]

In [185]:
# starter_move_list.drop(columns=['id', 'identifier', 'generation_id'], inplace=True)
# starter_move_list['type_id'] = starter_move_list['type_id'].apply(lambda x: get_type_encoding(x))
# # starter_move_list

In [186]:
# stat_change_effect_ids = [ 16, 17 ]
# condition = starter_move_list['effect_id'].isin(stat_change_effect_ids)
# starter_move_list.loc[condition, 'power'] = starter_move_list.loc[condition, 'power'].fillna(0)
# # starter_move_list

In [187]:
# effect_id_that_are_accuracy_independend = [ 16 ]
# condition = starter_move_list['effect_id'].isin(effect_id_that_are_accuracy_independend)
# starter_move_list.loc[condition, 'accuracy'] = starter_move_list.loc[condition, 'accuracy'].fillna(-1)
# # starter_move_list

In [188]:
# effect_id_that_have_no_secondary_effect = [ 1, 16, 17 ]
# condition = starter_move_list['effect_id'].isin(effect_id_that_have_no_secondary_effect)
# starter_move_list.loc[condition, 'effect_amt'] = starter_move_list.loc[condition, 'effect_amt'].fillna(0)
# starter_move_list.loc[condition, 'effect_chance'] = starter_move_list.loc[condition, 'effect_chance'].fillna(0)
# # starter_move_list

In [189]:
# effect_id_that_deal_direct_damage = [ 1 ]
# condition = starter_move_list['effect_id'].isin(effect_id_that_deal_direct_damage)
# starter_move_list.loc[condition, 'effect_stat'] = starter_move_list.loc[condition, 'effect_stat'].fillna(0)
# # starter_move_list

In [190]:
# starter_move_list = starter_move_list.astype(int)

#### Move data frame after above changes

In [191]:
# starter_move_list

In [192]:
# starter_move_list.describe()

In [193]:
# starter_move_list.isna().sum()

#### About the empty move

It is important to think about the case where a pokemon has less then 4 moves. We could use splash as a placeholder move, as its a move that litteraly does nothing, but this would perhaps be missleading for the agent. 

**Important note:**
> All these values need to be distinctly unique, as we can not have this empty move tuple be the same as any other move tuple. Otherwise it will negativly impact the agent's learning.

In [194]:
# for c in starter_move_list.columns:
#     print(c, starter_move_list[c].unique())

The empty move will be defined as followed:
> $\lambda = (17, -1, -1, -1, 0, -1, -1, -1, -1, 0, -1)$

In [195]:
# empty_move = {
#     'type_id': get_type_encoding(np.nan),
#     'power': -1,
#     'pp': -1,
#     'accuracy': -1,
#     'priority': 0,
#     'target_id': -1,
#     'move_class': -1,
#     'effect_id': -1,
#     'effect_chance': -1,
#     'effect_amt': 0,
#     'effect_stat': -1
# }

In [196]:
# starter_move_list.loc[len(starter_move_list)] = empty_move
# starter_move_list

#### Moves as a tuple of Discrete spaces

In [197]:
# starter_move_list.drop(columns=['type_id'], inplace=True)

In [198]:
# movecol_n_start = {}
# for col in starter_move_list.columns:
#     start = starter_move_list[col].min()
#     n = starter_move_list[col].max() + 1 - start

#     movecol_n_start[col] = (n, start)

# movecol_n_start

In [199]:
# # Type space is already defined
# move_power_space = gym.spaces.Discrete(movecol_n_start['power'][0], start=movecol_n_start['power'][1])
# move_pp_space = gym.spaces.Discrete(movecol_n_start['pp'][0], start=movecol_n_start['pp'][1])
# move_accuracy_space = gym.spaces.Discrete(movecol_n_start['accuracy'][0], start=movecol_n_start['accuracy'][1])
# move_priority_space = gym.spaces.Discrete(movecol_n_start['priority'][0], start=movecol_n_start['priority'][1])
# move_target_space = gym.spaces.Discrete(movecol_n_start['target_id'][0], start=movecol_n_start['target_id'][1])
# move_class_space = gym.spaces.Discrete(movecol_n_start['move_class'][0], start=movecol_n_start['move_class'][1])
# move_effect_id_space = gym.spaces.Discrete(movecol_n_start['effect_id'][0], start=movecol_n_start['effect_id'][1])
# move_effect_chance_space = gym.spaces.Discrete(movecol_n_start['effect_chance'][0], start=movecol_n_start['effect_chance'][1])
# move_effect_amt_space = gym.spaces.Discrete(movecol_n_start['effect_amt'][0], start=movecol_n_start['effect_amt'][1])
# move_effect_stat_space = gym.spaces.Discrete(movecol_n_start['effect_stat'][0], start=movecol_n_start['effect_stat'][1])

In [200]:
# # Sanity checks
# print(
#     move_power_space.contains(starter_move_list['power'].min() - 1), 
#     move_power_space.contains(starter_move_list['power'].min()), 
#     move_power_space.contains(starter_move_list['power'].max()), 
#     move_power_space.contains(starter_move_list['power'].max() + 1)
# )

# print(
#     move_pp_space.contains(starter_move_list['pp'].min() - 1), 
#     move_pp_space.contains(starter_move_list['pp'].min()), 
#     move_pp_space.contains(starter_move_list['pp'].max()), 
#     move_pp_space.contains(starter_move_list['pp'].max() + 1)
# )

# print(
#     move_accuracy_space.contains(starter_move_list['accuracy'].min() - 1), 
#     move_accuracy_space.contains(starter_move_list['accuracy'].min()), 
#     move_accuracy_space.contains(starter_move_list['accuracy'].max()), 
#     move_accuracy_space.contains(starter_move_list['accuracy'].max() + 1)
# )

# print(
#     move_priority_space.contains(starter_move_list['priority'].min() - 1), 
#     move_priority_space.contains(starter_move_list['priority'].min()), 
#     move_priority_space.contains(starter_move_list['priority'].max()), 
#     move_priority_space.contains(starter_move_list['priority'].max() + 1)
# )

# print(
#     move_target_space.contains(starter_move_list['target_id'].min() - 1), 
#     move_target_space.contains(starter_move_list['target_id'].min()), 
#     move_target_space.contains(starter_move_list['target_id'].max()), 
#     move_target_space.contains(starter_move_list['target_id'].max() + 1)
# )

# print(
#     move_class_space.contains(starter_move_list['move_class'].min() - 1), 
#     move_class_space.contains(starter_move_list['move_class'].min()), 
#     move_class_space.contains(starter_move_list['move_class'].max()), 
#     move_class_space.contains(starter_move_list['move_class'].max() + 1)
# )

# print(
#     move_effect_id_space.contains(starter_move_list['effect_id'].min() - 1), 
#     move_effect_id_space.contains(starter_move_list['effect_id'].min()), 
#     move_effect_id_space.contains(starter_move_list['effect_id'].max()), 
#     move_effect_id_space.contains(starter_move_list['effect_id'].max() + 1)
# )

# print(
#     move_effect_chance_space.contains(starter_move_list['effect_chance'].min() - 1), 
#     move_effect_chance_space.contains(starter_move_list['effect_chance'].min()), 
#     move_effect_chance_space.contains(starter_move_list['effect_chance'].max()), 
#     move_effect_chance_space.contains(starter_move_list['effect_chance'].max() + 1)
# )

# print(
#     move_effect_amt_space.contains(starter_move_list['effect_amt'].min() - 1), 
#     move_effect_amt_space.contains(starter_move_list['effect_amt'].min()), 
#     move_effect_amt_space.contains(starter_move_list['effect_amt'].max()), 
#     move_effect_amt_space.contains(starter_move_list['effect_amt'].max() + 1)
# )

# print(
#     move_effect_stat_space.contains(starter_move_list['effect_stat'].min() - 1), 
#     move_effect_stat_space.contains(starter_move_list['effect_stat'].min()), 
#     move_effect_stat_space.contains(starter_move_list['effect_stat'].max()), 
#     move_effect_stat_space.contains(starter_move_list['effect_stat'].max() + 1)
# )

The 2 spaces commented out are the ones that, specifically for the starter pokemons, are always 0, making them redundant. 

In [201]:
# move_space = gym.spaces.Tuple([
#     typing_space,
#     move_power_space,
#     move_pp_space,
#     move_accuracy_space,
#     # move_priority_space,
#     move_target_space,
#     move_class_space,
#     move_effect_id_space,
#     # move_effect_chance_space,
#     move_effect_amt_space,
#     move_effect_stat_space
# ])

#### Now for the pokemon tuple

In [202]:
# pokemon_space = gym.spaces.Tuple([
#     hp_space,
#     attack_space,
#     defense_space,
#     sp_atk_space,
#     sp_def_space,
#     speed_space,
#     typing_space,
#     typing_space,
#     ability_space,
#     move_space,
#     move_space,
#     move_space,
#     move_space
# ])

In [203]:
# # Manual space size calculation
# space_size = 0
# for space in pokemon_space:
#     if isinstance(space, gym.spaces.Discrete):
#         space_size += len(range(space.start, space.n))
#     elif isinstance(space, gym.spaces.Tuple):
#         for s in space:
#             space_size += len(range(s.start, s.n))

# print(space_size)

# # Recursive space size calculation
# def recursive_space_size(space: gym.spaces.Space, size: int = 0):
#     if isinstance(space, gym.spaces.Discrete):
#         size += len(range(space.start, space.n))
#     elif isinstance(space, gym.spaces.Tuple):
#         for s in space:
#             size = recursive_space_size(s, size)

#     return size

# print(recursive_space_size(pokemon_space))

#### And finally the party tuple

In [204]:
# party_space = gym.spaces.Tuple([
#     pokemon_space,
#     pokemon_space,
#     pokemon_space,
#     pokemon_space,
#     pokemon_space,
#     pokemon_space
# ])

In [205]:
# recursive_space_size(party_space) # 6 * 1022

## Action Space

The complete action space for the agent is defined by the set of buttons that can be pressed on the controller. These include the arrow keys (`up`, `down`, `left`, `right`), `A`, `B`, `L`, `R`, `X`, `Y`, `start`, and `select`. This will be referred to as the **fundamental action space**.

Utilizing the fundamental action space directly may be unmeaningful due to its granularity and lack of abstraction. Instead, I define a **derived action space** that represents higher-level, semantically meaningful actions. These actions are constructed by combining or sequencing fundamental actions to achieve specific in-game outcomes. More specically, the dericed action space will include the following actions:
- `switch`: Switch the active pokemon.
- `move`: Use a move.
- `item`: Use an item.

**HOWEVER**, in the starter battle the agent will only be able to use the `move` action. The `item` and `switch` actions will be added in later experiments.

Luckily, the `poke-battle-sim` supports the use of these derived actions!

> From `%ENV-DIR%/Lib/site-packages/poke_battle_sim/core/battle.py::Battle::turn`:
> ```python
> """
> The three types of valid actions are:
> 1. Moves - formatted as ['move', $move_name]
> 2. Items - formatted as ['item', $item, $item_target_pos, $move_target_name?]
> 3. Switch-out - formatted as ['other', 'switch']
> """
> ```

## Reward Function

The initial reward function will be quite simplistic. The agent will be rewarded for winning a battle and penalized for losing a battle. The agent will also be rewarded for fainting an NPC's Pokémon and penalized for when its own Pokémon faints. The rewards and penalties will be kept small to prevent the agent from learning to exploit the reward function.

We want to keep te rewards small to prevent the agent from learning to exploit the reward function. The rewards and penalties will be as follows:

| State Description | Reward Associated with reaching this state |
|-------------------|--------------------------------------------|
| Win | +1 |
| Lose | -1 |
| Non terminating state | -0.01 |

### On future reward shaping

It might be good to research how to make the reward function follow a behaviour that takes into account different party sizes. For example: should the reward of winning a 6v6 battle be higher, less then or equal to winning a 1v1 battle? This question will be explored in later experiments.

## Environment Implementation

In [206]:
# Important: Destroy any battle object before creating a new one
lucas = None
barry = None
battle = None

class StarterBattleEnvironment(gym.Env):
    def __init__(self):
        self._lucas = pb.Trainer('lucas', [get_random_starter()])
        self._barry = pb.Trainer(
            'barry', [get_rival_starter(self._lucas.poke_list[0].name)])
        self._battle = pb.Battle(self._lucas, self._barry)
        self._battle.start()

        # Action mappings, formated as:
        #   action_id: (action_type, pokemon_id, move_id)'
        # Where:
        #   action_type is one of 'move', 'switch', 'item'
        #   pokemon_id is always 0 (for targetting the pokemon in the first party slot)
        #   move_id is the index of the move in the pokemon's move list
        self._action_mappings = {
            0: ('move', 0, 0),
            1: ('move', 0, 1),
        }
        self.action_space = gym.spaces.Discrete(len(self._action_mappings))

        # Observation Space
        self.observation_space = gym.spaces.Dict()
        self.observation_prefixes = [ 'agent', 'npc' ]
        for prefix in self.observation_prefixes:
            for pokemon in range(1, 2):
                self.observation_space[f"{prefix}_pokemon{pokemon}_hp"] = hp_space
                self.observation_space[f"{prefix}_pokemon{pokemon}_attack"] = attack_space
                self.observation_space[f"{prefix}_pokemon{pokemon}_defense"] = defense_space
                self.observation_space[f"{prefix}_pokemon{pokemon}_sp_atk"] = sp_atk_space
                self.observation_space[f"{prefix}_pokemon{pokemon}_sp_def"] = sp_def_space
                self.observation_space[f"{prefix}_pokemon{pokemon}_speed"] = speed_space

                for move in range(1, 3):
                    self.observation_space[f"{prefix}_pokemon{pokemon}_move{move}_power"] = move_power_space
                    self.observation_space[f"{prefix}_pokemon{pokemon}_move{move}_pp"] = move_pp_space
                    self.observation_space[f"{prefix}_pokemon{pokemon}_move{move}_target"] = move_target_space
                    self.observation_space[f"{prefix}_pokemon{pokemon}_move{move}_class"] = move_class_space
                    self.observation_space[f"{prefix}_pokemon{pokemon}_move{move}_effect_id"] = move_effect_id_space

            self.observation_space[f"{prefix}_stat_stage_attack"] = stat_stage_space
            self.observation_space[f"{prefix}_stat_stage_defense"] = stat_stage_space
            self.observation_space[f"{prefix}_stat_stage_sp_atk"] = stat_stage_space
            self.observation_space[f"{prefix}_stat_stage_sp_def"] = stat_stage_space
            self.observation_space[f"{prefix}_stat_stage_speed"] = stat_stage_space
    
    def _get_obs(self):
        obs = {}
        for prefix, trainer in zip(self.observation_prefixes, [self._battle.t1, self._battle.t2]):
            for pokemon in range(1, 2):
                p = trainer.poke_list[pokemon - 1]
                obs[f"{prefix}_pokemon{pokemon}_hp"] = p.cur_hp
                obs[f"{prefix}_pokemon{pokemon}_attack"] = p.base[1]
                obs[f"{prefix}_pokemon{pokemon}_defense"] = p.base[2]
                obs[f"{prefix}_pokemon{pokemon}_sp_atk"] = p.base[3]
                obs[f"{prefix}_pokemon{pokemon}_sp_def"] = p.base[4]
                obs[f"{prefix}_pokemon{pokemon}_speed"] = p.base[5]

                for move in range(1, 3):
                    m = p.moves[move - 1]
                    obs[f"{prefix}_pokemon{pokemon}_move{move}_power"] = m.power if m.power else empty_move['power']
                    obs[f"{prefix}_pokemon{pokemon}_move{move}_pp"] = m.current_pp if m.current_pp else empty_move['pp']
                    obs[f"{prefix}_pokemon{pokemon}_move{move}_target"] = m.target if m.target else empty_move['target_id']
                    obs[f"{prefix}_pokemon{pokemon}_move{move}_class"] = m.category if m.category else empty_move['move_class']
                    obs[f"{prefix}_pokemon{pokemon}_move{move}_effect_id"] = m.ef_id if m.ef_id else empty_move['effect_id']

            stat_stages = map_stat_stages(trainer.poke_list[0].stat_stages)
            obs[f"{prefix}_stat_stage_attack"] = stat_stages[1]
            obs[f"{prefix}_stat_stage_defense"] = stat_stages[2]
            obs[f"{prefix}_stat_stage_sp_atk"] = stat_stages[3]
            obs[f"{prefix}_stat_stage_sp_def"] = stat_stages[4]
            obs[f"{prefix}_stat_stage_speed"] = stat_stages[5]

        return obs

    def _get_info(self):
        return {
            't1_pokemons': [p.name for p in self._battle.t1.poke_list],
            't2_pokemons': [p.name for p in self._battle.t2.poke_list],
        }
    
    def _reward(self):
        if self._battle.get_winner() == self._lucas:
            return 1
        elif self._battle.get_winner() == self._barry:
            return -1
        
        return -0.01  # Time penalty

    def step(self, action):
        if self._battle.is_finished():
            raise ValueError('Cannot perform action in a finished battle')
        
        # Perform the action
        action_type, pokemon_id, move_id = self._action_mappings[action]

        # Punish heavily for invalid actions        
        if not self._battle.t1.is_valid_action([action_type, self._battle.t1.poke_list[pokemon_id].moves[move_id].name]):
            reward = -2
        else: # Move is valid
            self._battle.turn(
                t1_turn=[
                    action_type,
                    self._battle.t1.poke_list[pokemon_id].moves[move_id].name
                ],
                t2_turn=[
                    "move",
                    random.choice(
                        list(filter(
                            lambda x: self._battle.t2.is_valid_action(
                                ["move", x.name]
                            ),
                            self._battle.t2.current_poke.moves
                        ))
                    ).name
                ]
            )
            reward = self._reward()

        # TODO implement if statement for the following:
        # - targeting pokemon that is not currently in battle
        # - using a item the trainer does not have access to
        # - switching to a fainted pokemon

        observation = self._get_obs()
        terminated = self._battle.winner is not None
        truncated = False
        info = self._get_info()

        return observation, reward, terminated, truncated, info

    def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
        # We need the following line to seed self.np_random
        super().reset(seed=seed)

        # Reset the battle simulation
        self._lucas = None
        self._barry = None
        self._battle = None

        self._lucas = pb.Trainer('lucas', [get_random_starter()])
        self._barry = pb.Trainer(
            'barry', [get_rival_starter(self._lucas.poke_list[0].name)])
        self._battle = pb.Battle(self._lucas, self._barry)
        self._battle.start()

        return self._get_obs(), self._get_info()

In [207]:
from stable_baselines3.common.env_checker import check_env

env = StarterBattleEnvironment()
check_env(env)

In [208]:
env = StarterBattleEnvironment()
obs, info = StarterBattleEnvironment().reset()

In [209]:
# Sanity check
assert all([ i[0] == i[1] for i in zip(obs.keys(), env.observation_space.spaces.keys()) ])
assert len(obs) == len(env.observation_space)

In [282]:
done = False
start = time.time()
max_time = 1 # seconds

obs, _ = env.reset()
while not done:
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated or (time.time() - start > max_time)
end = time.time()

print(f'Time taken: {end - start} seconds')
print('Battle log:')
cur_txt_parsed = ""
for line in env._battle.cur_text:
    if re.match(r'Turn \d+:', line):
        cur_txt_parsed += '\n'
    cur_txt_parsed += line
    cur_txt_parsed += ' '
print(cur_txt_parsed)

Time taken: 0.001188039779663086 seconds
Battle log:
lucas sent out TURTWIG! barry sent out CHIMCHAR! 
Turn 1: CHIMCHAR used Scratch! TURTWIG used Withdraw! TURTWIG's Defense rose! 
Turn 2: CHIMCHAR used Leer! TURTWIG's Defense fell! TURTWIG used Tackle! 
Turn 3: CHIMCHAR used Scratch! TURTWIG used Tackle! 
Turn 4: CHIMCHAR used Scratch! A critical hit! TURTWIG used Tackle! 
Turn 5: CHIMCHAR used Leer! TURTWIG's Defense fell! TURTWIG used Tackle! 
Turn 6: CHIMCHAR used Scratch! TURTWIG used Tackle! 
Turn 7: CHIMCHAR used Leer! TURTWIG's Defense fell! TURTWIG used Tackle! 
Turn 8: CHIMCHAR used Scratch! TURTWIG used Tackle! 
Turn 9: CHIMCHAR used Scratch! TURTWIG used Tackle! CHIMCHAR fainted! lucas has defeated barry! 


## Policy

For this starter battle environment I will be starting out by using an epsilon greedy policy. This is chosen as it is a simple policy that is easy to implement and understand. The epsilon greedy policy is a policy that selects the best action with a probability of $1 - \epsilon$ and a random action with a probability of $\epsilon$. This allows the agent to explore the environment while still exploiting the best actions it has learned.

If epslin greedy yields poor results, I will switch to an epsilon decay policy. This policy is similar to the epsilon greedy policy, but the epsilon value decays over time. This allows the agent to explore more in the beginning and exploit more towards the end of training.

<!-- 
- Eplsion greedy for starter battle
- Eplsion greedy compared with Boltzmann exploration for future battles 
-->

In [210]:
class BasePolicy:
    def __init__(self) -> None:
        pass

    def action(self, action: np.ndarray) -> int:
        raise NotImplementedError

    def update(self, step: int) -> None:
        raise NotImplementedError

    def config(self) -> dict:
        d = {k: v for k, v in self.__dict__.items() if not k.startswith('_') and not callable(v)}
        d['type'] = self.__class__.__name__
        return d
    
class EpsilonGreedy(BasePolicy):
    def __init__(self, epsilon: float, n_actions: int) -> None:
        self.epsilon = epsilon
        self.n_actions = n_actions

    def action(self, q_values: np.ndarray) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(q_values)
        
    def update(self, step: int) -> None:
        pass

class EpsilonDecay(BasePolicy):
    def __init__(self, epsilon: float, _min: float, decay_rate: float, n_actions: int) -> None:
        self.epsilon_init = epsilon
        self.epsilon_current = epsilon
        self.epsilon_min = _min
        self.decay_rate = decay_rate
        self.n_actions = n_actions

    def action(self, q_values: np.ndarray) -> int:
        if np.random.random() < self.epsilon_current:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(q_values)

    def update(self, step: int) -> None:
        self.epsilon_current = max(
            self.epsilon_current * (self.decay_rate ** step),
            self.epsilon_min
        )

## Logging data for analysis

The rewards will be logged over time to see potential exploitations of the reward function.

Logging:
- Cumuliative reward (must have)
  - Should rise over time
  - Should become less volatile over time
- Every N percent, do a test run (must have)
  - Do a battle
  - Log the battle text
  - Log the battle outcome
- Log how much of the state space has been explored by the agent (should have)
- Log how the agents decision making changes over time (quite advanded, could have)
- Loss over time (should have)
- Reward trends over time (should have)
- Exploration (e.g., epsilon value) over time (should have)

In [211]:
tensorboard_dir = os.path.abspath('./initial_pokemon_battleing_agent')
if not os.path.exists(tensorboard_dir):
    os.makedirs(tensorboard_dir)

## Model Free Approach (Deep Q-Learning)

Architecture: Decide on the architecture of your Q-network. For example:
- Fully connected layers for small, discrete state spaces.
- Convolutional layers if your state is represented as images (e.g., screenshots of the game).

Output: Ensure the network outputs a value for each action in the action space.

In [212]:
env = StarterBattleEnvironment()
model = DQN(
    'MultiInputPolicy',
    env,
    verbose=1,
    tensorboard_log=tensorboard_dir,
    exploration_fraction=0.5
)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Unfortunatly, the DQN model does not accept a custom policy. This means that the epsilon greedy policy will not be used in this approach. In future experiments, I will be looking into making a custom model with a custom policy.

In [213]:
total_timesteps = 1000000

In [214]:
model.learn(
    total_timesteps=total_timesteps,
    tb_log_name='dqn_starter_battle',
)

Logging to d:\Users\luc\repos\deth\research_lvde\experiments\initial_pokemon_battleing_agent\dqn_starter_battle_6


----------------------------------
| rollout/            |          |
|    ep_len_mean      | 20.2     |
|    ep_rew_mean      | -0.192   |
|    exploration_rate | 1        |
| time/               |          |
|    episodes         | 4        |
|    fps              | 1426     |
|    time_elapsed     | 0        |
|    total_timesteps  | 81       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21       |
|    ep_rew_mean      | 0.3      |
|    exploration_rate | 1        |
| time/               |          |
|    episodes         | 8        |
|    fps              | 1142     |
|    time_elapsed     | 0        |
|    total_timesteps  | 168      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0244   |
|    n_updates        | 16       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean    

<stable_baselines3.dqn.dqn.DQN at 0x2562610d340>

In [215]:
model_name = f'dqn_starter_battle_{len(os.listdir(tensorboard_dir))}'
model_path = os.path.join(tensorboard_dir, model_name)

In [216]:
if not os.path.isfile(model_path):
    model.save(model_path)

#### Inference on latest model

In [226]:
import time
import re

env = StarterBattleEnvironment()
model_path = os.path.join(tensorboard_dir, [ i for i in os.listdir(tensorboard_dir) if i.endswith('.zip') ][-1])
model = DQN.load(model_path, env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [279]:
done = False
rewards = []
start = time.time()
max_time = 1 # seconds

obs, _ = env.reset()
while not done:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(int(action))
    rewards.append(reward)
    done = terminated or truncated or (time.time() - start > max_time)
end = time.time()

print(f'Time taken: {end - start} seconds')
print(f'Total rewards: {sum(rewards)}')

print('Battle log:')
cur_txt_parsed = ""
for line in env._battle.cur_text:
    if re.match(r'Turn \d+:', line):
        cur_txt_parsed += '\n'
    cur_txt_parsed += line
    cur_txt_parsed += ' '
print(cur_txt_parsed)

Time taken: 0.015010356903076172 seconds
Total rewards: 0.86
Battle log:
lucas sent out CHIMCHAR! barry sent out PIPLUP! 
Turn 1: CHIMCHAR used Leer! PIPLUP's Defense fell! PIPLUP used Pound! 
Turn 2: CHIMCHAR used Leer! PIPLUP's Defense fell! PIPLUP used Pound! 
Turn 3: CHIMCHAR used Scratch! A critical hit! PIPLUP used Growl! CHIMCHAR's Attack fell! 
Turn 4: CHIMCHAR used Scratch! PIPLUP used Growl! CHIMCHAR's Attack fell! 
Turn 5: CHIMCHAR used Scratch! PIPLUP used Growl! CHIMCHAR's Attack fell! 
Turn 6: CHIMCHAR used Scratch! PIPLUP used Growl! CHIMCHAR's Attack fell! 
Turn 7: CHIMCHAR used Scratch! PIPLUP used Growl! CHIMCHAR's Attack fell! 
Turn 8: CHIMCHAR used Scratch! PIPLUP used Growl! CHIMCHAR's Attack won't go any lower! 
Turn 9: CHIMCHAR used Scratch! PIPLUP used Pound! 
Turn 10: CHIMCHAR used Scratch! PIPLUP used Pound! 
Turn 11: CHIMCHAR used Scratch! PIPLUP used Growl! 
Turn 12: CHIMCHAR used Scratch! PIPLUP used Growl! 
Turn 13: CHIMCHAR used Leer! PIPLUP's Defense fel

#### Development debugging log

The cell above keeps giving errors:
> ```
> File d:\Users\luc\anaconda3\envs\deth\Lib\site-packages\stable_baselines3\common\vec_env\dummy_vec_env.py:110, in DummyVecEnv._save_obs(self, env_idx, obs)
>     108     self.buf_obs[key][env_idx] = obs
>     109 else:
> --> 110     self.buf_obs[key][env_idx] = obs[key]
> 
> OverflowError: int too big to convert
> ```

And this one:
> ```
> File d:\Users\luc\anaconda3\envs\deth\Lib\site-packages\stable_baselines3\common\preprocessing.py:125, in preprocess_obs(obs, observation_space, normalize_images)
>     121     return obs.float()
>     123 elif isinstance(observation_space, spaces.Discrete):
>     124     # One hot encoding and convert to float to avoid errors
> --> 125     return F.one_hot(obs.long(), num_classes=int(observation_space.n)).float()
>     127 elif isinstance(observation_space, spaces.MultiDiscrete):
>     128     # Tensor concatenation of one hot encodings of each Categorical sub-space
>     129     return th.cat(
>     130         [
>     131             F.one_hot(obs_.long(), num_classes=int(observation_space.nvec[idx])).float()
>    (...)
>     134         dim=-1,
>     135     ).view(obs.shape[0], sum(observation_space.nvec))
> 
> RuntimeError: Class values must be smaller than num_classes.
> ```

##### `RuntimeError: Class values must be smaller than num_classes.`

As it turned out, I setup the observation spaces completly wrong. Essentially: all discrete spaces had a faulty `n` and `start` value. I rewrote the code for creating the observation spaces and the error was resolved. I also added sanity checks for each space.

I then ran the following code to check if the environment is set up correctly:
> ```py
> from stable_baselines3.common.env_checker import check_env
> env = StarterBattleEnvironment()
> check_env(env)
> ```

Which it turned out, it wasnt. The about gave me the output:
> ```
> UserWarning: Discrete observation space (key='agent_pokemon1_attack') with a non-zero start (start=51) is not supported by Stable-Baselines3. You can use a wrapper or update your observation space.
> ```

It seemed weird to me that even tough the env is wrapper in a dummyvecenv by stable baselines, the error persists. Regardless, I will try to fix the error by setting the `start` value to 0 for all discrete spaces.

##### Solution

I heavily reduced the state space to make it more simple, as after 4 days of debugging I still could not get the environment to work. See the [editors note](#EDITORS-NOTE) for more information. Regardless, it works now.

##### `OverflowError: int too big to convert`

After printing the some info of the battle instance, it became obvious why the overflow error occurs:
> ```py
> turtwig
> falty_pokemon = env._battle.t1.poke_list[0]
> print(falty_pokemon.name)
> print(falty_pokemon.stats_actual)
> print(falty_pokemon.stats_effective)
> print(falty_pokemon.stat_stages)
> print(len([ i for i in env._battle.cur_text if 'growl' in i.lower() or 'withdraw' in i.lower() ]))
> ```

**Output:**
> ```
> turtwig
> [55, 1, 140319401438009622528, 45, 55, 31]
> [55, 1, 140319401438009622528, 45, 55, 31]
> [0, -6, 6, 0, 0, 0]
> 24
> ```

To summerize the above:
- The agent and NPC used stat changing moves (like growl and withdraw) 22 times total.
- This resulted in the stats of turtwig chaning
    - The attack stat was lowered by 6 stages
    - The defence stat was raised by 6 stages

This indicated to me that something was wrong with the simulation package I was using. After some debugging I found the fault to reside in `Pokemon.calculate_stats_effective` method. Each time I ran the method on the faulty pokemon instance it essentially quadrupled the defense stat.

```
falty_pokemon.calculate_stats_effective()
[55, 1, 561277605752038490112, 45, 55, 31]
```

I created an issue on the github repo of the package: https://github.com/hiimvincent/poke-battle-sim/issues/5

---
##### Solution

I found a [fork of the package](https://github.com/thomas18F/pykemon) that fixed the issue (along with some other issues). I uninstalled the package and installed the forked version.

### Training summery

I trained the model whilst tweaking 2 hyperparameters:
- `exploration_fraction`: The fraction of the total number of steps during which the exploration rate is annealed.
- `total_timesteps`: The total number of steps to train the model.

The first 2 itterations I used the default value for exploration fraction (0.1) and a total of tenthousand timesteps. The second itteration I increased the exploration fraction 0.2 and kept the timesteps the same. The results of this are as followes:

![First and second itterations](./initial_pokemon_battleing_agent/tb_itterations_1_and_2.png)

The results seemed to be to noisy, plus the model was not performing very well. The mean episode reward was way bellow what I wanted it to be. The increase in exploration fraction did not seem to have any effect. So I thought: what if I further increase the exploration fraction whilst also increasing the total timesteps? The results of this are as followes:

![Third and fourth itterations](./initial_pokemon_battleing_agent/tb_itterations_3_and_4.png)

These results are with an exploration fraction of 0.5 and a total of onehundered thousand timesteps, which took about 3.3 minutes on my CPU (Ryzen 9 7900X). The results are much better then the previous itterations: the mean episode reward is still quite volatile, but it is much higher then before plus it has a clear upward trend. The model seems to be learning, but it is still not performing as well as I would like it to. 

So for my last experiment I increased the total timesteps to 1 million and keeping the exploration fraction at 0.5. The results of this are as followes:
![Fifth and final itteration](./initial_pokemon_battleing_agent/tb_itterations_5.png)

This took 35 minutes, again on my CPU, and seems to indicate a linear time growth for training dependend on the total timesteps. The results are much better then the previous itterations: the mean episode reward is still somewhat volatile, but it is much higher then before plus it avarages out to a higher value, especially towards the end of the training session (meaning its exploitation work pretty good).

### On further training

The following topics would be interesting to explore in future experiments:
- It might be nice to look into why tensor flow does not recognize my GPU. This could speed up training times.

## Model Based Approach (...)

TODO research model based approach

In [219]:
# TODO: implement model based stuff

## Training The Agents

...

## Debugging and Visualization

...

## Conclusion

- Expand observation space to include more information about the battle.
  - Define empty pokemon in observation space
- Research gen4 ai and implement it the environment