## 1. Load Data and Remove Bad Row

We begin by loading the `train.jsonl` data line by line. During this process, we explicitly check for and remove `battle_id` 4877, which was identified in the competition forums as having a faulty label. This ensures our model does not learn from incorrect data.

In [63]:
import json
import pandas as pd
import os

# --- Define the path to our data ---
COMPETITION_NAME = 'fds-pokemon-battles-prediction-2025'
DATA_PATH = os.path.join('../input', COMPETITION_NAME)

train_file_path = os.path.join(DATA_PATH, 'train.jsonl')
test_file_path = os.path.join(DATA_PATH, 'test.jsonl')
train_data = []

# Read the file line by line
print(f"Loading data from '{train_file_path}'...")
try:
    with open(train_file_path, 'r') as f:
        for line in f:
            # json.loads() parses one line (one JSON object) into a Python dictionary
            #train_data.append(json.loads(line))
            battle = json.loads(line)
            #let's drop the bad row
            if battle.get('battle_id') != 4877:
                train_data.append(battle)
    print(f"Successfully loaded {len(train_data)} battles.")

except FileNotFoundError:
    print(f"ERROR: Could not find the training file at '{train_file_path}'.")
    print("Please make sure you have added the competition data to this notebook.")

Loading data from '../input/fds-pokemon-battles-prediction-2025/train.jsonl'...
Successfully loaded 9999 battles.


## 2. Preprocessing & Domain Knowledge

Before we can engineer features, we must establish our "domain knowledge." This involves creating a lookup dictionary (a Pokédex) for Pokémon stats and defining the game's type-effectiveness rules.

### 2.1 Building the Global Pokédex

We create a `GLOBAL_POKEDEX_STATS` (for base stats) and `GLOBAL_POKEDEX_TYPES` (for types) by iterating through the **training data** (`train_data`) one time. This dictionary maps Pokémon names (e.g., 'snorlax') to their respective data.

This is built **only** from `train_data` to prevent data leakage from the test set.

In [64]:


print("building the Pokedex...")
GLOBAL_POKEDEX_STATS = {}
GLOBAL_POKEDEX_TYPES = {}

all_data_for_pokedex = train_data  

for battle in all_data_for_pokedex:
    # We use the information from the dataset to store
    all_details = battle.get('p1_team_details', []) + [battle.get('p2_lead_details')]
    
    for p in all_details:
        if p: # Assicurati che il Pokémon (p) non sia None
            name = p.get('name')
            # We only add pokemons that we never met before
            if name and name not in GLOBAL_POKEDEX_STATS:
                GLOBAL_POKEDEX_STATS[name] = p # dict for pokemon stats
                GLOBAL_POKEDEX_TYPES[name] = p.get('types', []) # Salva solo i tipi

print(f"Pokédex costruito! Trovati {len(GLOBAL_POKEDEX_STATS)} Pokémon differenti.")


building the Pokedex...
Pokédex costruito! Trovati 20 Pokémon differenti.


### 2.2 Helper Function: `compute_team_average_stats`

This is a helper function that will be used by our main feature engineering script. It takes a list of Pokémon names (e.g., `['snorlax', 'chansey']`) and the `GLOBAL_POKEDEX_STATS`, and returns a dictionary of the average base stats for that specific list of Pokémon.

In [65]:
def compute_team_average_stats(team_list, pokedex):
    """
    Calculates the average stats for a list of Pokémon names
    using a reference pokedex.
    """
    
    stat_keys = ["base_hp", "base_atk", "base_def", "base_spa", "base_spd", "base_spe"]

    # Initialize stat sums
    stats_sum = {key: 0 for key in stat_keys}
    count = 0

    for name in team_list:
        mon = pokedex.get(name)

        if mon:  # the Pokémon exists in the Pokédex
            for key in stat_keys:
                stats_sum[key] += mon.get(key, 0)
            count += 1

    if count == 0:
        # Return 0 for all stats if the team is unknown or empty
        return {key: 0 for key in stat_keys}

    # Calculate average
    return {key: stats_sum[key] / count for key in stat_keys}

### 2.3 Helper Function: Type Effectiveness Rules

We define the Gen 1 type-effectiveness rules using three dictionaries: `super_effective_map`, `immune_map`, and `not_very_effective_map`. These maps are essential for calculating type advantages during a battle. 

In [66]:

#--- Mappe Tipi (Gen 1)---
# mossa di questo tipo  -> {effetto} su questi tipo []
# ES fighting super efficace su normale roccia e acciaio e resistetnte contro volante ecc immune contro spettro
super_effective_map = {
    "fighting": ["normal", "rock", "ice"], "flying": ["grass", "fighting", "bug"],
    "poison": ["grass", "bug"], "ground": ["fire", "electric", "poison", "rock"],
    "rock": ["fire", "ice", "flying", "bug"], "bug": ["grass", "poison", "psychic"],
    "ghost": ["ghost"], "fire": ["grass", "ice", "bug"], "water": ["fire", "ground", "rock"],
    "grass": ["water", "ground", "rock"], "electric": ["water", "flying"],
    "psychic": ["fighting", "poison"], "ice": ["grass", "ground", "flying", "dragon"],
    "dragon": ["dragon"], "normal": []
}

immune_map = {
    "normal": ["ghost"], "fighting": ["ghost"], "ground": ["electric"],
    "ghost": ["normal", "fighting"], "electric": ["ground"], "psychic": ["ghost"],
    "flying": [], "poison": [], "rock": [], "bug": [], "fire": [], "water": [],
    "grass": [], "ice": [], "dragon": []
}

not_very_effective_map = {
    "normal": ["rock"], "fighting": ["poison", "flying", "psychic", "bug"],
    "flying": ["electric", "rock"], "poison": ["poison", "ground", "rock", "ghost"],
    "ground": ["grass", "bug"], "rock": ["fighting", "ground"],
    "bug": ["fire", "fighting", "flying", "ghost"], "ghost": [],
    "fire": ["fire", "water", "rock", "dragon"], "water": ["water", "grass", "dragon"],
    "grass": ["fire", "grass", "poison", "flying", "bug", "dragon"],
    "electric": ["grass", "electric", "dragon"], "psychic": ["psychic"],
    "ice": ["water", "ice"], "dragon": []
}

### 2.4 Helper Function: `get_effectiveness`

This function acts as our "game engine" or "referee". It takes a move's type and the defender's types, checks them against the maps we just defined, and returns the final damage multiplier (e.g., `0.0` for immune, `0.5` for resisted, `2.0` or `4.0` for super-effective).

In [67]:
def get_effectiveness(move_type, defender_types):
    """
    Calculate the effectiveness of a move.
    We consider both ['notype', 'type2'] e ['type1', 'notype'].
    """
    move_type = move_type.lower()
    if not defender_types:
        return 1.0 
        
    def_type1 = defender_types[0].lower()
    def_type2 = None
    if len(defender_types) > 1:
        def_type2 = defender_types[1].lower()

    effectiveness = 1.0

    # Check type 1
    if def_type1 != 'notype':
        # check immunity
        if move_type in immune_map and def_type1 in immune_map[move_type]:
            return 0.0 
        # checck resistence
        if move_type in not_very_effective_map and def_type1 in not_very_effective_map[move_type]:
            effectiveness *= 0.5
        # Check weakness
        if move_type in super_effective_map and def_type1 in super_effective_map[move_type]:
            effectiveness *= 2.0
            
    # Check for type 2
    if def_type2 and def_type2 != 'notype':
        
        if move_type in immune_map and def_type2 in immune_map[move_type]:
            return 0.0 
        
        if move_type in not_very_effective_map and def_type2 in not_very_effective_map[move_type]:
            effectiveness *= 0.5
        
        if move_type in super_effective_map and def_type2 in super_effective_map[move_type]:
            effectiveness *= 2.0

    return effectiveness

## 3. Core Feature Extraction: `check_team`

This is the most complex and important function in the notebook. Its job is to simulate an entire battle by processing the `battle_timeline` and aggregate all key events into a single row of data.

It "watches" the battle turn-by-turn and tracks over 40 different metrics, which are all returned as a single tuple. These metrics include:

* **Team Composition:** The set of P2's *revealed* Pokémon (`p2_team`).
* **Health & KOs:** Final HP, total defeated Pokémon (`p1_fainted_set`), and who got the `first_ko`.
* **Status Conditions:** Differentiates between `crucial_status` (Sleep, Freeze) and `minor_status` (Paralyze, etc.).
* **Effectiveness:** Counts every `super_effective`, `resisted`, and `immune` hit for both players.
* **Momentum:** Tracks `null_moves` (lost turns) and `avg_power` (offensive pressure), while correctly handling `Seismic Toss`.
* **Checkpoints:** It "snapshots" the state of the battle (HP, KOs, status) at **Turn 10** and **Turn 20** to give the model a sense of how the battle progressed over time.

In [68]:
#we take the battle log of a single battle, p1 team details p2 team lead detailt and the pokedex
def check_team(battle_log, p1_team_details, p2_lead_details, global_pokedex_types):
    
    # we create set to store p1 and p2 pokemon and a set to track of many of them are dead
    p1_team_names = [p.get('name') for p in p1_team_details]
    if p2_lead_details:
        p2_team = {p2_lead_details.get('name')}
    else:
        p2_team = set()
    p1_fainted_set = set()
    p2_fainted_set = set()
    
    # Status setups
    p1_status_inflicted_count = 0 
    p2_status_inflicted_count = 0
    p1_crucial_status_count = 0  # Per slp (sonno) e frz (congelamento)
    p1_minor_status_count = 0    # Per par (paralisi), psn (veleno), brn (scottatura)
    p2_crucial_status_count = 0  
    p2_minor_status_count = 0

    p1_last_known_status = {}
    p2_last_known_status = {}

    # First Blood
    p1_got_first_ko = 0
    p2_got_first_ko = 0 
    # variable to check if we got the first 
    first_ko_awarded = False 
    first_ko_turn = 31 #capire come inizializzarlo, ho messo 31 se non sappiamo ancora in quale turno è avvenuto il first ko
    

    # Hp features
    # We start from 1.0 (100%) for every pokemon di P1
    p1_hp_map = {name: 1.0 for name in p1_team_names if name} 
    # We don't have yet the team for p2
    p2_hp_map = {} 


    # Effectiveness counters
    p1_super_effective_hits, p2_super_effective_hits = 0, 0
    p1_resisted_hits, p2_resisted_hits = 0, 0
    p1_immune_hits, p2_immune_hits = 0, 0

    # a volte le mosse null sono date da predict intelligenti ma un alto numero di null potrebbe
    #significare che un player è in situazione di svantaggio dato da status e scambi forzati
    # con perdita di momentum
    p1_null_move_count = 0
    p2_null_move_count = 0

    # We check how many offensive moves and the base power to truck the strategy of a player 
    # (seismic toss is a special move so we need to treat that differently)
    # dei due player e la average base power
    p1_total_power = 0
    p1_offensive_move_count = 0
    p2_total_power = 0
    p2_offensive_move_count = 0

    
    # Ora vediamo ogni 10 turni quanto variano gli hp e i pokemon durante i turni cosi abbiamo un idea di come avanza la partiata
    p1_hp_at_turn_10, p2_hp_at_turn_10 = 6.0, 6.0
    p1_defeated_at_turn_10, p2_defeated_at_turn_10 = 0, 0 
    p1_status_at_turn_10, p2_status_at_turn_10 = 0, 0
    p1_crucial_status_count_at_turn_10, p2_crucial_status_count_at_turn_10 = 0,0
    p1_minor_status_count_at_turn_10, p2_minor_status_count_at_turn_10 = 0,0

    
    p1_hp_at_turn_20, p2_hp_at_turn_20 = 6.0, 6.0
    p1_defeated_at_turn_20, p2_defeated_at_turn_20 = 0, 0 
    p1_status_at_turn_20, p2_status_at_turn_20 = 0, 0
    p1_crucial_status_count_at_turn_20, p2_crucial_status_count_at_turn_20 = 0,0
    p1_minor_status_count_at_turn_20, p2_minor_status_count_at_turn_20 = 0,0


    
    # Cicle on battle log
    for turno in battle_log:
        turn_num = turno.get('turn')
        p1_state = turno.get('p1_pokemon_state')
        p2_state = turno.get('p2_pokemon_state')
        p1_move = turno.get('p1_move_details')
        p2_move = turno.get('p2_move_details')
        
        if not p1_state or not p2_state:
            continue 
        
        if not p1_move:
            p1_null_move_count += 1
        if not p2_move:
            p2_null_move_count += 1
        
        # ---  Update Names Hp and boots for p1 and p2 --- 
        p1_name = p1_state.get('name')
        if p1_name:
            p1_hp_map[p1_name] = p1_state.get('hp_pct', 0.0)
            
            
        p2_name = p2_state.get('name')
        if p2_name: 
            p2_team.add(p2_name)
            p2_hp_map[p2_name] = p2_state.get('hp_pct', 0.0)
           
        
        # ---  Fainted part --- 
        p1_fainted_this_turn = False
        p2_fainted_this_turn = False
        
        # We check if p1 pokemon is dead in this moment
        if (p1_state.get('status') == 'fnt' or p1_state.get('hp_pct') == 0.0):
            if p1_name and p1_name not in p1_fainted_set:
                p1_fainted_set.add(p1_name)
                p1_fainted_this_turn = True
        
        # now for p2
        if (p2_state.get('status') == 'fnt' or p2_state.get('hp_pct') == 0.0):
            if p2_name and p2_name not in p2_fainted_set:
                p2_fainted_set.add(p2_name)
                p2_fainted_this_turn = True
        
        # ---  First blood part --- 
        if not first_ko_awarded:
            if p2_fainted_this_turn and not p1_fainted_this_turn:
                # P2 fnt, P1 no -> P1 get first blood
                p1_got_first_ko = 1
                first_ko_awarded = True 
                first_ko_turn = turn_num
            elif p1_fainted_this_turn and not p2_fainted_this_turn:
                # P1 fnt, P2 no -> P2 get first blood
                p2_got_first_ko = 1
                first_ko_awarded = True 
                first_ko_turn = turn_num
            elif p1_fainted_this_turn and p2_fainted_this_turn:
                # both fnt (es. Selfdestruct) -> no vantage
                first_ko_awarded = True 
                first_ko_turn = turn_num
        
        
        # --- Status Part ---
        if p2_name:
            p2_current_status = p2_state.get('status', 'nostatus')
            p2_prev_status = p2_last_known_status.get(p2_name, 'nostatus')
            is_new_status = (p2_current_status != p2_prev_status) and (p2_current_status not in ['nostatus', 'fnt'])
            if is_new_status and p1_move:
                p1_status_inflicted_count += 1
                if p2_current_status in ['slp', 'frz']:
                    p1_crucial_status_count += 1
                elif p2_current_status in ['par', 'psn', 'brn']:
                    p1_minor_status_count += 1    
            
            p2_last_known_status[p2_name] = p2_current_status
        if p1_name:
            p1_current_status = p1_state.get('status', 'nostatus')
            p1_prev_status = p1_last_known_status.get(p1_name, 'nostatus')
            is_new_status = (p1_current_status != p1_prev_status) and (p1_current_status not in ['nostatus', 'fnt'])
            if is_new_status and p2_move:
                p2_status_inflicted_count += 1
                if p1_current_status in ['slp', 'frz']:
                    p2_crucial_status_count += 1
                elif p1_current_status in ['par', 'psn', 'brn']:
                    p2_minor_status_count += 1  
            p1_last_known_status[p1_name] = p1_current_status
    
        # Now we insert the typing if we know the pokemon
        p1_types = global_pokedex_types.get(p1_name, []) 
        p2_types = global_pokedex_types.get(p2_name, [])

        # --- Attacks part --- 
        # p1 attacks p2
        if p1_move and p1_move.get('type') and p2_types:
            if p1_move.get('base_power', 0) > 0:  # not a status move
                move_name = p1_move.get('name', '').lower()
    
                if move_name == 'seismictoss':
                    p1_offensive_move_count += 1
                    
    
                    if 'ghost' in [t.lower() for t in p2_types]:
                        effectiveness = 0.0
                    else:
                        effectiveness = 1.0
    
                    if effectiveness == 0.0:
                        p1_immune_hits += 1
    
                else:
                    # Normal attacking move
                    p1_total_power += p1_move.get('base_power', 0)
                    p1_offensive_move_count += 1
                    effectiveness = get_effectiveness(p1_move['type'], p2_types)
    
                    if effectiveness >= 2.0:
                        p1_super_effective_hits += 1
                    elif effectiveness == 0.0:
                        p1_immune_hits += 1
                    elif effectiveness < 1.0:
                        p1_resisted_hits += 1
    
        # p2 attacks p1
        if p2_move and p2_move.get('type') and p1_types:
            if p2_move.get('base_power', 0) > 0:
                move_name = p2_move.get('name', '').lower()
    
                if move_name == 'seismictoss':
                    p2_offensive_move_count += 1
                    # (Seismic counter rimosso)
    
                    if 'ghost' in [t.lower() for t in p1_types]:
                        effectiveness = 0.0
                    else:
                        effectiveness = 1.0
    
                    if effectiveness == 0.0:
                        p2_immune_hits += 1
    
                else:
                    p2_total_power += p2_move.get('base_power', 0)
                    p2_offensive_move_count += 1
                    effectiveness = get_effectiveness(p2_move['type'], p1_types)
    
                    if effectiveness >= 2.0:
                        p2_super_effective_hits += 1
                    elif effectiveness == 0.0:
                        p2_immune_hits += 1
                    elif effectiveness < 1.0:
                        p2_resisted_hits += 1

    
        # Now let's calculate how hp, status and alive pokemon varies every 10 turns
        current_p1_hp_total = sum(p1_hp_map.values())
        current_p1_defeated = len(p1_fainted_set)
        current_p1_status_count = p1_status_inflicted_count
        current_p1_crucial_status_count  = p1_crucial_status_count
        current_p1_minor_status_count = p1_minor_status_count

        # (Logica P2 HP Corretta)
        current_p2_hidden = 6 - len(p2_team)
        current_p2_hp_calc = current_p2_hidden * 1.0
        for name in p2_team:
            if name not in p2_fainted_set:
                current_p2_hp_calc += p2_hp_map.get(name, 1.0)
        current_p2_hp_total = current_p2_hp_calc
        
        current_p2_defeated = len(p2_fainted_set)
        current_p2_status_count = p2_status_inflicted_count
        current_p2_crucial_status_count  = p2_crucial_status_count
        current_p2_minor_status_count = p2_minor_status_count
        
        
        if turn_num == 10:
                p1_hp_at_turn_10, p2_hp_at_turn_10 = current_p1_hp_total, current_p2_hp_total
                p1_defeated_at_turn_10, p2_defeated_at_turn_10 = current_p1_defeated, current_p2_defeated
                p1_status_at_turn_10, p2_status_at_turn_10 = current_p1_status_count, current_p2_status_count
                p1_crucial_status_count_at_turn_10, p2_crucial_status_count_at_turn_10 = current_p1_crucial_status_count, current_p2_crucial_status_count 
                p1_minor_status_count_at_turn_10, p2_minor_status_count_at_turn_10 = current_p1_minor_status_count, current_p2_minor_status_count      

        if turn_num == 20:
                p1_hp_at_turn_20, p2_hp_at_turn_20 = current_p1_hp_total, current_p2_hp_total
                p1_defeated_at_turn_20, p2_defeated_at_turn_20 = current_p1_defeated, current_p2_defeated
                p1_status_at_turn_20, p2_status_at_turn_20 = current_p1_status_count, current_p2_status_count
                p1_crucial_status_count_at_turn_20, p2_crucial_status_count_at_turn_20 = current_p1_crucial_status_count, current_p2_crucial_status_count 
                p1_minor_status_count_at_turn_20, p2_minor_status_count_at_turn_20 = current_p1_minor_status_count, current_p2_minor_status_count      



    
    # --- Update after all the turns (Alive, HP, Avg Power) ---
    p1_alive_count = 6 - len(p1_fainted_set) 
    p2_alive_count = 6 - len(p2_fainted_set)
    p1_total_hp = sum(p1_hp_map.values())
    p2_hidden_count = 6 - len(p2_team)
    p2_total_hp = p2_hidden_count * 1.0
    

    # Calculates the average
    p1_avg_power = p1_total_power / p1_offensive_move_count if p1_offensive_move_count > 0 else 0
    p2_avg_power = p2_total_power / p2_offensive_move_count if p2_offensive_move_count > 0 else 0

    for name in p2_team:
        if name in p2_fainted_set:
            p2_total_hp += 0.0
        else:
            p2_total_hp += p2_hp_map.get(name, 1.0) 

    # --- Return the features ---
    return (p2_team,
            p1_alive_count, p2_alive_count, 
            p1_status_inflicted_count, p2_status_inflicted_count, 
            p1_total_hp, p2_total_hp,
            p1_super_effective_hits, p2_super_effective_hits,
            p1_resisted_hits, p2_resisted_hits,
            p1_immune_hits, p2_immune_hits, 
            p1_avg_power, p2_avg_power, 
            p1_null_move_count, p2_null_move_count,
            p1_got_first_ko, p2_got_first_ko,
            
            p1_crucial_status_count, p1_minor_status_count,
            p2_crucial_status_count, p2_minor_status_count,
            
            first_ko_turn,
            p1_hp_at_turn_10, p2_hp_at_turn_10,
            p1_defeated_at_turn_10, 
            p2_defeated_at_turn_10,
            p1_status_at_turn_10, p2_status_at_turn_10,
            p1_crucial_status_count_at_turn_10, 
            p2_crucial_status_count_at_turn_10, 
            p1_minor_status_count_at_turn_10, 
            p2_minor_status_count_at_turn_10, 
            
            p1_hp_at_turn_20, p2_hp_at_turn_20 ,
            p1_defeated_at_turn_20, p2_defeated_at_turn_20,
            p1_status_at_turn_20, p2_status_at_turn_20,
            p1_crucial_status_count_at_turn_20, p2_crucial_status_count_at_turn_20,
            p1_minor_status_count_at_turn_20, p2_minor_status_count_at_turn_20
            
            
           )

## 4. Feature Engineering: `create_features`

This function acts as the "factory" that builds our final `DataFrame`. It iterates through each battle and:
1.  **Extracts Static Features:** Calculates pre-battle "static" features, like the average stats of P1's full team.
2.  **Calls `check_team`:** Runs the timeline-processing function to get the 40+ "dynamic" features.
3.  **Combines & Creates Features:** Combines all static and dynamic data into a final set of features, calculating "advantage" (`A - B`) and "ratio" (`A / B`) columns, which are very powerful for tree-based models like XGBoost.

In [69]:
from tqdm.notebook import tqdm
import numpy as np
def create_features(data: list[dict]) -> pd.DataFrame:
    feature_list = []
    for battle in tqdm(data, desc="Extracting features"):
        features = {}
        battle_log = battle.get("battle_timeline",[])
        p1_team = battle.get('p1_team_details', [])
        p2_lead = battle.get('p2_lead_details')        

    
    

    # --- Timeline Features ---
    
        # Values from check_team
        (
            p2_team, p1_alive, p2_alive, 
            p1_status, p2_status, 
            p1_hp, p2_hp, 
            p1_se_hits, p2_se_hits,
            p1_resist_hits, p2_resist_hits,
            p1_immune_hits, p2_immune_hits,
            p1_avg_power, p2_avg_power, 
            p1_null_moves, p2_null_moves,
            p1_got_first_ko, p2_got_first_ko,
            p1_crucial_status_count, p1_minor_status_count,
            p2_crucial_status_count, p2_minor_status_count,
            first_ko_turn,
            p1_hp_t10, p2_hp_t10, p1_defeated_t10, p2_defeated_t10,
            p1_status_t10, p2_status_t10,
            p1_crucial_t10, p2_crucial_t10, 
            p1_minor_t10, p2_minor_t10,
            p1_hp_t20, p2_hp_t20, p1_defeated_t20, p2_defeated_t20,
            p1_status_t20, p2_status_t20,
            p1_crucial_t20, p2_crucial_t20,
            p1_minor_t20, p2_minor_t20
        ) = check_team(battle_log, p1_team, p2_lead, GLOBAL_POKEDEX_TYPES)

         # --- AVERAGE TEAM STATS FOR PLAYER 1 ---
        p1_team_names = [p.get('name') for p in p1_team]
        # We use a helper functions to calculate the avg stats
        p1_avg_stats = compute_team_average_stats(p1_team_names, GLOBAL_POKEDEX_STATS)
        

        features['p1_mean_hp']  = p1_avg_stats['base_hp']
        features['p1_mean_atk'] = p1_avg_stats['base_atk']
        features['p1_mean_def'] = p1_avg_stats['base_def']
        features['p1_mean_spa'] = p1_avg_stats['base_spa']
        features['p1_mean_spd'] = p1_avg_stats['base_spd']
        features['p1_mean_spe'] = p1_avg_stats['base_spe']
        
        # --- AVERAGE TEAM STATS FOR PLAYER 2 (KNOWN TEAM) ---
        # We use a helper functions to calculate the avg stats
        p2_avg_stats = compute_team_average_stats(p2_team, GLOBAL_POKEDEX_STATS)


        features['p2_mean_hp']  = p2_avg_stats['base_hp']
        features['p2_mean_atk'] = p2_avg_stats['base_atk']
        features['p2_mean_def'] = p2_avg_stats['base_def']
        features['p2_mean_spa'] = p2_avg_stats['base_spa']
        features['p2_mean_spd'] = p2_avg_stats['base_spd']
        features['p2_mean_spe'] = p2_avg_stats['base_spe']

        

        p1_hp = p1_hp/6
        p2_hp = p2_hp/6 
        
        # Features on Hp, Alive, Deceased or Status pokemon
        #features['p1_alive_pkmn'] = p1_alive
        #features['p2_alive_pkmn'] = p2_alive
        features['p2_revealed_count'] = len(p2_team)
        features['p1_defeated_pkmn'] = 6-p1_alive
        features['p2_defeated_pkmn'] = 6-p2_alive
        features['p1_status_count'] = p1_status
        features['p2_status_count'] = p2_status
        features['p1_crucial_status_count'] = p1_crucial_status_count
        features['p1_minor_status_count'] = p1_minor_status_count
        features['p2_crucial_status_count'] = p2_crucial_status_count
        features['p2_minor_status_count'] = p2_minor_status_count
        features['p1_minor_status_advantage'] = p1_minor_status_count - p2_minor_status_count
        features['p1_crucial_status_advantage'] = p1_crucial_status_count - p2_crucial_status_count
        
        features['p1_total_hp'] = p1_hp
        features['p2_total_hp'] = p2_hp
        features['hp_p1__advantage'] = p1_hp - p2_hp
        
        # Features on effectiveness ---
        features['p1_super_effective_hits'] = p1_se_hits
        features['p2_super_effective_hits'] = p2_se_hits
        features['p1_resisted_hits'] = p1_resist_hits
        features['p2_resisted_hits'] = p2_resist_hits
        features['p1_immune_hits'] = p1_immune_hits
        features['p2_immune_hits'] = p2_immune_hits

        # Advantages on types
        features['p1_se_hits_advantage'] = p1_se_hits - p2_se_hits
        features['p1_resist_hits_advantage'] = p1_resist_hits - p2_resist_hits
        features['p1_immune_hits_advantage'] = p1_immune_hits - p2_immune_hits

        # Features on power of the moves and null counter
        features['p1_avg_power'] = p1_avg_power
        features['p2_avg_power'] = p2_avg_power
        features['p1_null_moves'] = p1_null_moves
        features['p2_null_moves'] = p2_null_moves
        
        # First blood
        features['p1_got_first_ko'] = p1_got_first_ko
        features['p2_got_first_ko'] = p2_got_first_ko
        features['first_ko_advantage'] = p1_got_first_ko - p2_got_first_ko
        features['first_ko_turn'] = first_ko_turn

        

        # Ratio features
        features['hp_ratio'] = p1_hp / (p2_hp + 1e-6)
        features['se_hits_ratio'] = (p1_se_hits + 1) / (p2_se_hits + 1)
        features['alive_ratio'] = (p1_alive + 1) / (p2_alive + 1)

        p1_team_names = [p.get('name') for p in p1_team]
        p2_team_names = p2_team 

        # Threat P1
        features['p1_has_snorlax'] = 1 if 'snorlax' in p1_team_names else 0
        features['p1_has_tauros'] = 1 if 'tauros' in p1_team_names else 0
        features['p1_has_chansey'] = 1 if 'chansey' in p1_team_names else 0
        
        # Threat P2
        features['p2_has_snorlax'] = 1 if 'snorlax' in p2_team_names else 0
        features['p2_has_tauros'] = 1 if 'tauros' in p2_team_names else 0
        features['p2_has_chansey'] = 1 if 'chansey' in p2_team_names else 0
        
        # Advantage
        features['p1_threat_advantage'] = (features['p1_has_snorlax'] + features['p1_has_tauros'] + features['p1_has_chansey']) - \
                                      (features['p2_has_snorlax'] + features['p2_has_tauros'] + features['p2_has_chansey'])

        
        
        
        # --- Turn 10 Features ---
        p1_hp_t10_norm = p1_hp_t10 / 6.0
        p2_hp_t10_norm = p2_hp_t10 / 6.0
        features['hp_adv_t10'] = p1_hp_t10_norm - p2_hp_t10_norm
        features['defeated_adv_t10'] = p2_defeated_t10 - p1_defeated_t10 
        features['status_adv_t10'] = p1_status_t10 - p2_status_t10
        features['crucial_status_adv_t10'] = p1_crucial_t10 - p2_crucial_t10
        features['minor_status_adv_t10'] = p1_minor_t10 - p2_minor_t10
        
        
        features['p1_hp_t10'] = p1_hp_t10_norm
        features['p2_hp_t10'] = p2_hp_t10_norm
        features['p1_defeated_t10'] = p1_defeated_t10
        features['p2_defeated_t10'] = p2_defeated_t10 
        
        # --- Turn 20 Features ---
        p1_hp_t20_norm = p1_hp_t20 / 6.0
        p2_hp_t20_norm = p2_hp_t20 / 6.0
        features['hp_adv_t20'] = p1_hp_t20_norm - p2_hp_t20_norm
        features['defeated_adv_t20'] = p2_defeated_t20 - p1_defeated_t20
        features['status_adv_t20'] = p1_status_t20 - p2_status_t20
        features['crucial_status_adv_t20'] = p1_crucial_t20 - p2_crucial_t20
        features['minor_status_adv_t20'] = p1_minor_t20 - p2_minor_t20
        
        
        features['p1_hp_t20'] = p1_hp_t20_norm
        features['p2_hp_t20'] = p2_hp_t20_norm
        features['p1_defeated_t20'] = p1_defeated_t20
        features['p2_defeated_t20'] = p2_defeated_t20

 

        features['battle_id'] = battle.get('battle_id')
        if 'player_won' in battle:
            features['player_won'] = int(battle['player_won'])
            
        feature_list.append(features)
        
    return pd.DataFrame(feature_list).fillna(0)

In [70]:
# Create feature DataFrames for both training and test sets
print("Processing training data...")
train_df = create_features(train_data)

print("\nProcessing test data...")
test_data = []
with open(test_file_path, 'r') as f:
    for line in f:
        test_data.append(json.loads(line))
test_df = create_features(test_data)

print("\nTraining features preview:")

#display(train_df)
display(train_df.head(10))

Processing training data...


Extracting features:   0%|          | 0/9999 [00:00<?, ?it/s]


Processing test data...


Extracting features:   0%|          | 0/5000 [00:00<?, ?it/s]


Training features preview:


Unnamed: 0,p1_mean_hp,p1_mean_atk,p1_mean_def,p1_mean_spa,p1_mean_spd,p1_mean_spe,p2_mean_hp,p2_mean_atk,p2_mean_def,p2_mean_spa,...,defeated_adv_t20,status_adv_t20,crucial_status_adv_t20,minor_status_adv_t20,p1_hp_t20,p2_hp_t20,p1_defeated_t20,p2_defeated_t20,battle_id,player_won
0,115.833333,72.5,63.333333,100.0,100.0,80.0,141.25,71.25,60.0,98.75,...,0,-1,0,-1,0.800528,0.801446,0,0,0,1
1,123.333333,72.5,65.833333,90.0,90.0,61.666667,115.833333,72.5,63.333333,100.0,...,-2,-1,0,-1,0.533333,0.676667,2,0,1,1
2,124.166667,84.166667,71.666667,90.0,90.0,65.833333,110.0,55.0,51.25,110.0,...,-1,0,1,-1,0.833333,0.693333,1,0,2,1
3,121.666667,77.5,65.833333,103.333333,103.333333,75.833333,101.25,101.25,77.5,90.0,...,-2,-1,1,-2,0.616667,0.791667,2,0,3,1
4,114.166667,75.833333,79.166667,97.5,97.5,72.5,128.0,77.0,67.0,93.0,...,0,2,0,2,0.78,0.721667,0,0,4,1
5,103.333333,70.833333,70.0,100.0,100.0,85.0,138.0,69.0,56.0,105.0,...,0,0,0,0,0.785,0.776667,0,0,5,1
6,74.166667,89.166667,105.833333,99.166667,99.166667,80.833333,124.166667,85.833333,75.833333,85.0,...,-1,0,0,0,0.543333,0.776667,2,1,6,1
7,89.166667,86.666667,76.666667,103.333333,103.333333,88.333333,84.0,75.0,60.0,95.0,...,-1,-4,-3,-1,0.738333,0.885,1,0,7,1
8,74.166667,89.166667,105.833333,99.166667,99.166667,80.833333,109.166667,68.333333,70.833333,92.5,...,-2,3,1,2,0.45,0.621667,3,1,8,1
9,120.833333,75.0,63.333333,104.166667,104.166667,77.5,105.833333,75.0,72.5,96.666667,...,-1,0,1,-1,0.65082,0.765714,1,0,9,1


## 5. Model Training & Validation

Now we use the `train_df` to find the best possible model. Our strategy has two parts:

1.  **Cross-Validation:** We first run a 5-fold `StratifiedKFold` cross-validation on the *entire* training set (`X`, `y`). This gives us a highly reliable, stable "Mean CV Accuracy" that tells us how good our features really are.
2.  **Optimal Parameter Search:** We then use a separate `train_test_split` (85/15) to find the `best_iteration` (optimal number of trees) using `early_stopping_rounds`. This prevents our model from overfitting.
3.  **Final Training:** Finally, we create a new `XGBClassifier` using the `best_iteration` we found and train it on the **entire 100%** of the training data (`X`, `y`). This is the `final_model` we will use for submission.

In [73]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score
import numpy as np

# Setup the train test with our features
features = [col for col in train_df.columns if col not in ['battle_id', 'player_won']]
X = train_df[features]
y = train_df['player_won']

#Cross Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# --- model XGB Boost for the cross validation ---
model_cv = XGBClassifier(
    n_estimators=500,     
    learning_rate=0.05,
    max_depth=2,
    random_state=42,
    n_jobs=-1
)


print("Running 5-Fold Cross-Validation...")
# We train the model 5 times for the cross validation.
cv_scores = cross_val_score(model_cv, X, y, cv=cv, scoring="accuracy", n_jobs=-1)

print("Cross-Validation scores for each fold:")
print(cv_scores)
print(f"Mean CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")


# Now we train the model in the full train test now that we have an idea of how it works
#  We split again the train to find the  best_iteration and find the best model for us
X_train_full, X_val_full, y_train_full, y_val_full = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)

# Now we train our model with early_stopping so we can find the optimal round
model_finder = XGBClassifier(
    n_estimators=2000,
    learning_rate=0.03,
    max_depth=2,
    early_stopping_rounds=30,
    random_state=42,
    n_jobs=-1
)
model_finder.fit(
    X_train_full, 
    y_train_full, 
    eval_set=[(X_val_full, y_val_full)],
    verbose=False
)

best_iteration = model_finder.best_iteration
print(f"Retraining on full data with {best_iteration} estimators...")

# We found the optimal number of trees so we can do the final training
final_model = XGBClassifier(
    n_estimators=best_iteration, 
    learning_rate=0.03,
    max_depth=2,
    random_state=42,
    n_jobs=-1
)
final_model.fit(X, y) # Train on 100% of the train data

X_test = test_df[features]
preds = final_model.predict(X_test)

print("Model retrained on full data and predictions generated.")

Running 5-Fold Cross-Validation...
Cross-Validation scores for each fold:
[0.843      0.8355     0.852      0.844      0.84742371]
Mean CV Accuracy: 0.8444 ± 0.0054
Retraining on full data with 595 estimators...
Model retrained on full data and predictions generated.


## 6. Create Submission File

Using our `final_model` (which was trained on 100% of the training data with the optimal parameters), we generate predictions on the `test_df` and save the results to `submission.csv`.

In [74]:
print("Generating predictions on the test set...")

# X test with some features on train data
X_test = test_df[features] 

# Use the final model (trained on  100% of data)
test_predictions = final_model.predict(X_test)

# Create the submission DataFrame
submission_df = pd.DataFrame({
    'battle_id': test_df['battle_id'],
    'player_won': test_predictions
})

# Save the DataFrame to a .csv file
submission_df.to_csv('submission.csv', index=False)

print("\n'submission.csv' file created successfully!")
display(submission_df.head(10))

Generating predictions on the test set...

'submission.csv' file created successfully!


Unnamed: 0,battle_id,player_won
0,0,0
1,1,1
2,2,1
3,3,1
4,4,1
5,5,1
6,6,1
7,7,1
8,8,1
9,9,1
