# FDS Challenge: Starter Notebook

This notebook will guide you through the first steps of the competition. Our goal here is to show you how to:

1.  Load the `train.jsonl` and `test.jsonl` files from the competition data.
2.  Create a very simple set of features from the data.
3.  Train a basic model.
4.  Generate a `submission.csv` file in the correct format.
5.  Submit your results.

Let's get started!

### 1. Loading and Inspecting the Data

When you create a notebook within a Kaggle competition, the competition's data is automatically attached and available in the `../input/` directory.

The dataset is in a `.jsonl` format, which means each line is a separate JSON object. This is great because we can process it one line at a time without needing to load the entire large file into memory.

Let's write a simple loop to load the training data and inspect the first battle.

In [1]:
import json
import pandas as pd
import os

# --- Define the path to our data ---
COMPETITION_NAME = 'fds-pokemon-battles-prediction-2025'
DATA_PATH = os.path.join('../input', COMPETITION_NAME)

train_file_path = os.path.join(DATA_PATH, 'train.jsonl')
test_file_path = os.path.join(DATA_PATH, 'test.jsonl')
train_data = []

# Read the file line by line
print(f"Loading data from '{train_file_path}'...")
try:
    with open(train_file_path, 'r') as f:
        for line in f:
            # json.loads() parses one line (one JSON object) into a Python dictionary
            train_data.append(json.loads(line))

    print(f"Successfully loaded {len(train_data)} battles.")

    # Let's inspect the first battle to see its structure
    print("\n--- Structure of the first train battle: ---")
    if train_data:
        first_battle = train_data[0]
        
        # To keep the output clean, we can create a copy and truncate the timeline
        battle_for_display = first_battle.copy()
        battle_for_display['battle_timeline'] = battle_for_display.get('battle_timeline', [])[:2] # Show first 2 turns
        
        # Use json.dumps for pretty-printing the dictionary
        print(json.dumps(battle_for_display, indent=4))
        if len(first_battle.get('battle_timeline', [])) > 3:
            print("    ...")
            print("    (battle_timeline has been truncated for display)")


except FileNotFoundError:
    print(f"ERROR: Could not find the training file at '{train_file_path}'.")
    print("Please make sure you have added the competition data to this notebook.")

Loading data from '../input/fds-pokemon-battles-prediction-2025/train.jsonl'...
Successfully loaded 10000 battles.

--- Structure of the first train battle: ---
{
    "player_won": true,
    "p1_team_details": [
        {
            "name": "starmie",
            "level": 100,
            "types": [
                "psychic",
                "water"
            ],
            "base_hp": 60,
            "base_atk": 75,
            "base_def": 85,
            "base_spa": 100,
            "base_spd": 100,
            "base_spe": 115
        },
        {
            "name": "exeggutor",
            "level": 100,
            "types": [
                "grass",
                "psychic"
            ],
            "base_hp": 95,
            "base_atk": 95,
            "base_def": 85,
            "base_spa": 125,
            "base_spd": 125,
            "base_spe": 55
        },
        {
            "name": "chansey",
            "level": 100,
            "types": [
                "normal",

### 2. Basic Feature Engineering

A successful model will likely require creating many complex features. For this starter notebook, however, we will create a very simple feature set based **only on the initial team stats**. This will be enough to train a model and generate a submission file.

It's up to you to engineer more powerful features!

In [2]:
pokedex = {
    "bulbasaur": ("grass", "poison"),
    "ivysaur": ("grass", "poison"),
    "venusaur": ("grass", "poison"),
    "charmander": ("fire", "notype"),
    "charmeleon": ("fire", "notype"),
    "charizard": ("fire", "flying"),
    "squirtle": ("water", "notype"),
    "wartortle": ("water", "notype"),
    "blastoise": ("water", "notype"),
    "caterpie": ("bug", "notype"),
    "metapod": ("bug", "notype"),
    "butterfree": ("bug", "flying"),
    "weedle": ("bug", "poison"),
    "kakuna": ("bug", "poison"),
    "beedrill": ("bug", "poison"),
    "pidgey": ("normal", "flying"),
    "pidgeotto": ("normal", "flying"),
    "pidgeot": ("normal", "flying"),
    "rattata": ("normal", "notype"),
    "raticate": ("normal", "notype"),
    "spearow": ("normal", "flying"),
    "fearow": ("normal", "flying"),
    "ekans": ("poison", "notype"),
    "arbok": ("poison", "notype"),
    "pikachu": ("electric", "notype"),
    "raichu": ("electric", "notype"),
    "sandshrew": ("ground", "notype"),
    "sandslash": ("ground", "notype"),
    "nidoran♀": ("poison", "notype"),
    "nidorina": ("poison", "notype"),
    "nidoqueen": ("poison", "ground"),
    "nidoran♂": ("poison", "notype"),
    "nidorino": ("poison", "notype"),
    "nidoking": ("poison", "ground"),
    "clefairy": ("normal", "notype"),
    "clefable": ("normal", "notype"),
    "vulpix": ("fire", "notype"),
    "ninetales": ("fire", "notype"),
    "jigglypuff": ("normal", "notype"),
    "wigglytuff": ("normal", "notype"),
    "zubat": ("poison", "flying"),
    "golbat": ("poison", "flying"),
    "oddish": ("grass", "poison"),
    "gloom": ("grass", "poison"),
    "vileplume": ("grass", "poison"),
    "paras": ("bug", "grass"),
    "parasect": ("bug", "grass"),
    "venonat": ("bug", "poison"),
    "venomoth": ("bug", "poison"),
    "diglett": ("ground", "notype"),
    "dugtrio": ("ground", "notype"),
    "meowth": ("normal", "notype"),
    "persian": ("normal", "notype"),
    "psyduck": ("water", "notype"),
    "golduck": ("water", "notype"),
    "mankey": ("fighting", "notype"),
    "primeape": ("fighting", "notype"),
    "growlithe": ("fire", "notype"),
    "arcanine": ("fire", "notype"),
    "poliwag": ("water", "notype"),
    "poliwhirl": ("water", "notype"),
    "poliwrath": ("water", "fighting"),
    "abra": ("psychic", "notype"),
    "kadabra": ("psychic", "notype"),
    "alakazam": ("psychic", "notype"),
    "machop": ("fighting", "notype"),
    "machoke": ("fighting", "notype"),
    "machamp": ("fighting", "notype"),
    "bellsprout": ("grass", "poison"),
    "weepinbell": ("grass", "poison"),
    "victreebel": ("grass", "poison"),
    "tentacool": ("water", "poison"),
    "tentacruel": ("water", "poison"),
    "geodude": ("rock", "ground"),
    "graveler": ("rock", "ground"),
    "golem": ("rock", "ground"),
    "ponyta": ("fire", "notype"),
    "rapidash": ("fire", "notype"),
    "slowpoke": ("water", "psychic"),
    "slowbro": ("water", "psychic"),
    "magnemite": ("electric", "notype"),
    "magneton": ("electric", "notype"),
    "farfetch'd": ("normal", "flying"),
    "doduo": ("normal", "flying"),
    "dodrio": ("normal", "flying"),
    "seel": ("water", "notype"),
    "dewgong": ("water", "ice"),
    "grimer": ("poison", "notype"),
    "muk": ("poison", "notype"),
    "shellder": ("water", "notype"),
    "cloyster": ("water", "ice"),
    "gastly": ("ghost", "poison"),
    "haunter": ("ghost", "poison"),
    "gengar": ("ghost", "poison"),
    "onix": ("rock", "ground"),
    "drowzee": ("psychic", "notype"),
    "hypno": ("psychic", "notype"),
    "krabby": ("water", "notype"),
    "kingler": ("water", "notype"),
    "voltorb": ("electric", "notype"),
    "electrode": ("electric", "notype"),
    "exeggcute": ("grass", "psychic"),
    "exeggutor": ("grass", "psychic"),
    "cubone": ("ground", "notype"),
    "marowak": ("ground", "notype"),
    "hitmonlee": ("fighting", "notype"),
    "hitmonchan": ("fighting", "notype"),
    "lickitung": ("normal", "notype"),
    "koffing": ("poison", "notype"),
    "weezing": ("poison", "notype"),
    "rhyhorn": ("ground", "rock"),
    "rhydon": ("ground", "rock"),
    "chansey": ("normal", "notype"),
    "tangela": ("grass", "notype"),
    "kangaskhan": ("normal", "notype"),
    "horsea": ("water", "notype"),
    "seadra": ("water", "notype"),
    "goldeen": ("water", "notype"),
    "seaking": ("water", "notype"),
    "staryu": ("water", "notype"),
    "starmie": ("water", "psychic"),
    "mr. mime": ("psychic", "notype"),
    "scyther": ("bug", "flying"),
    "jynx": ("ice", "psychic"),
    "electabuzz": ("electric", "notype"),
    "magmar": ("fire", "notype"),
    "pinsir": ("bug", "notype"),
    "tauros": ("normal", "notype"),
    "magikarp": ("water", "notype"),
    "gyarados": ("water", "flying"),
    "lapras": ("water", "ice"),
    "ditto": ("normal", "notype"),
    "eevee": ("normal", "notype"),
    "vaporeon": ("water", "notype"),
    "jolteon": ("electric", "notype"),
    "flareon": ("fire", "notype"),
    "porygon": ("normal", "notype"),
    "omanyte": ("rock", "water"),
    "omastar": ("rock", "water"),
    "kabuto": ("rock", "water"),
    "kabutops": ("rock", "water"),
    "aerodactyl": ("rock", "flying"),
    "snorlax": ("normal", "notype"),
    "articuno": ("ice", "flying"),
    "zapdos": ("electric", "flying"),
    "moltres": ("fire", "flying"),
    "dratini": ("dragon", "notype"),
    "dragonair": ("dragon", "notype"),
    "dragonite": ("dragon", "flying"),
    "mewtwo": ("psychic", "notype"),
    "mew": ("psychic", "notype"),
}

super_effective = {
    "normal": [],
    "fire": ["grass", "ice", "bug"],
    "water": ["fire", "ground", "rock"],
    "electric": ["water", "flying"],
    "grass": ["water", "ground", "rock"],
    "ice": ["grass", "ground", "flying", "dragon"],
    "fighting": ["normal", "ice", "rock"],
    "poison": ["grass", "bug"],
    "ground": ["fire", "electric", "poison", "rock"],
    "flying": ["grass", "fighting", "bug"],
    "psychic": ["fighting", "poison"],
    "bug": ["grass", "psychic"],
    "rock": ["fire", "ice", "flying", "bug"],
    "ghost": ["ghost"],
    "dragon": ["dragon"],
}

In [3]:
from tqdm.notebook import tqdm
import numpy as np

def has_type_advantage(pokemon1 : str, pokemon2 : str, pokedex : dict) -> bool:
    types1 = pokedex[pokemon1]
    types2 = pokedex[pokemon2]
    for t1 in types1:
        if t1 == "notype":
            continue
        for t2 in types2:
            if t2 == "notype":
                continue
            if t2 in super_effective.get(t1, []):
                return True
    return False

def create_simple_features(data: list[dict]) -> pd.DataFrame:
    """
    A very basic feature extraction function.
    It only uses the aggregated base stats of the player's team and opponent's lead.
    """
    feature_list = []
    for battle in tqdm(data, desc="Extracting features"):
        features = {}
        
        # --- Player 1 Lead Features ---
        p1_team = battle.get('p1_team_details', [])
        if p1_team:
            tmp_count=0
            for p in p1_team:
                tmp_count += p.get('base_spe', 0)
            #features['p1_spe_sum'] = tmp_count

        # --- Player 2 Lead Features ---
        p2_lead = battle.get('p2_lead_details')
        #if p2_lead:
            # Player 2's lead Pokémon's stats
            #features['p2_lead_spe'] = p2_lead.get('base_spe', 0)

        # --- Type advantage counter for Player 1 ---
        battle_timeline = battle.get('battle_timeline', 0)
        if battle_timeline:
            # Features I am going to extract:
            p1_state_counter = 0
            p2_state_counter = 0
            p1_hp_advantage_counter = 0
            p1_null_move_counter = 0
            p2_null_move_counter = 0
            p1_faints_counter = 0
            p2_faints_counter = 0
            p1_type_advantage_counter = 0
            p1_effects_counter = 0
            p2_effects_counter = 0
            p1_boosts_counter = 0 
            p2_boosts_counter = 0 
            p1_tauros_counter = 0
            p1_chansey_counter = 0
            p2_tauros_counter = 0
            p2_chansey_counter = 0
            p1_frz = 0
            p2_frz = 0

            p1_hyperbeam_counter = 0
            p2_hyperbeam_counter = 0
            p1_thunder_wave = 0 
            p2_thunder_wave = 0
            p1_explosion_selfdestruct = 0
            p2_explosion_selfdestruct = 0
            p1_body_slam = 0
            p2_body_slam = 0
            p1_rest = 0
            p2_rest = 0
            
            p1_early_faints = 0  # First 10 turns
            p2_early_faints = 0
            p1_early_hp_advantage = 0
            first_blood = 0  # Who got first KO? (-1 = P2, 0 = none, 1 = P1)

            # Iterate through battle timeline
            for turn_idx, battle_step in enumerate(battle_timeline):
                p1_pokemon_state = battle_step.get('p1_pokemon_state', 0)
                p2_pokemon_state = battle_step.get('p2_pokemon_state', 0)
                
                # -- Type advantage counter
                if has_type_advantage(p1_pokemon_state.get('name', 0), p2_pokemon_state.get('name', 0), pokedex):
                    p1_type_advantage_counter += 1
                    
                # -- Counter for P1 pokemons' states
                p1_status = p1_pokemon_state.get('status', 0)
                if p1_status and p1_status != "fnt":
                    p1_state_counter += 1
                    if p1_status == "frz":
                        p1_frz += 1
                    
                # -- Counter for P2 pokemons' states
                p2_status = p2_pokemon_state.get('status', 0)
                if p2_status and p2_status != "fnt":
                    p2_state_counter += 1
                    if p2_status == "frz":
                        p2_frz += 1
                
                # -- Counter for hp percentage advatange for P1
                p1_hp_pct = p1_pokemon_state.get('hp_pct', 0)
                p2_hp_pct = p2_pokemon_state.get('hp_pct', 0)
                if p1_hp_pct and p2_hp_pct and (p1_hp_pct > p2_hp_pct):
                    p1_hp_advantage_counter += 1

                # -- P1 pokemon faints counter
                if p1_status == "fnt":
                        p1_faints_counter += 1
                    
                # -- P2 pokemon faints counter
                if p2_status == "fnt":
                        p2_faints_counter += 1
                
                # -- Effects counter
                p1_effects = p1_pokemon_state.get('effects', 0)
                p2_effects = p2_pokemon_state.get('effects', 0)
                if not "noeffect" in p1_effects:
                    p1_effects_counter += 1
                if not "noeffect" in p2_effects:
                    p2_effects_counter += 1

                # -- Boosts counter
                p1_boosts = p1_pokemon_state.get('boosts', 0)
                p2_boosts = p2_pokemon_state.get('boosts', 0)
                if any(value > 0 for value in p1_boosts.values()):
                    p1_boosts_counter += 1
                if any(value > 0 for value in p2_boosts.values()):
                    p2_boosts_counter += 1
                
                # Tauros and Chansey counter
                p1_name = p1_pokemon_state.get('name', 0)
                p1_tauros_counter = p1_tauros_counter + 1 if p1_name == "tauros" else p1_tauros_counter
                p1_chansey_counter = p1_chansey_counter +1 if p1_name == "chansey" else p1_chansey_counter

                p2_name = p2_pokemon_state.get('name', 0)
                p2_tauros_counter = p2_tauros_counter + 1 if p2_name == "tauros" else p2_tauros_counter
                p2_chansey_counter = p2_chansey_counter + 1 if p2_name == "chansey" else p2_chansey_counter
                
                # Counters for p1 and p2 lost turns (move is null)
                p1_move_details = battle_step.get('p1_move_details', 0)
                p2_move_details = battle_step.get('p2_move_details', 0)
                if not p1_move_details:
                    p1_null_move_counter += 1
                if not p2_move_details:
                    p2_null_move_counter += 1

                # -- Move  counters
                if p1_move_details:
                    if p1_move_details.get('name', 0) == 'hyperbeam':
                        p1_hyperbeam_counter +=1
                    elif p1_move_details.get('name', 0) == 'bodyslam':
                        p1_body_slam +=1
                    elif p1_move_details.get('name', 0) == 'thunderwave':
                        p1_thunder_wave +=1
                    elif p1_move_details.get('name', 0) == 'explosion':
                        p1_explosion_selfdestruct +=1
                    elif p1_move_details.get('name', 0) == 'selfdestruct':
                        p1_explosion_selfdestruct +=1
                    elif p1_move_details.get('name', 0) == 'rest':
                        p1_rest +=1
                        
                if p2_move_details:
                    if p2_move_details.get('name', 0) == 'hyperbeam':
                        p2_hyperbeam_counter +=1
                    elif p2_move_details.get('name', 0) == 'bodyslam':
                        p2_body_slam +=1
                    elif p2_move_details.get('name', 0) == 'thunderwave':
                        p2_thunder_wave +=1
                    elif p2_move_details.get('name', 0) == 'explosion':
                        p2_explosion_selfdestruct +=1
                    elif p2_move_details.get('name', 0) == 'selfdestruct':
                        p2_explosion_selfdestruct +=1
                    elif p2_move_details.get('name', 0) == 'rest':
                        p2_rest +=1

                # Early game tracking (first 10 turns)
                if turn_idx < 10:
                    if p1_status == "fnt":
                        p1_early_faints += 1
                        if first_blood == 0:
                            first_blood = -1
                    if p2_status == "fnt":
                        p2_early_faints += 1
                        if first_blood == 0:
                            first_blood = 1
                    if p1_hp_pct and p2_hp_pct and (p1_hp_pct > p2_hp_pct):
                        p1_early_hp_advantage += 1

                
        features['p1_state_counter'] = p1_state_counter
        features['p2_state_counter'] = p2_state_counter
        features['p1_hp_advantage_counter'] = p1_hp_advantage_counter
        features['p1_null_move_counter'] = p1_null_move_counter
        features['p2_null_move_counter'] = p2_null_move_counter
        features['p1_faints_counter'] = p1_faints_counter
        features['p2_faints_counter'] = p2_faints_counter
        features['p1_type_advantage_counter'] = p1_type_advantage_counter
        features['p1_effects_counter'] = p1_effects_counter
        features['p2_effects_counter'] = p2_effects_counter
        features['p1_boosts_counter'] = p1_boosts_counter
        features['p2_boosts_counter'] = p2_boosts_counter
        features['p1_chansey_counter'] = p1_chansey_counter
        features['p1_tauros_counter'] = p1_tauros_counter
        features['p2_chansey_counter'] = p2_chansey_counter
        features['p2_tauros_counter'] = p2_tauros_counter
        features['p1_early_faints'] = p1_early_faints
        features['p2_early_faints'] = p2_early_faints
        features['first_blood'] = first_blood
        features['p1_early_hp_advantage'] = p1_early_hp_advantage
        features['p1_frz'] = p1_frz
        features['p2_frz'] = p2_frz

        features['p1_hyperbeam_counter'] = p1_hyperbeam_counter
        features['p2_hyperbeam_counter'] = p2_hyperbeam_counter
        features['p1_thunder_wave'] = p1_thunder_wave
        features['p2_thunder_wave'] = p2_thunder_wave
        features['p1_explosion_selfdestruct'] = p1_explosion_selfdestruct
        features['p2_explosion_selfdestruct'] = p2_explosion_selfdestruct
        features['p1_body_slam'] = p1_body_slam
        features['p2_body_slam'] = p2_body_slam
        features['p1_rest'] = p1_rest
        features['p2_rest'] = p2_rest
        


        features['faint_difference'] = p2_faints_counter - p1_faints_counter  # Positive = good for P1
        features['null_move_difference'] = p2_null_move_counter - p1_null_move_counter
        features['state_difference'] = p2_state_counter - p1_state_counter  # More opponent statuses = good
        features['effects_difference'] = p2_effects_counter - p1_effects_counter
        #features['boosts_difference'] = p1_boosts_counter - p2_boosts_counter
        
        # We also need the ID and the target variable (if it exists)
        features['battle_id'] = battle.get('battle_id')
        if 'player_won' in battle:
            features['player_won'] = int(battle['player_won'])
            
        feature_list.append(features)
        
    return pd.DataFrame(feature_list).fillna(0)

# Create feature DataFrames for both training and test sets
print("Processing training data...")
train_df = create_simple_features(train_data)

print("\nProcessing test data...")
test_data = []
with open(test_file_path, 'r') as f:
    for line in f:
        test_data.append(json.loads(line))
test_df = create_simple_features(test_data)

print("\nTraining features preview:")
display(train_df.head())

Processing training data...


Extracting features:   0%|          | 0/10000 [00:00<?, ?it/s]


Processing test data...


Extracting features:   0%|          | 0/5000 [00:00<?, ?it/s]


Training features preview:


Unnamed: 0,p1_state_counter,p2_state_counter,p1_hp_advantage_counter,p1_null_move_counter,p2_null_move_counter,p1_faints_counter,p2_faints_counter,p1_type_advantage_counter,p1_effects_counter,p2_effects_counter,...,p1_body_slam,p2_body_slam,p1_rest,p2_rest,faint_difference,null_move_difference,state_difference,effects_difference,battle_id,player_won
0,29,29,12,3,14,1,1,3,2,0,...,0,8,1,0,0,11,0,-2,0,1
1,27,30,15,7,7,3,0,1,0,0,...,7,9,0,0,-3,0,3,0,1,1
2,29,30,13,3,8,1,0,0,9,12,...,0,0,0,0,-1,5,1,3,2,1
3,27,30,10,7,5,3,0,4,0,4,...,2,3,0,0,-3,-2,3,4,3,1
4,29,30,13,4,4,1,0,1,0,2,...,3,2,0,0,-1,0,1,2,4,1


In [4]:
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Create a dictionary to store all results
model_results = {}
print("Kfold initialized")

Kfold initialized


In [5]:
# Define our features (X) and target (y)
features = [col for col in train_df.columns if col not in ['battle_id', 'player_won']]
X_train = train_df[features]
y_train = train_df['player_won']

X_test = test_df[features]

## Standardize Feature Dataset
Since some models may require normalized features to work properly, like logistic regression and SVM, we must create normalized train and test splits.

## Load Models and train them
1. **Logistic Regression**
    - What it does: Linear model that estimates probability of binary outcomes using a logistic function
    - Why it's good: Fast, interpretable, works well with linearly separable data, handles feature interactions with polynomial features
    - Best for: Capturing linear relationships between features and outcomes

2. **Random Forest**
    - What it does: Ensemble of decision trees, each trained on random subsets of data and features
    - Why it's good: Handles non-linear relationships, robust to outliers, provides feature importance, doesn't need normalization
    - Best for: Capturing complex, non-linear patterns in battle dynamics

3. **Support Vector Machine (SVM)**
    - What it does: Finds optimal hyperplane that maximally separates classes in high-dimensional space
    - Why it's good: Effective in high-dimensional spaces, versatile with different kernels (linear, RBF), good with small-to-medium datasets
    - Best for: Finding complex decision boundaries between winning/losing patterns

4. **Gradient Boosting (Optional)**
    - What it does: Sequentially builds trees where each corrects errors of previous ones
    - Why it's good: Often achieves highest accuracy, handles feature interactions well
    - Best for: Squeezing out extra performance

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Define models with pipelines
log_reg_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

svm_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC(kernel='rbf', probability=True, random_state=42))
])

rf = RandomForestClassifier(n_estimators=300, max_depth=10, random_state=42)

# Define CatBoost model
catboost_model = CatBoostClassifier(
    iterations=500,           # Number of boosting iterations
    learning_rate=0.1,        # Step size
    depth=6,                  # Tree depth
    loss_function='Logloss',  # Binary classification loss
    verbose=False,            # Suppress training output
    random_state=42,
    eval_metric='Accuracy'    # Metric to monitor
)

# Define LightGBM model
lgbm_model = LGBMClassifier(
    n_estimators=500,         # Number of boosting rounds
    learning_rate=0.1,        # Step size
    max_depth=6,              # Maximum tree depth
    num_leaves=31,            # Max number of leaves (2^depth - 1 typically)
    random_state=42,
    verbose=-1,               # Suppress warnings
    force_col_wise=True       # Faster training for your data size
)

# Define XGBoost
xgb_model = XGBClassifier(
    n_estimators=500,
    learning_rate=0.1,
    max_depth=6,
    random_state=42,
    eval_metric='logloss',
    verbosity=0
)

# Train and evaluate each model individually
models = {
    'Logistic Regression': log_reg_pipe, 
    'Random Forest': rf,
    'SVM': svm_pipe,
    'Catboost': catboost_model,
    'LightGBM': lgbm_model,
    'XGBoost': xgb_model 
}

print("Individual Model Performance:")
for name, model in models.items():
    # All models receive X_train (pipelines handle scaling internally)
    scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")
    model_results[name] = {
        'model': model,
        'cv_mean': scores.mean(),
        'cv_std': scores.std(),
        'cv_scores': scores
    }


Individual Model Performance:
Logistic Regression: 0.8045 (+/- 0.0077)
Random Forest: 0.8036 (+/- 0.0069)
SVM: 0.8051 (+/- 0.0076)
Catboost: 0.8017 (+/- 0.0067)
LightGBM: 0.7973 (+/- 0.0104)
XGBoost: 0.8009 (+/- 0.0057)


## Voting Ensamble
Voting Classifier combines predictions through voting:
- Hard voting: Majority class vote (3 models predict class 1, 1 predicts class 0 → output is class 1)
- Soft voting: Averages predicted probabilities, then picks class with highest average (usually better)

In [7]:
from sklearn.ensemble import VotingClassifier

# Create pipelines to handle scaling for specific models
log_reg_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

svm_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC(kernel='rbf', probability=True, random_state=42))
])

# Random Forest doesn't need scaling
rf_model = RandomForestClassifier(n_estimators=300, max_depth=10, random_state=42)

# Create voting ensemble
voting_ensemble = VotingClassifier(
    estimators=[
        ('lr', log_reg_pipe),
        ('rf', rf_model),
        ('svm', svm_pipe),
        ('catboost' , catboost_model),
        ('lgbm' , lgbm_model),
        ('xgboost' , xgb_model)
    ],
    voting='soft'
)

# ===== EVALUATE WITH K-FOLD ON TRAINING DATA =====
print("Performing K-Fold Cross-Validation on training data...")
cv_scores = cross_val_score(voting_ensemble, X_train, y_train, cv=kf, scoring='accuracy')
print(f"\nK-Fold CV Results:")
print(f"Mean Accuracy: {cv_scores.mean():.4f}")
print(f"Std Deviation: {cv_scores.std():.4f}")
print(f"Individual Fold Scores: {cv_scores.round(4)}")

model_results['Voting Ensemble'] = {
    'model': voting_ensemble,
    'cv_mean': scores.mean(),
    'cv_std': scores.std(),
    'cv_scores': scores
}

Performing K-Fold Cross-Validation on training data...

K-Fold CV Results:
Mean Accuracy: 0.8066
Std Deviation: 0.0065
Individual Fold Scores: [0.8125 0.7965 0.803  0.8145 0.8065]


## Stacking Classifier
Stacking Classifier uses a meta-model to learn how to best combine base model predictions:
- Base models make predictions
- Meta-model (usually Logistic Regression) learns optimal weighting of base predictions

In [8]:
from sklearn.ensemble import StackingClassifier

# Define base learners (with pipelines for normalization)
base_learners = [
    ('lr', log_reg_pipe),
    ('rf', rf_model),
    ('svm', svm_pipe),
    ('catboost' , catboost_model),
    ('lgbm' , lgbm_model),
    ('xgboost' , xgb_model)
]

# Define meta-learner
meta_learner = LogisticRegression(max_iter=1000, random_state=42)

# Create stacking ensemble
stacking_ensemble = StackingClassifier(
    estimators=base_learners,
    final_estimator=meta_learner,
    cv=5  # Use cross-validation to generate training data for meta-learner
)

# ===== EVALUATE WITH K-FOLD ON TRAINING DATA =====
print("Performing K-Fold Cross-Validation for Stacking Ensemble...")
cv_scores = cross_val_score(stacking_ensemble, X_train, y_train, cv=kf, scoring='accuracy')
print(f"\nStacking Ensemble CV Results:")
print(f"Mean Accuracy: {cv_scores.mean():.4f}")
print(f"Std Deviation: {cv_scores.std():.4f}")
print(f"Individual Fold Scores: {cv_scores.round(4)}")

model_results['Stacking Ensemble'] = {
    'model': stacking_ensemble,
    'cv_mean': scores.mean(),
    'cv_std': scores.std(),
    'cv_scores': scores
}

Performing K-Fold Cross-Validation for Stacking Ensemble...

Stacking Ensemble CV Results:
Mean Accuracy: 0.8075
Std Deviation: 0.0071
Individual Fold Scores: [0.814  0.8005 0.8055 0.8175 0.8   ]


In [9]:
# ============================================================
# DETAILED MODEL ANALYSIS (OPTIMIZED)
# ============================================================
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.model_selection import cross_val_predict

print("=" * 60)
print("GENERATING CROSS-VALIDATION PREDICTIONS (ONE TIME)")
print("=" * 60)

# Generate all predictions ONCE and store them
cv_predictions = {}
cv_probabilities = {}

for name, results in model_results.items():
    print(f"Generating predictions for {name}...")
    
    # Get class predictions
    cv_predictions[name] = cross_val_predict(results['model'], X_train, y_train, cv=kf)
    
    # Get probability predictions
    cv_probabilities[name] = cross_val_predict(results['model'], X_train, y_train, 
                                                cv=kf, method='predict_proba')[:, 1]

print("✓ All predictions generated!\n")

GENERATING CROSS-VALIDATION PREDICTIONS (ONE TIME)
Generating predictions for Logistic Regression...
Generating predictions for Random Forest...
Generating predictions for SVM...
Generating predictions for Catboost...
Generating predictions for LightGBM...
Generating predictions for XGBoost...
Generating predictions for Voting Ensemble...
Generating predictions for Stacking Ensemble...
✓ All predictions generated!



### 4.1 Selecting Best Model
In previous blocks, multiple model are evaluated with cross validation. Now we select the best one by cross validation mean accuracy score, train it on full training set and generate predictions for the submission csv

In [10]:
# Display comparison of all models
print("\n" + "=" * 60)
print("MODEL COMPARISON")
print("=" * 60)

# Sort by CV accuracy
sorted_models = sorted(model_results.items(), key=lambda x: x[1]['cv_mean'], reverse=True)

for rank, (name, results) in enumerate(sorted_models, 1):
    print(f"{rank}. {name:25s} {results['cv_mean']:.4f} (+/- {results['cv_std']:.4f})")

# Identify best model
best_model_name, best_results = sorted_models[0]
print(f"\nBest Model: {best_model_name}")
print(f"CV Accuracy: {best_results['cv_mean']:.4f}")


MODEL COMPARISON
1. SVM                       0.8051 (+/- 0.0076)
2. Logistic Regression       0.8045 (+/- 0.0077)
3. Random Forest             0.8036 (+/- 0.0069)
4. Catboost                  0.8017 (+/- 0.0067)
5. XGBoost                   0.8009 (+/- 0.0057)
6. Voting Ensemble           0.8009 (+/- 0.0057)
7. Stacking Ensemble         0.8009 (+/- 0.0057)
8. LightGBM                  0.7973 (+/- 0.0104)

Best Model: SVM
CV Accuracy: 0.8051


In [11]:
# Train best model on full training set
print("\n" + "=" * 60)
print("TRAINING BEST MODEL")
print("=" * 60)

best_model = best_results['model']
print(f"Training {best_model_name} on full training set...")
best_model.fit(X_train, y_train)
print("Training complete!")


TRAINING BEST MODEL
Training SVM on full training set...
Training complete!


### 4.2 Creating the Submission File

The competition requires a `.csv` file with two columns: `battle_id` and `player_won`. Let's use our trained model to make predictions on the test set and format them correctly.

In [12]:
import pandas as pd

# Generate predictions with best model
print("\n" + "=" * 60)
print("GENERATING PREDICTIONS")
print("=" * 60)

y_pred = best_model.predict(X_test)

# Create submission file
submission_df = pd.DataFrame({
    'battle_id': test_df['battle_id'],
    'player_won': y_pred
})

submission_df.to_csv('submission.csv', index=False)

print(f"Predictions saved to 'predictions.csv'")
print(f"Total predictions: {len(submission_df)}")
print(f"\nPrediction distribution:")
print(f"Player won (1): {(y_pred == 1).sum()} ({(y_pred == 1).sum()/len(y_pred)*100:.1f}%)")
print(f"Player lost (0): {(y_pred == 0).sum()} ({(y_pred == 0).sum()/len(y_pred)*100:.1f}%)")

display(submission_df.head())


GENERATING PREDICTIONS
Predictions saved to 'predictions.csv'
Total predictions: 5000

Prediction distribution:
Player won (1): 2478 (49.6%)
Player lost (0): 2522 (50.4%)


Unnamed: 0,battle_id,player_won
0,0,0
1,1,1
2,2,1
3,3,1
4,4,0


### 5. Submitting Your Results

Once you have generated your `submission.csv` file, there are two primary ways to submit it to the competition.

---

#### Method A: Submitting Directly from the Notebook

This is the standard method for code competitions. It ensures that your submission is linked to the code that produced it, which is crucial for reproducibility.

1.  **Save Your Work:** Click the **"Save Version"** button in the top-right corner of the notebook editor.
2.  **Run the Notebook:** In the pop-up window, select **"Save & Run All (Commit)"** and then click the **"Save"** button. This will run your entire notebook from top to bottom and save the output, including your `submission.csv` file.
3.  **Go to the Viewer:** Once the save process is complete, navigate to the notebook viewer page. 
4.  **Submit to Competition:** In the viewer, find the **"Submit to Competition"** section. This is usually located in the header of the output section or in the vertical "..." menu on the right side of the page. Clicking the **Submit** button this will submit your generated `submission.csv` file.

After submitting, you will see your score in the **"Submit to Competition"** section or in the [Public Leaderboard](https://www.kaggle.com/competitions/fds-pokemon-battles-prediction-2025/leaderboard?).

---

#### Method B: Manual Upload

You can also generate your predictions and submission file using any environment you prefer (this notebook, Google Colab, or your local machine).

1.  **Generate the `submission.csv` file** using your model.
2.  **Download the file** to your computer.
3.  **Navigate to the [Leaderboard Page](https://www.kaggle.com/competitions/fds-pokemon-battles-prediction-2025/leaderboard?)** and click on the **"Submit Predictions"** button.
4.  **Upload Your File:** Drag and drop or select your `submission.csv` file to upload it.

This method is quick, but keep in mind that for the final evaluation, you might be required to provide the code that generated your submission.

Good luck!