# 05 - Feature Engineering

**Purpose**: Create derived features for modeling and deeper analysis.

**Features to Create**:
1. **Matchup features**: Trophy diff, elixir diff, card level diff
2. **Deck complexity**: Weighted score based on elixir, spell count, legendary count
3. **Archetype indicators**: Beatdown, cycle, spell-heavy flags
4. **Card synergy scores**: Based on historical win rates of card pairs
5. **Trophy brackets**: Categorical variables for skill levels

**Output**: Clean feature matrix saved as Parquet for modeling

In [None]:
import sys, os, duckdb, pandas as pd, numpy as np

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, os.path.join(PROJECT_ROOT, 'src'))

# Use Parquet if available (faster), fallback to CSV
DATA_PATH = os.path.join(PROJECT_ROOT, 'battles.parquet')
if not os.path.exists(DATA_PATH):
    DATA_PATH = os.path.join(PROJECT_ROOT, 'battles.csv')

from duckdb_utils import get_connection, create_battles_view, query_to_df, save_to_parquet, create_sample
from feature_engineering import (
    create_card_level_features,
    create_deck_archetype_features,
    create_trophy_bracket_features,
    create_matchup_features,
    create_tower_damage_features
)

con = get_connection()
create_battles_view(con, DATA_PATH)

## 1. Load Base Data

Work with a sample for feature engineering development

In [10]:
# Create 10% sample if not exists
sample_path = 'artifacts/sample_battles_10pct.parquet'
if not os.path.exists(os.path.join(PROJECT_ROOT, sample_path)):
    print("Creating 10% sample...")
    sample = create_sample(con, sample_pct=10, output_path=sample_path)
else:
    print("Loading existing sample...")
    sample = pd.read_parquet(os.path.join(PROJECT_ROOT, sample_path))
    
print(f"Sample size: {len(sample):,} battles")

✓ Saved 1,678,990 rows to artifacts\sample_battles_10pct.parquet (222.0 MB)
Sample size: 1,678,990 battles


## 2. Create Matchup Features

In [11]:
# Add matchup comparison features
sample_features = create_matchup_features(sample)

print("Matchup features created:")
print("  - trophy_diff")
print("  - elixir_diff")
print("  - card_level_diff")
print("  - spell_diff")

Matchup features created:
  - trophy_diff
  - elixir_diff
  - card_level_diff
  - spell_diff


## 3. Create Deck Archetype Features

In [12]:
# Add archetype indicators for winner and loser
sample_features = create_deck_archetype_features(sample_features, player='winner')
sample_features = create_deck_archetype_features(sample_features, player='loser')

print("Archetype features created for both players")

Archetype features created for both players


## 4. Create Trophy Bracket Features

In [13]:
# Categorize battles by trophy level
sample_features = create_trophy_bracket_features(sample_features)

print("Trophy bracket distribution:")
print(sample_features['trophy_bracket'].value_counts())

Trophy bracket distribution:
trophy_bracket
4000-5000     1255866
5000-6000      313223
3000-4000       53743
1000-2000       19417
2000-3000       14010
6000-8000       13360
0-1000           9353
8000-10000         18
Name: count, dtype: int64


## 5. Create Tower Damage Features

In [14]:
# ⚠️ NOTE: Crown-related features cause DATA LEAKAGE for prediction tasks!
# Crown counts are the OUTCOME of battles, not input features.
# Skipping create_tower_damage_features() to prevent leakage in modeling.

# If needed for descriptive analysis only (not modeling), create them separately:
# sample_features = create_tower_damage_features(sample_features)

print("⚠️  Skipping crown features (DATA LEAKAGE PREVENTION)")
print("   Crown counts reveal battle outcomes - cannot be used for prediction!")

⚠️  Skipping crown features (DATA LEAKAGE PREVENTION)
   Crown counts reveal battle outcomes - cannot be used for prediction!


## 6. Save Feature Matrix

In [15]:
# Save engineered features for modeling
save_to_parquet(sample_features, 'artifacts/model_features.parquet')

print(f"\n✓ Feature matrix saved with {len(sample_features.columns)} columns")

✓ Saved 1,678,990 rows to artifacts\model_features.parquet (230.2 MB)

✓ Feature matrix saved with 87 columns


## 7. Feature Summary

In [16]:
# List all engineered features
engineered_cols = [col for col in sample_features.columns 
                   if any(x in col for x in ['_diff', '_heavy', '_beatdown', '_cycle', 'bracket', 'close_game'])]

print(f"Engineered features ({len(engineered_cols)}):")
for col in sorted(engineered_cols):
    print(f"  - {col}")

Engineered features (13):
  - card_level_diff
  - elixir_diff
  - loser_beatdown
  - loser_building_heavy
  - loser_cycle
  - loser_spell_heavy
  - spell_diff
  - trophy_bracket
  - trophy_diff
  - winner_beatdown
  - winner_building_heavy
  - winner_cycle
  - winner_spell_heavy
