# 06 - Modeling: Battle Outcome Prediction

**Purpose**: Build predictive models for technical rigor scoring.

**Goal**: Predict battle outcomes based on deck composition alone.

**Benchmark**: Previous research achieved **56.94% accuracy** - aim to beat this!

**Models to Try**:
1. Logistic Regression (baseline)
2. Random Forest (feature importance insights)
3. XGBoost (likely best performance)

**Key Metrics**:
- Accuracy
- Precision/Recall
- ROC-AUC
- Feature importance (for insights!)

In [None]:
import sys, os, pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix
import xgboost as xgb

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, os.path.join(PROJECT_ROOT, 'src'))

from visualization import setup_presentation_style
setup_presentation_style()

## 1. Load Feature Matrix

In [None]:
# Load engineered features from notebook 05
features = pd.read_parquet(os.path.join(PROJECT_ROOT, 'artifacts/model_features.parquet'))

print(f"Loaded {len(features):,} battles with {len(features.columns)} features")

## 2. Prepare Data for Modeling

In [None]:
# TODO: Define target variable (1 = winner won, 0 = loser won - always 1 in this dataset!)
# Need to restructure: each battle becomes 2 rows (one for each player)
# with outcome = 1 if that player won, 0 if lost

# Example structure:
# y = features['outcome']  # 1 or 0
# X = features[feature_columns]  # numeric features only

print("TODO: Restructure data and select features")

## 3. Train/Test Split

In [None]:
# TODO: Split data
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42, stratify=y
# )

print("TODO: Create train/test split")

## 4. Model 1: Logistic Regression (Baseline)

In [None]:
# TODO: Train logistic regression
# lr_model = LogisticRegression(max_iter=1000, random_state=42)
# lr_model.fit(X_train, y_train)
# lr_pred = lr_model.predict(X_test)
# lr_acc = accuracy_score(y_test, lr_pred)
# print(f"Logistic Regression Accuracy: {lr_acc:.4f}")

## 5. Model 2: Random Forest

In [None]:
# TODO: Train random forest
# rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
# rf_model.fit(X_train, y_train)
# rf_pred = rf_model.predict(X_test)
# rf_acc = accuracy_score(y_test, rf_pred)
# print(f"Random Forest Accuracy: {rf_acc:.4f}")

## 6. Model 3: XGBoost

In [None]:
# TODO: Train XGBoost
# xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
# xgb_model.fit(X_train, y_train)
# xgb_pred = xgb_model.predict(X_test)
# xgb_acc = accuracy_score(y_test, xgb_pred)
# print(f"XGBoost Accuracy: {xgb_acc:.4f}")

## 7. Feature Importance Analysis

**THIS IS KEY FOR PRESENTATION INSIGHTS!**

In [None]:
# TODO: Extract feature importances from best model
# Plot top 15 most important features
# These tell the story of what matters most for winning!

## 8. Model Evaluation Summary

In [None]:
# TODO: Create summary table
# results = pd.DataFrame({
#     'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
#     'Accuracy': [lr_acc, rf_acc, xgb_acc],
#     'ROC-AUC': [...]
# })

print("TODO: Summarize model performance")

## Insights for Presentation

**Key Points**:
1. Achieved X% accuracy (compare to 56.94% benchmark)
2. Top 3 most important features are: [list]
3. This means: [actionable insight from feature importance]