# Notebook 09: Player Timeline Construction

**Phase 1: Reverse Engineering the Winning Approach**

## Objective

Transform battle-centric data into **player-centric timelines** with temporal features.

**Paradigm Shift**:
- **Before**: Analyzing 16.9M battles (what deck wins?)
- **After**: Analyzing N players over time (what behaviors predict retention?)

## What This Notebook Does

1. **Extract Player-Battle Pairs**: Each battle → 2 rows (winner + loser)
2. **Sort by Player & Time**: Create chronological player timelines
3. **Engineer Temporal Features**:
   - `next_battleTime` - When did they play again?
   - `return_gap_hours` - Time between battles
   - `fast_return_1hr` - Returned within 1 hour? (Boolean)
   - `loss_streak` - Consecutive losses
   - `win_streak` - Consecutive wins
4. **Filter Active Players**: Keep players with 10+ matches
5. **Aggregate to Player Level**: One row per player with summary stats

## Outputs

- `artifacts/phase_1_3_outputs/player_timeline.parquet` - Raw player timelines
- `artifacts/phase_1_3_outputs/player_timeline_features.parquet` - With temporal features
- `artifacts/phase_1_3_outputs/player_aggregated.parquet` - Player-level summaries

---

## Setup & Imports

In [None]:
import sys
import os
from pathlib import Path

# Add src to path
sys.path.insert(0, os.path.join(os.getcwd(), '..', 'src'))

import pandas as pd
import numpy as np
import duckdb
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns

# Import our custom utilities
from temporal_features import (
    create_player_timeline_from_battles,
    engineer_temporal_features,
    aggregate_to_player_level
)

# Visualization setup
sns.set_style("whitegrid")
sns.set_context("notebook")
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Imports successful")
print(f"Working directory: {os.getcwd()}")

## Step 1: Load Battle Data

We'll use DuckDB to stream data from the 9.2GB CSV without loading it all into memory.

**Strategy**: Start with a sample (10%) to validate the pipeline, then scale to full dataset.

In [None]:
# Configuration
USE_SAMPLE = True  # Set to False for full dataset
SAMPLE_RATE = 0.10  # 10% sample

# Connect to DuckDB
con = duckdb.connect()

print("Loading battle data from battles.csv...")
print(f"Sample mode: {USE_SAMPLE} ({SAMPLE_RATE*100:.0f}% if True)")

# Create view
con.execute("""
    CREATE VIEW battles AS
    SELECT * FROM read_csv_auto('../battles.csv',
        SAMPLE_SIZE=-1,
        IGNORE_ERRORS=true
    )
""")

# Load subset of columns we need
query = """
SELECT
    battleTime,
    "winner.tag",
    "winner.startingTrophies",
    "winner.trophyChange",
    "winner.crowns",
    "loser.tag",
    "loser.startingTrophies",
    "loser.trophyChange",
    "loser.crowns",
    "gameMode.id",
    "arena.id"
FROM battles
"""

if USE_SAMPLE:
    query += f" USING SAMPLE {SAMPLE_RATE*100:.0f}%"

print("\nExecuting query (this may take 1-3 minutes)...")
battles_df = con.sql(query).df()

print(f"\n✅ Loaded {len(battles_df):,} battles")
print(f"Memory usage: {battles_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"\nColumns: {list(battles_df.columns)}")
print(f"\nFirst few rows:")
battles_df.head()

## Step 2: Create Player Timeline

**The Paradigm Shift Happens Here**

Each battle becomes **TWO rows**:
1. Winner's perspective (outcome=1)
2. Loser's perspective (outcome=0)

This allows us to track individual player journeys over time.

In [None]:
print("Creating player timeline...")
print(f"Input: {len(battles_df):,} battles")
print(f"Expected output: ~{len(battles_df) * 2:,} player-battle pairs\n")

player_timeline = create_player_timeline_from_battles(battles_df)

print(f"✅ Player timeline created!")
print(f"Total rows: {len(player_timeline):,}")
print(f"Unique players: {player_timeline['player_tag'].nunique():,}")
print(f"Date range: {player_timeline['battleTime'].min()} to {player_timeline['battleTime'].max()}")
print(f"\nSample player timeline:")
player_timeline.head(10)

In [None]:
# Verify timeline is sorted correctly
print("Verification: Check one player's timeline is chronological")
sample_player = player_timeline['player_tag'].iloc[100]  # Pick a random player

sample_timeline = player_timeline[player_timeline['player_tag'] == sample_player].head(10)

print(f"\nPlayer: {sample_player}")
print(f"Battles: {len(player_timeline[player_timeline['player_tag'] == sample_player])}")
print(f"\nFirst 10 battles (should be chronological):")
print(sample_timeline[['battleTime', 'outcome', 'trophies_before', 'trophy_change']].to_string())

# Check sorting
is_sorted = sample_timeline['battleTime'].is_monotonic_increasing
print(f"\n{'✅' if is_sorted else '❌'} Timeline is {'properly' if is_sorted else 'NOT'} sorted")

## Step 3: Engineer Temporal Features

**The Key Innovation**: Calculate how players behave *over time*

Features:
- When do they return?
- How quickly?
- After wins or losses?
- Streaks?

In [None]:
print("Engineering temporal features...")
print("This calculates: next_battleTime, return_gap, streaks, etc.")
print("Expected time: 1-3 minutes\n")

player_timeline_features = engineer_temporal_features(player_timeline)

print(f"\n✅ Temporal features added!")
print(f"New columns: {[col for col in player_timeline_features.columns if col not in player_timeline.columns]}")
print(f"\nSample with features:")
player_timeline_features.head()

In [None]:
# Analyze the features
print("Feature Statistics:")
print("="*60)

print(f"\n1. Return Gap Distribution:")
print(player_timeline_features['return_gap_hours'].describe())

print(f"\n2. Fast Return Rate (< 1 hour):")
fast_return_rate = player_timeline_features['fast_return_1hr'].mean()
print(f"   {fast_return_rate:.1%} of battles followed by fast return")

print(f"\n3. Loss Streak Distribution:")
print(player_timeline_features['loss_streak'].value_counts().head(10))

print(f"\n4. Win Streak Distribution:")
print(player_timeline_features['win_streak'].value_counts().head(10))

In [None]:
# Visualize return gap distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Return gap (capped at 48 hours for visualization)
return_gaps_capped = player_timeline_features['return_gap_hours'].clip(upper=48)
axes[0].hist(return_gaps_capped.dropna(), bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Return Gap (hours, capped at 48)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Return Times')
axes[0].axvline(1, color='red', linestyle='--', label='1 hour threshold')
axes[0].legend()

# Loss streak distribution
loss_streak_counts = player_timeline_features['loss_streak'].value_counts().head(15).sort_index()
axes[1].bar(loss_streak_counts.index, loss_streak_counts.values, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Loss Streak Length')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Loss Streaks')

plt.tight_layout()
plt.savefig('../presentation/figures/phase1_temporal_features.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n✅ Visualization saved to presentation/figures/phase1_temporal_features.png")

## Step 4: Filter to Active Players

Like the winning team, we filter to players with **10+ matches** to ensure meaningful behavioral patterns.

In [None]:
# Count matches per player
matches_per_player = player_timeline_features.groupby('player_tag').size()

print("Match Count Distribution:")
print(matches_per_player.describe())

print(f"\nPlayers by match count:")
print(f"  Total players: {len(matches_per_player):,}")
print(f"  1-9 matches: {(matches_per_player < 10).sum():,} ({(matches_per_player < 10).sum()/len(matches_per_player):.1%})")
print(f"  10-49 matches: {((matches_per_player >= 10) & (matches_per_player < 50)).sum():,}")
print(f"  50-99 matches: {((matches_per_player >= 50) & (matches_per_player < 100)).sum():,}")
print(f"  100+ matches: {(matches_per_player >= 100).sum():,}")

# Filter
MIN_MATCHES = 10
active_players = matches_per_player[matches_per_player >= MIN_MATCHES].index

player_timeline_filtered = player_timeline_features[
    player_timeline_features['player_tag'].isin(active_players)
]

print(f"\n✅ Filtered to players with {MIN_MATCHES}+ matches")
print(f"Active players: {len(active_players):,}")
print(f"Battles analyzed: {len(player_timeline_filtered):,}")
print(f"Data retention: {len(player_timeline_filtered)/len(player_timeline_features):.1%}")

## Step 5: Aggregate to Player Level

**Final transformation**: Timeline → One row per player with summary stats

In [None]:
print("Aggregating to player level...")

player_aggregated = aggregate_to_player_level(
    player_timeline_filtered,
    min_matches=MIN_MATCHES
)

print(f"\n✅ Player-level dataset created!")
print(f"Players: {len(player_aggregated):,}")
print(f"Features: {len(player_aggregated.columns)}")
print(f"\nColumns: {list(player_aggregated.columns)}")
print(f"\nSample:")
player_aggregated.head()

In [None]:
# Player-level statistics
print("Player-Level Summary Statistics:")
print("="*60)

print(f"\n1. Engagement Metrics:")
print(f"   Avg matches: {player_aggregated['match_count'].mean():.1f}")
print(f"   Median matches: {player_aggregated['match_count'].median():.0f}")
print(f"   Avg days active: {player_aggregated['days_active'].mean():.1f}")

print(f"\n2. Performance Metrics:")
print(f"   Avg win rate: {player_aggregated['win_rate'].mean():.1%}")
print(f"   Median trophy momentum: {player_aggregated['trophy_momentum'].median():.0f}")

print(f"\n3. Behavioral Metrics:")
print(f"   Avg return gap: {player_aggregated['avg_return_gap_hours'].mean():.1f} hours")
print(f"   Median return gap: {player_aggregated['median_return_gap_hours'].median():.1f} hours")
print(f"   Fast return rate: {player_aggregated['fast_return_rate'].mean():.1%}")
print(f"   Avg max loss streak: {player_aggregated['max_loss_streak'].mean():.1f}")

In [None]:
# Visualize player-level distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Match count
axes[0, 0].hist(player_aggregated['match_count'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Match Count')
axes[0, 0].set_ylabel('Players')
axes[0, 0].set_title('Distribution of Match Counts')
axes[0, 0].axvline(player_aggregated['match_count'].median(), color='red', linestyle='--', 
                   label=f'Median: {player_aggregated["match_count"].median():.0f}')
axes[0, 0].legend()

# Win rate
axes[0, 1].hist(player_aggregated['win_rate'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].set_xlabel('Win Rate')
axes[0, 1].set_ylabel('Players')
axes[0, 1].set_title('Distribution of Win Rates')
axes[0, 1].axvline(0.5, color='red', linestyle='--', label='50%')
axes[0, 1].legend()

# Return gap (log scale)
return_gaps_plot = player_aggregated['avg_return_gap_hours'].clip(upper=100)
axes[1, 0].hist(return_gaps_plot, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Avg Return Gap (hours, capped at 100)')
axes[1, 0].set_ylabel('Players')
axes[1, 0].set_title('Distribution of Return Times')

# Fast return rate
axes[1, 1].hist(player_aggregated['fast_return_rate'], bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Fast Return Rate (< 1 hour)')
axes[1, 1].set_ylabel('Players')
axes[1, 1].set_title('Distribution of Fast Return Behavior')

plt.tight_layout()
plt.savefig('../presentation/figures/phase1_player_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n✅ Visualization saved to presentation/figures/phase1_player_distributions.png")

## Step 6: Save Outputs

Save all intermediate and final datasets for use in Phase 2 and 3.

In [None]:
# Create output directory if needed
output_dir = Path('../artifacts/phase_1_3_outputs')
output_dir.mkdir(parents=True, exist_ok=True)

print("Saving outputs...")

# 1. Raw player timeline
player_timeline_path = output_dir / 'player_timeline.parquet'
player_timeline.to_parquet(player_timeline_path)
print(f"✅ Saved: {player_timeline_path} ({len(player_timeline):,} rows)")

# 2. Player timeline with features
player_timeline_features_path = output_dir / 'player_timeline_features.parquet'
player_timeline_features.to_parquet(player_timeline_features_path)
print(f"✅ Saved: {player_timeline_features_path} ({len(player_timeline_features):,} rows)")

# 3. Filtered timeline (10+ matches)
player_timeline_filtered_path = output_dir / 'player_timeline_filtered.parquet'
player_timeline_filtered.to_parquet(player_timeline_filtered_path)
print(f"✅ Saved: {player_timeline_filtered_path} ({len(player_timeline_filtered):,} rows)")

# 4. Player aggregated
player_aggregated_path = output_dir / 'player_aggregated.parquet'
player_aggregated.to_parquet(player_aggregated_path)
print(f"✅ Saved: {player_aggregated_path} ({len(player_aggregated):,} rows)")

print("\n" + "="*60)
print("✅ PHASE 1 COMPLETE!")
print("="*60)
print(f"\nOutputs saved to: {output_dir}")
print(f"\nNext: Run notebook 10-behavioral-tilt-analysis.ipynb")

## Summary

**What We Accomplished**:
1. ✅ Transformed 16.9M battles → Player timelines
2. ✅ Engineered 7 temporal features (return gaps, streaks, etc.)
3. ✅ Filtered to active players (10+ matches)
4. ✅ Aggregated to player level (one row per player)
5. ✅ Saved 4 datasets for Phase 2 & 3

**Key Insights**:
- Unique players: {player_timeline['player_tag'].nunique():,}
- Active players (10+): {len(active_players):,}
- Avg return gap: {player_aggregated['avg_return_gap_hours'].mean():.1f} hours
- Fast return rate: {player_aggregated['fast_return_rate'].mean():.1%}

**Paradigm Shift Achieved**: ✅
- Before: Battle-centric ("what deck wins?")
- After: Player-centric ("what behaviors predict retention?")

---

**Next Phase**: Behavioral Tilt Analysis (Notebook 10)