# Shrine Bowl Analytics Competition - Exploratory Data Analysis

This notebook explores the competition datasets to understand:
1. Data structure and quality
2. Player linkage across datasets
3. Drill types and tracking data characteristics
4. Target variable (NFL rookie outcomes) distribution

**Goal**: Identify the most promising features and analysis paths for the PTP Score model.


In [None]:
import polars as pl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import sys

# Add src to path for imports
sys.path.insert(0, str(Path.cwd().parent))

from src.data.pipeline import DataPipeline, load_all_data

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Initialize pipeline
pipeline = DataPipeline(data_dir='../Shrine Bowl Data')

print('Data Pipeline Initialized')
print(f'Practice files: {len(pipeline.list_practice_files())}')
print(f'Game files: {len(pipeline.list_game_files())}')


## 1. Dataset Overview


In [None]:
# Load all static datasets
data = load_all_data('../Shrine Bowl Data')

print('=== DATASET SHAPES ===')
for name, df in data.items():
    print(f'{name}: {df.shape}')


In [None]:
# Player data overview - check column coverage
players = data['players']
print('=== PLAYER DATA COLUMN COVERAGE ===')
for col in players.columns:
    non_null = players.select(pl.col(col).is_not_null().sum()).item()
    pct = 100 * non_null / len(players)
    if pct > 50:  # Only show well-populated columns
        print(f'{col}: {non_null}/{len(players)} ({pct:.0f}%)')


## 2. Target Variable Analysis (NFL Rookie Outcomes)


In [None]:
# Rookie stats - our prediction target
rookie = data['rookie_stats']

print('=== ROOKIE STATS SUMMARY ===')
print(f'Total players with rookie data: {len(rookie)}')
print(f"\nRookie seasons: {rookie.select('rookie_season').unique().sort('rookie_season').to_series().to_list()}")
print(f"\nPositions: {rookie.select('position').unique().to_series().to_list()}")


In [None]:
# Total snaps distribution - key outcome variable
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw distribution
snaps = rookie.select('total_snaps').to_series().to_numpy()
axes[0].hist(snaps, bins=50, edgecolor='black', alpha=0.7, color='#2c5f2d')
axes[0].set_xlabel('Total Rookie Snaps')
axes[0].set_ylabel('Count')
axes[0].set_title('Distribution of NFL Rookie Year Snaps')
axes[0].axvline(np.median(snaps), color='red', linestyle='--', label=f'Median: {np.median(snaps):.0f}')
axes[0].legend()

# Log-transformed
log_snaps = np.log1p(snaps)
axes[1].hist(log_snaps, bins=30, edgecolor='black', alpha=0.7, color='#4a7c59')
axes[1].set_xlabel('Log(Total Snaps + 1)')
axes[1].set_ylabel('Count')
axes[1].set_title('Log-Transformed Snaps (Better for Modeling)')

plt.tight_layout()
plt.show()

print(f'Snaps Statistics:')
print(f'  Min: {np.min(snaps)}')
print(f'  Max: {np.max(snaps)}')
print(f'  Mean: {np.mean(snaps):.1f}')
print(f'  Median: {np.median(snaps):.1f}')


## 3. Player Linkage Analysis


In [None]:
# How many players can we analyze?
print('=== PLAYER LINKAGE ===')

# Players in tracking
tracking_ids = pipeline.get_tracking_player_ids()
print(f'Players with tracking data: {len(tracking_ids)}')

# Players with outcomes
outcomes = data['player_outcomes']
outcome_ids = set(outcomes.select('player_id').to_series().to_list())
print(f'Players with rookie outcomes: {len(outcome_ids)}')

# Intersection
analyzable_ids = tracking_ids.intersection(outcome_ids)
print(f'Players with BOTH (analyzable): {len(analyzable_ids)}')


In [None]:
# Get analyzable players with full details
analyzable = pipeline.get_analyzable_players()

print('=== ANALYZABLE PLAYERS ===')
print(f'Total: {len(analyzable)}')

# Position breakdown
print('\nBy Position:')
print(analyzable.group_by('position').len().sort('len', descending=True))

# Draft round breakdown
print('\nBy Draft Round:')
print(analyzable.group_by('draft_round').len().sort('draft_round'))


In [None]:
# Session timestamps - understand drill structure
sessions = data['sessions']

print('=== DRILL TYPES ===')
drill_counts = sessions.group_by('drillType').len().sort('len', descending=True)
print(drill_counts.head(15))


In [None]:
# Identify 1-on-1 drill sessions (key for our analysis)
one_on_one_drills = [d for d in sessions.select('drillType').unique().to_series().to_list() 
                    if '1' in str(d) and ('on' in str(d).lower() or 'v' in str(d).lower())]

print('=== 1-ON-1 RELATED DRILLS ===')
for d in sorted(one_on_one_drills):
    count = sessions.filter(pl.col('drillType') == d).shape[0]
    print(f'  {d}: {count} sessions')


In [None]:
# Load sample tracking data to understand structure
# Filter to 1-on-1 drills for efficiency
one_on_one_lf = pipeline.load_one_on_one_drills(lazy=True)

# Get sample and schema
sample = one_on_one_lf.head(10000).collect()

print('=== 1-ON-1 TRACKING DATA SCHEMA ===')
print(sample.schema)

print('\n=== SAMPLE DATA ===')
print(sample.head(5))


In [None]:
# Check tracking data statistics
print('=== TRACKING DATA STATISTICS (1-on-1 drills sample) ===')
for col in ['s', 'a', 'x', 'y', 'z', 'dir', 'sa', 'dis']:
    if col in sample.columns:
        vals = sample.select(col).drop_nulls().to_series()
        if len(vals) > 0:
            print(f'{col}: min={vals.min():.2f}, max={vals.max():.2f}, mean={vals.mean():.2f}')


## 5. Key Findings Summary

### Data Availability
- **113 players** have both tracking data and NFL rookie outcomes
- Primary outcome variable: `total_snaps` (continuous, right-skewed - use log transform)
- Good coverage of combine metrics (40-yard dash: 90%, 3-cone: 78%)

### Drill Types
- Multiple 1-on-1 drill sessions available for isolation analysis
- Key drills: "1 on 1", "1v1", "Best of 1 on 1", "Bigs 1 on 1"
- Tracking data includes: x, y, z position, speed (s), acceleration (a), direction (dir)

### Modeling Considerations
- Log-transform `total_snaps` for regression
- Consider binary classification: "significant contributor" (>200 snaps) vs "limited role"
- Position-specific analysis may improve signal

### Next Steps
1. Extract kinematic features from 1-on-1 drills
2. Build physical profile index from combine data
3. Aggregate college production metrics
4. Train and validate PTP Score model
