# Feature Engineering
In Feature Engineering, we'll transform raw game stats into smart features that help predict fantasy scores. This includes:

  - Rolling averages - Player's last 5-10 game performance
  - Matchup data - How players perform vs specific teams/defenses
  - Schedule factors - Rest days, back-to-back games, home/away
  - Recent form - Is the player trending up or down?
  - Opponent strength - Defensive rankings and pace of play

  The goal is to give our model the same context fantasy experts use - like knowing LeBron scores less on
  back-to-backs or Steph shoots better at home. These features capture patterns beyond basic box scores.

**Import libraries**

In [24]:
import pandas as pd
import pickle

**Upload dataset**

In [25]:
# Load the cleaned dataset with current players
with open('../data/processed/player_stats_current.pkl', 'rb') as f:
    df = pickle.load(f)

print(f"‚úÖ Dataset loaded successfully!")
print(f"üìä Shape: {df.shape}")
print(f"üìÖ Date range: {df['gameDate'].min()[:10]} to {df['gameDate'].max()[:10]}")
print(f"üë• Players: {df[['firstName', 'lastName']].nunique().sum()}")

‚úÖ Dataset loaded successfully!
üìä Shape: (239838, 36)
üìÖ Date range: 2003-10-07 to 2025-11-09
üë• Players: 1238


## Feature 1 - Rolling Average
**What are Rolling Averages?**
  Rolling averages calculate the average performance over the last N games for each player. Instead of using
  season-long averages, we look at recent form to better predict upcoming performance.

  **Example:**
  If LeBron's last 5 games were: 45, 38, 52, 41, 48 fantasy points
  His `fantasy_last5` = (45 + 38 + 52 + 41 + 48) √∑ 5 = 44.8

  **Why This Matters for ML:**
  - **Recent form** is more predictive than season averages
  - **Captures trends** - is the player getting hot or cooling off?
  - **Injury impact** - shows if player is returning to form after injury
  - **Matchup adjustments** - some players perform better against certain teams recently

  **Technical Details:**
  - `shift(1)` ensures we don't use today's game to predict today's game (no data leakage)
  - `min_periods=1` handles early season games with limited history
  - We calculate both 5-game (recent) and 10-game (longer trend) windows
  - Applied per player using `groupby()` to maintain player-specific rolling windows

**Data pre-processing**

First we need to Sort data by player and date (crucial for rolling averages)


In [27]:
# Handle mixed datetime formats safely
print("üìÖ Converting gameDate to datetime...")
print(f"üîç Current dtype: {df['gameDate'].dtype}")

# Check for mixed formats and convert safely
if df['gameDate'].dtype == 'object':
    df['gameDate'] = pd.to_datetime(df['gameDate'], format='mixed', utc=True).dt.tz_convert(None)
else:
  # Already datetime, just remove timezone if present
    if hasattr(df['gameDate'].dtype, 'tz') and df['gameDate'].dt.tz is not None:
        df['gameDate'] = df['gameDate'].dt.tz_convert(None)

print("‚úÖ DateTime conversion completed")

# Sort data by player and date
df = df.sort_values(['firstName', 'lastName', 'gameDate']).reset_index(drop=True)

print(f"‚úÖ Data sorted by player and date")
print(f"üìÖ Date range: {df['gameDate'].min().date()} to {df['gameDate'].max().date()}")

üìÖ Converting gameDate to datetime...
üîç Current dtype: object
‚úÖ DateTime conversion completed
‚úÖ Data sorted by player and date
üìÖ Date range: 2003-10-07 to 2025-11-09


**Calculate the rolling average**

In [36]:
def calculate_rolling_features(group):
    """Calculate rolling averages for each player"""
    # Fantasy score averages
    group['rolling_avg_5'] = group['espn_fantasy_score'].shift(1).rolling(5, min_periods=1).mean()
    group['rolling_avg_10'] = group['espn_fantasy_score'].shift(1).rolling(10, min_periods=1).mean()
    
    # Key stats averages
    group['points_last5'] = group['points'].shift(1).rolling(5, min_periods=1).mean()
    group['rebounds_last5'] = group['reboundsTotal'].shift(1).rolling(5, min_periods=1).mean()
    group['assists_last5'] = group['assists'].shift(1).rolling(5, min_periods=1).mean()
    group['minutes_last5'] = group['numMinutes'].shift(1).rolling(5, min_periods=1).mean()
    
    return group

In [37]:
# Apply to each player
df = df.groupby(['firstName', 'lastName'], group_keys=False).apply(calculate_rolling_features)
print("‚úÖ Rolling averages calculated!")
print(f"üìä New columns: {[col for col in df.columns if 'last' in col]}")

‚úÖ Rolling averages calculated!
üìä New columns: ['lastName', 'fantasy_last5', 'fantasy_last10', 'points_last5', 'rebounds_last5', 'assists_last5', 'minutes_last5']


Show examples:

In [38]:
# Show example
print("\nüìã Example - Recent games with rolling averages:")
sample = df[df['firstName'] == 'LeBron'].tail(3)
print(sample[['gameDate', 'espn_fantasy_score', 'rolling_avg_5', 'rolling_avg_10', 'points_last5']].round(1))


üìã Example - Recent games with rolling averages:
                  gameDate  espn_fantasy_score  rolling_avg_5  rolling_avg_10  \
155999 2025-10-05 20:30:00                 0.0           47.2            42.7   
156000 2025-10-14 22:00:00                 0.0           37.6            38.8   
156001 2025-10-15 22:30:00                 0.0           22.2            35.6   

        points_last5  
155999          21.6  
156000          17.4  
156001           9.8  
