# Feature Engineering
In Feature Engineering, we'll transform raw game stats into smart features that help predict fantasy scores. This includes:

  - Rolling averages - Player's last 5-10 game performance
  - Matchup data - How players perform vs specific teams/defenses
  - Schedule factors - Rest days, back-to-back games, home/away
  - Recent form - Is the player trending up or down?
  - Opponent strength - Defensive rankings and pace of play

  The goal is to give our model the same context fantasy experts use - like knowing LeBron scores less on
  back-to-backs or Steph shoots better at home. These features capture patterns beyond basic box scores.

**Import libraries**

In [1]:
import pandas as pd
import pickle

**Upload dataset**

In [2]:
# Load the cleaned dataset with current players
with open('../data/processed/player_stats_current.pkl', 'rb') as f:
    df = pickle.load(f)

print(f"‚úÖ Dataset loaded successfully!")
print(f"üìä Shape: {df.shape}")
print(f"üìÖ Date range: {df['gameDate'].min()[:10]} to {df['gameDate'].max()[:10]}")
print(f"üë• Players: {df[['firstName', 'lastName']].nunique().sum()}")

‚úÖ Dataset loaded successfully!
üìä Shape: (239838, 36)
üìÖ Date range: 2003-10-07 to 2025-11-09
üë• Players: 1238


## Feature 1 - Rolling Average
**What are Rolling Averages?**
  Rolling averages calculate the average performance over the last N games for each player. Instead of using
  season-long averages, we look at recent form to better predict upcoming performance.

  **Example:**
  If LeBron's last 5 games were: 45, 38, 52, 41, 48 fantasy points
  His `fantasy_last5` = (45 + 38 + 52 + 41 + 48) √∑ 5 = 44.8

  **Why This Matters for ML:**
  - **Recent form** is more predictive than season averages
  - **Captures trends** - is the player getting hot or cooling off?
  - **Injury impact** - shows if player is returning to form after injury
  - **Matchup adjustments** - some players perform better against certain teams recently

  **Technical Details:**
  - `shift(1)` ensures we don't use today's game to predict today's game (no data leakage)
  - `min_periods=1` handles early season games with limited history
  - We calculate both 5-game (recent) and 10-game (longer trend) windows
  - Applied per player using `groupby()` to maintain player-specific rolling windows

**Data pre-processing**

First we need to Sort data by player and date (crucial for rolling averages)


In [3]:
# Handle mixed datetime formats safely
print("üìÖ Converting gameDate to datetime...")
print(f"üîç Current dtype: {df['gameDate'].dtype}")

# Check for mixed formats and convert safely
if df['gameDate'].dtype == 'object':
    df['gameDate'] = pd.to_datetime(df['gameDate'], format='mixed', utc=True).dt.tz_convert(None)
else:
  # Already datetime, just remove timezone if present
    if hasattr(df['gameDate'].dtype, 'tz') and df['gameDate'].dt.tz is not None:
        df['gameDate'] = df['gameDate'].dt.tz_convert(None)

print("‚úÖ DateTime conversion completed")

# Sort data by player and date
df = df.sort_values(['firstName', 'lastName', 'gameDate']).reset_index(drop=True)

print(f"‚úÖ Data sorted by player and date")
print(f"üìÖ Date range: {df['gameDate'].min().date()} to {df['gameDate'].max().date()}")

üìÖ Converting gameDate to datetime...
üîç Current dtype: object
‚úÖ DateTime conversion completed
‚úÖ Data sorted by player and date
üìÖ Date range: 2003-10-07 to 2025-11-09


**Calculate the rolling average**

In [4]:
def calculate_rolling_features(group):
    """Calculate rolling averages for each player"""
    # Fantasy score averages
    group['rolling_avg_fantasy_5'] = group['espn_fantasy_score'].shift(1).rolling(5, min_periods=1).mean()
    group['rolling_avg_fantasy_10'] = group['espn_fantasy_score'].shift(1).rolling(10, min_periods=1).mean()
    
    # Key stats averages
    group['points_last5'] = group['points'].shift(1).rolling(5, min_periods=1).mean()
    group['rebounds_last5'] = group['reboundsTotal'].shift(1).rolling(5, min_periods=1).mean()
    group['assists_last5'] = group['assists'].shift(1).rolling(5, min_periods=1).mean()
    group['minutes_last5'] = group['numMinutes'].shift(1).rolling(5, min_periods=1).mean()
    
    return group

In [5]:
# Apply to each player
df = df.groupby(['firstName', 'lastName'], group_keys=False).apply(calculate_rolling_features)
print("‚úÖ Rolling averages calculated!")
print(f"üìä New columns: {[col for col in df.columns if 'last' in col]}")

‚úÖ Rolling averages calculated!
üìä New columns: ['lastName', 'points_last5', 'rebounds_last5', 'assists_last5', 'minutes_last5']


Show examples:

In [30]:
# Show example
print("\nüìã Example - Recent games with rolling averages:")
sample = df[df['firstName'] == 'LeBron'].head(20)
print(sample[['gameDate', 'espn_fantasy_score', 'rolling_avg_fantasy_5', 'rolling_avg_fantasy_10', 'points_last5']].round(1))


üìã Example - Recent games with rolling averages:
                  gameDate  espn_fantasy_score  rolling_avg_fantasy_5  \
153989 2003-10-07 19:30:00                27.0                    NaN   
153990 2003-10-08 19:00:00                16.0                   27.0   
153991 2003-10-29 22:30:00                63.0                   21.5   
153992 2003-10-30 22:30:00                36.0                   35.3   
153993 2003-11-01 22:00:00                22.0                   35.5   
153994 2003-11-05 20:00:00                43.0                   32.8   
153995 2003-11-07 19:30:00                18.0                   36.0   
153996 2003-11-08 19:30:00                41.0                   36.4   
153997 2003-11-10 19:00:00                39.0                   32.0   
153998 2003-11-12 19:30:00                36.0                   32.6   
153999 2003-11-14 19:30:00                 7.0                   35.4   
154000 2003-11-15 19:30:00                44.0                   28.2   

## Feature 2 - Matchup Data
**What is Matchup Data?**
How players perform against each of the 30 NBA teams. Some players consistently score higher/lower against
certain teams due to defensive schemes, pace of play, and style matchups.

**Example**

LeBron averages 52 fantasy points vs Warriors but 38 vs Celtics

**Implementation**

Calculate average fantasy score vs each opponent team using historical games with recent
weighting.

**Why This Matters**

Instead of treating all opponents equally, the model learns that matchups significantly impact
performance.

**RESET MATCHUP DATA**

In [20]:
# Reset the matchup columns first
matchup_cols = [col for col in df.columns if col.startswith('vs_')]
df = df.drop(columns=matchup_cols)

**Calculate Matchup Data**

In [21]:
# Better approach - calculate matchup averages properly
def calculate_matchup_features_fixed(group):
    """Calculate average performance vs each opponent team"""
    # Sort by date to ensure proper chronological order
    group = group.sort_values('gameDate').reset_index(drop=True)
    
    # Initialize all matchup columns with NaN
    opponents = group['opponentteamName'].unique()
    
    for opponent in opponents:
        col_name = f'vs_{opponent}_avg'
        group[col_name] = float('nan')
        
        # Get games against this opponent
        opponent_mask = group['opponentteamName'] == opponent
        opponent_indices = group.index[opponent_mask].tolist()
        
        # For each game against this opponent, calculate average of previous games vs this opponent
        for idx in opponent_indices:
            # Get all previous games against this opponent
            prev_games = group.loc[:idx-1]  # All games before current
            prev_vs_opponent = prev_games[prev_games['opponentteamName'] == opponent]
            
            if len(prev_vs_opponent) > 0:
                # Take last 5 games against this opponent (or all if less than 5)
                recent_vs_opponent = prev_vs_opponent.tail(5)
                group.loc[idx, col_name] = recent_vs_opponent['espn_fantasy_score'].mean()
                # else leave as NaN (first time playing this opponent)
    
    return group

**Apply to dataframe**

In [22]:
# Apply to each player
print("üèÄ Calculating matchup features...")
df = df.groupby(['firstName', 'lastName'], group_keys=False).apply(calculate_matchup_features)

# Fill NaN values with overall player average for new matchups
matchup_cols = [col for col in df.columns if col.startswith('vs_')]
for col in matchup_cols:
    # Only fill NaN (teams never played) with overall average
    mask = df[col].isna()
    df.loc[mask, col] = df.loc[mask].groupby(['firstName', 'lastName'])['espn_fantasy_score'].transform('mean')

print(f"‚úÖ Matchup features created: {len(matchup_cols)} opponent-specific averages")
print(f"üìä Sample columns: {matchup_cols[:5]}")

üèÄ Calculating matchup features...
‚úÖ Matchup features created: 36 opponent-specific averages
üìä Sample columns: ['vs_Heat_avg', 'vs_Cavaliers_avg', 'vs_Lakers_avg', 'vs_Clippers_avg', 'vs_Nets_avg']


**Test**

In [28]:
print(df[(df['lastName'] == 'Edwards') & (df['firstName']=='Anthony') & (df['opponentteamName'] == 'Grizzlies')][['gameDate',
  'opponentteamName', 'vs_Grizzlies_avg', 'espn_fantasy_score']])

                 gameDate opponentteamName  vs_Grizzlies_avg  \
12746 2020-12-12 20:00:00        Grizzlies         38.983871   
12747 2020-12-14 20:00:00        Grizzlies          1.000000   
12759 2021-01-13 20:00:00        Grizzlies          8.500000   
12797 2021-04-02 20:00:00        Grizzlies          6.666667   
12814 2021-05-05 20:00:00        Grizzlies         17.250000   
12833 2021-11-08 20:00:00        Grizzlies         30.400000   
12840 2021-11-20 20:00:00        Grizzlies         38.200000   
12860 2022-01-13 20:00:00        Grizzlies         42.200000   
12878 2022-02-24 20:00:00        Grizzlies         49.800000   
12899 2022-04-16 15:30:00        Grizzlies         42.600000   
12900 2022-04-19 20:30:00        Grizzlies         38.600000   
12901 2022-04-21 19:30:00        Grizzlies         35.400000   
12902 2022-04-23 22:00:00        Grizzlies         34.200000   
12903 2022-04-26 19:30:00        Grizzlies         36.800000   
12904 2022-04-29 21:00:00        Grizzli