# Production Feature Engineering Pipeline

**Objective:** Transform raw FPL data into a high-dimensional feature space $X$ suitable for regression models (XGBoost/LightGBM).

**Methodology:**
1. **Temporal Features:** EWMA (Exponential Weighted Moving Average) and Rolling Means for [3, 6, 10] GWs.
2. **Stability Metrics:** Coefficient of Variation (CV) for minutes (detecting rotation risk).
3. **Contextual Features:** Opponent Defensive Strength & **Future Fixture Difficulty**.
4. **Interaction Features:** Value Efficiency & Relative Form.
5. **Target Engineering:** Cumulative points for $t+1, t+2, t+3$.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

BASE_DIR = Path.cwd().parent if 'Notebooks' in str(Path.cwd()) else Path.cwd()
PROCESSED_DIR = BASE_DIR / "data" / "processed"

print(f"Base Directory: {BASE_DIR}")

## 1. Data Ingestion
Loading the unified preprocessed dataset. Ensure `src/preprocess.py` has been run recently to include `opponent_team_name`.

In [None]:
df = pd.read_csv(PROCESSED_DIR / "fpl_unified_preprocessed.csv")

# Temporal Sort is NON-NEGOTIABLE for Time Series
df['kickoff_time'] = pd.to_datetime(df['kickoff_time'])
df = df.sort_values(['season', 'GW', 'element']).reset_index(drop=True)

print(f"Loaded {len(df):,} rows.")
print(f"Columns: {list(df.columns[:10])}...")

# Quick check for the patch
if 'opponent_team_name' not in df.columns:
    raise ValueError("CRITICAL: 'opponent_team_name' missing. Run src/preprocess.py first!")

## 2. Temporal Features (The Past)

We implement two types of memory:
1. **Rolling Mean:** Simple average. Good for baseline.
2. **EWMA (Exponential Weighted Moving Average):** assigns higher weight to recent games. 
   $$y_t = \alpha x_t + (1-\alpha) y_{t-1}$$
   where $\alpha = 2/(span+1)$.

**Windows:** `[3, 6, 10]` gameweeks.

In [None]:
def engineer_temporal_features(df):
    df = df.copy()
    
    metrics = ['total_points', 'minutes', 'ict_index', 'influence', 'creativity', 'threat', 'goals_scored', 'assists']
    windows = [3, 6, 10]
    
    # 1. Lags (What happened last game?)
    print("Generating Lag Features...")
    for m in ['total_points', 'minutes']:
        df[f'{m}_lag_1'] = df.groupby('element')[m].shift(1)

    # 2. Rolling & EWMA
    print(f"Generating Rolling & EWMA Features for windows {windows}...")
    for window in windows:
        for m in metrics:
            # Simple Rolling Mean
            df[f'{m}_roll_{window}'] = df.groupby('element')[m].transform(lambda x: x.rolling(window, min_periods=1).mean())
            
            # EWMA (More sensitive to form)
            df[f'{m}_ewma_{window}'] = df.groupby('element')[m].transform(lambda x: x.ewm(span=window, adjust=False).mean())

    # 3. Stability Metric (Coefficient of Variation for Minutes)
    # Low CV = "Nailed" (Consistent minutes). High CV = Rotation Risk.
    print("Generating Stability Metrics...")
    df['minutes_std_5'] = df.groupby('element')['minutes'].transform(lambda x: x.rolling(5, min_periods=1).std())
    df['minutes_mean_5'] = df.groupby('element')['minutes'].transform(lambda x: x.rolling(5, min_periods=1).mean())
    df['minutes_cv_5'] = df['minutes_std_5'] / (df['minutes_mean_5'] + 1e-6) # Avoid div/0

    return df

df = engineer_temporal_features(df)
print("Temporal Engineering Complete.")

## 3. Contextual Features (The Future)

**The Problem:** Standard models only look at *past* opponents. 
**The Solution:** We must inject knowledge of the *upcoming* schedule.

1. Calculate **Team Defensive Strength** (Rolling 5 GW goals conceded).
2. Map this strength to the player's **Next 3 Opponents**.

In [None]:
# A. Calculate Team Defensive Strength
def calculate_team_defense(df):
    # Lower is better (fewer goals conceded)
    team_stats = df.groupby(['season', 'GW', 'team'])['goals_conceded'].mean().reset_index()
    team_stats = team_stats.sort_values(['season', 'team', 'GW'])
    team_stats['def_strength_5'] = team_stats.groupby('team')['goals_conceded'].transform(lambda x: x.rolling(5, min_periods=1).mean())
    return team_stats[['season', 'GW', 'team', 'def_strength_5']]

team_defense = calculate_team_defense(df)

# B. Merge Defensive Strength for CURRENT Match (The Opponent)
# We match (Player's Opponent Name) -> (Team Defense Table)
df = df.merge(
    team_defense,
    left_on=['season', 'GW', 'opponent_team_name'],
    right_on=['season', 'GW', 'team'],
    how='left',
    suffixes=('', '_opp_lookup')
)
df.rename(columns={'def_strength_5': 'opponent_strength_current'}, inplace=True)
df.drop(columns=['team_opp_lookup'], inplace=True)

# Fill NaNs (Early gameweeks or missing map) with League Average
mean_def = df['opponent_strength_current'].mean()
df['opponent_strength_current'] = df['opponent_strength_current'].fillna(mean_def)

print("Opponent Strength Calculated.")

In [None]:
# C. Look-Ahead: Calculate Upcoming Fixture Difficulty
# Logic: Shift the 'opponent_strength_current' column BACKWARDS for each player.
# Shift(-1) is the NEXT game's opponent strength.

print("Calculating Future Fixture Difficulty (Next 3 GWs)...")

df = df.sort_values(['season', 'element', 'GW']) # Ensure sorted by player time

df['next_opp_strength_1'] = df.groupby('element')['opponent_strength_current'].shift(-1)
df['next_opp_strength_2'] = df.groupby('element')['opponent_strength_current'].shift(-2)
df['next_opp_strength_3'] = df.groupby('element')['opponent_strength_current'].shift(-3)

# Feature: Average difficulty of next 3 games
df['upcoming_difficulty_3gw'] = df[['next_opp_strength_1', 'next_opp_strength_2', 'next_opp_strength_3']].mean(axis=1)

# Note: The last 3 GWs of a season will have NaNs. This is expected.
print(df[['name', 'GW', 'opponent_team_name', 'opponent_strength_current', 'upcoming_difficulty_3gw']].head(10))

## 4. Interaction Features

Combining raw features to capture efficiency and relative performance.

In [None]:
# 1. Value Efficiency (Points per Million)
df['value_efficiency'] = df['total_points_ewma_6'] / (df['value'] + 0.1)

# 2. Home/Away Bias
# Ratio of Home Points vs Away Points (Rolling)
# If > 1, Player prefers Home. If < 1, Player prefers Away.
home_pts = df[df['was_home']==True].groupby('element')['total_points'].rolling(10, min_periods=1).mean().reset_index(level=0, drop=True)
away_pts = df[df['was_home']==False].groupby('element')['total_points'].rolling(10, min_periods=1).mean().reset_index(level=0, drop=True)

# This is complex to merge back due to different indices (home vs away rows).
# Simplified Approach: Interaction Term
df['home_advantage_feature'] = df['total_points_ewma_10'] * df['was_home'].astype(int)

## 5. Target Generation

We predict the sum of points over the next 3 Gameweeks.

In [None]:
horizon = 3
df['target_points_next_3'] = df.groupby('element')['total_points'].transform(lambda x: x.rolling(horizon).sum().shift(-horizon))

# Validate
print("Target Distribution:")
print(df['target_points_next_3'].describe())

## 6. The Clean & Save
Remove artifacts and save the Feature Matrix $X$.

In [None]:
# Drop rows where Target is NaN (End of seasons)
df_model = df.dropna(subset=['target_points_next_3'])

# Drop highly collinear / leakage columns
drop_cols = ['match_score', 'own_goals', 'penalties_missed', 'penalties_saved', 'saves', 'bonus', 'bps'] 
df_model = df_model.drop(columns=[c for c in drop_cols if c in df_model.columns])

print(f"Final Feature Matrix Shape: {df_model.shape}")

output_path = PROCESSED_DIR / "fpl_features_production.csv"
df_model.to_csv(output_path, index=False)
print(f"Saved to {output_path}")