# Feature Forensics & Audit

**Objective:** Mathematically validate the engineered features from `02_feature_engineering.ipynb`.

**Forensic Questions:**
1. **Distribution Check:** Are the new features (EWMA, Interaction) distributed normally or do they have outliers?
2. **Target Correlation:** Which features actually correlate with `target_points_next_3`?
3. **Multicollinearity:** Did we create redundant features?
4. **Visual Proof:** Does `upcoming_difficulty` actually align with points dropped?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

BASE_DIR = Path.cwd().parent if 'Notebooks' in str(Path.cwd()) else Path.cwd()
PROCESSED_DIR = BASE_DIR / "data" / "processed"

# Load the "Model Ready" dataset
df = pd.read_csv(PROCESSED_DIR / "fpl_features_production.csv")
print(f"Evidence Loaded: {df.shape[0]:,} rows x {df.shape[1]} columns")

## 1. The Interrogation (Correlation Analysis)
We check which features have the strongest linear relationship with the future target.

In [None]:
# Filter for numeric columns only
numeric_df = df.select_dtypes(include=[np.number])

# Compute correlations with the Target
correlations = numeric_df.corrwith(df['target_points_next_3']).sort_values(ascending=False)

print("Top 10 Positively Correlated Features:")
print(correlations.head(10))

print("\nTop 10 Negatively Correlated Features:")
print(correlations.tail(10))

# Plotting the top 20 predictors
plt.figure(figsize=(10, 8))
top_features = pd.concat([correlations.head(10), correlations.tail(10)])
top_features.plot(kind='barh', color='teal')
plt.title("Feature Importance (Linear Correlation with Target)")
plt.xlabel("Pearson Correlation Coefficient")
plt.grid(True, alpha=0.3)
plt.show()

## 2. Evidence Validation: Does 'Difficulty' Matter?
We visualize if our new `upcoming_difficulty_3gw` feature actually separates high performers from low performers.

In [None]:
# We bin the difficulty into categories for cleaner plotting
df['difficulty_bin'] = pd.qcut(df['upcoming_difficulty_3gw'], q=5, labels=['Easiest', 'Easy', 'Medium', 'Hard', 'Hardest'])

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='difficulty_bin', y='target_points_next_3', showfliers=False)
plt.title("Impact of Upcoming Fixture Difficulty on Future Points")
plt.xlabel("Average Opponent Defensive Strength (Next 3 Games)")
plt.ylabel("Actual Points Scored (Next 3 Games)")
plt.show()

print("Forensic Note: If the boxplots trend DOWNWARDS as difficulty increases, the feature is valid.")

## 3. Signal vs. Noise: EWMA vs Rolling
Does the Exponential Moving Average (EWMA) capture form better than the simple Rolling Mean?

In [None]:
# Pick a volatile player (e.g., someone with high minutes variance)
sample_player = df[df['minutes_cv_5'] > 0.5]['element'].iloc[0]
player_data = df[df['element'] == sample_player].sort_values('GW')
player_name = df[df['element'] == sample_player]['name'].iloc[0]

plt.figure(figsize=(14, 6))
plt.plot(player_data['GW'], player_data['total_points'], 'o', alpha=0.3, label='Raw Points', color='gray')
plt.plot(player_data['GW'], player_data['total_points_roll_6'], '--', label='Rolling Mean (6)', color='blue')
plt.plot(player_data['GW'], player_data['total_points_ewma_6'], '-', linewidth=2, label='EWMA (6)', color='red')

plt.title(f"Signal Processing: Rolling vs EWMA for {player_name}")
plt.legend()
plt.ylabel("Points")
plt.xlabel("Gameweek")
plt.show()

## 4. The "Nailed" Test (Stability Metrics)
Validating if `minutes_cv_5` correctly identifies rotation risks.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df.sample(2000), x='minutes_cv_5', y='minutes_mean_5', alpha=0.5, hue='position')
plt.title("Rotation Risk Analysis: Coefficient of Variation vs Mean Minutes")
plt.xlabel("Minutes Stability (CV) - Higher is Riskier")
plt.ylabel("Average Minutes (Last 5)")
plt.axvline(0.2, color='r', linestyle='--', label='Unstable Threshold')
plt.legend()
plt.show()

## 5. Collinearity Audit
Ensure we don't have redundant features that will confuse the model.

In [None]:
cols_to_check = [
    'total_points_ewma_6', 'total_points_roll_6', 
    'ict_index_ewma_6', 'value_efficiency', 
    'upcoming_difficulty_3gw', 'minutes_cv_5'
]

plt.figure(figsize=(10, 8))
sns.heatmap(df[cols_to_check].corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Feature Correlation Matrix")
plt.show()