# Conception

We have Time Series Data (a sequence of events).

    GW1​,GW2​,GW3​...

Machine Learning models (like XGBoost) require Supervised Learning Data (X→y).

    We want to predict Next Week's Points (yt+1​).

    We use Current Week's Stats (Xt​) to do it.

We must construct a Sliding Window:

    Input (X): Stats from Gameweek 1 to 5.

    Target (y): Points in Gameweek 6.

# Data cleaning

## Merging the History

In [5]:
import pandas as pd
from pathlib import Path
import glob

# 1. Setup Paths
DATA_PATH = Path("../data/raw")

# 2. Load all Season Files
all_files = glob.glob(str(DATA_PATH / "*_merged_gw.csv"))

dfs = []
for filename in all_files:
    # Extract season name from filename
    season_name = Path(filename).name.split('_')[0]
    
    # Read CSV
    # Use the python engine and skip malformed lines so a bad row doesn't break the whole load.
    # on_bad_lines='skip' requires pandas >= 1.3.0
    # Note: low_memory is not supported with engine='python', so it's removed.
    df = pd.read_csv(filename, encoding='latin-1', engine='python', on_bad_lines='skip')
    
    # Add metadata column (so we know which season this row belongs to)
    df['season'] = season_name
    
    dfs.append(df)

# 3. Concatenate into one massive dataset
master_df = pd.concat(dfs, axis=0, ignore_index=True)

# 4. Clean Column Names (Engineering Best Practice)
# Some seasons use 'Kickoff time' vs 'kickoff_time'. We normalize this.
master_df.columns = master_df.columns.str.lower().str.replace(' ', '_')

print(f"Dataset Shape: {master_df.shape}")
print(f"Columns: {master_df.columns.tolist()}")
master_df.head()

Dataset Shape: (120220, 43)
Columns: ['name', 'position', 'team', 'xp', 'assists', 'bonus', 'bps', 'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded', 'goals_scored', 'ict_index', 'influence', 'kickoff_time', 'minutes', 'opponent_team', 'own_goals', 'penalties_missed', 'penalties_saved', 'red_cards', 'round', 'saves', 'selected', 'team_a_score', 'team_h_score', 'threat', 'total_points', 'transfers_balance', 'transfers_in', 'transfers_out', 'value', 'was_home', 'yellow_cards', 'gw', 'season', 'expected_assists', 'expected_goal_involvements', 'expected_goals', 'expected_goals_conceded', 'starts', 'modified']


Unnamed: 0,name,position,team,xp,assists,bonus,bps,clean_sheets,creativity,element,...,was_home,yellow_cards,gw,season,expected_assists,expected_goal_involvements,expected_goals,expected_goals_conceded,starts,modified
0,Aaron Connolly,FWD,Brighton,0.5,0,0,-3,0,0.3,78,...,True,0,1,2020-21,,,,,,
1,Aaron Cresswell,DEF,West Ham,2.1,0,0,11,0,11.2,435,...,True,0,1,2020-21,,,,,,
2,Aaron Mooy,MID,Brighton,0.0,0,0,0,0,0.0,60,...,True,0,1,2020-21,,,,,,
3,Aaron Ramsdale,GK,Sheffield Utd,2.5,0,0,12,0,0.0,483,...,True,0,1,2020-21,,,,,,
4,Abdoulaye DoucourÃ©,MID,Everton,1.3,0,0,20,1,44.6,512,...,False,0,1,2020-21,,,,,,


## Lag Features.

In [12]:
# 1. Sort Data
master_df['kickoff_time'] = pd.to_datetime(master_df['kickoff_time'], errors='coerce')
master_df = master_df.sort_values(by=['name', 'kickoff_time'])

# 2. Group by Player
grouped = master_df.groupby('name')

# --- TIER 1: POINTS ---
master_df['mean_pts_3'] = grouped['total_points'].transform(
    lambda x: x.shift(1).rolling(window=3).mean()
).fillna(0)

# --- TIER 2: UNDERLYING STATS ---
# THREAT: Are they getting in the box?
master_df['mean_threat_3'] = grouped['threat'].transform(
    lambda x: x.shift(1).rolling(window=3).mean()
).fillna(0)

# CREATIVITY: Are they making passes?
master_df['mean_creativity_3'] = grouped['creativity'].transform(
    lambda x: x.shift(1).rolling(window=3).mean()
).fillna(0)

# --- TIER 3: CONTEXT ---
# Security (Minutes)
master_df['mean_mins_3'] = grouped['minutes'].transform(
    lambda x: x.shift(1).rolling(window=3).mean()
).fillna(0)

# Cost (The Budget Constraint)
# Some season files use 'now_cost' while others use 'value' (or neither).
# Guard against KeyError by creating/deriving a consistent 'now_cost' column.
if 'now_cost' in master_df.columns:
    master_df['now_cost'] = master_df['now_cost'] / 10.0
elif 'value' in master_df.columns:
    master_df['now_cost'] = master_df['value'] / 10.0
else:
    # fallback: create a default column so downstream code doesn't fail
    master_df['now_cost'] = 0.0

# --- FORMATTING ---
master_df['opponent_team'] = master_df['opponent_team'].astype('category')
master_df['position'] = master_df['position'].astype('category')
master_df['was_home'] = master_df['was_home'].astype(bool)

# Select the "Pro" Feature Set
final_cols = [
    # Metadata
    'name', 'season', 'kickoff_time', 
    # Context Features (Input)
    'position', 'was_home', 'opponent_team', 'now_cost',
    # Lag Features (The Signals)
    'mean_pts_3', 'mean_threat_3', 'mean_creativity_3', 'mean_mins_3',
    # Target (Output)
    'total_points'
]

# Create Training Set
train_df = master_df[final_cols].dropna()

# Save
output_path = Path("../data/processed/training_data.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
train_df.to_csv(output_path, index=False)

print(f"Training Data Saved: {len(train_df)} rows.")

Training Data Saved: 120220 rows.
