# Data Manipulation and Feature Builidng


## Spectator Liveliness Score — **SLS-F+** (FotMob, z-standardized)

### Notation (per match)
- Teams: $t \in \{\mathrm{H}, \mathrm{A}\}$ = Home, Away  
- Minutes played:  
  $$ M \;=\; 90 \;+\; \Delta_{45} \;+\; \Delta_{90} $$
  where $\Delta_{45}$, $\Delta_{90}$ are added-time minutes at 45′ and 90′ (fallback $M=95$ if unknown).

- Per team raw stats (from the match JSON, Period = “All”):  
  $\mathrm{xG}_t$, $\mathrm{Shots}_t$, $\mathrm{SoT}_t$ (shots on target), $\mathrm{BigCh}_t$ (big chances), $\mathrm{Corners}_t$, $\mathrm{ToB}_t$ (touches in opposition box; fallback: $\mathrm{SiB}_t$ = shots inside box).

- Attendance & capacity: $A$ (attendance), $C$ (stadium capacity); occupancy $ \rho = A/C \in [0,1]$.

---

### 1) Aggregate to match totals and per-minute rates
**Totals across both teams:**
$$
\begin{aligned}
\mathrm{xG}_{\mathrm{tot}}      &= \mathrm{xG}_{\mathrm{H}} + \mathrm{xG}_{\mathrm{A}} \\
\mathrm{SoT}_{\mathrm{tot}}     &= \mathrm{SoT}_{\mathrm{H}} + \mathrm{SoT}_{\mathrm{A}} \\
\mathrm{BigCh}_{\mathrm{tot}}   &= \mathrm{BigCh}_{\mathrm{H}} + \mathrm{BigCh}_{\mathrm{A}} \\
\mathrm{Corners}_{\mathrm{tot}} &= \mathrm{Corners}_{\mathrm{H}} + \mathrm{Corners}_{\mathrm{A}} \\
\mathrm{ToB}_{\mathrm{tot}}     &= \mathrm{ToB}_{\mathrm{H}} + \mathrm{ToB}_{\mathrm{A}} \quad (\text{or } \mathrm{SiB}_{\mathrm{H}}+\mathrm{SiB}_{\mathrm{A}} \text{ if ToB unavailable})
\end{aligned}
$$

**Per-minute features (match-level):**
$$
\begin{aligned}
xg_{\mathrm{pm}}      &= \frac{\mathrm{xG}_{\mathrm{tot}}}{M} \\
sot_{\mathrm{pm}}     &= \frac{\mathrm{SoT}_{\mathrm{tot}}}{M} \\
big_{\mathrm{pm}}     &= \frac{\mathrm{BigCh}_{\mathrm{tot}}}{M} \\
corn_{\mathrm{pm}}    &= \frac{\mathrm{Corners}_{\mathrm{tot}}}{M} \\
tob_{\mathrm{pm}}     &= \frac{\mathrm{ToB}_{\mathrm{tot}}}{M} \quad \text{(or } \frac{\mathrm{SiB}_{\mathrm{tot}}}{M}\text{)}
\end{aligned}
$$

Attendance occupancy (match-level):
$$
\rho \;=\; \frac{A}{C}
$$

---

### 2) Season standardization (z-scores)
Compute means $\mu_{\bullet}$ and standard deviations $\sigma_{\bullet}$ over **all matches in the same league & season**. Then:
$$
\begin{aligned}
z_{xg}   &= \frac{xg_{\mathrm{pm}} - \mu_{xg}}{\sigma_{xg}} \,,\quad
z_{sot}  &= \frac{sot_{\mathrm{pm}} - \mu_{sot}}{\sigma_{sot}} \,,\quad
z_{big}  &= \frac{big_{\mathrm{pm}} - \mu_{big}}{\sigma_{big}} \,,\\[4pt]
z_{corn} &= \frac{corn_{\mathrm{pm}} - \mu_{corn}}{\sigma_{corn}} \,,\quad
z_{tob}  &= \frac{tob_{\mathrm{pm}} - \mu_{tob}}{\sigma_{tob}} \,,\quad
z_{\rho} &= \frac{\rho - \mu_{\rho}}{\sigma_{\rho}}
\end{aligned}
$$

---

### 3) Core + attendance boost → raw score
**Default component weights (unnormalized):**
$$
w_{xg}=0.50,\quad w_{sot}=0.20,\quad w_{big}=0.10,\quad w_{corn}=0.10,\quad w_{tob}=0.10
$$

If any component is missing (e.g., $tob_{\mathrm{pm}}$), renormalize weights:
$$
\alpha_i \;=\; \frac{w_i}{\sum_{j \in \mathcal{S}} w_j}\quad\text{for } i\in\mathcal{S}\ (\mathcal{S}=\text{available components})
$$

**Core (danger + pace):**
$$
\mathrm{core} \;=\; \sum_{i\in\mathcal{S}} \alpha_i\, z_i
$$
(where $z_i \in \{z_{xg}, z_{sot}, z_{big}, z_{corn}, z_{tob}\}$ as available)

**Attendance boost (atmosphere):**
$$
\mathrm{boost} \;=\; \operatorname{clip}\!\left(\beta \, z_{\rho},\; -0.30,\; 0.30\right),\quad \beta = 0.15
$$

**Raw score:**
$$
\mathrm{raw} \;=\; \mathrm{core} \;+\; \mathrm{boost}
$$

---

### 4) Final 0–100 index scaling
Let $\mu_{\mathrm{raw}}, \sigma_{\mathrm{raw}}$ be the season mean and std of $\mathrm{raw}$. Define:
$$
z_{\mathrm{raw}} \;=\; \frac{\mathrm{raw}-\mu_{\mathrm{raw}}}{\sigma_{\mathrm{raw}}}
$$
$$
\boxed{\ \mathrm{SLS\!-\!F^+} \;=\; \operatorname{clip}\!\left(50 \;+\; 15\, z_{\mathrm{raw}},\; 0,\; 100\right)\ }
$$

---

### Fields to extract from the match JSON (for **label** computation)

- Team stats (Period = “All”, per team):  
  $\mathrm{xG}_t$, $\mathrm{SoT}_t$, $\mathrm{BigCh}_t$, $\mathrm{Corners}_t$  
  *(Optional)* $\mathrm{ToB}_t$ (touches in opposition box). If absent, use $\mathrm{SiB}_t$ (shots inside box) from shot events.

- Match timing: $\Delta_{45}$, $\Delta_{90}$ (added time at 45′ and 90′) to build $M$.

- Stadium & attendance: $A$ (attendance), $C$ (capacity) to build $\rho$.

---

### Features we will also compute (from multiple match JSONs) for **modeling SLS-F+ pre-match**

For each team, over rolling last $N$ matches (e.g., $N{=}5$), with home/away splits as needed:

- **Attacking form per 90:**  
  $\overline{\mathrm{xG}}_{90}$, $\overline{\mathrm{SoT}}_{90}$, $\overline{\mathrm{BigCh}}_{90}$, $\overline{\mathrm{Corners}}_{90}$  
  *(Optional)* $\overline{\mathrm{ToB}}_{90}$ or $\overline{\mathrm{SiB}}_{90}$

- **Defensive concessions per 90:**  
  $\overline{\mathrm{xGA}}_{90}$ (opponent xG against), $\overline{\mathrm{SoT\_against}}_{90}$, $\overline{\mathrm{BigCh\_against}}_{90}$

- **Context:**  
  Home flag; Days rest since last match (from match dates);  
  **Occupancy prior:** average $\rho$ in recent home games.

These features are derived by repeating the label-side extractions across prior match JSONs and averaging per 90. They are the primary inputs for a pre-match predictor of $\mathrm{SLS\!-\!F^+}$.




### Step 1: Load Match Index and Initialize Data Structures

First, we load the index.json file which lists all matches grouped by round (matchweek). This index provides each match’s ID, teams, and the relative path to its detailed JSON file. We will iterate through these rounds and matches to collect data.

### Step 2: Extract Match Metrics from JSON

For each match, we extract the required inputs for SLS-F+ as specified:

- Attendance and Stadium Capacity – from content.matchFacts.infoBox (to compute occupancy).

- Added time minutes – from events of type "AddedTime" at 45′ and 90′ (to compute total match minutes).

- Team stats (full match, “All” period) – from content.stats.Periods.All.stats:

- Expected Goals (expected_goals) for home and away.

- Total shots (total_shots) for home and away.

- Shots on target (ShotsOnTarget) for home and away.

- Big chances (big_chance) for home and away.

- Corners (corners) for home and away.

- Touches in opposition box – team totals. If a team-level stat for this exists in the JSON (key touches_opp_box), we’ll use it. Otherwise, we would sum player stats from the "Attack" section (if needed).

In [15]:
# --- Imports ---
import json, os
import math
import pandas as pd
from datetime import datetime


In [8]:
# Load the index of matches (schedule for the season)
index_path = "24-25_PL_Data_raw/index.json"  # path to the index file (adjust if needed)
with open(index_path, 'r') as f:
    schedule = json.load(f)

# Prepare a list to collect per-match data
match_data_list = []

# Loop through each round in the schedule
for round_info in schedule:
    round_num = round_info["round"]
    for match_info in round_info["matches"]:
        match_id   = match_info["matchId"]
        home_team  = match_info["home"]
        away_team  = match_info["away"]
        json_rel   = match_info["jsonPath"]         # e.g., "round_0/4506263_matchDetails_Manchester_United-vs-Fulham.json"
        # Construct absolute path to the match JSON file
        base_dir   = os.path.dirname(index_path)    # base directory of index.json
        match_path = os.path.join(base_dir, json_rel)
        if not os.path.exists(match_path):
            # Fallback: if the file is not found in the expected subfolder, try base directory
            match_path = os.path.join(base_dir, os.path.basename(json_rel))
        if not os.path.exists(match_path):
            # If file still not found, skip this match (data might be missing)
            continue

        # --- Load the match JSON data ---
        with open(match_path, 'r') as f:
            match_json = json.load(f)
        
            # --- Extract attendance and capacity for occupancy ---
            info_box   = match_json.get("content", {}).get("matchFacts", {}).get("infoBox", {})
            attendance = info_box.get("Attendance")
            capacity   = None
            if "Stadium" in info_box:
                capacity = info_box["Stadium"].get("capacity")
            # Compute occupancy (fraction of stadium filled)
            if attendance is not None and capacity and capacity > 0:
                occ = attendance / capacity
            else:
                occ = None
    
            # --- Determine total match minutes (90 + added time) ---
            match_minutes = 90
            # Find events of type "AddedTime" at 45' and 90'
            def find_added_times(obj):
                results = []
                if isinstance(obj, dict):
                    if obj.get("type") == "AddedTime" and "minutesAddedInput" in obj:
                        results.append(obj)
                    for value in obj.values():
                        results += find_added_times(value)
                elif isinstance(obj, list):
                    for item in obj:
                        results += find_added_times(item)
                return results
    
            added_time_events = find_added_times(match_json.get("header", {}))
            # Sum all minutes added in first half (time == 45) and second half (time == 90)
            added_first_half  = sum(ev.get("minutesAddedInput", 0) or 0 for ev in added_time_events if ev.get("time") == 45)
            added_second_half = sum(ev.get("minutesAddedInput", 0) or 0 for ev in added_time_events if ev.get("time") == 90)
            match_minutes     = 90 + added_first_half + added_second_half
            if match_minutes < 80:
                # If match duration is unrealistically low (e.g., abandonment), default to 90
                match_minutes = 90
    
            # --- Extract team aggregate stats from "Top stats" section ---
            stats_all   = match_json.get("content", {}).get("stats", {}).get("Periods", {}).get("All", {})
            stats_groups = stats_all.get("stats", []) if isinstance(stats_all, dict) else stats_all  # handle dict or list form
            # Initialize stat values
            xG_home = xG_away = 0.0
            shots_home = shots_away = 0
            SOT_home = SOT_away = 0
            bigch_home = bigch_away = 0
            corners_home = corners_away = 0
    
            # Find the "Top stats" group and extract relevant stats
            top_stats_group = next((grp for grp in stats_groups if grp.get("key") == "top_stats"), None)
            if top_stats_group:
                for stat_item in top_stats_group.get("stats", []):
                    key   = stat_item.get("key")
                    values = stat_item.get("stats", [])
                    if len(values) < 2:
                        continue  # skip if we don't have both home and away values
                    home_val, away_val = values[0], values[1]
                    # Convert values to numeric (floats/ints). Some values may be strings (e.g., "2.43").
                    def to_number(v):
                        if v is None:
                            return 0
                        if isinstance(v, (int, float)):
                            return float(v)
                        if isinstance(v, str):
                            s = v.strip()
                            if s == "":
                                return 0
                            # If the string includes a percentage or other text (unlikely in these stats), extract the numeric part
                            try:
                                return float(s)
                            except ValueError:
                                # If string has a format like "408 (85%)", take the first part before space or parentheses
                                import re
                                num_str = re.match(r'[\d\.]+', s)  # match leading numeric part
                                return float(num_str.group()) if num_str else 0
                        return 0
    
                    hv = to_number(home_val)
                    av = to_number(away_val)
                    if key == "expected_goals":
                        xG_home, xG_away = hv, av
                    elif key == "total_shots":
                        shots_home, shots_away = int(hv), int(av)
                    elif key == "ShotsOnTarget":
                        SOT_home, SOT_away = int(hv), int(av)
                    elif key == "big_chance":
                        bigch_home, bigch_away = int(hv), int(av)
                    elif key == "corners":
                        corners_home, corners_away = int(hv), int(av)
    
            # --- Extract touches in opposition box (if available) ---
            touches_opp_box_home = touches_opp_box_away = None
            for group in stats_groups:
                for stat_item in group.get("stats", []):
                    if stat_item.get("key") == "touches_opp_box":
                        vals = stat_item.get("stats", [])
                        if len(vals) >= 2:
                            try:
                                touches_opp_box_home = int(vals[0])
                                touches_opp_box_away = int(vals[1])
                            except ValueError:
                                touches_opp_box_home = float(vals[0])
                                touches_opp_box_away = float(vals[1])
            # If touches in opp. box is not directly provided, we could sum player stats (not shown here for brevity).
            if touches_opp_box_home is None or touches_opp_box_away is None:
                touches_opp_box_home = touches_opp_box_home or 0
                touches_opp_box_away = touches_opp_box_away or 0
    
            # --- Compute match totals and per-minute rates ---
            xG_total     = xG_home + xG_away
            shots_total  = shots_home + shots_away
            SOT_total    = SOT_home + SOT_away
            bigch_total  = bigch_home + bigch_away
            corners_total= corners_home + corners_away
            tob_total    = None if (touches_opp_box_home is None or touches_opp_box_away is None) else (touches_opp_box_home + touches_opp_box_away)
    
            # Rates per minute of match (to normalize pace across different match lengths)
            minutes = match_minutes if match_minutes > 0 else 90
            xG_per_min       = xG_total / minutes
            shots_per_min    = shots_total / minutes
            SOT_per_min      = SOT_total / minutes
            bigch_per_min    = bigch_total / minutes
            corners_per_min  = corners_total / minutes
            tob_per_min      = tob_total / minutes if tob_total is not None else None
    
            # Store all collected metrics for this match
            match_data_list.append({
                "Round": round_num,
                "HomeTeam": home_team,
                "AwayTeam": away_team,
                "Attendance": attendance,
                "Capacity": capacity,
                "Occupancy": occ,
                "MatchMinutes": minutes,
                "xG_home": xG_home, "xG_away": xG_away,
                "Shots_home": shots_home, "Shots_away": shots_away,
                "ShotsOnTarget_home": SOT_home, "ShotsOnTarget_away": SOT_away,
                "BigChances_home": bigch_home, "BigChances_away": bigch_away,
                "Corners_home": corners_home, "Corners_away": corners_away,
                "TouchesOppBox_home": touches_opp_box_home, "TouchesOppBox_away": touches_opp_box_away,
                "xG_per_min": xG_per_min,
                "Shots_per_min": shots_per_min,
                "ShotsOnTarget_per_min": SOT_per_min,
                "BigChances_per_min": bigch_per_min,
                "Corners_per_min": corners_per_min,
                "TouchesOppBox_per_min": tob_per_min
            })


### Step 3: Standardize Features (League-wide Z-Scores)
SLS-F+ uses z-scores to normalize each feature relative to the distribution of that feature across the league and season. Now that we have all matches’ data in match_data_list, we calculate the mean and standard deviation for each per-minute feature and for occupancy across all matches:

In [9]:
# Compute league-wide mean (μ) and std deviation (σ) for each per-minute feature
features = ["xG_per_min", "ShotsOnTarget_per_min", "Shots_per_min", "BigChances_per_min", "Corners_per_min"]
if any(m["TouchesOppBox_per_min"] is not None for m in match_data_list):
    features.append("TouchesOppBox_per_min")

feature_means = {}
feature_stds  = {}
for feat in features:
    # Gather all non-null values for this feature across matches
    vals = [m[feat] for m in match_data_list if m[feat] is not None]
    if not vals:
        feature_means[feat] = 0.0
        feature_stds[feat]  = 0.0
    else:
        # Calculate mean
        mu = sum(vals) / len(vals)
        # Calculate population std (using N, not N-1, since we consider full season data)
        variance = sum((x - mu)**2 for x in vals) / len(vals)
        sigma    = math.sqrt(variance)
        feature_means[feat] = mu
        feature_stds[feat]  = sigma

# Compute league-wide mean and std for occupancy as well
occ_vals = [m["Occupancy"] for m in match_data_list if m["Occupancy"] is not None]
mu_occ   = sum(occ_vals)/len(occ_vals) if occ_vals else 0.0
var_occ  = sum((x - mu_occ)**2 for x in occ_vals)/len(occ_vals) if occ_vals else 0.0
sigma_occ= math.sqrt(var_occ) if occ_vals else 0.0

# Add z-scores for each match in our data list
for m in match_data_list:
    # Z-score for each feature f: z_f = (f_value - μ_f) / σ_f
    for feat in features:
        val = m[feat]
        mu  = feature_means.get(feat, 0.0)
        sigma = feature_stds.get(feat, 0.0)
        if sigma and val is not None:
            m[f"z_{feat}"] = (val - mu) / (sigma + 1e-9)  # small epsilon to avoid division by zero
        else:
            m[f"z_{feat}"] = 0.0  # if missing or zero std, set z-score to 0 (average)
    # Z-score for occupancy
    if m["Occupancy"] is not None and sigma_occ:
        m["z_occ"] = (m["Occupancy"] - mu_occ) / (sigma_occ + 1e-9)
    else:
        m["z_occ"] = 0.0


### Step 4: Calculate SLS-F+ (Core Blend + Attendance Boost)

With all features standardized, we calculate the core liveliness component and the crowd boost for each match, then combine them into the raw SLS-F+ score and scale it to a 0–100 range:

In [10]:
# Define feature weights for the core liveliness score
use_tob = "TouchesOppBox_per_min" in features  # whether touches in opp. box data is available
if use_tob:
    core_weights = {
        "xG_per_min": 0.50,
        "ShotsOnTarget_per_min": 0.20,
        "BigChances_per_min": 0.10,
        "Corners_per_min": 0.10,
        "TouchesOppBox_per_min": 0.10
    }
else:
    # If ToB is not available, re-normalize weights (sum to 1)
    core_weights = {
        "xG_per_min": 0.56,
        "ShotsOnTarget_per_min": 0.22,
        "BigChances_per_min": 0.11,
        "Corners_per_min": 0.11
        # (Optionally could include Shots_per_min at a small weight if ToB is absent)
    }

β = 0.15  # weight factor for attendance boost

for m in match_data_list:
    # Core component: weighted sum of feature z-scores
    core_score = 0.0
    for feat, w in core_weights.items():
        core_score += w * m.get(f"z_{feat}", 0.0)
    m["core"] = core_score

    # Attendance boost: β * z_occ, clamped to [-0.30, +0.30]
    boost = β * m["z_occ"]
    if boost > 0.30: 
        boost = 0.30
    if boost < -0.30:
        boost = -0.30
    m["boost"] = boost

    # Raw combined score (before scaling)
    m["raw_score"] = core_score + boost

# Compute mean and std of the raw combined scores across all matches (for final scaling)
raw_values = [m["raw_score"] for m in match_data_list]
mu_raw  = sum(raw_values)/len(raw_values) if raw_values else 0.0
var_raw = sum((x - mu_raw)**2 for x in raw_values)/len(raw_values) if raw_values else 0.0
sigma_raw = math.sqrt(var_raw) if raw_values else 0.0

# Scale raw scores to 0–100 range with mean ~50 and std dev ~15
for m in match_data_list:
    if sigma_raw:
        z_raw = (m["raw_score"] - mu_raw) / (sigma_raw + 1e-9)
    else:
        z_raw = 0.0
    m["z_raw"]   = z_raw
    # Linear mapping: mean->50, 1 std->15 points
    SLS = 50 + 15 * z_raw
    # Clip to [0, 100]
    if SLS < 0:   SLS = 0.0
    if SLS > 100: SLS = 100.0
    m["SLS_Fplus"] = SLS


### Step 5: Save Results to CSV Files

With all matches processed, we can organize the results into tables. We will create one CSV file per round (matchweek), as well as an aggregated file for all rounds combined. Each table will have one row per match with the following columns:

- Round (matchweek number, 0-indexed in our data if round 0 = week 1).

- HomeTeam, AwayTeam.

- Occupancy (attendance fraction, e.g., 0.96 meaning 96% of stadium filled).

- xG_home, xG_away.

- Shots_home, Shots_away.

- ShotsOnTarget_home, ShotsOnTarget_away.

- BigChances_home, BigChances_away.

- Corners_home, Corners_away.

- TouchesOppBox_home, TouchesOppBox_away (total touches in opposition box for each team).

- SLS_Fplus (the computed liveliness score for the match, 0–100).

Let's convert our list of match dictionaries into a pandas DataFrame for convenient output, then save the CSVs:

In [14]:
# Convert the collected data into a DataFrame
df = pd.DataFrame(match_data_list)

# Select and order the columns for output
columns = [
    "Round", "HomeTeam", "AwayTeam", "Occupancy",
    "xG_home", "xG_away", "Shots_home", "Shots_away",
    "ShotsOnTarget_home", "ShotsOnTarget_away",
    "BigChances_home", "BigChances_away",
    "Corners_home", "Corners_away"
]
# Include touches in box columns if available in data
if any(df["TouchesOppBox_home"].notna()):
    columns += ["TouchesOppBox_home", "TouchesOppBox_away"]
# Append the target score
columns.append("SLS_Fplus")

df_out = df[columns]

# Ensure output directory exists
os.makedirs("tables", exist_ok=True)

# Save one CSV per round
for rnd, group in df_out.groupby("Round"):
    output_path = f"tables/round_{rnd}.csv"
    # Drop the Round column in individual round files (since it's obvious from filename)
    group.drop(columns=["Round"], inplace=False).to_csv(output_path, index=False)

# Also save a combined file for all rounds (with Round included)
df_out.to_csv("tables/all_rounds.csv", index=False)


## Feature Extraction

To predict the Spectator Liveliness Score (SLS-F+) before a match, we need to derive features from each team’s recent performance without any data leakage. We will compute per-team features based on the last $N=5$ matches (using fewer if a team has played less) as follows:

Attacking form (per 90 minutes): Average Expected Goals (xG), Shots on Target (SoT), Big Chances, Corners, and (optionally) Touches in Opposition Box over the last 5 matches. These features capture a team’s offensive output rate.

Defensive form (per 90 minutes): Average xG Against (opponent’s xG), Shots on Target against, and Big Chances against over the last 5 matches. These reflect how much the team concedes chances.

Contextual factors:

- A binary home-field indicator (implicitly handled by having separate home/away feature sets per match).

- Days of rest since each team’s previous match.

- Occupancy prior: the average stadium occupancy (fraction of capacity filled) in the home team’s recent home games, which serves as a proxy for crowd atmosphere. (For away teams, this specific feature is not used, since occupancy is tied to home stadiums.)

We must ensure these features for a given match are computed only from past matches. For the first few matches of the season when a team has little or no history (less than 3 games), we will use league-average values as a fallback to avoid unreliable small samples. The league averages are computed incrementally from matches that have already been played (e.g. using data from previous rounds), thereby avoiding any future data leakage. This way, early-season matches are essentially treated as between “average” teams until enough data accumulates.

Additionally, we can consider including the team identities (HomeTeam and AwayTeam) as categorical features. Incorporating team names (as one-hot encoded or categorical variables) can help the model capture team-specific tendencies and historical home advantage differences. However, this comes at the risk of overfitting to known teams; if our goal is purely within one season and we have enough data, it may be beneficial. In our approach below, we will keep the team names as features (which can later be encoded) so that the model can learn, for example, that certain teams’ matches tend to be more lively, or that certain teams perform better at home.

In [19]:
import os
import json
from datetime import datetime
import pandas as pd
import math

############################################################
# Config / helpers
############################################################

INDEX_PATH = "24-25_PL_Data_raw/index.json"
SLS_TABLE_PATH = "tables/all_rounds.csv"
OUT_DIR = "feature_tables"
os.makedirs(OUT_DIR, exist_ok=True)

# Baseline priors for first ~2 rounds before we have any history at all.
# These are just safe starting priors so model doesn't blow up early season.
BASELINE_PRIOR = {
    "xG_per90": 1.3,      # typical team xG per 90
    "SoT_per90": 3.5,     # shots on target per team per match
    "BigCh_per90": 2.0,   # big chances
    "Corn_per90": 5.0,    # corners
    "ToB_per90": 24.0,    # touches in opp box
    "occ": 0.95,          # 95% full, rough EPL vibe
    "league_xG_match": 2.6,
    "league_corners_match": 10.0
}

def safe_iso_to_dt(s):
    """Convert various FotMob-style timestamps -> datetime or None."""
    if s is None:
        return None
    # common forms: "2024-08-10T19:00:00.000Z", "2024-08-10T19:00:00Z"
    s = s.replace("Z","")
    try:
        return datetime.fromisoformat(s)
    except ValueError:
        return None

def safe_float(x):
    """Convert string/int/None -> float 0.0 fallback."""
    if x is None:
        return 0.0
    if isinstance(x,(int,float)):
        return float(x)
    try:
        return float(str(x).split()[0])
    except Exception:
        return 0.0

def per90(sum_stat, sum_mins):
    """Convert raw total over N minutes to per-90 rate."""
    if sum_mins is None or sum_mins <= 0:
        return 0.0
    return (sum_stat / sum_mins) * 90.0

############################################################
# Core rolling feature builder
############################################################

# We'll keep:
# - team_history[team] = list of dicts, one per past match
# - global aggregates for league context "so far"
team_history = {}

global_occ_sum = 0.0
global_occ_n = 0

# For per-team priors when not enough history:
global_team_minutes = 0.0
global_xG_sum = 0.0
global_SoT_sum = 0.0
global_BigCh_sum = 0.0
global_Corn_sum = 0.0
global_ToB_sum = 0.0

# For league-wide context features
global_match_count = 0
global_match_xG_sum = 0.0
global_match_corners_sum = 0.0

feature_rows = []

# Load index (season schedule grouped by round)
with open(INDEX_PATH, "r") as f:
    schedule = json.load(f)

# Loop rounds in chronological order
for round_info in schedule:
    rnd = round_info["round"]

    # First pass: read raw match JSONs for the round
    round_matches = []
    for m in round_info["matches"]:
        match_id = m["matchId"]
        home_team = m["home"]
        away_team = m["away"]

        # resolve path to the match JSON
        # index.json may say e.g. "round_0/4506...json"
        guess_path = os.path.join(os.path.dirname(INDEX_PATH), m["jsonPath"])
        if not os.path.exists(guess_path):
            # fallback: just take basename
            guess_path = os.path.join(os.path.dirname(INDEX_PATH),
                                      os.path.basename(m["jsonPath"]))
        if not os.path.exists(guess_path):
            # if file not found, skip
            continue

        with open(guess_path, "r") as jf:
            match_json = json.load(jf)

        # Match datetime (used for rest-days calc)
        match_date = None
        # Some FotMob exports contain this under general.matchTimeUTCDate
        match_date = safe_iso_to_dt(
            match_json.get("general",{}).get("matchTimeUTCDate")
        )

        # Attendance / capacity -> occupancy
        info_box = (
            match_json.get("content",{})
                      .get("matchFacts",{})
                      .get("infoBox",{})
        )
        attendance = info_box.get("Attendance")
        stadium = info_box.get("Stadium",{})
        capacity = stadium.get("capacity")
        attendance = attendance if isinstance(attendance,int) else safe_float(attendance)
        capacity = capacity if isinstance(capacity,int) else safe_float(capacity)
        occupancy = None
        if attendance and capacity and capacity > 0:
            occupancy = attendance / capacity

        # Approximate match minutes
        # If we don't have stoppage-time detail, assume ~95 total
        match_minutes = 95.0

        # Extract team stats from "Periods -> All -> stats"
        # We want: xG, ShotsOnTarget, big_chance, corners, touches_opp_box
        xG_home = xG_away = 0.0
        SoT_home = SoT_away = 0.0
        BigCh_home = BigCh_away = 0.0
        Corn_home = Corn_away = 0.0
        ToB_home = ToB_away = 0.0

        periods_all = (
            match_json.get("content",{})
                      .get("stats",{})
                      .get("Periods",{})
                      .get("All",{})
        )
        stats_groups = periods_all.get("stats", [])

        # Pull top_stats first
        top_stats_group = None
        for grp in stats_groups:
            if grp.get("key") == "top_stats":
                top_stats_group = grp
                break
        if top_stats_group:
            for stat_item in top_stats_group.get("stats",[]):
                key = stat_item.get("key")
                vals = stat_item.get("stats",[])
                if len(vals) < 2:
                    continue
                hv = safe_float(vals[0])
                av = safe_float(vals[1])

                if key == "expected_goals":
                    xG_home, xG_away = hv, av
                elif key == "ShotsOnTarget":
                    SoT_home, SoT_away = hv, av
                elif key == "big_chance":
                    BigCh_home, BigCh_away = hv, av
                elif key == "corners":
                    Corn_home, Corn_away = hv, av

        # Touches in opposition box might live in a different stat group
        for grp in stats_groups:
            for stat_item in grp.get("stats",[]):
                if stat_item.get("key") == "touches_opp_box":
                    vals = stat_item.get("stats",[])
                    if len(vals) >= 2:
                        ToB_home = safe_float(vals[0])
                        ToB_away = safe_float(vals[1])

        round_matches.append({
            "Round": rnd,
            "date": match_date,
            "home_team": home_team,
            "away_team": away_team,
            "minutes": match_minutes,
            "occupancy": occupancy,
            "xG_home": xG_home, "xG_away": xG_away,
            "SoT_home": SoT_home, "SoT_away": SoT_away,
            "BigCh_home": BigCh_home, "BigCh_away": BigCh_away,
            "Corn_home": Corn_home, "Corn_away": Corn_away,
            "ToB_home": ToB_home, "ToB_away": ToB_away
        })

    # Second pass: build forward-looking features for each match in this round
    for match in round_matches:
        home = match["home_team"]
        away = match["away_team"]

        # init team histories if first time seen
        if home not in team_history:
            team_history[home] = []
        if away not in team_history:
            team_history[away] = []

        # helper to build per-90 attacking form for N last matches
        def build_attacking(team):
            hist = team_history[team]
            if len(hist) < 3:
                # fallback: league-average so far, or baseline if none
                if global_team_minutes > 0:
                    per_min_xG   = global_xG_sum   / global_team_minutes
                    per_min_SoT  = global_SoT_sum  / global_team_minutes
                    per_min_BC   = global_BigCh_sum/ global_team_minutes
                    per_min_Corn = global_Corn_sum / global_team_minutes
                    per_min_ToB  = global_ToB_sum  / global_team_minutes
                    return {
                        "xG_att_90":   per_min_xG*90,
                        "SoT_att_90":  per_min_SoT*90,
                        "BigCh_att_90":per_min_BC*90,
                        "Corn_att_90": per_min_Corn*90,
                        "ToB_att_90":  per_min_ToB*90
                    }
                else:
                    return {
                        "xG_att_90":   BASELINE_PRIOR["xG_per90"],
                        "SoT_att_90":  BASELINE_PRIOR["SoT_per90"],
                        "BigCh_att_90":BASELINE_PRIOR["BigCh_per90"],
                        "Corn_att_90": BASELINE_PRIOR["Corn_per90"],
                        "ToB_att_90":  BASELINE_PRIOR["ToB_per90"]
                    }
            # else: use last up to 5 games
            recent = hist[-5:]
            mins_sum = sum(g["minutes"] for g in recent)
            return {
                "xG_att_90":   per90(sum(g["xG_for"]        for g in recent), mins_sum),
                "SoT_att_90":  per90(sum(g["SoT_for"]       for g in recent), mins_sum),
                "BigCh_att_90":per90(sum(g["BigCh_for"]     for g in recent), mins_sum),
                "Corn_att_90": per90(sum(g["Corn_for"]      for g in recent), mins_sum),
                "ToB_att_90":  per90(sum(g["ToB_for"]       for g in recent), mins_sum)
            }

        # helper to build defensive concessions form
        def build_defensive(team):
            hist = team_history[team]
            if len(hist) < 3:
                if global_team_minutes > 0:
                    per_min_xGA   = global_xG_sum   / global_team_minutes
                    per_min_SoTA  = global_SoT_sum  / global_team_minutes
                    per_min_BCA   = global_BigCh_sum/ global_team_minutes
                    return {
                        "xGA_def_90":    per_min_xGA*90,
                        "SoT_agst_90":   per_min_SoTA*90,
                        "BigCh_agst_90": per_min_BCA*90
                    }
                else:
                    return {
                        "xGA_def_90":    BASELINE_PRIOR["xG_per90"],
                        "SoT_agst_90":   BASELINE_PRIOR["SoT_per90"],
                        "BigCh_agst_90": BASELINE_PRIOR["BigCh_per90"]
                    }
            recent = hist[-5:]
            mins_sum = sum(g["minutes"] for g in recent)
            return {
                "xGA_def_90":    per90(sum(g["xG_against"]    for g in recent), mins_sum),
                "SoT_agst_90":   per90(sum(g["SoT_against"]   for g in recent), mins_sum),
                "BigCh_agst_90": per90(sum(g["BigCh_against"] for g in recent), mins_sum)
            }

        # helper to compute days rest
        def days_rest(team, match_dt):
            hist = team_history[team]
            if not hist or match_dt is None or hist[-1]["date"] is None:
                return 7  # default
            delta = match_dt - hist[-1]["date"]
            return max(delta.days, 0)

        # home occupancy prior = mean of last up to 5 home games' occupancy,
        # or league avg occupancy so far, or baseline.
        def occupancy_prior_home(team):
            hist = team_history[team]
            home_games = [g["occupancy"] for g in hist
                          if g["homeAway"] == "home" and g["occupancy"] is not None]
            if len(home_games) < 3:
                if global_occ_n > 0:
                    return global_occ_sum / global_occ_n
                return BASELINE_PRIOR["occ"]
            recent_home = home_games[-5:]
            return sum(recent_home)/len(recent_home)

        # league averages so far (for context features)
        if global_match_count > 0:
            league_xG_sofar = global_match_xG_sum / global_match_count
            league_corn_sofar = global_match_corners_sum / global_match_count
        else:
            league_xG_sofar = BASELINE_PRIOR["league_xG_match"]
            league_corn_sofar = BASELINE_PRIOR["league_corners_match"]

        # build forms
        home_att = build_attacking(home)
        away_att = build_attacking(away)
        home_def = build_defensive(home)
        away_def = build_defensive(away)

        home_rest = days_rest(home, match["date"])
        away_rest = days_rest(away, match["date"])

        occ_prior = occupancy_prior_home(home)

        # composite predictors
        Home_AttackVsDefense = home_att["xG_att_90"] + away_def["xGA_def_90"]
        Away_AttackVsDefense = away_att["xG_att_90"] + home_def["xGA_def_90"]
        TempoSum = home_att["Corn_att_90"] + away_att["Corn_att_90"]
        SoTSum = home_att["SoT_att_90"] + away_att["SoT_att_90"]
        DaysRestDiff = home_rest - away_rest

        # put it all together (wide row per match)
        row = {
            "Round": match["Round"],
            "HomeTeam": home,
            "AwayTeam": away,

            # context
            "Home_days_rest": home_rest,
            "Away_days_rest": away_rest,
            "DaysRestDiff": DaysRestDiff,
            "Home_occ_prior": occ_prior,
            "LeagueAvg_xG_perMatch_sofar": league_xG_sofar,
            "LeagueAvg_Corners_perMatch_sofar": league_corn_sofar,
            "HomeFlag": 1,  # by definition row is from home POV

            # home attacking form
            "Home_xG_att_90": home_att["xG_att_90"],
            "Home_SoT_att_90": home_att["SoT_att_90"],
            "Home_BigCh_att_90": home_att["BigCh_att_90"],
            "Home_Corn_att_90": home_att["Corn_att_90"],
            "Home_ToB_att_90": home_att["ToB_att_90"],

            # home defensive form
            "Home_xGA_def_90": home_def["xGA_def_90"],
            "Home_SoT_agst_90": home_def["SoT_agst_90"],
            "Home_BigCh_agst_90": home_def["BigCh_agst_90"],

            # away attacking form
            "Away_xG_att_90": away_att["xG_att_90"],
            "Away_SoT_att_90": away_att["SoT_att_90"],
            "Away_BigCh_att_90": away_att["BigCh_att_90"],
            "Away_Corn_att_90": away_att["Corn_att_90"],
            "Away_ToB_att_90": away_att["ToB_att_90"],

            # away defensive form
            "Away_xGA_def_90": away_def["xGA_def_90"],
            "Away_SoT_agst_90": away_def["SoT_agst_90"],
            "Away_BigCh_agst_90": away_def["BigCh_agst_90"],

            # composites
            "Home_AttackVsDefense": Home_AttackVsDefense,
            "Away_AttackVsDefense": Away_AttackVsDefense,
            "TempoSum": TempoSum,
            "SoTSum": SoTSum
        }

        feature_rows.append(row)

        # --- AFTER computing features, update history and league aggregates with this match. ---
        # This is critical for leak prevention: we only learn from the match AFTER we've created its pre-match row.

        # update team_history for home
        team_history[home].append({
            "date": match["date"],
            "homeAway": "home",
            "opponent": away,
            "minutes": match["minutes"],
            "occupancy": match["occupancy"],
            "xG_for": match["xG_home"],
            "SoT_for": match["SoT_home"],
            "BigCh_for": match["BigCh_home"],
            "Corn_for": match["Corn_home"],
            "ToB_for": match["ToB_home"],
            "xG_against": match["xG_away"],
            "SoT_against": match["SoT_away"],
            "BigCh_against": match["BigCh_away"]
        })

        # update team_history for away
        team_history[away].append({
            "date": match["date"],
            "homeAway": "away",
            "opponent": home,
            "minutes": match["minutes"],
            "occupancy": match["occupancy"],
            "xG_for": match["xG_away"],
            "SoT_for": match["SoT_away"],
            "BigCh_for": match["BigCh_away"],
            "Corn_for": match["Corn_away"],
            "ToB_for": match["ToB_away"],
            "xG_against": match["xG_home"],
            "SoT_against": match["SoT_home"],
            "BigCh_against": match["BigCh_home"]
        })

        # update league per-team running totals
        mins = match["minutes"]
        global_team_minutes += mins * 2
        global_xG_sum += match["xG_home"] + match["xG_away"]
        global_SoT_sum += match["SoT_home"] + match["SoT_away"]
        global_BigCh_sum += match["BigCh_home"] + match["BigCh_away"]
        global_Corn_sum += match["Corn_home"] + match["Corn_away"]
        global_ToB_sum += match["ToB_home"] + match["ToB_away"]

        if match["occupancy"] is not None:
            global_occ_sum += match["occupancy"]
            global_occ_n += 1

        # league-wide match context
        global_match_count += 1
        global_match_xG_sum += (match["xG_home"] + match["xG_away"])
        global_match_corners_sum += (match["Corn_home"] + match["Corn_away"])

############################################################
# Merge in the SLS_Fplus target and save CSVs
############################################################

wide_df = pd.DataFrame(feature_rows)

# Load SLS table; we assume it has Round, HomeTeam, AwayTeam, SLS_Fplus
sls_df = pd.read_csv(SLS_TABLE_PATH)

# Minimal subset of columns we need from sls_df
# We assume sls_df columns include ["Round","HomeTeam","AwayTeam","SLS_Fplus"]
if "SLS_Fplus" not in sls_df.columns:
    # Some earlier code may have named it differently like "SLS_Fplus"
    # Adjust here if needed.
    pass

model_df = pd.merge(
    wide_df,
    sls_df[["Round","HomeTeam","AwayTeam","SLS_Fplus"]],
    on=["Round","HomeTeam","AwayTeam"],
    how="left"
)

# Save wide match-level feature table
wide_out_path = os.path.join(OUT_DIR, "match_features_wide.csv")
model_df.to_csv(wide_out_path, index=False)

# Also build long format (one row per team per match) for per-team exploration
long_rows = []
for _,r in model_df.iterrows():
    # home row
    long_rows.append({
        "Round": r["Round"],
        "Team": r["HomeTeam"],
        "Role": "home",
        "Opponent": r["AwayTeam"],
        "xG_att_90": r["Home_xG_att_90"],
        "SoT_att_90": r["Home_SoT_att_90"],
        "BigCh_att_90": r["Home_BigCh_att_90"],
        "Corn_att_90": r["Home_Corn_att_90"],
        "ToB_att_90": r["Home_ToB_att_90"],
        "xGA_def_90": r["Home_xGA_def_90"],
        "SoT_agst_90": r["Home_SoT_agst_90"],
        "BigCh_agst_90": r["Home_BigCh_agst_90"],
        "Days_rest": r["Home_days_rest"],
        "Occ_prior": r["Home_occ_prior"],
        "AttackVsDefense": r["Home_AttackVsDefense"],
        "TempoSum": r["TempoSum"],
        "SoTSum": r["SoTSum"],
        "LeagueAvg_xG_perMatch_sofar": r["LeagueAvg_xG_perMatch_sofar"],
        "LeagueAvg_Corners_perMatch_sofar": r["LeagueAvg_Corners_perMatch_sofar"],
        "SLS_Fplus": r["SLS_Fplus"]
    })
    # away row
    long_rows.append({
        "Round": r["Round"],
        "Team": r["AwayTeam"],
        "Role": "away",
        "Opponent": r["HomeTeam"],
        "xG_att_90": r["Away_xG_att_90"],
        "SoT_att_90": r["Away_SoT_att_90"],
        "BigCh_att_90": r["Away_BigCh_att_90"],
        "Corn_att_90": r["Away_Corn_att_90"],
        "ToB_att_90": r["Away_ToB_att_90"],
        "xGA_def_90": r["Away_xGA_def_90"],
        "SoT_agst_90": r["Away_SoT_agst_90"],
        "BigCh_agst_90": r["Away_BigCh_agst_90"],
        "Days_rest": r["Away_days_rest"],
        "Occ_prior": None,  # away team doesn't "own" stadium occupancy
        "AttackVsDefense": r["Away_AttackVsDefense"],
        "TempoSum": r["TempoSum"],
        "SoTSum": r["SoTSum"],
        "LeagueAvg_xG_perMatch_sofar": r["LeagueAvg_xG_perMatch_sofar"],
        "LeagueAvg_Corners_perMatch_sofar": r["LeagueAvg_Corners_perMatch_sofar"],
        "SLS_Fplus": r["SLS_Fplus"]
    })

long_df = pd.DataFrame(long_rows)

long_out_path = os.path.join(OUT_DIR, "match_features_long.csv")
long_df.to_csv(long_out_path, index=False)

print("Saved:")
print(f" - {wide_out_path}")
print(f" - {long_out_path}")
print("Columns in wide file:")
print(model_df.columns.tolist())
print("Columns in long file:")
print(long_df.columns.tolist())

Saved:
 - feature_tables/match_features_wide.csv
 - feature_tables/match_features_long.csv
Columns in wide file:
['Round', 'HomeTeam', 'AwayTeam', 'Home_days_rest', 'Away_days_rest', 'DaysRestDiff', 'Home_occ_prior', 'LeagueAvg_xG_perMatch_sofar', 'LeagueAvg_Corners_perMatch_sofar', 'HomeFlag', 'Home_xG_att_90', 'Home_SoT_att_90', 'Home_BigCh_att_90', 'Home_Corn_att_90', 'Home_ToB_att_90', 'Home_xGA_def_90', 'Home_SoT_agst_90', 'Home_BigCh_agst_90', 'Away_xG_att_90', 'Away_SoT_att_90', 'Away_BigCh_att_90', 'Away_Corn_att_90', 'Away_ToB_att_90', 'Away_xGA_def_90', 'Away_SoT_agst_90', 'Away_BigCh_agst_90', 'Home_AttackVsDefense', 'Away_AttackVsDefense', 'TempoSum', 'SoTSum', 'SLS_Fplus']
Columns in long file:
['Round', 'Team', 'Role', 'Opponent', 'xG_att_90', 'SoT_att_90', 'BigCh_att_90', 'Corn_att_90', 'ToB_att_90', 'xGA_def_90', 'SoT_agst_90', 'BigCh_agst_90', 'Days_rest', 'Occ_prior', 'AttackVsDefense', 'TempoSum', 'SoTSum', 'LeagueAvg_xG_perMatch_sofar', 'LeagueAvg_Corners_perMatch_s