<a href="https://colab.research.google.com/github/Ashvin7/pl-xg-ml/blob/main/03_baseline_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Clone the GitHub repo into Colab
!git clone https://github.com/Ashvin7/pl-xg-ml.git

Cloning into 'pl-xg-ml'...
remote: Enumerating objects: 37, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 37 (delta 14), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (37/37), 287.55 KiB | 4.71 MiB/s, done.
Resolving deltas: 100% (14/14), done.


# Phase 3: Baseline Models

This notebook implements three baseline approaches:

## Model 1 — Linear Regression Baseline
- Goal: Predict `points`
- Input: `xg_diff_league`
- Output: `points`
- Metrics: RMSE, MAE, Spearman, Kendall

## Model 2 — xG League Table Projection
- Goal: Rank teams
- Rank by: `xg_diff_league` or `xg_per_match`
- Metrics: Spearman, Kendall (rank vs actual table)

## Model 3 — Elo-Inspired Season Strength Model
- Goal: Predict **points** and **rankings**
- Steps:
  1) Initialize ratings in first season (e.g., 0)
  2) Update ratings season-by-season using xGD
  3) Use rating to:
     - Predict points (regression)
     - Rank teams


In [2]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

from scipy.stats import spearmanr, kendalltau

In [3]:
DATA_PATH = "pl-xg-ml/data/processed/phase2_model_dataset.csv"

df = pd.read_csv(DATA_PATH)

df.columns = [c.strip() for c in df.columns]
df["team"] = df["team"].astype(str).str.strip()
df["season"] = df["season"].astype(str).str.strip()

df = df.sort_values(["season", "team"]).reset_index(drop=True)

print("Shape:", df.shape)
print("Seasons:", sorted(df["season"].unique()))
df.head()


Shape: (160, 18)
Seasons: ['2017-18', '2018-19', '2019-20', '2020-21', '2021-22', '2022-23', '2023-24', '2024-25']


Unnamed: 0,season,team,season_start_year,season_idx,split,points,goal_diff,matches,goals_for,goals_against,points_per_match,gd_per_match,xg_league,xga_league,xgd_league,xgd_per_match,xg_squad,xg_per_match_squad
0,2017-18,Arsenal,2017,0,train,63,23,38,74,51,1.657895,0.605263,68.3,47.8,20.5,0.539474,68.3,1.797368
1,2017-18,Bournemouth,2017,0,train,44,-16,38,45,61,1.157895,-0.421053,38.8,59.2,-20.4,-0.536842,38.8,1.021053
2,2017-18,Brighton,2017,0,train,40,-20,38,34,54,1.052632,-0.526316,37.0,50.8,-13.8,-0.363158,37.0,0.973684
3,2017-18,Burnley,2017,0,train,54,-3,38,36,39,1.421053,-0.078947,32.3,51.2,-18.9,-0.497368,32.3,0.85
4,2017-18,Chelsea,2017,0,train,70,24,38,62,38,1.842105,0.631579,54.4,33.8,20.6,0.542105,54.4,1.431579


**In Phase 3, we establish baseline models to benchmark performance before moving to more advanced approaches.**

Baselines:

Linear Regression: Predict season points from a single xG-based feature.

xG League Table Projection: Rank teams by xG-based strength and compare to actual table.

Elo-inspired Season Strength: Build a season-to-season “strength rating” using xGD, then use it to predict points and rank teams. **bold text**

Evaluation:

Regression: RMSE, MAE

Ranking: Spearman, Kendall

Data leakage rule: When predicting season t, we only use information available up to season t-1 for the Elo-inspired rating.

In [4]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

from scipy.stats import spearmanr, kendalltau

def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def regression_metrics(y_true, y_pred):
    return {
        "RMSE": rmse(y_true, y_pred),
        "MAE": float(mean_absolute_error(y_true, y_pred)),
    }

def ranking_metrics(df_season, score_col, actual_col="points"):
    """
    Compare ranking induced by score_col vs actual points ranking for one season.
    """
    tmp = df_season[[score_col, actual_col]].dropna().copy()
    # higher score => better rank (descending)
    score_rank = tmp[score_col].rank(ascending=False, method="average")
    actual_rank = tmp[actual_col].rank(ascending=False, method="average")

    sp = spearmanr(score_rank, actual_rank).statistic
    kt = kendalltau(score_rank, actual_rank).statistic
    return {"Spearman": float(sp), "Kendall": float(kt)}

In [5]:
print("Columns:", list(df.columns))
print("Shape:", df.shape)
print("Seasons:", sorted(df["season"].unique()))

# Pick the best available single-feature baseline
candidate_features = ["xg_diff_league", "xgd_league", "xg_per_match"]
feature_col = None
for c in candidate_features:
    if c in df.columns:
        feature_col = c
        break

if feature_col is None:
    raise ValueError(f"None of {candidate_features} exist in df. Update candidate_features to match your dataset.")

print("Using single-feature baseline:", feature_col)

Columns: ['season', 'team', 'season_start_year', 'season_idx', 'split', 'points', 'goal_diff', 'matches', 'goals_for', 'goals_against', 'points_per_match', 'gd_per_match', 'xg_league', 'xga_league', 'xgd_league', 'xgd_per_match', 'xg_squad', 'xg_per_match_squad']
Shape: (160, 18)
Seasons: ['2017-18', '2018-19', '2019-20', '2020-21', '2021-22', '2022-23', '2023-24', '2024-25']
Using single-feature baseline: xgd_league


**Train/Test Split (Chronological)**

We use a season-based split to avoid leakage:

Train on older seasons

Test on the most recent season (or last N seasons)

This better reflects real forecasting: use history to predict future outcomes.

In [6]:
seasons = sorted(df["season"].unique())
test_seasons = [seasons[-1]]
train_seasons = seasons[:-1]

train_df = df[df["season"].isin(train_seasons)].dropna(subset=["points", feature_col]).copy()
test_df  = df[df["season"].isin(test_seasons)].dropna(subset=["points", feature_col]).copy()

print("Train seasons:", train_seasons)
print("Test seasons:", test_seasons)
print("Train rows:", train_df.shape[0], "Test rows:", test_df.shape[0])

Train seasons: ['2017-18', '2018-19', '2019-20', '2020-21', '2021-22', '2022-23', '2023-24']
Test seasons: ['2024-25']
Train rows: 140 Test rows: 20


**Model 1: Linear Regression Baseline**

Goal: Predict season points using a single xG-based feature.

Input: {feature_col}

Output: points

Metrics: RMSE, MAE + ranking correlations (Spearman, Kendall)

In [7]:
X_train = train_df[[feature_col]].values
y_train = train_df["points"].values

X_test  = test_df[[feature_col]].values
y_test  = test_df["points"].values

lin = LinearRegression()
lin.fit(X_train, y_train)

pred_test = lin.predict(X_test)

print("Linear Regression coefficients:")
print("  intercept:", float(lin.intercept_))
print("  coef:", float(lin.coef_[0]))

reg = regression_metrics(y_test, pred_test)
print("\nTest regression metrics:", reg)

# Attach preds for ranking eval
test_pred_df = test_df[["season", "team", "points", feature_col]].copy()
test_pred_df["pred_points_linreg"] = pred_test
test_pred_df.sort_values(["season", "pred_points_linreg"], ascending=[True, False]).head(10)

Linear Regression coefficients:
  intercept: 52.61640870372261
  coef: 0.7430463029138841

Test regression metrics: {'RMSE': 5.976387855950987, 'MAE': 4.118390863317276}


Unnamed: 0,season,team,points,xgd_league,pred_points_linreg
151,2024-25,Liverpool,84,43.6,85.013228
140,2024-25,Arsenal,74,25.5,71.564089
145,2024-25,Chelsea,69,20.5,67.848858
152,2024-25,Manchester City,71,20.4,67.774553
154,2024-25,Newcastle Utd,66,18.3,66.214156
142,2024-25,Bournemouth,56,15.5,64.133626
146,2024-25,Crystal Palace,53,11.3,61.012832
141,2024-25,Aston Villa,66,6.0,57.074687
144,2024-25,Brighton,61,4.1,55.662899
143,2024-25,Brentford,56,3.6,55.291375


In [8]:
results_rank_lin = []
for s in test_seasons:
    d = test_pred_df[test_pred_df["season"] == s]
    m = ranking_metrics(d, "pred_points_linreg", actual_col="points")
    m["season"] = s
    results_rank_lin.append(m)

rank_lin_df = pd.DataFrame(results_rank_lin)
rank_lin_df

Unnamed: 0,Spearman,Kendall,season
0,0.91833,0.801086,2024-25


**Model 2: xG League Table Projection**

Goal: Rank teams using an xG-based “strength” score and compare to the actual table.

We’ll evaluate ranking quality (Spearman, Kendall) season-by-season.

In [9]:
rank_score_col = None
for c in ["xg_diff_league", "xgd_league", "xg_per_match"]:
    if c in df.columns:
        rank_score_col = c
        break

print("Ranking score column:", rank_score_col)

results_rank_xg = []
for s in seasons:
    d = df[df["season"] == s].dropna(subset=[rank_score_col, "points"])
    if len(d) < 5:
        continue
    m = ranking_metrics(d, rank_score_col, actual_col="points")
    m["season"] = s
    results_rank_xg.append(m)

rank_xg_df = pd.DataFrame(results_rank_xg)
rank_xg_df

Ranking score column: xgd_league


Unnamed: 0,Spearman,Kendall,season
0,0.720905,0.542584,2017-18
1,0.928518,0.814826,2018-19
2,0.876507,0.702167,2019-20
3,0.845749,0.709005,2020-21
4,0.860474,0.744066,2021-22
5,0.875188,0.684211,2022-23
6,0.803162,0.620692,2023-24
7,0.91833,0.801086,2024-25


In [10]:
rank_xg_df.describe(include="all")

Unnamed: 0,Spearman,Kendall,season
count,8.0,8.0,8
unique,,,8
top,,,2017-18
freq,,,1
mean,0.853604,0.70233,
std,0.066606,0.090005,
min,0.720905,0.542584,
25%,0.835102,0.668331,
50%,0.867831,0.705586,
75%,0.886963,0.758321,


**Model 3: Elo-inspired Season Strength (Season-to-Season)**

We build a latent team strength rating updated season-by-season using xG differential (xGD).

**Key idea:**

Initialize ratings in the first season.

After each season, update each team’s rating based on its xGD.

When predicting season t, use the rating computed from seasons up to t-1 (no leakage).

We’ll then:

Use rating to predict points (regression)

Use rating to predict rankings (Spearman/Kendall)

In [11]:
# Choose xGD source for Elo updates
xgd_col = "xgd_league" if "xgd_league" in df.columns else ("xg_diff_league" if "xg_diff_league" in df.columns else None)
if xgd_col is None:
    raise ValueError("Need xgd_league or xg_diff_league in df to build Elo-inspired rating.")

alpha = 0.35  # update weight (tunable)

seasons = sorted(df["season"].unique())
teams = sorted(df["team"].unique())

# init ratings at 0
ratings = {t: 0.0 for t in teams}

rows = []
for i, s in enumerate(seasons):
    season_df = df[df["season"] == s].copy()

    # Record current (pre-update) rating for this season (this is what you'd know at season start)
    season_df["elo_rating_pre"] = season_df["team"].map(ratings)

    # Standardize xGD *within season* so scale is stable across years
    tmp = season_df.dropna(subset=[xgd_col]).copy()
    if len(tmp) > 0:
        mu = tmp[xgd_col].mean()
        sd = tmp[xgd_col].std(ddof=0) if tmp[xgd_col].std(ddof=0) != 0 else 1.0
    else:
        mu, sd = 0.0, 1.0

    season_df["xgd_z"] = (season_df[xgd_col] - mu) / sd

    # After season finishes, update ratings using that season's xGD
    for _, r in season_df.iterrows():
        t = r["team"]
        xgd_z = r["xgd_z"]
        if pd.isna(xgd_z):
            continue
        ratings[t] = (1 - alpha) * ratings[t] + alpha * float(xgd_z)

    rows.append(season_df)

elo_df = pd.concat(rows, ignore_index=True)
elo_df[["season", "team", "points", xgd_col, "elo_rating_pre", "xgd_z"]].head(10)

Unnamed: 0,season,team,points,xgd_league,elo_rating_pre,xgd_z
0,2017-18,Arsenal,63,20.5,0.0,0.914393
1,2017-18,Bournemouth,44,-20.4,0.0,-0.910378
2,2017-18,Brighton,40,-13.8,0.0,-0.615916
3,2017-18,Burnley,54,-18.9,0.0,-0.843454
4,2017-18,Chelsea,70,20.6,0.0,0.918854
5,2017-18,Crystal Palace,44,5.3,0.0,0.236239
6,2017-18,Everton,49,-11.9,0.0,-0.531146
7,2017-18,Huddersfield,37,-17.2,0.0,-0.767608
8,2017-18,Leicester City,47,1.9,0.0,0.084546
9,2017-18,Liverpool,75,39.1,0.0,1.74424


In [12]:
# IMPORTANT: to avoid leakage, use elo_rating_pre as the feature (rating available entering the season)
train_elo = elo_df[elo_df["season"].isin(train_seasons)].dropna(subset=["points", "elo_rating_pre"]).copy()
test_elo  = elo_df[elo_df["season"].isin(test_seasons)].dropna(subset=["points", "elo_rating_pre"]).copy()

X_train = train_elo[["elo_rating_pre"]].values
y_train = train_elo["points"].values

X_test = test_elo[["elo_rating_pre"]].values
y_test = test_elo["points"].values

lin_elo = LinearRegression()
lin_elo.fit(X_train, y_train)
pred_test = lin_elo.predict(X_test)

print("Elo->Points linear model:")
print("  intercept:", float(lin_elo.intercept_))
print("  coef:", float(lin_elo.coef_[0]))

print("\nTest regression metrics:", regression_metrics(y_test, pred_test))

elo_test_pred = test_elo[["season", "team", "points", "elo_rating_pre"]].copy()
elo_test_pred["pred_points_elo"] = pred_test
elo_test_pred.sort_values(["season", "pred_points_elo"], ascending=[True, False]).head(10)

Elo->Points linear model:
  intercept: 50.80686648257863
  coef: 21.770512735127856

Test regression metrics: {'RMSE': 15.221167811613292, 'MAE': 12.297920062646615}


Unnamed: 0,season,team,points,elo_rating_pre,pred_points_elo
152,2024-25,Manchester City,71,2.039073,95.198539
151,2024-25,Liverpool,84,1.489139,83.226178
140,2024-25,Arsenal,74,1.161549,76.094386
145,2024-25,Chelsea,69,0.692867,65.890943
157,2024-25,Tottenham,38,0.326895,57.923535
144,2024-25,Brighton,61,0.277094,56.839348
154,2024-25,Newcastle Utd,66,0.251926,56.291434
153,2024-25,Manchester Utd,42,0.218163,55.556387
143,2024-25,Brentford,56,0.09667,52.911432
149,2024-25,Ipswich Town,22,0.0,50.806866


In [13]:
results_rank_elo = []
for s in test_seasons:
    d = elo_test_pred[elo_test_pred["season"] == s].copy()
    m = ranking_metrics(d, "elo_rating_pre", actual_col="points")
    m["season"] = s
    results_rank_elo.append(m)

rank_elo_df = pd.DataFrame(results_rank_elo)
rank_elo_df

Unnamed: 0,Spearman,Kendall,season
0,0.59014,0.450943,2024-25


**Phase 3 Results Summary**

We compare:

Model 1: Linear regression on {feature_col}

Model 2: xG ranking projection using {rank_score_col}

Model 3: Elo-inspired season strength rating (pre-season rating)

Next: choose the strongest baseline and carry it into Phase 4+ as a benchmark.

In [14]:
summary_rows = []

# Model 1
for i, s in enumerate(test_seasons):
    d = test_pred_df[test_pred_df["season"] == s].copy()
    r = ranking_metrics(d, "pred_points_linreg", actual_col="points")
    summary_rows.append({
        "model": f"Model1_LinReg({feature_col})",
        "season": s,
        **r
    })

# Model 2
for i, s in enumerate(test_seasons):
    d = df[df["season"] == s].dropna(subset=[rank_score_col, "points"]).copy()
    r = ranking_metrics(d, rank_score_col, actual_col="points")
    summary_rows.append({
        "model": f"Model2_xGRank({rank_score_col})",
        "season": s,
        **r
    })

# Model 3
for i, s in enumerate(test_seasons):
    d = elo_test_pred[elo_test_pred["season"] == s].copy()
    r = ranking_metrics(d, "elo_rating_pre", actual_col="points")
    summary_rows.append({
        "model": "Model3_EloRating(pre)",
        "season": s,
        **r
    })

summary_df = pd.DataFrame(summary_rows)
summary_df

Unnamed: 0,model,season,Spearman,Kendall
0,Model1_LinReg(xgd_league),2024-25,0.91833,0.801086
1,Model2_xGRank(xgd_league),2024-25,0.91833,0.801086
2,Model3_EloRating(pre),2024-25,0.59014,0.450943
