# xG Prediction from Starting XI via Player-Dummy Regression

This notebook fits a **linear regression (OLS)** model that predicts a team's match **xG** from its **starting 11** using **player dummy variables** (one-hot indicators).

It then **predicts xG** for your **proposed optimal lineup**.

---

## Data
Expected CSV columns (example):
- `match_id`
- `team_id`, `team_name`
- `opponent_id`, `opponent_name`
- `starting_11_players` (stringified Python list of 11 player names)
- `team_xg` (float)



In [1]:
import pandas as pd
import numpy as np
import ast
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

# Optional: statsmodels for classical OLS summaries
import statsmodels.api as sm

DATA_PATH = "match_team_lineups_xg.csv"  # <-- update if needed

df = pd.read_csv(DATA_PATH)

# Parse the stringified list safely
df["starting_11_players"] = df["starting_11_players"].apply(ast.literal_eval)

# Basic cleaning
df = df.dropna(subset=["team_xg", "starting_11_players"])
df = df[df["starting_11_players"].apply(lambda x: isinstance(x, list) and len(x) > 0)].copy()

print("Rows:", len(df))
print(df.head(3))


Rows: 612
   match_id  team_id               team_name  opponent_id  \
0    122838       33       FC Bayern München           38   
1    122838       38        SV Werder Bremen           33   
2    122839       37  RasenBallsport Leipzig           41   

         opponent_name                                starting_11_players  \
0     SV Werder Bremen  [Joshua Kimmich, Leroy Sané, Harry Kane, Kings...   
1    FC Bayern München  [Mitchell Weiser, Milos Veljkovic, Leonardo Bi...   
2  Bayer 04 Leverkusen  [Timo Werner, Willi Orbán, Benjamin Henrichs, ...   

   team_xg  
0  2.58382  
1  0.80446  
2  1.34721  


## 1) Build design matrices

We build player one-hot features with `MultiLabelBinarizer`.  
Two model variants are constructed:

- **Model A:** Player dummies only  
- **Model B:** Player dummies + **team** and **opponent** fixed effects (one-hot for `team_id` and `opponent_id`)

> Tip: Model B often stabilizes player coefficients by absorbing team/opponent baseline strength.


In [2]:
# Player one-hot
mlb = MultiLabelBinarizer(sparse_output=False)
X_players = mlb.fit_transform(df["starting_11_players"])
player_names = mlb.classes_
X_players = pd.DataFrame(X_players, columns=[f"p::{n}" for n in player_names], index=df.index)

y = df["team_xg"].astype(float)

# Fixed effects (optional)
X_team = pd.get_dummies(df["team_id"], prefix="team", dtype=float)
X_opp  = pd.get_dummies(df["opponent_id"], prefix="opp", dtype=float)

# Model A, B
X_A = X_players.copy()
X_B = pd.concat([X_players, X_team, X_opp], axis=1)

print("Unique players:", len(player_names))
print("X_A shape:", X_A.shape, "X_B shape:", X_B.shape)


Unique players: 421
X_A shape: (612, 421) X_B shape: (612, 457)


## 2) Fit OLS (and a safe fallback)

With many dummies, classical OLS can be **rank-deficient** (perfect collinearity).  
We therefore:

1) Try **statsmodels OLS** (for summary + robust SEs), and  
2) Fit **sklearn LinearRegression** (OLS) for prediction, and  
3) Optionally fit **Ridge** (recommended if OLS is unstable).

You can choose which model to use for the final prediction.


In [3]:
RANDOM_SEED = 7
TEST_SIZE = 0.2

def fit_and_eval(X: pd.DataFrame, y: pd.Series, name: str):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=RANDOM_SEED
    )

    # 1) Statsmodels OLS attempt (may fail if singular)
    sm_result = None
    try:
        X_train_sm = sm.add_constant(X_train, has_constant="add")
        X_test_sm  = sm.add_constant(X_test,  has_constant="add")
        sm_model = sm.OLS(y_train.values, X_train_sm.values)
        sm_result = sm_model.fit(cov_type="HC3")  # robust SEs
        y_pred_sm = sm_result.predict(X_test_sm.values)
        mae = mean_absolute_error(y_test, y_pred_sm)
        r2  = r2_score(y_test, y_pred_sm)
        print(f"[{name}] statsmodels OLS: MAE={mae:.3f}, R2={r2:.3f}")
    except Exception as e:
        print(f"[{name}] statsmodels OLS failed (likely collinearity): {type(e).__name__}: {e}")

    # 2) sklearn OLS (always works; uses pseudo-inverse)
    lr = LinearRegression(n_jobs=None)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    r2  = r2_score(y_test, y_pred)
    print(f"[{name}] sklearn OLS:      MAE={mae:.3f}, R2={r2:.3f}")

    # 3) Ridge (optional)
    ridge = Ridge(alpha=1.0, random_state=RANDOM_SEED)
    ridge.fit(X_train, y_train)
    y_pred_r = ridge.predict(X_test)
    mae_r = mean_absolute_error(y_test, y_pred_r)
    r2_r  = r2_score(y_test, y_pred_r)
    print(f"[{name}] Ridge(alpha=1):   MAE={mae_r:.3f}, R2={r2_r:.3f}")

    return {"sm": sm_result, "ols": lr, "ridge": ridge}

models_A = fit_and_eval(X_A, y, "Model A (players only)")
print("-"*80)
models_B = fit_and_eval(X_B, y, "Model B (players+team+opp FE)")


  self.het_scale = (self.wresid / (1 - h))**2
  H = np.dot(self.model.pinv_wexog,


[Model A (players only)] statsmodels OLS: MAE=1.849, R2=-7.545
[Model A (players only)] sklearn OLS:      MAE=1.849, R2=-7.535
[Model A (players only)] Ridge(alpha=1):   MAE=0.724, R2=-0.122
--------------------------------------------------------------------------------


  self.het_scale = (self.wresid / (1 - h))**2


[Model B (players+team+opp FE)] statsmodels OLS: MAE=1.623, R2=-4.786
[Model B (players+team+opp FE)] sklearn OLS:      MAE=1.623, R2=-4.777
[Model B (players+team+opp FE)] Ridge(alpha=1):   MAE=0.724, R2=-0.036


## 3) (Optional) View top player effects

For interpretability, we'll use **Model B Ridge** coefficients (more stable) and show the highest positive/negative player contributions.


In [4]:
def top_coefs(model, feature_names, k=15):
    coefs = pd.Series(model.coef_, index=feature_names)
    pos = coefs.sort_values(ascending=False).head(k)
    neg = coefs.sort_values(ascending=True).head(k)
    return pos, neg

feature_names_B = X_B.columns.tolist()
pos, neg = top_coefs(models_B["ridge"], feature_names_B, k=15)

print("Top + coefficients (Ridge, Model B):")
display(pos.to_frame("coef"))
print("\nTop - coefficients (Ridge, Model B):")
display(neg.to_frame("coef"))


Top + coefficients (Ridge, Model B):


Unnamed: 0,coef
p::Xavi Simons,0.658705
p::Nadiem Amiri,0.61816
p::Steffen Tigges,0.592471
p::Deniz Undav,0.580088
p::Lukas Daschner,0.543394
p::Silas,0.542139
p::Donyell Malen,0.525222
p::Timo Werner,0.522667
p::Leroy Sané,0.499289
p::Tim Oermann,0.497723



Top - coefficients (Ridge, Model B):


Unnamed: 0,coef
p::Aleksandar Pavlovic,-0.774804
opp_41,-0.720435
p::Junior Dina Ebimbe,-0.656733
p::Patrick Osterhage,-0.595317
p::Dayot Upamecano,-0.591132
p::Nico Elvedi,-0.543542
p::Emir Karic,-0.52801
p::Fábio Carvalho,-0.521751
p::Angelo Stiller,-0.508488
p::Moritz Jenz,-0.508291


## 4) Predict xG for your proposed optimal lineup

Fill in:
- `OPTIMAL_LINEUP` (list of 11 player names)
- `TEAM_ID` (integer) and `OPPONENT_ID` (integer) **if using Model B**
- Choose `PREDICT_WITH` among: `"A_OLS"`, `"A_RIDGE"`, `"B_OLS"`, `"B_RIDGE"`

If you don't have a specific opponent in mind, you can:
- Use Model A (no opponent/team FE), or
- For Model B, set opponent/team IDs to the **most common** in the training data (rough baseline).


In [None]:
# -----------------------------
# USER INPUTS (edit these)
# -----------------------------
OPTIMAL_LINEUP = [
    # paste your 11 names exactly as they appear in the dataset
    # "Player 1", "Player 2", ...
]

TEAM_ID = None       # e.g., 33 for FC Bayern München
OPPONENT_ID = None   # e.g., 38 for SV Werder Bremen

PREDICT_WITH = "B_RIDGE"  # one of: "A_OLS", "A_RIDGE", "B_OLS", "B_RIDGE"


# -----------------------------
# Helpers
# -----------------------------
def make_feature_row(lineup, X_columns, team_id=None, opp_id=None):
    # player dummies
    lineup = list(lineup)
    x_players = pd.Series(0.0, index=X_columns)

    # set player indicators (only for those seen during training)
    for p in lineup:
        col = f"p::{p}"
        if col in x_players.index:
            x_players[col] = 1.0

    # team/opponent fixed effects if present
    if team_id is not None:
        tcol = f"team_{team_id}"
        if tcol in x_players.index:
            x_players[tcol] = 1.0

    if opp_id is not None:
        ocol = f"opp_{opp_id}"
        if ocol in x_players.index:
            x_players[ocol] = 1.0

    return x_players.to_frame().T

def pick_default_id(series):
    # most frequent category
    return series.value_counts().idxmax()

if PREDICT_WITH.startswith("B") and (TEAM_ID is None or OPPONENT_ID is None):
    # choose defaults for a generic baseline if user didn't set IDs
    if TEAM_ID is None:
        TEAM_ID = int(pick_default_id(df["team_id"]))
    if OPPONENT_ID is None:
        OPPONENT_ID = int(pick_default_id(df["opponent_id"]))
    print(f"Using defaults TEAM_ID={TEAM_ID}, OPPONENT_ID={OPPONENT_ID} (edit above if you want specific).")

if len(OPTIMAL_LINEUP) == 0:
    raise ValueError("Please paste your OPTIMAL_LINEUP list of 11 players above.")

# -----------------------------
# Build row and predict
# -----------------------------
if PREDICT_WITH == "A_OLS":
    row = make_feature_row(OPTIMAL_LINEUP, X_A.columns)
    pred = float(models_A["ols"].predict(row)[0])
elif PREDICT_WITH == "A_RIDGE":
    row = make_feature_row(OPTIMAL_LINEUP, X_A.columns)
    pred = float(models_A["ridge"].predict(row)[0])
elif PREDICT_WITH == "B_OLS":
    row = make_feature_row(OPTIMAL_LINEUP, X_B.columns, TEAM_ID, OPPONENT_ID)
    pred = float(models_B["ols"].predict(row)[0])
elif PREDICT_WITH == "B_RIDGE":
    row = make_feature_row(OPTIMAL_LINEUP, X_B.columns, TEAM_ID, OPPONENT_ID)
    pred = float(models_B["ridge"].predict(row)[0])
else:
    raise ValueError("Invalid PREDICT_WITH option.")

# Diagnostics: which lineup players were unseen?
seen = set(player_names)
unseen = [p for p in OPTIMAL_LINEUP if p not in seen]

print("Predicted xG:", pred)
print("Unseen players (not in training data, treated as 0):", unseen)


## 5) (Optional) Save model artifacts

This cell saves:
- player vocabulary (`players.json`)
- fitted coefficients (CSV)

Useful if you want to plug prediction into your lineup-optimization pipeline.


In [None]:
import json, pathlib

OUT_DIR = pathlib.Path("xg_lineup_model_artifacts")
OUT_DIR.mkdir(exist_ok=True)

# Save player list
(OUT_DIR / "players.json").write_text(json.dumps(list(player_names), indent=2), encoding="utf-8")

# Save coefficients for chosen model
coef_df = pd.DataFrame({
    "feature": X_B.columns,
    "coef_ridge_modelB": models_B["ridge"].coef_
})
coef_df.to_csv(OUT_DIR / "coef_modelB_ridge.csv", index=False)

print("Saved to:", OUT_DIR.resolve())
