# EPL Shots + Shots on Target (SOT) Props — End-to-End Notebook

This notebook is a **reproducible pipeline** to model **EPL player props** for **Shots** and **Shots on Target (SOT)**, produce **fair odds**, **de-vig** bookmaker prices, backtest with **walk-forward** splits, and generate a weekly **candidate table**.

**No guarantee of profit.** The goal is a rigorous process that can *detect* (or falsify) an edge.

---

## Where the data comes from (default plan)

### 1) Player minutes + shots outcomes (historical)
**FBref via the `soccerdata` Python package**:
- Match schedule (`read_schedule`) gives match date + home/away teams + a stable `game_id`.
- Lineups (`read_lineup`) give **started** + **minutes_played** + position.
- Shot events (`read_shot_events`) are aggregated to per-player **shots** and **SOT** per match.

This is enough to build a player-match table for 5+ seasons **without a paid stats feed**.

### 2) Bookmaker odds (weekly + optional historical)
**The Odds API** (requires an API key):
- Sport key: `soccer_epl`
- Player-prop markets used here: `player_shots`, `player_shots_on_target`
- Fetch **upcoming** odds via the event-odds endpoint, and (if you have a paid plan) fetch **historical snapshots** for backtests.

If you don't want to use an API, you can also provide `odds_df.csv` / `upcoming_odds.csv` manually in the schema described below.

---

## Files this notebook will create or use (CSV)

### `model_df.csv` (auto-built if missing)
A player-match history table with at least:
- `match_id, date, season, team, opponent, home_away`
- `player_id, player_name, position`
- `minutes, started`
- targets: `shots, sot`

### `odds_df.csv` (you provide OR build from API snapshots you store)
Schema:
- `match_id, date, player_id, market, line, odds_over, odds_under, book, timestamp`
Optional:
- `odds_over_close, odds_under_close, timestamp_close`

### `upcoming_odds.csv` (optional; can be pulled from The Odds API)
Same schema as `odds_df.csv` but for upcoming matches.

---

## What you get out
- Minutes uncertainty model: **P(start)** + **minutes distribution**
- Shots model: **Negative Binomial** rate per 90 + minutes integration
- SOT model: **direct NB** vs **structural (Shots × Accuracy)**
- De-vig (proportional + Shin)
- Walk-forward backtest + calibration plots + (optional) CLV
- `candidates.csv` + `model_version.json`


In [None]:
# 1) Setup & config

from __future__ import annotations

import os
import json
import math
from dataclasses import dataclass, asdict
from typing import Dict, List, Tuple, Optional

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

import statsmodels.api as sm
from statsmodels.discrete.discrete_model import NegativeBinomial

RNG = np.random.default_rng(7)

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 200)

@dataclass
class Config:
    data_dir: str = "data"
    model_df_path: str = "data/model_df.csv"
    odds_df_path: str = "data/odds_df.csv"              # historical (optional)
    upcoming_odds_path: str = "data/upcoming_odds.csv"  # upcoming (optional)
    cache_dir: str = "cache"

    # Feature params
    ewm_span_player: int = 10
    ewm_span_team: int = 12
    min_minutes_for_rates: float = 10.0

    # Accuracy smoothing prior: p = (sot + a) / (shots + a + b)
    acc_alpha: float = 2.0
    acc_beta: float = 6.0

    # Minutes model
    minutes_max: int = 95
    minutes_mc_samples: int = 4000

    # NB sampling
    count_mc_samples: int = 4000

    # De-vig method
    devig_method: str = "proportional"  # "proportional" or "shin"

    # Betting / filtering rules
    min_ev_mean: float = 0.03
    require_ev_low_positive: bool = True
    min_p_start: float = 0.55
    max_minutes_ci_width: float = 35.0

    # Backtest
    walk_forward_step_days: int = 14
    min_train_rows: int = 20000

CFG = Config()
os.makedirs(CFG.cache_dir, exist_ok=True)

print(CFG)


In [None]:
# 2) Utilities
import re
import hashlib

_TEAM_ALIASES = {
    # common FBref vs sportsbook naming differences
    "man utd": "manchester united",
    "man united": "manchester united",
    "man city": "manchester city",
    "spurs": "tottenham",
    "wolves": "wolverhampton",
    "newcastle utd": "newcastle united",
    "brighton": "brighton",
    "nottingham forest": "nottingham forest",
}

def canon_team(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r"[^a-z0-9 ]+", " ", s)
    s = re.sub(r"\butd\b", "united", s)
    s = re.sub(r"\bfc\b", "", s)
    s = re.sub(r"\s+", " ", s).strip()
    return _TEAM_ALIASES.get(s, s)

def canon_player(s: str) -> str:
    s = str(s).lower()
    s = re.sub(r"[^a-z0-9 ]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def name_hash_id(name: str, n: int = 12) -> str:
    h = hashlib.sha1(canon_player(name).encode("utf-8")).hexdigest()
    return h[:n]

def parse_date(s: pd.Series) -> pd.Series:
    return pd.to_datetime(s, utc=False, errors="coerce").dt.tz_localize(None)

def assert_columns(df: pd.DataFrame, cols: List[str], name: str) -> None:
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise ValueError(f"{name} missing columns: {missing}")

def safe_div(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    b = np.asarray(b)
    out = np.zeros_like(np.asarray(a), dtype=float)
    m = b != 0
    out[m] = np.asarray(a)[m] / b[m]
    return out

def pos_bucket(pos: str) -> str:
    # Simple mapping; tweak for your data's position codes
    p = str(pos).upper()
    if any(x in p for x in ["FW", "ST", "CF"]):
        return "FWD"
    if any(x in p for x in ["W", "LW", "RW", "AM", "CAM", "LM", "RM"]):
        return "ATT_MID"
    if any(x in p for x in ["MF", "CM", "DM", "CDM"]):
        return "MID"
    if any(x in p for x in ["DF", "CB", "FB", "LB", "RB", "WB"]):
        return "DEF"
    if "GK" in p:
        return "GK"
    return "OTHER"

def clip_minutes(x: np.ndarray, mx: int) -> np.ndarray:
    return np.clip(x, 0, mx)

def nb2_sample(mu: np.ndarray, alpha: float, rng: np.random.Generator) -> np.ndarray:
    """Sample Negative Binomial with mean mu and variance mu + alpha*mu^2 (NB2)."""
    mu = np.asarray(mu, dtype=float)
    if alpha <= 1e-12:
        return rng.poisson(mu)
    r = 1.0 / alpha
    p = r / (r + mu)
    return rng.negative_binomial(r, p, size=mu.shape)

def event_from_count(count: np.ndarray, line: float) -> np.ndarray:
    return (count > line).astype(int)

def implied_prob_decimal(odds: float) -> float:
    return 1.0 / odds if odds and odds > 0 else np.nan

def to_implied_probs(odds_over: float, odds_under: float) -> Tuple[float, float]:
    q_over = implied_prob_decimal(odds_over)
    q_under = implied_prob_decimal(odds_under)
    return q_over, q_under


In [None]:
# 3) Load / build data (FBref stats + odds inputs)

import os
import time
import requests
def load_model_df(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    required = [
        "match_id","date","season","team","opponent","home_away",
        "player_id","player_name","position",
        "minutes","started","shots","sot"
    ]
    assert_columns(df, required, "model_df")
    df["date"] = parse_date(df["date"])
    df = df.dropna(subset=["date"]).copy()
    df["home_away"] = df["home_away"].astype(str).str.upper().str[0]  # H/A
    df["started"] = df["started"].astype(int)
    for c in ["minutes","shots","sot"]:
        df[c] = pd.to_numeric(df[c], errors="coerce")
    df["pos_bucket"] = df["position"].map(pos_bucket)
    df = df.sort_values(["date","match_id","player_id"]).reset_index(drop=True)
    return df

def load_odds_df(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    required = ["match_id","date","player_id","market","line","odds_over","odds_under","book","timestamp"]
    assert_columns(df, required, "odds_df")
    df["date"] = parse_date(df["date"])
    df["market"] = df["market"].astype(str).str.lower()
    df["line"] = pd.to_numeric(df["line"], errors="coerce")
    df["odds_over"] = pd.to_numeric(df["odds_over"], errors="coerce")
    df["odds_under"] = pd.to_numeric(df["odds_under"], errors="coerce")
    # optional close
    for c in ["odds_over_close","odds_under_close"]:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors="coerce")
    df = df.dropna(subset=["date","line","odds_over","odds_under"]).copy()
    df = df.sort_values(["date","match_id","player_id","market","line"]).reset_index(drop=True)
    return df

os.makedirs(CFG.data_dir, exist_ok=True)
os.makedirs(CFG.cache_dir, exist_ok=True)

# ---- 3.1 Build model_df from FBref (default stats source) ----
# soccerdata uses FBref endpoints and caches locally.
# Seasons are labeled by the *ending year* (e.g., 2021 == 2020-21 season).
DEFAULT_SEASONS = [2021, 2022, 2023, 2024, 2025]

def build_model_df_from_fbref(
    seasons: List[int] = DEFAULT_SEASONS,
    leagues: str = "ENG-Premier League",
    output_csv: str = CFG.model_df_path,
) -> pd.DataFrame:
    try:
        import soccerdata as sd
    except ImportError as e:
        raise ImportError(
            "Missing dependency 'soccerdata'. Install it with: pip install soccerdata"
        ) from e

    fbref = sd.FBref(leagues=leagues, seasons=seasons, data_dir=CFG.cache_dir)

    # Schedule gives us stable game_id (we use this as match_id)
    sched = fbref.read_schedule().reset_index()
    assert_columns(sched, ["league","season","game","date","home_team","away_team","game_id"], "fbref.schedule")
    sched = sched.rename(columns={"game_id":"match_id"})
    sched["date"] = parse_date(sched["date"])
    sched = sched.dropna(subset=["date"]).copy()

    sched_small = sched[["league","season","game","match_id","date","home_team","away_team"]].copy()
    sched_small["home_team_c"] = sched_small["home_team"].map(canon_team)
    sched_small["away_team_c"] = sched_small["away_team"].map(canon_team)

    # Lineups: starter + minutes played + position
    lineups = fbref.read_lineup().reset_index()
    assert_columns(lineups, ["league","season","game","player","team","is_starter","position","minutes_played"], "fbref.lineups")

    lineups = lineups.merge(sched_small, on=["league","season","game"], how="left")
    lineups = lineups.dropna(subset=["match_id","date"]).copy()
    lineups = lineups.rename(columns={
        "player":"player_name",
        "team":"team",
        "minutes_played":"minutes",
        "is_starter":"started",
        "position":"position",
    })
    lineups["started"] = lineups["started"].astype(int)
    lineups["minutes"] = pd.to_numeric(lineups["minutes"], errors="coerce").fillna(0.0)

    # Shot events: aggregate to shots and shots-on-target per player-match
    shots = fbref.read_shot_events().reset_index()
    assert_columns(shots, ["league","season","game","player","team","outcome"], "fbref.shot_events")
    shots = shots.merge(sched_small, on=["league","season","game"], how="left")
    shots = shots.dropna(subset=["match_id","date"]).copy()

    # Define SoT from shot outcomes: saved or goal are on-target in most data conventions.
    out = shots["outcome"].astype(str)
    is_sot = out.str.contains(r"goal|saved", case=False, regex=True)

    agg_shots = (
        shots.groupby(["match_id","team","player"], as_index=False)
        .size()
        .rename(columns={"size":"shots"})
    )
    agg_sot = (
        shots.loc[is_sot]
        .groupby(["match_id","team","player"], as_index=False)
        .size()
        .rename(columns={"size":"sot"})
    )
    agg = agg_shots.merge(agg_sot, on=["match_id","team","player"], how="left").fillna({"sot":0})
    agg = agg.rename(columns={"player":"player_name"})
    agg["shots"] = pd.to_numeric(agg["shots"], errors="coerce").fillna(0.0)
    agg["sot"] = pd.to_numeric(agg["sot"], errors="coerce").fillna(0.0)

    model_df = lineups.merge(agg, on=["match_id","team","player_name"], how="left").fillna({"shots":0,"sot":0})

    # Add opponent and home/away
    home_ctx = sched_small.rename(columns={"home_team":"team","away_team":"opponent"})[["match_id","date","season","team","opponent"]]
    home_ctx["home_away"] = "H"
    away_ctx = sched_small.rename(columns={"away_team":"team","home_team":"opponent"})[["match_id","date","season","team","opponent"]]
    away_ctx["home_away"] = "A"
    ctx = pd.concat([home_ctx, away_ctx], ignore_index=True)

    model_df = model_df.merge(ctx, on=["match_id","date","season","team"], how="left")
    model_df["home_away"] = model_df["home_away"].astype(str).str.upper().str[0]
    model_df["player_id"] = model_df["player_name"].map(lambda s: name_hash_id(s))
    model_df["position"] = model_df["position"].astype(str)
    model_df = model_df[[
        "match_id","date","season","team","opponent","home_away",
        "player_id","player_name","position",
        "minutes","started","shots","sot"
    ]].copy()

    model_df.to_csv(output_csv, index=False)
    print(f"[OK] Wrote {output_csv} with {len(model_df):,} player-match rows")
    return model_df

# ---- 3.2 Odds helpers (The Odds API) ----
# This is optional: you can always supply odds_df.csv / upcoming_odds.csv manually.
ODDS_API_KEY = os.environ.get("ODDS_API_KEY", "").strip()
ODDS_API_SPORT = "soccer_epl"
ODDS_API_REGIONS = "us"  # player props coverage is mainly US books (per provider docs)
ODDS_API_MARKETS = ["player_shots", "player_shots_on_target"]  # "shots" and "SOT"

def _oddsapi_get(url: str, params: Dict[str, str]) -> dict:
    r = requests.get(url, params=params, timeout=60)
    r.raise_for_status()
    return r.json()

def fetch_upcoming_odds_from_oddsapi(
    api_key: str,
    max_events: int = 30,
    output_csv: Optional[str] = CFG.upcoming_odds_path,
) -> pd.DataFrame:
    # 1) List current/upcoming events (to get event IDs)
    events_url = f"https://api.the-odds-api.com/v4/sports/{ODDS_API_SPORT}/events"
    events = _oddsapi_get(events_url, {"apiKey": api_key})
    events = events[:max_events]

    # Pull FBref schedule (future matches) to map odds events -> fbref match_id (game_id)
    if os.path.exists(CFG.model_df_path):
        # Build a schedule table from model_df (past only). Better is to pull fresh schedule:
        # We'll just pull schedule from FBref quickly here (cheap) if soccerdata is installed.
        try:
            import soccerdata as sd
            fbref = sd.FBref(leagues="ENG-Premier League", seasons=DEFAULT_SEASONS, data_dir=CFG.cache_dir)
            sched = fbref.read_schedule().reset_index().rename(columns={"game_id":"match_id"})
            sched["date"] = parse_date(sched["date"])
            sched = sched.dropna(subset=["date"]).copy()
        except Exception:
            sched = pd.DataFrame(columns=["match_id","date","home_team","away_team"])
    else:
        sched = pd.DataFrame(columns=["match_id","date","home_team","away_team"])

    if len(sched):
        sched["home_c"] = sched["home_team"].map(canon_team)
        sched["away_c"] = sched["away_team"].map(canon_team)

    rows = []
    for ev in events:
        event_id = ev.get("id")
        commence_time = pd.to_datetime(ev.get("commence_time"), utc=True, errors="coerce").tz_convert(None)
        home = ev.get("home_team")
        away = ev.get("away_team")
        if not event_id or pd.isna(commence_time) or not home or not away:
            continue

        # 2) Query event-odds for player prop markets (shots, SOT)
        odds_url = f"https://api.the-odds-api.com/v4/sports/{ODDS_API_SPORT}/events/{event_id}/odds/"
        params = {
            "apiKey": api_key,
            "regions": ODDS_API_REGIONS,
            "markets": ",".join(ODDS_API_MARKETS),
            "oddsFormat": "decimal",
        }
        try:
            ev_odds = _oddsapi_get(odds_url, params)
        except Exception as e:
            print("Failed odds fetch for", event_id, ":", e)
            continue

        # Map to FBref match_id (game_id) if possible
        fbref_match_id = None
        if len(sched):
            home_c = canon_team(home)
            away_c = canon_team(away)
            d0 = commence_time.date()
            cand = sched[
                (sched["home_c"] == home_c) &
                (sched["away_c"] == away_c) &
                (sched["date"].dt.date == d0)
            ]
            if len(cand) == 0:
                # allow +/- 1 day for timezone differences
                cand = sched[
                    (sched["home_c"] == home_c) &
                    (sched["away_c"] == away_c) &
                    (sched["date"].dt.date.isin([ (commence_time - pd.Timedelta(days=1)).date(), (commence_time + pd.Timedelta(days=1)).date() ]))
                ]
            if len(cand):
                fbref_match_id = cand.iloc[0]["match_id"]

        # Parse outcomes into rows
        for bk in ev_odds.get("bookmakers", []):
            book = bk.get("title") or bk.get("key") or "unknown"
            ts = bk.get("last_update") or ev_odds.get("commence_time") or None
            for m in bk.get("markets", []):
                mkey = m.get("key")
                if mkey not in ODDS_API_MARKETS:
                    continue
                market = "shots" if mkey == "player_shots" else "sot"
                # outcomes are like: {"name":"Over"/"Under","description":"Player","price":..,"point":..}
                tmp = []
                for o in m.get("outcomes", []):
                    side = str(o.get("name","")).lower()
                    player = o.get("description") or o.get("player") or None
                    point = o.get("point")
                    price = o.get("price")
                    if player is None or point is None or price is None:
                        continue
                    tmp.append((player, float(point), side, float(price)))

                if not tmp:
                    continue

                # pivot to over/under
                dfm = pd.DataFrame(tmp, columns=["player_name","line","side","price"])
                pivot = dfm.pivot_table(index=["player_name","line"], columns="side", values="price", aggfunc="first").reset_index()
                if "over" not in pivot.columns or "under" not in pivot.columns:
                    continue

                for _, r in pivot.iterrows():
                    rows.append({
                        "match_id": fbref_match_id or event_id,  # fallback to Odds API id if mapping fails
                        "date": commence_time.date().isoformat(),
                        "player_id": name_hash_id(r["player_name"]),
                        "market": market,
                        "line": float(r["line"]),
                        "odds_over": float(r["over"]),
                        "odds_under": float(r["under"]),
                        "book": str(book),
                        "timestamp": str(ts) if ts is not None else "",
                    })

        # basic rate-limit friendliness
        time.sleep(0.25)

    odds_df = pd.DataFrame(rows)
    if output_csv is not None:
        odds_df.to_csv(output_csv, index=False)
        print(f"[OK] Wrote {output_csv} with {len(odds_df):,} prop prices")
    return odds_df

# ---- 3.3 Load data objects ----
# model_df
if os.path.exists(CFG.model_df_path):
    model_df = load_model_df(CFG.model_df_path)
else:
    print("No model_df.csv found. Building from FBref via soccerdata...")
    model_df = build_model_df_from_fbref(output_csv=CFG.model_df_path)

# odds_df (historical, for backtest)
if os.path.exists(CFG.odds_df_path):
    odds_df = load_odds_df(CFG.odds_df_path)
else:
    print("No odds_df.csv found at", CFG.odds_df_path, "(backtests vs book odds will be skipped).")
    odds_df = pd.DataFrame(columns=["match_id","date","player_id","market","line","odds_over","odds_under","book","timestamp"])

# upcoming odds (optional)
if (not os.path.exists(CFG.upcoming_odds_path)) and ODDS_API_KEY:
    print("No upcoming_odds.csv found. Pulling upcoming odds from The Odds API...")
    _ = fetch_upcoming_odds_from_oddsapi(ODDS_API_KEY, output_csv=CFG.upcoming_odds_path)

print("model_df:", model_df.shape, "odds_df:", odds_df.shape)
display(model_df.head())


In [None]:
# 4) Clean & merge helpers

def build_match_level_tables(model_df: pd.DataFrame) -> pd.DataFrame:
    """Create match-level team totals and opponent-allowed metrics."""
    df = model_df.copy()
    df["shots_team"] = df.groupby(["match_id","team"])["shots"].transform("sum")
    df["sot_team"] = df.groupby(["match_id","team"])["sot"].transform("sum")

    # opponent totals for allowance
    team_totals = (
        df.groupby(["match_id","date","team"], as_index=False)
          .agg(shots_for=("shots","sum"), sot_for=("sot","sum"))
    )
    opp_totals = team_totals.rename(columns={"team":"opponent", "shots_for":"shots_against", "sot_for":"sot_against"})
    merged = team_totals.merge(opp_totals, on=["match_id","date"], how="left")
    # keep rows where opponent column matches the actual opponent
    # We'll merge later by (match_id, team, opponent)
    return merged

def add_team_opp_rolling_features(model_df: pd.DataFrame) -> pd.DataFrame:
    df = model_df.copy()
    match_tbl = build_match_level_tables(df)

    # Build team-opponent mapping using player rows
    key = df[["match_id","date","team","opponent"]].drop_duplicates()
    match_tbl = key.merge(match_tbl, on=["match_id","date","team"], how="left")
    # attach opponent totals correctly
    # for a given (match_id, date, team, opponent), opponent's shots_for = shots_against for team
    opp_lookup = match_tbl.rename(columns={"team":"opponent"}).copy()
    opp_lookup = opp_lookup[["match_id","date","opponent","shots_for","sot_for"]].rename(
        columns={"shots_for":"shots_against","sot_for":"sot_against"}
    )
    match_tbl = match_tbl.drop(columns=["shots_against","sot_against"], errors="ignore").merge(
        opp_lookup, on=["match_id","date","opponent"], how="left"
    )

    # per90 at team level: assume 90 minutes; this is a crude normalization but works pre-match
    match_tbl["team_shots_for_per90"] = match_tbl["shots_for"]
    match_tbl["team_sot_for_per90"] = match_tbl["sot_for"]
    match_tbl["opp_shots_allowed_per90"] = match_tbl["shots_against"]
    match_tbl["opp_sot_allowed_per90"] = match_tbl["sot_against"]

    match_tbl = match_tbl.sort_values(["team","date","match_id"])

    # EWMA rolling with shift(1) to prevent leakage
    for c, span in [
        ("team_shots_for_per90", CFG.ewm_span_team),
        ("team_sot_for_per90", CFG.ewm_span_team),
        ("opp_shots_allowed_per90", CFG.ewm_span_team),
        ("opp_sot_allowed_per90", CFG.ewm_span_team),
    ]:
        match_tbl[c + "_ewm"] = (
            match_tbl.groupby("team")[c]
            .transform(lambda s: s.ewm(span=span, adjust=False).mean().shift(1))
        )

    # Merge back to player rows
    feat_cols = [c + "_ewm" for c in ["team_shots_for_per90","team_sot_for_per90","opp_shots_allowed_per90","opp_sot_allowed_per90"]]
    df = df.merge(match_tbl[["match_id","date","team","opponent"] + feat_cols], on=["match_id","date","team","opponent"], how="left")

    return df

def add_player_rolling_features(model_df: pd.DataFrame) -> pd.DataFrame:
    df = model_df.copy()
    minutes = df["minutes"].to_numpy()
    denom90 = np.maximum(minutes / 90.0, CFG.min_minutes_for_rates / 90.0)

    df["shots_per90"] = df["shots"] / denom90
    df["sot_per90"] = df["sot"] / denom90

    df = df.sort_values(["player_id","date","match_id"]).reset_index(drop=True)

    # EWMA for rates (shifted)
    df["shots_per90_ewm"] = df.groupby("player_id")["shots_per90"].transform(
        lambda s: s.ewm(span=CFG.ewm_span_player, adjust=False).mean().shift(1)
    )
    df["sot_per90_ewm"] = df.groupby("player_id")["sot_per90"].transform(
        lambda s: s.ewm(span=CFG.ewm_span_player, adjust=False).mean().shift(1)
    )

    # Smoothed historical accuracy prior (shifted cumulative)
    g = df.groupby("player_id")
    cum_shots = g["shots"].cumsum().shift(1).fillna(0.0)
    cum_sot = g["sot"].cumsum().shift(1).fillna(0.0)
    a, b = CFG.acc_alpha, CFG.acc_beta
    df["acc_prior"] = (cum_sot + a) / (cum_shots + a + b)

    # Recent starts/minutes features for minutes model
    df["starts_ewm"] = g["started"].transform(lambda s: s.ewm(span=CFG.ewm_span_player, adjust=False).mean().shift(1))
    df["minutes_ewm"] = g["minutes"].transform(lambda s: s.ewm(span=CFG.ewm_span_player, adjust=False).mean().shift(1))

    return df

def add_rest_days(model_df: pd.DataFrame) -> pd.DataFrame:
    df = model_df.copy()
    # Team rest days: days since last team match
    team_dates = df[["team","date"]].drop_duplicates().sort_values(["team","date"])
    team_dates["team_rest_days"] = team_dates.groupby("team")["date"].diff().dt.days
    df = df.merge(team_dates, on=["team","date"], how="left")

    # Player rest days
    player_dates = df[["player_id","date"]].drop_duplicates().sort_values(["player_id","date"])
    player_dates["player_rest_days"] = player_dates.groupby("player_id")["date"].diff().dt.days
    df = df.merge(player_dates, on=["player_id","date"], how="left")

    for c in ["team_rest_days","player_rest_days"]:
        df[c] = df[c].fillna(df[c].median())
    return df

def build_features(model_df: pd.DataFrame) -> pd.DataFrame:
    df = model_df.copy()
    df = add_team_opp_rolling_features(df)
    df = add_player_rolling_features(df)
    df = add_rest_days(df)

    # Home indicator
    df["is_home"] = (df["home_away"] == "H").astype(int)

    # Basic missing handling
    for c in [
        "shots_per90_ewm","sot_per90_ewm","acc_prior","starts_ewm","minutes_ewm",
        "team_shots_for_per90_ewm","team_sot_for_per90_ewm","opp_shots_allowed_per90_ewm","opp_sot_allowed_per90_ewm",
        "team_rest_days","player_rest_days"
    ]:
        if c in df.columns:
            df[c] = df[c].fillna(df[c].median())
    return df

feat_df = build_features(model_df)
feat_df.head()


In [None]:
# 5) EDA (quick sanity checks)

def quick_eda(df: pd.DataFrame) -> None:
    print("Rows:", df.shape[0])
    print("Date range:", df["date"].min(), "->", df["date"].max())
    print("Players:", df["player_id"].nunique(), "Teams:", df["team"].nunique())
    print("\nShots summary:")
    print(df["shots"].describe())
    print("\nSOT summary:")
    print(df["sot"].describe())
    print("\nMinutes summary:")
    print(df["minutes"].describe())

quick_eda(feat_df)

plt.figure()
feat_df["shots"].hist(bins=30)
plt.title("Shots distribution")
plt.show()

plt.figure()
feat_df["sot"].hist(bins=20)
plt.title("SOT distribution")
plt.show()


In [None]:
# 6) Minutes model: P(start) + minutes distributions

MINUTES_FEATURES_NUM = [
    "starts_ewm","minutes_ewm","team_rest_days","player_rest_days",
    "team_shots_for_per90_ewm","opp_shots_allowed_per90_ewm",
]
MINUTES_FEATURES_CAT = ["pos_bucket","team"]

def fit_start_model(train: pd.DataFrame) -> Pipeline:
    X = train[MINUTES_FEATURES_NUM + MINUTES_FEATURES_CAT]
    y = train["started"].astype(int)

    pre = ColumnTransformer(
        transformers=[
            ("num", StandardScaler(), MINUTES_FEATURES_NUM),
            ("cat", OneHotEncoder(handle_unknown="ignore"), MINUTES_FEATURES_CAT),
        ]
    )
    clf = LogisticRegression(max_iter=2000, C=0.8, solver="lbfgs")
    pipe = Pipeline([("pre", pre), ("clf", clf)])
    pipe.fit(X, y)
    return pipe

def fit_minutes_regressor(train: pd.DataFrame, started_value: int) -> Pipeline:
    sub = train[train["started"] == started_value].copy()
    X = sub[MINUTES_FEATURES_NUM + MINUTES_FEATURES_CAT]
    y = sub["minutes"].astype(float)

    pre = ColumnTransformer(
        transformers=[
            ("num", StandardScaler(), MINUTES_FEATURES_NUM),
            ("cat", OneHotEncoder(handle_unknown="ignore"), MINUTES_FEATURES_CAT),
        ]
    )
    reg = HistGradientBoostingRegressor(
        loss="squared_error",
        max_depth=5,
        learning_rate=0.06,
        max_iter=300,
        random_state=7
    )
    pipe = Pipeline([("pre", pre), ("reg", reg)])
    pipe.fit(X, y)
    return pipe

def estimate_resid_std(train: pd.DataFrame, reg: Pipeline, started_value: int) -> float:
    sub = train[train["started"] == started_value].copy()
    X = sub[MINUTES_FEATURES_NUM + MINUTES_FEATURES_CAT]
    y = sub["minutes"].astype(float).to_numpy()
    pred = reg.predict(X)
    resid = y - pred
    # robust-ish
    return float(np.nanmedian(np.abs(resid)) * 1.4826)

@dataclass
class MinutesModel:
    start_pipe: Pipeline
    min_start_pipe: Pipeline
    min_bench_pipe: Pipeline
    std_start: float
    std_bench: float

    def predict_p_start(self, rows: pd.DataFrame) -> np.ndarray:
        X = rows[MINUTES_FEATURES_NUM + MINUTES_FEATURES_CAT]
        return self.start_pipe.predict_proba(X)[:, 1]

    def sample_minutes(self, rows: pd.DataFrame, n: int, rng: np.random.Generator) -> np.ndarray:
        """Return samples with shape (len(rows), n)."""
        p = self.predict_p_start(rows)
        X = rows[MINUTES_FEATURES_NUM + MINUTES_FEATURES_CAT]
        mu_start = self.min_start_pipe.predict(X)
        mu_bench = self.min_bench_pipe.predict(X)

        # Sample start indicator
        start_draw = rng.random((len(rows), n)) < p[:, None]

        m = np.empty((len(rows), n), dtype=float)
        # Truncated normal approximation (fast and practical)
        m_start = rng.normal(mu_start[:, None], self.std_start, size=(len(rows), n))
        m_bench = rng.normal(mu_bench[:, None], self.std_bench, size=(len(rows), n))

        m = np.where(start_draw, m_start, m_bench)
        m = clip_minutes(m, CFG.minutes_max)
        return m

def fit_minutes_model(train: pd.DataFrame) -> MinutesModel:
    start_pipe = fit_start_model(train)
    min_start = fit_minutes_regressor(train, started_value=1)
    min_bench = fit_minutes_regressor(train, started_value=0) if (train["started"] == 0).any() else fit_minutes_regressor(train, started_value=1)
    std_start = estimate_resid_std(train, min_start, started_value=1)
    std_bench = estimate_resid_std(train, min_bench, started_value=0) if (train["started"] == 0).any() else std_start
    return MinutesModel(start_pipe, min_start, min_bench, std_start, std_bench)

# Example fit on all data (for later candidate generation); backtest will refit per fold
minutes_model_all = fit_minutes_model(feat_df)

# quick check on a random sample
sample_rows = feat_df.sample(5, random_state=7)
m_samp = minutes_model_all.sample_minutes(sample_rows, n=2000, rng=RNG)
p_start = minutes_model_all.predict_p_start(sample_rows)
pd.DataFrame({
    "player": sample_rows["player_name"].values,
    "p_start": p_start,
    "min_mean": m_samp.mean(axis=1),
    "min_p10": np.quantile(m_samp, 0.10, axis=1),
    "min_p90": np.quantile(m_samp, 0.90, axis=1),
})


In [None]:
# 7) Shots model (Negative Binomial rate per 90)

SHOTS_FEATURES_NUM = [
    "shots_per90_ewm",
    "team_shots_for_per90_ewm",
    "opp_shots_allowed_per90_ewm",
    "team_rest_days","player_rest_days",
    "is_home",
]
SHOTS_FEATURES_CAT = ["pos_bucket"]

def make_design_matrix(df: pd.DataFrame, num_cols: List[str], cat_cols: List[str], fit_cols: Optional[List[str]]=None) -> Tuple[pd.DataFrame, List[str]]:
    X_num = df[num_cols].astype(float).copy()
    X_cat = pd.get_dummies(df[cat_cols].astype(str), prefix=cat_cols, dummy_na=False)
    X = pd.concat([X_num, X_cat], axis=1)
    X = sm.add_constant(X, has_constant="add")
    if fit_cols is None:
        fit_cols = list(X.columns)
    else:
        # align columns
        for c in fit_cols:
            if c not in X.columns:
                X[c] = 0.0
        X = X[fit_cols]
    return X, fit_cols

@dataclass
class NBModel:
    res: any
    cols: List[str]

    @property
    def alpha(self) -> float:
        # statsmodels stores alpha as a parameter named 'alpha' in res.params sometimes; for discrete NB, it's res.params or res.model._dispersion?
        # For NegativeBinomial (discrete), alpha is res.params[-1] if model includes it implicitly. We'll read res.model._dispersion if available.
        if hasattr(self.res, "params") and hasattr(self.res.model, "exog_names"):
            # Discrete NB stores ln(alpha)?? Actually discrete NB includes alpha separately accessible as res.model._dispersion if set.
            if hasattr(self.res, "model") and hasattr(self.res.model, "_dispersion"):
                return float(self.res.model._dispersion)
        # Fallback: try res.params_alpha if present
        if hasattr(self.res, "params") and "alpha" in getattr(self.res, "params", {}):
            return float(self.res.params["alpha"])
        # Conservative fallback
        return 0.4

    def predict_mu90(self, rows: pd.DataFrame, num_cols: List[str], cat_cols: List[str]) -> np.ndarray:
        X, _ = make_design_matrix(rows, num_cols, cat_cols, fit_cols=self.cols)
        offset = np.zeros(len(rows))
        mu90 = self.res.predict(X, offset=offset)
        return np.asarray(mu90, dtype=float)

def fit_nb_count_model(train: pd.DataFrame, y_col: str, num_cols: List[str], cat_cols: List[str]) -> NBModel:
    df = train.copy()
    # offset uses realized minutes
    offset = np.log(np.maximum(df["minutes"].astype(float).to_numpy() / 90.0, 1e-6))
    y = df[y_col].astype(int).to_numpy()

    X, cols = make_design_matrix(df, num_cols, cat_cols, fit_cols=None)
    # Discrete NegativeBinomial (NB2)
    mod = NegativeBinomial(y, X, offset=offset)
    res = mod.fit(disp=False, maxiter=200)
    # store estimated dispersion
    try:
        res.model._dispersion = float(res.params[-1]) if np.isfinite(res.params[-1]) else 0.4
    except Exception:
        res.model._dispersion = 0.4
    return NBModel(res=res, cols=cols)

shots_nb_all = fit_nb_count_model(feat_df.dropna(subset=SHOTS_FEATURES_NUM), "shots", SHOTS_FEATURES_NUM, SHOTS_FEATURES_CAT)
print(shots_nb_all.res.summary())


In [None]:
# 8) SOT models: (A) direct NB and (B) structural Shots × Accuracy

SOT_FEATURES_NUM = [
    "sot_per90_ewm",
    "team_sot_for_per90_ewm",
    "opp_sot_allowed_per90_ewm",
    "team_rest_days","player_rest_days",
    "is_home",
]
SOT_FEATURES_CAT = ["pos_bucket"]

# A) direct NB on SOT count with minutes offset
sot_nb_all = fit_nb_count_model(
    feat_df.dropna(subset=SOT_FEATURES_NUM),
    "sot",
    SOT_FEATURES_NUM,
    SOT_FEATURES_CAT
)
print(sot_nb_all.res.summary())

# B) accuracy model for Binomial(SOT | Shots)
# We fit a Binomial GLM on per-shot accuracy with var_weights=shots.
ACC_FEATURES_NUM = [
    "acc_prior",
    "shots_per90_ewm",
    "team_sot_for_per90_ewm",
    "opp_sot_allowed_per90_ewm",
    "is_home",
]
ACC_FEATURES_CAT = ["pos_bucket"]

@dataclass
class BinomModel:
    res: any
    cols: List[str]

    def predict_p(self, rows: pd.DataFrame, num_cols: List[str], cat_cols: List[str]) -> np.ndarray:
        X, _ = make_design_matrix(rows, num_cols, cat_cols, fit_cols=self.cols)
        p = np.asarray(self.res.predict(X), dtype=float)
        return np.clip(p, 0.03, 0.75)

def fit_accuracy_model(train: pd.DataFrame) -> BinomModel:
    df = train.copy()
    df = df[df["shots"] > 0].copy()
    y = (df["sot"] / df["shots"]).clip(0, 1).astype(float)
    w = df["shots"].astype(float).to_numpy()

    X, cols = make_design_matrix(df, ACC_FEATURES_NUM, ACC_FEATURES_CAT, fit_cols=None)
    mod = sm.GLM(y, X, family=sm.families.Binomial(), var_weights=w)
    res = mod.fit()
    return BinomModel(res=res, cols=cols)

acc_model_all = fit_accuracy_model(feat_df)
print(acc_model_all.res.summary())

def predict_accuracy(rows: pd.DataFrame, acc_model: BinomModel) -> np.ndarray:
    return acc_model.predict_p(rows, ACC_FEATURES_NUM, ACC_FEATURES_CAT)


In [None]:
# 9) Probability engine: compute P(Over line) for shots and SOT with minutes uncertainty

def predict_over_probs_shots(
    rows: pd.DataFrame,
    lines: np.ndarray,
    minutes_model: MinutesModel,
    shots_model: NBModel,
    n_mc: int,
    rng: np.random.Generator
) -> np.ndarray:
    """Return shape (len(rows), len(lines))"""
    mu90 = shots_model.predict_mu90(rows, SHOTS_FEATURES_NUM, SHOTS_FEATURES_CAT)  # mean shots at 90
    m = minutes_model.sample_minutes(rows, n=n_mc, rng=rng)  # (N, n_mc)
    mu = mu90[:, None] * (m / 90.0)
    alpha = shots_model.alpha
    shots_samp = nb2_sample(mu, alpha=alpha, rng=rng)
    out = np.zeros((len(rows), len(lines)), dtype=float)
    for j, line in enumerate(lines):
        out[:, j] = (shots_samp > line).mean(axis=1)
    return out

def predict_over_probs_sot_direct(
    rows: pd.DataFrame,
    lines: np.ndarray,
    minutes_model: MinutesModel,
    sot_model: NBModel,
    n_mc: int,
    rng: np.random.Generator
) -> np.ndarray:
    mu90 = sot_model.predict_mu90(rows, SOT_FEATURES_NUM, SOT_FEATURES_CAT)
    m = minutes_model.sample_minutes(rows, n=n_mc, rng=rng)
    mu = mu90[:, None] * (m / 90.0)
    alpha = sot_model.alpha
    sot_samp = nb2_sample(mu, alpha=alpha, rng=rng)
    out = np.zeros((len(rows), len(lines)), dtype=float)
    for j, line in enumerate(lines):
        out[:, j] = (sot_samp > line).mean(axis=1)
    return out

def predict_over_probs_sot_structural(
    rows: pd.DataFrame,
    lines: np.ndarray,
    minutes_model: MinutesModel,
    shots_model: NBModel,
    acc_model: BinomModel,
    n_mc: int,
    rng: np.random.Generator
) -> np.ndarray:
    mu90_shots = shots_model.predict_mu90(rows, SHOTS_FEATURES_NUM, SHOTS_FEATURES_CAT)
    p_acc = predict_accuracy(rows, acc_model)  # (N,)
    m = minutes_model.sample_minutes(rows, n=n_mc, rng=rng)
    mu_shots = mu90_shots[:, None] * (m / 90.0)
    shots_samp = nb2_sample(mu_shots, alpha=shots_model.alpha, rng=rng)
    # Binomial per draw
    p = np.clip(p_acc[:, None], 0.01, 0.9)
    sot_samp = rng.binomial(shots_samp, p)
    out = np.zeros((len(rows), len(lines)), dtype=float)
    for j, line in enumerate(lines):
        out[:, j] = (sot_samp > line).mean(axis=1)
    return out

# Quick demo on a few rows
demo = feat_df.sample(3, random_state=7)
shots_lines = np.array([0.5, 1.5, 2.5])
sot_lines = np.array([0.5, 1.5])

p_shots = predict_over_probs_shots(demo, shots_lines, minutes_model_all, shots_nb_all, n_mc=2000, rng=RNG)
p_sot_a = predict_over_probs_sot_direct(demo, sot_lines, minutes_model_all, sot_nb_all, n_mc=2000, rng=RNG)
p_sot_b = predict_over_probs_sot_structural(demo, sot_lines, minutes_model_all, shots_nb_all, acc_model_all, n_mc=2000, rng=RNG)

pd.DataFrame({
    "player": demo["player_name"].values,
    "p_shots>1.5": p_shots[:,1],
    "p_sot>0.5_direct": p_sot_a[:,0],
    "p_sot>0.5_struct": p_sot_b[:,0],
})


In [None]:
# 10) De-vig methods (proportional + Shin)

def devig_proportional(q_over: float, q_under: float) -> Tuple[float, float]:
    s = q_over + q_under
    if not np.isfinite(s) or s <= 0:
        return np.nan, np.nan
    return q_over / s, q_under / s

def devig_shin(q_over: float, q_under: float, tol: float=1e-9, max_iter: int=100) -> Tuple[float, float]:
    """Shin method for a two-outcome market. Returns (p_over, p_under)."""
    qs = np.array([q_over, q_under], dtype=float)
    if np.any(~np.isfinite(qs)) or np.any(qs <= 0):
        return np.nan, np.nan

    # Normalize to avoid numeric issues
    qs = qs / qs.sum()

    # Solve for z in [0,1) using binary search (monotonic)
    lo, hi = 0.0, 0.999999
    for _ in range(max_iter):
        z = 0.5 * (lo + hi)
        denom = 1 - z
        ps = (np.sqrt(z*z + 4*denom*qs) - z) / (2*denom)
        s = ps.sum()
        if abs(s - 1.0) < tol:
            break
        if s > 1.0:
            lo = z
        else:
            hi = z
    ps = (np.sqrt(z*z + 4*(1-z)*qs) - z) / (2*(1-z))
    ps = ps / ps.sum()
    return float(ps[0]), float(ps[1])

def devig_row(row: pd.Series, method: str) -> Tuple[float, float]:
    q_over, q_under = to_implied_probs(row["odds_over"], row["odds_under"])
    if method == "proportional":
        return devig_proportional(q_over, q_under)
    if method == "shin":
        return devig_shin(q_over, q_under)
    raise ValueError(f"Unknown devig method: {method}")

def ev_decimal(p: float, odds: float) -> float:
    # EV per 1 unit stake
    return p * (odds - 1.0) - (1.0 - p)

# demo on odds_df
tmp = odds_df.head(5).copy()
tmp[["p_mkt_over","p_mkt_under"]] = tmp.apply(lambda r: pd.Series(devig_row(r, CFG.devig_method)), axis=1)
tmp


In [None]:
# 11) Merge odds with features and label outcomes

def merge_odds_with_features(odds_df: pd.DataFrame, feat_df: pd.DataFrame) -> pd.DataFrame:
    # Join on match_id, date, player_id
    merged = odds_df.merge(
        feat_df,
        on=["match_id","date","player_id"],
        how="left",
        suffixes=("", "_feat")
    )
    # keep only rows where we have outcomes (for backtesting)
    merged = merged.dropna(subset=["minutes","shots","sot"])
    # outcome labels
    merged["y_count"] = np.where(merged["market"] == "shots", merged["shots"], merged["sot"])
    merged["y_over"] = (merged["y_count"] > merged["line"]).astype(int)
    return merged

odds_feat = merge_odds_with_features(odds_df, feat_df)
print("Merged odds rows:", odds_feat.shape)
odds_feat.head()


In [None]:
# 12) Walk-forward backtest

def build_prediction_block(
    block: pd.DataFrame,
    minutes_model: MinutesModel,
    shots_model: NBModel,
    sot_direct_model: NBModel,
    acc_model: BinomModel,
    n_mc: int,
    rng: np.random.Generator
) -> pd.DataFrame:
    out = block.copy()
    out["p_start"] = minutes_model.predict_p_start(out)
    # minutes CI
    m_samp = minutes_model.sample_minutes(out, n=n_mc, rng=rng)
    out["min_mean"] = m_samp.mean(axis=1)
    out["min_p10"] = np.quantile(m_samp, 0.10, axis=1)
    out["min_p90"] = np.quantile(m_samp, 0.90, axis=1)
    out["min_ci_width"] = out["min_p90"] - out["min_p10"]

    # Predict probabilities depending on market
    out["p_model_over_direct"] = np.nan
    out["p_model_over_struct"] = np.nan

    # Shots
    mask_sh = out["market"] == "shots"
    if mask_sh.any():
        lines = out.loc[mask_sh, "line"].to_numpy()
        # vectorize by grouping unique lines (faster)
        uniq_lines = np.unique(lines)
        probs = predict_over_probs_shots(out.loc[mask_sh], uniq_lines, minutes_model, shots_model, n_mc=n_mc, rng=rng)
        line_to_idx = {l:i for i,l in enumerate(uniq_lines)}
        out.loc[mask_sh, "p_model_over_direct"] = [probs[i, line_to_idx[l]] for i,l in enumerate(lines)]
        out.loc[mask_sh, "p_model_over_struct"] = out.loc[mask_sh, "p_model_over_direct"]  # same for shots

    # SOT: compute both direct and structural
    mask_sot = out["market"] == "sot"
    if mask_sot.any():
        lines = out.loc[mask_sot, "line"].to_numpy()
        uniq_lines = np.unique(lines)

        probs_a = predict_over_probs_sot_direct(out.loc[mask_sot], uniq_lines, minutes_model, sot_direct_model, n_mc=n_mc, rng=rng)
        probs_b = predict_over_probs_sot_structural(out.loc[mask_sot], uniq_lines, minutes_model, shots_model, acc_model, n_mc=n_mc, rng=rng)

        line_to_idx = {l:i for i,l in enumerate(uniq_lines)}
        out.loc[mask_sot, "p_model_over_direct"] = [probs_a[i, line_to_idx[l]] for i,l in enumerate(lines)]
        out.loc[mask_sot, "p_model_over_struct"] = [probs_b[i, line_to_idx[l]] for i,l in enumerate(lines)]

    # Market probs
    out[["p_mkt_over","p_mkt_under"]] = out.apply(lambda r: pd.Series(devig_row(r, CFG.devig_method)), axis=1)

    # EVs using offered odds
    out["ev_over_direct"] = out.apply(lambda r: ev_decimal(r["p_model_over_direct"], r["odds_over"]), axis=1)
    out["ev_over_struct"] = out.apply(lambda r: ev_decimal(r["p_model_over_struct"], r["odds_over"]), axis=1)

    # CLV if available
    if "odds_over_close" in out.columns and "odds_under_close" in out.columns:
        def devig_close(r):
            if pd.isna(r.get("odds_over_close")) or pd.isna(r.get("odds_under_close")):
                return np.nan
            q_over, q_under = to_implied_probs(r["odds_over_close"], r["odds_under_close"])
            p_over, _ = devig_proportional(q_over, q_under)
            return p_over
        out["p_close_over"] = out.apply(devig_close, axis=1)
        out["clv_prob"] = out["p_close_over"] - out["p_mkt_over"]
    else:
        out["p_close_over"] = np.nan
        out["clv_prob"] = np.nan

    return out

def walk_forward_backtest(odds_feat: pd.DataFrame) -> pd.DataFrame:
    df = odds_feat.sort_values("date").reset_index(drop=True)
    start_date = df["date"].min()
    end_date = df["date"].max()
    cut = start_date + pd.Timedelta(days=365)  # require at least 1 year warmup by default
    preds_all = []

    while cut < end_date:
        train = feat_df[feat_df["date"] < cut].copy()
        test = df[(df["date"] >= cut) & (df["date"] < cut + pd.Timedelta(days=CFG.walk_forward_step_days))].copy()

        if len(train) < CFG.min_train_rows or len(test) == 0:
            cut += pd.Timedelta(days=CFG.walk_forward_step_days)
            continue

        # Fit models
        minutes_model = fit_minutes_model(train)
        shots_model = fit_nb_count_model(train.dropna(subset=SHOTS_FEATURES_NUM), "shots", SHOTS_FEATURES_NUM, SHOTS_FEATURES_CAT)
        sot_direct_model = fit_nb_count_model(train.dropna(subset=SOT_FEATURES_NUM), "sot", SOT_FEATURES_NUM, SOT_FEATURES_CAT)
        acc_model = fit_accuracy_model(train)

        # Predict
        block_preds = build_prediction_block(
            test, minutes_model, shots_model, sot_direct_model, acc_model,
            n_mc=min(CFG.count_mc_samples, 2000),  # keep fold runtime reasonable
            rng=RNG
        )
        preds_all.append(block_preds)
        print(f"Fold @ {cut.date()} | train={len(train):,} test={len(test):,}")

        cut += pd.Timedelta(days=CFG.walk_forward_step_days)

    if not preds_all:
        raise RuntimeError("No folds produced. Reduce CFG.min_train_rows or adjust dates.")
    return pd.concat(preds_all, ignore_index=True)

bt_preds = walk_forward_backtest(odds_feat)
bt_preds.head()


In [None]:
# 13) Backtest evaluation: calibration, Brier/logloss, and SOT model selection

def eval_block(df: pd.DataFrame, p_col: str, label_col: str = "y_over") -> Dict[str, float]:
    d = df.dropna(subset=[p_col, label_col]).copy()
    if d.empty:
        return {"brier": np.nan, "logloss": np.nan, "n": 0}
    p = np.clip(d[p_col].to_numpy(), 1e-6, 1 - 1e-6)
    y = d[label_col].astype(int).to_numpy()
    return {
        "brier": float(brier_score_loss(y, p)),
        "logloss": float(log_loss(y, p)),
        "n": int(len(d))
    }

def reliability_curve(y: np.ndarray, p: np.ndarray, bins: int = 10) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    p = np.clip(p, 1e-9, 1 - 1e-9)
    edges = np.linspace(0, 1, bins + 1)
    idx = np.digitize(p, edges) - 1
    bin_centers = 0.5 * (edges[:-1] + edges[1:])
    obs = np.full(bins, np.nan)
    cnt = np.zeros(bins, dtype=int)
    for b in range(bins):
        m = idx == b
        cnt[b] = m.sum()
        if cnt[b] > 0:
            obs[b] = y[m].mean()
    return bin_centers, obs, cnt

def plot_reliability(df: pd.DataFrame, p_col: str, title: str):
    d = df.dropna(subset=[p_col, "y_over"]).copy()
    p = d[p_col].to_numpy()
    y = d["y_over"].astype(int).to_numpy()
    x, obs, cnt = reliability_curve(y, p, bins=10)
    plt.figure()
    plt.plot([0,1],[0,1])
    plt.plot(x, obs, marker="o")
    plt.title(title)
    plt.xlabel("Predicted probability (bin center)")
    plt.ylabel("Observed frequency")
    plt.ylim(0,1)
    plt.show()

# Evaluate by market
shots_bt = bt_preds[bt_preds["market"] == "shots"].copy()
sot_bt = bt_preds[bt_preds["market"] == "sot"].copy()

print("Shots model (direct):", eval_block(shots_bt, "p_model_over_direct"))
print("SOT direct:", eval_block(sot_bt, "p_model_over_direct"))
print("SOT structural:", eval_block(sot_bt, "p_model_over_struct"))

plot_reliability(shots_bt, "p_model_over_direct", "Reliability — Shots (model)")
plot_reliability(sot_bt, "p_model_over_direct", "Reliability — SOT (direct)")
plot_reliability(sot_bt, "p_model_over_struct", "Reliability — SOT (structural)")

# Choose SOT model based on Brier then calibration
sot_direct_brier = eval_block(sot_bt, "p_model_over_direct")["brier"]
sot_struct_brier = eval_block(sot_bt, "p_model_over_struct")["brier"]
SOT_CHOICE = "struct" if sot_struct_brier < sot_direct_brier else "direct"
print("Selected SOT model:", SOT_CHOICE)


In [None]:
# 14) Strategy simulation and CLV summaries (simple baseline)

def fractional_kelly(p: float, odds: float, frac: float=0.25, cap: float=0.03) -> float:
    b = odds - 1.0
    q = 1 - p
    if b <= 0:
        return 0.0
    f = (b*p - q) / b
    f = max(0.0, f) * frac
    return float(min(f, cap))

def apply_bet_filters(df: pd.DataFrame, p_col: str, ev_col: str) -> pd.DataFrame:
    out = df.copy()
    out["minutes_ci_width"] = out["min_ci_width"]
    out["pass_filters"] = (
        (out["p_start"] >= CFG.min_p_start) &
        (out["minutes_ci_width"] <= CFG.max_minutes_ci_width) &
        (out[ev_col] >= CFG.min_ev_mean)
    )
    return out

def simulate_strategy(df: pd.DataFrame, p_col: str, ev_col: str, label_col: str="y_over") -> pd.DataFrame:
    d = apply_bet_filters(df, p_col, ev_col).copy()
    d = d[d["pass_filters"]].copy()
    if d.empty:
        return d
    d["stake"] = d.apply(lambda r: fractional_kelly(float(r[p_col]), float(r["odds_over"])), axis=1)
    d["pnl"] = d["stake"] * np.where(d[label_col].astype(int) == 1, d["odds_over"] - 1.0, -1.0)
    return d

# Use chosen SOT model
bt_preds["p_used"] = np.where(
    bt_preds["market"] == "shots",
    bt_preds["p_model_over_direct"],
    np.where(SOT_CHOICE == "struct", bt_preds["p_model_over_struct"], bt_preds["p_model_over_direct"])
)
bt_preds["ev_used"] = np.where(
    bt_preds["market"] == "shots",
    bt_preds["ev_over_direct"],
    np.where(SOT_CHOICE == "struct", bt_preds["ev_over_struct"], bt_preds["ev_over_direct"])
)

bets = simulate_strategy(bt_preds, "p_used", "ev_used")
print("Bets:", len(bets), "Total PnL:", bets["pnl"].sum(), "Avg stake:", bets["stake"].mean())

# CLV summary if available
if bets["clv_prob"].notna().any():
    print("Mean CLV (prob):", bets["clv_prob"].mean())
    print("Share positive CLV:", (bets["clv_prob"] > 0).mean())

# Plot cumulative pnl
if len(bets) > 0:
    b = bets.sort_values("date").copy()
    b["cum_pnl"] = b["pnl"].cumsum()
    plt.figure()
    plt.plot(b["date"], b["cum_pnl"])
    plt.title("Cumulative PnL (baseline strategy)")
    plt.xlabel("Date")
    plt.ylabel("PnL (units)")
    plt.show()


In [None]:
# 15) Candidate generation for upcoming matchweek (using upcoming_odds.csv)

def make_candidates(
    upcoming_odds: pd.DataFrame,
    feat_df: pd.DataFrame,
    minutes_model: MinutesModel,
    shots_model: NBModel,
    sot_direct_model: NBModel,
    acc_model: BinomModel,
    n_mc: int,
    rng: np.random.Generator,
    sot_choice: str
) -> pd.DataFrame:
    # Join features for the player-match rows. You must have those rows present pre-match.
    # Practical approach: build a placeholder row for each upcoming odds line using latest known features.
    # Here we approximate by taking each player's most recent feature row and overwriting match context if you have it.
    last_feat = (
        feat_df.sort_values("date")
              .groupby("player_id", as_index=False)
              .tail(1)
    )
    base = upcoming_odds.merge(last_feat, on="player_id", how="left", suffixes=("", "_last"))
    # Set date and match_id from odds
    base["date"] = parse_date(base["date"])
    # If you have team/opponent/home_away for upcoming, include them in upcoming_odds and overwrite here.
    # Otherwise we keep last known team/opponent as a rough approximation (you should improve this for real use).

    preds = build_prediction_block(
        base, minutes_model, shots_model, sot_direct_model, acc_model,
        n_mc=n_mc, rng=rng
    )

    preds["p_used"] = np.where(
        preds["market"] == "shots",
        preds["p_model_over_direct"],
        np.where(sot_choice == "struct", preds["p_model_over_struct"], preds["p_model_over_direct"])
    )
    preds["ev_used"] = preds.apply(lambda r: ev_decimal(float(r["p_used"]), float(r["odds_over"])), axis=1)

    # Conservative lower bound via a quick MC re-run with fewer draws + using p10 minutes (simple proxy)
    preds["p_conservative"] = np.clip(preds["p_used"] - 0.03, 0.0, 1.0)  # replace with bootstrap if desired
    preds["ev_low"] = preds.apply(lambda r: ev_decimal(float(r["p_conservative"]), float(r["odds_over"])), axis=1)

    preds["stake_rec"] = preds.apply(lambda r: fractional_kelly(float(r["p_used"]), float(r["odds_over"])), axis=1)

    preds["risk_flag_rotation"] = (preds["p_start"] < CFG.min_p_start).astype(int)
    preds["risk_flag_minutes_wide"] = (preds["min_ci_width"] > CFG.max_minutes_ci_width).astype(int)

    # Filters
    preds["pass_filters"] = (
        (preds["ev_used"] >= CFG.min_ev_mean) &
        ((preds["ev_low"] > 0) if CFG.require_ev_low_positive else True) &
        (preds["p_start"] >= CFG.min_p_start) &
        (preds["min_ci_width"] <= CFG.max_minutes_ci_width)
    )

    cols = [
        "date","match_id","player_id","player_name","team","opponent","market","line",
        "book","timestamp","odds_over","odds_under",
        "p_mkt_over","p_used","p_conservative",
        "ev_used","ev_low","stake_rec",
        "p_start","min_mean","min_p10","min_p90","min_ci_width",
        "risk_flag_rotation","risk_flag_minutes_wide",
        "pass_filters"
    ]
    for c in cols:
        if c not in preds.columns:
            preds[c] = np.nan
    preds = preds[cols].sort_values(["pass_filters","ev_used"], ascending=[False, False])
    return preds

if os.path.exists(CFG.upcoming_odds_path):
    upcoming_odds = load_odds_df(CFG.upcoming_odds_path)
    candidates = make_candidates(
        upcoming_odds, feat_df,
        minutes_model_all, shots_nb_all, sot_nb_all, acc_model_all,
        n_mc=CFG.count_mc_samples, rng=RNG,
        sot_choice=SOT_CHOICE
    )
    display(candidates.head(30))
else:
    print("No upcoming_odds.csv found at", CFG.upcoming_odds_path)


In [None]:
# 16) Exports (candidates + model_version)

def export_outputs(candidates: Optional[pd.DataFrame], out_dir: str = "outputs") -> None:
    os.makedirs(out_dir, exist_ok=True)
    version = {
        "run_datetime": str(pd.Timestamp.now()),
        "devig_method": CFG.devig_method,
        "sot_choice": SOT_CHOICE,
        "config": asdict(CFG),
        "note": "This notebook does not guarantee profit; validate with calibration + CLV over time."
    }
    with open(os.path.join(out_dir, "model_version.json"), "w") as f:
        json.dump(version, f, indent=2)

    if candidates is not None:
        candidates.to_csv(os.path.join(out_dir, "candidates.csv"), index=False)
        print("Wrote:", os.path.join(out_dir, "candidates.csv"))
    print("Wrote:", os.path.join(out_dir, "model_version.json"))

if "candidates" in locals():
    export_outputs(candidates)
else:
    export_outputs(None)


# Assumptions & limitations

- **Data provenance & terms:** This notebook’s default “free-ish” stats source is **FBref via `soccerdata`** (cached locally). You are responsible for complying with the data providers’ terms, robots rules, and rate limits. If you need guaranteed stability/compliance, use a licensed provider (Opta/StatsPerform, Wyscout, etc.).
- **SOT definition:** We infer “shots on target” from FBref shot-event outcomes containing **Goal** or **Saved**. If you have an official SOT field, swap it in (and re-run the backtest).
- **Odds history is the bottleneck:** A real edge test needs *historical* offered prices (and ideally a “closing” snapshot for CLV). If you only have current odds, you can still validate calibration of the model, but you can’t credibly validate “beating the market.”
- **Historical player props availability:** The Odds API’s historical coverage for non-featured markets (like player props) starts later than the underlying match stats. For earlier seasons you’ll need (a) your own archived snapshots, or (b) a paid/vendor feed.
- **Name matching is imperfect:** Odds feeds usually key props by **player name**, not a stable ID. This notebook uses a canonicalized-name hash as `player_id`, which can collide for rare duplicates and can break on diacritics / name changes / mid-season transfers. Always spot-check candidates.
- **Minutes uncertainty dominates:** If your minutes model is wrong (rotation, injuries, lineup shocks), shots/SOT prop pricing will be wrong. Treat the minutes model as a first-class component.
- **Market efficiency:** Shots/SOT props for popular players can be very efficient. “No stable edge detected” is a valid result.

Next improvements (high ROI):
- Add an external projected-lineups signal to the minutes model (even a weak one helps).
- Improve upcoming-match context features (opponent allowance, home/away) instead of copying the last match’s features.
- Replace the conservative EV proxy with a proper bootstrap/posterior interval on `p_over`.
