<a href="https://colab.research.google.com/github/Ashvin7/pl-xg-ml/blob/main/notebooks/Phase_0_Project_Setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 0: Project Setup & Scope Definition

This notebook documents the initial setup and scoping decisions for the project:
**Predicting Premier League Standings Using Expected Goals (xG/xGA) and Machine Learning**.


## Project Scope

**Primary Target Variable**
- Final season points total (regression task)

**Secondary Target Variable**
- Final league ranking (derived from predicted points)

**Supporting Outputs**
- Top-4 qualification probability
- Relegation probability
- Over/under-performance relative to xG

These targets are fixed and will not change throughout the project.


## Seasons Included

The analysis will include the following Premier League seasons:

- 2015–16 through 2023–24

This range provides sufficient historical data while maintaining modern tactical relevance and includes notable case studies such as Leicester City’s 2015–16 title-winning season.


## Core Feature Set

The core feature set is fixed to prevent scope creep and ensure model comparability.

**Attack & Defense**
- xG per 90
- xGA per 90
- xG difference per 90

**Chance Creation**
- Shots
- Shots on target
- Shot-creating actions

**Possession & Control**
- Possession percentage
- Progressive passes

**Pressing**
- PPDA (passes per defensive action)

**Contextual Features**
- Home vs away performance splits
- Rolling form metrics (last 3–5 matches)

Additional features may be explored experimentally but will not be required for the core modeling pipeline.


In [None]:
import os

folders = [
    "data/raw",
    "data/processed",
    "figures",
    "src/models",
    "src/evaluation",
    "src/simulation"
]

for folder in folders:
    os.makedirs(folder, exist_ok=True)

print("Project folder structure created.")


In [None]:
!pip install pandas numpy matplotlib seaborn scikit-learn xgboost shap statsmodels

In [None]:
import pandas as pd
import numpy as np
import sklearn
import xgboost

print("Environment ready.")


## Data Sources

- FBref: team-level season statistics (xG, xGA, shots, possession)
- Understat: match-level xG and xGA data
- StatsBomb Open Data: event-level data (pressures, passes, xA)

Data will be stored in /data/raw and processed into modeling tables in /data/processed.


In [None]:
# NOTE: This is a preliminary connectivity check only.
# Data ingestion and cleaning will be performed in Phase 1 (01_eda.ipynb).

url = "https://fbref.com/en/comps/9/Premier-League-Stats"
tables = pd.read_html(url)

len(tables)