## XGBoost for Expected Goals (xG) Modeling

This notebook applies **XGBoost (Extreme Gradient Boosting)** to estimate **expected goals (xG)** using dataset **DS4**.  
XGBoost is a highly efficient and scalable implementation of **gradient boosted decision trees**, designed to deliver both accuracy and speed. 

While Random Forest aggregates many independent decision trees, XGBoost builds trees **sequentially**, where each new tree is trained to correct the errors of the previous ones.  This **boosting approach** makes the model more accurate and capable of capturing subtle patterns in the data, especially **non-linear relationships** and **complex feature interactions**.  

To remain consistent with previous models, we evaluate XGBoost using:  

- **RMSE (Root Mean Squared Error)** and **MAE (Mean Absolute Error)**, measuring prediction accuracy in absolute terms.  

- **R² (Coefficient of Determination)** and **Explained Variance**, assessing how much of the variance in the target is captured by the model.  

- **Pearson and Spearman Correlation**, quantifying the strength of linear and monotonic relationships between predicted and true values.  

- **Calibration Curve**, providing a graphical evaluation of how closely predicted probabilities align with observed values across probability bins.


#### Imports and global settings

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from xgboost import XGBRegressor
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score, explained_variance_score
from scipy.stats import pearsonr, spearmanr
from sklearn.model_selection import train_test_split

import os
import random

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

# Display options
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")

# Output paths
OUTPUT_DIR = "../task1_xg/outputs"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Model directory
MODEL_DIR = "../task1_xg/models"
os.makedirs(MODEL_DIR, exist_ok=True)

print("Setup complete. Ready to load data.")

####  Load dataset DS4

In [None]:
DATA_PATH = "../task1_xg/data/DS4.csv"
ds4 = pd.read_csv(DATA_PATH)

print(f"Dataset loaded: {ds4.shape[0]} rows, {ds4.shape[1]} columns")
print("Columns:", list(ds4.columns))

# Preview first rows
ds4.head()

####  Define features, target and train/test split

In [None]:
# Define target column
target_column = "target_xg"
train_columns = [col for col in ds4.columns if col != target_column]

X = ds4[train_columns]
y = ds4[target_column]

# Check on target
print("\nTarget (xG) stats:")
print(y.describe())
print(f"Range: {y.min():.4f} - {y.max():.4f}")

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE
)

print(f"\nTraining set: {X_train.shape[0]} rows, {X_train.shape[1]} features")
print(f"Test set:     {X_test.shape[0]} rows, {X_test.shape[1]} features")

#### Training the XGBoost Model