# Logistic Regression for Expected Goals (xG) Modeling

This notebook applies a **Logistic Regression model** to estimate **expected goals (xG)** using dataset **DS3**. The aim is to assign to each shot the probability of resulting in a goal. Although Logistic Regression is formally a classification method, it is the traditional baseline in football analytics for xG modeling, since it naturally provides a probability between 0 and 1 that can be interpreted as the xG value.

Logistic Regression is an appropriate starting point because it is **simple and widely established** in the literature. The coefficients can be inspected directly to evaluate the contribution of each feature, while the model itself is capable of producing **well-calibrated probabilities** when properly trained. This makes it a natural benchmark for comparing more complex approaches such as Random Forest, XGBoost, or Neural Networks.

The following metrics are used to ensure the evaluation:

- **Brier Score**, measuring the mean squared error of probabilistic predictions.  

- **Log Loss**, rewarding correct calibration and penalizing confident but wrong predictions.  

- **RMSE (Root Mean Squared Error)** and **MAE (Mean Absolute Error)**, providing complementary regression error measures.  

- **Calibration Curve (Reliability Diagram)**, offering a graphical evaluation of the alignment between predicted probabilities and observed outcomes.

This model therefore serves as a transparent and interpretable **baseline**, against which more advanced models will later be compared.

#### Imports and Global Settings

In [None]:
#  Imports and global settings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    brier_score_loss,
    log_loss,
    mean_squared_error,
    mean_absolute_error
)
from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split

import os
import random

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

# Display options
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")


# Output paths
OUTPUT_DIR = "../task1_xg/outputs/04_logreg"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Setup complete. Ready to load data.")


Setup complete. Ready to load data.


#### Load Dataset DS3 and train/test split

In [3]:
# Load the dataset
DATA_PATH = "../task1_xg/data/DS3.csv" 
ds3 = pd.read_csv(DATA_PATH)

print(f"Dataset loaded: {ds3.shape[0]} rows, {ds3.shape[1]} columns")
ds3.columns.tolist()

Dataset loaded: 2 rows, 1 columns


['version https://git-lfs.github.com/spec/v1']

In [None]:
# Define features and target
target_column = "target_xg"  # expected goals column
train_columns = [col for col in ds3.columns if col != target_column]

X = ds3[train_columns]
y = ds3[target_column]

# Sanity check on target
print("\nTarget (xG) stats:")
print(y.describe())
print(f"Range: {y.min():.4f} - {y.max():.4f}")

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=(y > 0).astype(int)
)

print(f"\nTraining set: {X_train.shape[0]} rows")
print(f"Test set: {X_test.shape[0]} rows")


####  Logistic Regression: training and evaluation

In [None]:
# Metrics function
def evaluate_predictions(y_true, y_pred, model_name="Model"):
    metrics = {
        "Brier": brier_score_loss(y_true, y_pred),
        "LogLoss": log_loss(y_true, y_pred, eps=1e-15),
        "RMSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred)
    }
    print(f"\n{model_name} performance:")
    for k, v in metrics.items():
        print(f"{k:>8}: {v:.4f}")
    return metrics

In [None]:
# Initialize the model
log_reg = LogisticRegression(
    penalty="l2",               # regularization
    C=1.0,                      # inverse of regularization strength
    solver="lbfgs",             # efficient solver for small/medium datasets
    max_iter=1000,              # to ensure convergence
    random_state=RANDOM_STATE
)

# Train the model
log_reg.fit(X_train, y_train)

# Predict probabilities (xG values)
y_train_pred = log_reg.predict_proba(X_train)[:, 1]
y_test_pred = log_reg.predict_proba(X_test)[:, 1]

# Evaluate
logreg_train_metrics = evaluate_predictions(y_train, y_train_pred, "Logistic Regression (train)")
logreg_test_metrics = evaluate_predictions(y_test, y_test_pred, "Logistic Regression (test)")
