# Which Statistics Drive Winning in European Football? – Feature Engineering

This notebook focuses on transforming the cleaned match-level dataset into a set of features suitable for modeling. A binary target variable is created to represent match outcome, and match statistics are converted into relative performance measures between the home and away teams.

Specifically, this notebook:
- Defines the target variable as a home team win indicator
- Constructs differential features (home − away) for selected match statistics
- Produces a modeling-ready dataset where each row represents a match with its corresponding outcome and relative performance features

The output of this notebook is a target variable and differential features that will be used for model training and evaluation.

In [1]:
import pandas as pd

# Read filtered CSV
df = pd.read_csv("../data/football_2023_2024_top5_clean.csv")

The target variable created in this section is the match result of winning, since it is the outcome that is being assesed as a result of the performance statistics across various leagues.

In [2]:
# Create the target 
df["home_win"] = (df["home_goals"] > df["away_goals"]).astype(int)

# Verfies that home wins are created
print("Target distribution (counts):")
print(df["home_win"].value_counts())

print("\nTarget distribution (proportions):")
print(df["home_win"].value_counts(normalize=True).round(3))

# Verify that more home goals results in home win
df[["League", "home_goals", "away_goals", "home_win"]].sample(10)


Target distribution (counts):
home_win
0    847
1    637
Name: count, dtype: int64

Target distribution (proportions):
home_win
0    0.571
1    0.429
Name: proportion, dtype: float64


Unnamed: 0,League,home_goals,away_goals,home_win
1427,Laliga,0,2,0
57,Bundesliga,3,1,1
1168,Serie-a,2,2,0
534,Ligue-1,2,3,0
463,Ligue-1,1,3,0
462,Ligue-1,2,0,1
1169,Serie-a,1,1,0
1101,Serie-a,1,0,1
1194,Serie-a,1,1,0
602,Ligue-1,1,2,0


Differential features were used to represent relative match dominance, as football outcomes are determined by performance compared to the opponent rather than absolute values alone.

In [3]:


# Create home and away differential features
df["xg_diff"] = df["xg_home"] - df["xg_away"]

# Replacing the percent value witha float so that python can complete operation
df["possession_home"] = (
    df["possession_home"]
    .str.replace("%", "", regex=False)
    .astype(float)
)

df["possession_away"] = (
    df["possession_away"]
    .str.replace("%", "", regex=False)
    .astype(float)
)
df["possession_diff"] = df["possession_home"] - df["possession_away"]


df["shots_on_target_diff"] = df["shots_on_target_home"] - df["shots_on_target_away"]
df["corners_diff"] = df["corners_home"] - df["corners_away"]
df["passes_diff"] = df["passes_home"] - df["passes_away"]

df[[
    "xg_home","xg_away","xg_diff",
    "possession_home","possession_away","possession_diff",
    "shots_on_target_home","shots_on_target_away","shots_on_target_diff",
    "corners_home","corners_away","corners_diff",
    "passes_home","passes_away","passes_diff"
]].head(10)


Unnamed: 0,xg_home,xg_away,xg_diff,possession_home,possession_away,possession_diff,shots_on_target_home,shots_on_target_away,shots_on_target_diff,corners_home,corners_away,corners_diff,passes_home,passes_away,passes_diff
0,1.94,2.76,-0.82,42.0,58.0,-16.0,5.0,8.0,-3.0,14.0,8.0,6.0,363.0,703.0,-340.0
1,1.32,2.25,-0.93,55.0,45.0,10.0,3.0,8.0,-5.0,7.0,8.0,-1.0,482.0,313.0,169.0
2,2.86,0.77,2.09,62.0,38.0,24.0,8.0,2.0,6.0,5.0,4.0,1.0,728.0,382.0,346.0
3,2.83,0.42,2.41,71.0,29.0,42.0,10.0,1.0,9.0,7.0,3.0,4.0,727.0,258.0,469.0
4,2.02,1.73,0.29,46.0,54.0,-8.0,6.0,5.0,1.0,8.0,3.0,5.0,432.0,520.0,-88.0
5,0.92,2.56,-1.64,50.0,50.0,0.0,7.0,9.0,-2.0,5.0,12.0,-7.0,370.0,419.0,-49.0
6,2.21,1.59,0.62,48.0,52.0,-4.0,9.0,4.0,5.0,6.0,5.0,1.0,469.0,509.0,-40.0
7,1.84,1.45,0.39,58.0,42.0,16.0,11.0,5.0,6.0,8.0,4.0,4.0,716.0,475.0,241.0
8,2.68,1.04,1.64,43.0,57.0,-14.0,5.0,5.0,0.0,3.0,7.0,-4.0,303.0,517.0,-214.0
9,3.87,1.98,1.89,48.0,52.0,-4.0,12.0,8.0,4.0,8.0,8.0,0.0,355.0,392.0,-37.0


In [4]:
# Save processed CSV
df.to_csv("../data/football_2023_2024_features.csv", index=False)
