This notebook aims to load the raw football match data set which includes match logs with a variety of statistics from various leagues over many years. The following is filtered in this notebook:

- Leagues: Bundesliga, English Premier League, La Liga, Ligue 1, and Serie A
- Season: 2023/2024
- Statistics: Match score, XG, ball possesion, shot on target, corners, and total passes

The output is a clean DataFrame, where each row represents a single match and its information in preparation for feature engineering.

In [18]:
import pandas as pd

In [19]:
# Load the full dataset
df = pd.read_csv("../data/Football.csv")
df.shape


  df = pd.read_csv("../data/Football.csv")


(95384, 91)

## Scope and Feature Selection

In this project, the leagues chosen for analysis were Europe's top 5 football leagues (Premier League, La Liga, Bundesliga, Serie A, and Ligue 1) across the 2023/2024 season. Only one season was chosen to avoid effects from playstyle changes that may occur from season to season.

The selected statistics chosen for comparison represent complementary aspects of performance. These aspects include:
- **Quality of goal-scoring chances**: Expected goals (xG)
- **Attacking execution**: Shots on target
- **Ball control**: Possession, measuring the proportion of match time a team controls the ball
- **Ball circulation and tempo**: Total passes, reflecting how actively possession is used
- **Sustained attacking pressure**: Corners

In [20]:
# Remove the unwanted columns and rows

# Remove columns
cols = ['League', 'home_score', 'away_score', 'season_year', 'expected_goals_xg_home', 'expected_goals_xg_host', 'Ball_Possession_Home', 'Ball_Possession_Host', 'Shots_on_Goal_Home', 'Shots_on_Goal_Host', 'Corner_Kicks_Home', 'Corner_Kicks_Host', 'Total_Passes_Home', 'Total_Passes_Host']
df = df[cols]

# Remove rows
df = df[
    (df["season_year"] == "2023/2024") &
    (df["League"].isin([
        "Premier-league",
        "Laliga",
        "Bundesliga",
        "Serie-a",
        "Ligue-1"
    ]))
]

In [21]:
# Clean up the title names

df = df.rename(columns={
    "home_score": "home_goals",
    "away_score": "away_goals",
    "expected_goals_xg_home": "xg_home",
    "expected_goals_xg_host": "xg_away",
    "Ball_Possession_Home": "possession_home",
    "Ball_Possession_Host": "possession_away",
    "Shots_on_Goal_Home": "shots_on_target_home",
    "Shots_on_Goal_Host": "shots_on_target_away",
    "Corner_Kicks_Home": "corners_home",
    "Corner_Kicks_Host": "corners_away",
    "Total_Passes_Home": "passes_home",
    "Total_Passes_Host": "passes_away"
})

df.sample(10)


Unnamed: 0,League,home_goals,away_goals,season_year,xg_home,xg_away,possession_home,possession_away,shots_on_target_home,shots_on_target_away,corners_home,corners_away,passes_home,passes_away
63285,Serie-a,3,0,2023/2024,1.19,0.8,60%,40%,2.0,4.0,4.0,2.0,,
36811,Premier-league,2,1,2023/2024,3.12,0.5,69%,31%,5.0,2.0,8.0,1.0,650.0,298.0
83343,Laliga,0,2,2023/2024,0.79,1.12,41%,59%,2.0,4.0,8.0,5.0,,
189,Bundesliga,1,1,2023/2024,1.92,1.03,54%,46%,4.0,3.0,4.0,6.0,523.0,415.0
37088,Premier-league,6,1,2023/2024,1.74,0.62,66%,34%,8.0,1.0,12.0,1.0,731.0,395.0
63087,Serie-a,1,2,2023/2024,1.38,1.22,55%,45%,2.0,6.0,6.0,2.0,,
16507,Ligue-1,0,1,2023/2024,1.04,0.48,55%,45%,5.0,2.0,6.0,4.0,490.0,408.0
36888,Premier-league,4,3,2023/2024,2.82,1.54,57%,43%,10.0,5.0,12.0,3.0,549.0,426.0
62994,Serie-a,0,2,2023/2024,1.21,1.71,44%,56%,3.0,9.0,2.0,5.0,,
63250,Serie-a,1,3,2023/2024,0.65,0.78,46%,54%,5.0,6.0,2.0,6.0,,


In [22]:
# Save fitered data 
df.to_csv("../data/football_2023_2024_top5_clean.csv", index=False)
