This notebook aims to load the raw football match data set which includes match logs with a variety of statistics from various leagues over many years. The following is filtered in this notebook:

- Leagues: Bundesliga, English Premier League, La Liga, Ligue 1, and Serie A
- Season: 2023/2024
- Statistics: Match score, XG, ball possesion, shot on target, corners, and total passes

The output is a clean DataFrame, where each row represents a single match and its information in preparation for feature engineering.

In [1]:
import pandas as pd

In [2]:
# Load the full dataset

df = pd.read_csv(r"C:\cross-league-match-outcome-drivers\data\Football.csv")
df.shape


  df = pd.read_csv(r"C:\cross-league-match-outcome-drivers\data\Football.csv")


(95384, 91)

In [3]:
# Remove the unwanted columns and rows

# Remove columns
cols = ['League', 'home_score', 'away_score', 'season_year', 'expected_goals_xg_home', 'expected_goals_xg_host', 'Ball_Possession_Home', 'Ball_Possession_Host', 'Shots_on_Goal_Home', 'Shots_on_Goal_Host', 'Corner_Kicks_Home', 'Corner_Kicks_Host', 'Total_Passes_Home', 'Total_Passes_Host']
df = df[cols]

# Remove rows
df = df[
    (df["season_year"] == "2023/2024") &
    (df["League"].isin([
        "Premier-league",
        "Laliga",
        "Bundesliga",
        "Serie-a",
        "Ligue-1"
    ]))
]

In [4]:
# Clean up the title names

df = df.rename(columns={
    "home_score": "home_goals",
    "away_score": "away_goals",
    "expected_goals_xg_home": "xg_home",
    "expected_goals_xg_host": "xg_away",
    "Ball_Possession_Home": "possession_home",
    "Ball_Possession_Host": "possession_away",
    "Shots_on_Goal_Home": "shots_on_target_home",
    "Shots_on_Goal_Host": "shots_on_target_away",
    "Corner_Kicks_Home": "corners_home",
    "Corner_Kicks_Host": "corners_away",
    "Total_Passes_Home": "passes_home",
    "Total_Passes_Host": "passes_away"
})


In [5]:
# Save fitered data 

df.to_csv("../data/football_2023_2024_top5_clean.csv", index=False)
