This project focuses on the analysis of event and tracking data from the German Bundesliga (seasons 22/23 & 23/24), applying AI methodologies to tackle match analysis challenges.
The preprocessing pipeline prepares raw event and tracking data for modeling. Key steps include:
- Data Parsing: Raw event and tracking data are parsed into structured formats.
- Aggregation: Team-level performance metrics are aggregated for each match.
- Post-Match Prediction: Features are generated based on aggregated data for retrospective analysis.
- Pre-Match Prediction KPI Calculation: Key performance indicators (KPIs) are computed using pre-match data.
- Pre-Match Prediction (Goal Number & Mtach Outcome): These KPIs serve as input features to predict the number of goals per team in future fixtures to estimate the outcomes of upcoming matches based on predicted goals.
The project includes two main predictive modeling tasks:
- Pre-Match Prediction: Forecasting outcomes using only data available before the match begins.
To predict the match outcome based on pre-match information a two-stage approach was developed. In the first stage, machine learning models were developed to predict the number of goals scored for both teams. Those models were trained solely on information of the match performances of both opposing teams in the last three matches. In the second stage, this predicted number of goals for both teams were used to simulate the match outcome to quantify the probabilities.
Features To effectively predict the target variable of number of goals scored by an individual team (i.e. considered team) in an upcoming match (against an upcoming opponent), we engineered features based on the core idea introduced by Dixon and Coles for forecasting future performances (Dixon & Coles, 1997). Thereby, a well-designed model for predicting match outcomes should incorporate the following aspects: (i) the abilities of both teams, (ii) the contextual factor of match venue (home vs. away), (iii) a team's ability should be reflected by its recent performances, (iv) this ability should consider both offensive and defensive strengths, and (v) those recent performances should be weighted by the strength of the opponents faced in those matches. Therefore, the following features were computed, consisting of information about the match context (e.g. venue), the relative quality of the opponent (e.g. difference in table position between considered and opposing team), the attacking performance of the considered team in the last three matches (e.g. EPV measures, xG measures), defensive performance of the upcoming opponent in the last three matches (e.g. xBG measures, xG conceded measures), and features describing the difficulty of the last three matches for both the considered team and the upcoming opponent (e.g. table position of the last three opponents). All detailed feature specifications and their description can be found in Table 1.
Modelling of number of goals (xGoalNumber) The prediction approach to predict the number of goals of a considered team in an upcoming match against an upcoming opponent (xGoalNumber) used information of the match performance of the last three matches of each team. However, to assess the difficulty of each team's third-to-last opponent, the table positions prior to that match were calculated based on the last three matches. To ensure sufficient performance data of the competing teams the first six matchdays of both seasons were excluded. Therefore, a total of 504 matches with 1008 case samples of individual teams were included in the prediction approach. Random Forest and XGBoost regression models were trained for both the xG pre-match and EPV pre-match approaches. For the EPV-pre match approach, all xG-related variables were excluded (see blue features in Table 1), ensuring the approach only incorporates information about context, strength of opponent, offensive and defensive performance of shots, goals, EPV, and xBG. Conversely, for the xG approach, all EPV- and xBG-related variables were excluded (all features in orange see Table 1). An 80/20 hold-out split was applied on a match-by-match basis to prevent data leakage by ensuring that no samples from the same match appeared in both the training and test datasets. With the 80 % training dataset a five-fold cross validation with hyperparameter tuning on a randomized grid search and performance optimization on RMSE was applied.
Prediction of match outcomes After the prediction, a double Poisson distribution (with a maximum of eight goals scored in the current dataset) was applied to the absolute predicted number of goals for each team estimated by the machine learning model. The approach to model match outcomes based on Poisson distributions was used in line with (Dixon & Coles, 1997; Karlis & Ntzoufras, 2003; Ley et al., 2019) which indicated substantial predictive performance. This distribution was then used to simulate the match outcome 10,000 times. Similar to the post-match approach those results were used afterwards to quantify the probabilities of the final match outcome.
- Post-Match Prediction: Retrospective analysis using full match data.
To predict the match outcome based on post-match information, two approaches - xG and EPV - were applied for both teams after the considered match. For each match, 10,000 match simulations were completed. Each match simulation predicted whether a goal would result on each individual shot (xG β post match) or each individual possession (EPV β post match) based on the computed probabilities of the considered models. This process generated 10,000 potential match outcomes with regard of the performance of both teams during the match. The distribution of the match outcomes was then used to quantify the probabilities of the final match outcome.
- SHAP Value Analysis: Model explainability is achieved through SHAP values, helping interpret which features influence predictions in the pre-match models.
Utility scripts and helper functions that support preprocessing, feature engineering, and evaluation.
- Dr. Leander Forcher
leander.forcher@kit.edu