# Investigating Data Needs

When searching for data, it’s essential to understand how a model's performance changes with different training datasets. This helps us determine what data to prioritize for optimal results.

In this notebook, we examine how the choice of leagues in the training data affects model performance when evaluated exclusively on the English Premier League (EPL). To do this, we will:

0. [Establish a Baseline Model](#baseline-training-on-all-available-epl-data-only):
    - The model will predict second-half-of-season results using first-half season statistics as features.
    - It will be trained on all available EPL data*.

1. [Experiment with Data Cutoffs](#experiment-1-cutting-training-data-to-recent-seasons):
    - We will assess the impact of restricting the training data to more recent EPL seasons by removing older seasons.

2. [Incorporate Additional Leagues](#experiment-2-extending-training-data-to-other-top-leagues-all-seasons):
    - We will explore how training on multiple leagues affects model performance when applied to EPL predictions.

3. [Combine Both Strategies](#experiment-3-training-on-all-top-leagues-recent-seasons):
    - We will train the model using data from multiple leagues but only from recent seasons. This will help determine whether including more leagues can compensate for a lack of historical EPL data should we encounter this issue in our search.

4. [Remove EPL Data Completely](#experiment-4-training-on-non-epl-data-only)
    - Finally, as a bonus, we will train on non-EPL data and test on EPL data to understand how important it is to represent the leagues we are predicting for in our training set.

*The dataset we are using backdates to 2010 however there are additional seasons available from the source.

By conducting these experiments, we aim to identify the most effective training data composition for predicting EPL outcomes.

The modelling we use in this notebook (particularly the concept of using aggregated first-half-of-season stats to predict results in the second half) does not necessarily reflect how our future model(s) will work. This is meerly a method that sufficiently represents the likely complexity of our future model(s) whilst allowing us to use easily available back-dated data.

The data used in this notebook was sourced from [Football-Data.co.uk](https://Football-Data.co.uk).



In [1]:
import pandas as pd
import mlflow

import sys
import os

sys.path.append(os.path.abspath(os.path.join("../src")))
from src.ingestion.preprocess import create_dataframe
from experiment import run_experiment
from helper_functions import generate_random_string

In [2]:
raw_data = pd.read_csv("../data/raw_games.csv")

In [3]:
data = create_dataframe(raw_data)

[32m2025-02-20 18:00:01.098[0m | [1mINFO    [0m | [36mlegacy.preprocess[0m:[36mcreate_dataframe[0m:[36m13[0m - [1mStarting dataset creation[0m
[32m2025-02-20 18:00:01.102[0m | [1mINFO    [0m | [36mlegacy.preprocess[0m:[36mcreate_dataframe[0m:[36m113[0m - [1mHome and away dataframes created succesfully[0m
[32m2025-02-20 18:00:01.106[0m | [1mINFO    [0m | [36mlegacy.preprocess[0m:[36mcreate_dataframe[0m:[36m119[0m - [1mHome and away dataframes succesfully combined and grouped[0m
[32m2025-02-20 18:00:01.113[0m | [1mINFO    [0m | [36mlegacy.preprocess[0m:[36mcreate_dataframe[0m:[36m151[0m - [1mTeam stats succesfully merged onto game data[0m
[32m2025-02-20 18:00:01.115[0m | [1mINFO    [0m | [36mlegacy.preprocess[0m:[36mcreate_dataframe[0m:[36m156[0m - [1mDuplicate columns successfully dropped[0m
[32m2025-02-20 18:00:01.118[0m | [1mINFO    [0m | [36mlegacy.preprocess[0m:[36mcreate_dataframe[0m:[36m170[0m - [1mDataset crea

In [4]:
def list_remove(lst: list, remove: list) -> list:
    return [x for x in lst if x not in remove]

In [5]:
non_features = [
    "season",
    "div",
    "date",
    "h_team",
    "a_team",
    "ftr",
    "b365h",
    "b365d",
    "b365a",
    "h_win",
    "bookies_prob",
]
FEATURE_NAMES = list_remove(data.columns, non_features)
ALL_SEASONS = list(data["season"].drop_duplicates())
TEST_SEASONS = ["23_24", "24_25"]
VAL_SEASONS = ["21_22", "22_23"]

In [6]:
columns = FEATURE_NAMES + ["bookies_prob", "div", "season", "h_win"]

data = data[columns]

train_full = data[~data["season"].isin(VAL_SEASONS + TEST_SEASONS)]
train_full = train_full.drop(columns=["bookies_prob"])

val = data[data["season"].isin(VAL_SEASONS) & (data["div"] == "E0")]
val = val.drop(columns=["div"])

## Baseline: Training on All Available EPL Data Only

In [7]:
train_bl = train_full.copy()
train_bl = train_bl.loc[train_bl["div"] == "E0"]
train_bl = train_bl.drop(columns="div")

In [8]:
bl_run_id = generate_random_string()
print(f"Experiment run ID: {bl_run_id}")

run_experiment(
    "data-needs-experiment",
    train_bl,
    val,
    hidden_units=None,
    learning_rate=0.001,
    num_epochs=10000,
    num_samples=1000,
    num_batches=1,
    league_tag="EPL",
    run_id=bl_run_id,
    run_description="training on 2010-2021 EPL data",
    return_model=False,
)

Experiment run ID: GMRPfoeZ


Training Progress: 100%|██████████| 10000/10000 [00:17<00:00, 569.19epoch/s]
Training Output Sampling Progress: 100%|██████████| 1000/1000 [00:24<00:00, 40.57it/s]
Validation Output Sampling Progress: 100%|██████████| 1000/1000 [00:24<00:00, 40.20it/s]


🏃 View run rogue-robin-152 at: http://localhost:5001/#/experiments/3/runs/8c724b3c65f54bd4953f0f5a6d52bd94
🧪 View experiment at: http://localhost:5001/#/experiments/3


In [9]:
run_data_bl = mlflow.search_runs(filter_string=f"tags.run_id = '{bl_run_id}'")[
    [
        "tags.run_description",
        "params.n_train",
        "params.num_train_seasons",
        "metrics.train_auc",
        "metrics.val_auc",
        "metrics.train_mse",
        "metrics.val_mse",
        "metrics.val_mse_diff",
        "metrics.val_auc_diff",
    ]
]

run_data_bl

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.train_auc,metrics.val_auc,metrics.train_mse,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2010-2021 EPL data,2067,11,0.711393,0.717485,0.215309,0.21142,0.005083,-0.024936


## Experiment 1: Cutting Training Data To Recent Seasons 

### Part A: 2013-2021

In [10]:
e1a_train_seasons = [
    "13_14",
    "14_15",
    "15_16",
    "16_17",
    "17_18",
    "18_19",
    "19_20",
    "20_21",
]

In [11]:
train_e1a = train_full.copy()
train_e1a = train_e1a.loc[
    (train_e1a["div"] == "E0") & (train_e1a["season"].isin(e1a_train_seasons))
]
train_e1a = train_e1a.drop(columns="div")

In [12]:
e1a_run_id = generate_random_string()
print(f"Experiment run ID: {e1a_run_id}")

run_experiment(
    "data-needs-experiment",
    train_e1a,
    val,
    hidden_units=None,
    learning_rate=0.001,
    num_epochs=10000,
    num_samples=1000,
    num_batches=1,
    league_tag="EPL",
    run_id=e1a_run_id,
    run_description="training on 2013-2021 EPL data",
    return_model=False,
)

Experiment run ID: FCEI7kcV


Training Progress: 100%|██████████| 10000/10000 [00:16<00:00, 623.37epoch/s]
Training Output Sampling Progress: 100%|██████████| 1000/1000 [00:25<00:00, 39.84it/s]
Validation Output Sampling Progress: 100%|██████████| 1000/1000 [00:25<00:00, 39.75it/s]


🏃 View run orderly-gull-357 at: http://localhost:5001/#/experiments/3/runs/7430fb2d1b934a2589a1e630022ed722
🧪 View experiment at: http://localhost:5001/#/experiments/3


In [13]:
run_data_e1a = mlflow.search_runs(
    filter_string=f"tags.run_id = '{e1a_run_id}'"
)[
    [
        "tags.run_description",
        "params.n_train",
        "params.num_train_seasons",
        "metrics.train_auc",
        "metrics.val_auc",
        "metrics.train_mse",
        "metrics.val_mse",
        "metrics.val_mse_diff",
        "metrics.val_auc_diff",
    ]
]

run_data_all = pd.concat([run_data_bl, run_data_e1a])
run_data_e1a

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.train_auc,metrics.val_auc,metrics.train_mse,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2013-2021 EPL data,1505,8,0.710823,0.710517,0.215461,0.213557,0.00722,-0.031905


### Part B: 2017-2021

In [14]:
e1b_train_seasons = ["17_18", "18_19", "19_20", "20_21"]

In [15]:
train_e1b = train_full.copy()
train_e1b = train_e1b.loc[
    (train_e1b["div"] == "E0") & (train_e1b["season"].isin(e1b_train_seasons))
]
train_e1b = train_e1b.drop(columns="div")

In [16]:
e1b_run_id = generate_random_string()
print(f"Experiment run ID: {e1b_run_id}")

run_experiment(
    "data-needs-experiment",
    train_e1b,
    val,
    hidden_units=None,
    learning_rate=0.001,
    num_epochs=10000,
    num_samples=1000,
    num_batches=1,
    league_tag="EPL",
    run_id=e1b_run_id,
    run_description="training on 2017-2021 EPL data",
    return_model=False,
)

Experiment run ID: N7c44Xbn


Training Progress: 100%|██████████| 10000/10000 [00:15<00:00, 654.87epoch/s]
Training Output Sampling Progress: 100%|██████████| 1000/1000 [00:25<00:00, 39.65it/s]
Validation Output Sampling Progress: 100%|██████████| 1000/1000 [00:25<00:00, 39.29it/s]


🏃 View run shivering-panda-160 at: http://localhost:5001/#/experiments/3/runs/f79dcb1807dc4affbaf66c04e380939f
🧪 View experiment at: http://localhost:5001/#/experiments/3


In [17]:
run_data_e1b = mlflow.search_runs(
    filter_string=f"tags.run_id = '{e1b_run_id}'"
)[
    [
        "tags.run_description",
        "params.n_train",
        "params.num_train_seasons",
        "metrics.train_auc",
        "metrics.val_auc",
        "metrics.train_mse",
        "metrics.val_mse",
        "metrics.val_mse_diff",
        "metrics.val_auc_diff",
    ]
]

run_data_all = pd.concat([run_data_all, run_data_e1b])
run_data_e1b

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.train_auc,metrics.val_auc,metrics.train_mse,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2017-2021 EPL data,755,4,0.717868,0.707477,0.212262,0.215056,0.008719,-0.034945


## Experiment 2: Extending Training Data To Other Top Leagues (All Seasons)

In [18]:
train_e2 = train_full.copy()
train_e2 = train_e2.drop(columns="div")

In [19]:
e2_run_id = generate_random_string()
print(f"Experiment run ID: {e2_run_id}")

run_experiment(
    "data-needs-experiment",
    train_e2,
    val,
    hidden_units=None,
    learning_rate=0.001,
    num_epochs=10000,
    num_samples=1000,
    num_batches=1,
    league_tag="EPL",
    run_id=e2_run_id,
    run_description="training on 2010-2021 all leagues data",
    return_model=False,
)

Experiment run ID: oDfvlGn7


Training Progress: 100%|██████████| 10000/10000 [00:23<00:00, 421.22epoch/s]
Training Output Sampling Progress: 100%|██████████| 1000/1000 [00:26<00:00, 37.12it/s]
Validation Output Sampling Progress: 100%|██████████| 1000/1000 [00:25<00:00, 38.66it/s]


🏃 View run resilient-bass-744 at: http://localhost:5001/#/experiments/3/runs/c38208be72c44f60932f962d7037a7ad
🧪 View experiment at: http://localhost:5001/#/experiments/3


In [20]:
run_data_e2 = mlflow.search_runs(filter_string=f"tags.run_id = '{e2_run_id}'")[
    [
        "tags.run_description",
        "params.n_train",
        "params.num_train_seasons",
        "metrics.train_auc",
        "metrics.val_auc",
        "metrics.train_mse",
        "metrics.val_mse",
        "metrics.val_mse_diff",
        "metrics.val_auc_diff",
    ]
]

run_data_all = pd.concat([run_data_all, run_data_e2])
run_data_e2

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.train_auc,metrics.val_auc,metrics.train_mse,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2010-2021 all leagues data,9840,11,0.69937,0.71195,0.217906,0.212736,0.006399,-0.030471


## Experiment 3: Training on All Top Leagues (Recent Seasons)

### Part A: 2013-2021

In [21]:
e3a_train_seasons = [
    "13_14",
    "14_15",
    "15_16",
    "16_17",
    "17_18",
    "18_19",
    "19_20",
    "20_21",
]

In [22]:
train_e3a = train_full.copy()
train_e3a = train_e3a.loc[train_e3a["season"].isin(e3a_train_seasons)]
train_e3a = train_e3a.drop(columns="div")

In [23]:
e3a_run_id = generate_random_string()
print(f"Experiment run ID: {e3a_run_id}")

run_experiment(
    "data-needs-experiment",
    train_e3a,
    val,
    hidden_units=None,
    learning_rate=0.001,
    num_epochs=10000,
    num_samples=1000,
    num_batches=1,
    league_tag="EPL",
    run_id=e3a_run_id,
    run_description="training on 2013-2021 all leagues data",
    return_model=False,
)

Experiment run ID: 2TWRtRia


Training Progress: 100%|██████████| 10000/10000 [00:22<00:00, 446.52epoch/s]
Training Output Sampling Progress: 100%|██████████| 1000/1000 [00:26<00:00, 37.20it/s]
Validation Output Sampling Progress: 100%|██████████| 1000/1000 [00:25<00:00, 38.51it/s]


🏃 View run unruly-swan-65 at: http://localhost:5001/#/experiments/3/runs/1060f87251a545fda6d48a86a4f549b6
🧪 View experiment at: http://localhost:5001/#/experiments/3


In [24]:
run_data_e3a = mlflow.search_runs(
    filter_string=f"tags.run_id = '{e3a_run_id}'"
)[
    [
        "tags.run_description",
        "params.n_train",
        "params.num_train_seasons",
        "metrics.train_auc",
        "metrics.val_auc",
        "metrics.train_mse",
        "metrics.val_mse",
        "metrics.val_mse_diff",
        "metrics.val_auc_diff",
    ]
]

run_data_all = pd.concat([run_data_all, run_data_e3a])
run_data_e3a

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.train_auc,metrics.val_auc,metrics.train_mse,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2013-2021 all leagues data,7127,8,0.7075,0.71195,0.215632,0.213103,0.006767,-0.030471


### Part B: 2017-2021

In [25]:
e3b_train_seasons = ["17_18", "18_19", "19_20", "20_21"]

In [26]:
train_e3b = train_full.copy()
train_e3b = train_e3b.loc[train_e3b["season"].isin(e3b_train_seasons)]
train_e3b = train_e3b.drop(columns="div")

In [27]:
e3b_run_id = generate_random_string()
print(f"Experiment run ID: {e3b_run_id}")

run_experiment(
    "data-needs-experiment",
    train_e3b,
    val,
    hidden_units=None,
    learning_rate=0.001,
    num_epochs=10000,
    num_samples=1000,
    num_batches=1,
    league_tag="EPL",
    run_id=e3b_run_id,
    run_description="training on 2017-2021 all leagues data",
)

Experiment run ID: a2jP5Tj8


Training Progress: 100%|██████████| 10000/10000 [00:18<00:00, 530.29epoch/s]
Training Output Sampling Progress: 100%|██████████| 1000/1000 [00:26<00:00, 37.74it/s]
Validation Output Sampling Progress: 100%|██████████| 1000/1000 [00:26<00:00, 38.43it/s]


🏃 View run gentle-swan-773 at: http://localhost:5001/#/experiments/3/runs/42d5f745d07c4f28b7517e96b0fb8926
🧪 View experiment at: http://localhost:5001/#/experiments/3


In [28]:
run_data_e3b = mlflow.search_runs(
    filter_string=f"tags.run_id = '{e3b_run_id}'"
)[
    [
        "tags.run_description",
        "params.n_train",
        "params.num_train_seasons",
        "metrics.train_auc",
        "metrics.val_auc",
        "metrics.train_mse",
        "metrics.val_mse",
        "metrics.val_mse_diff",
        "metrics.val_auc_diff",
    ]
]

run_data_all = pd.concat([run_data_all, run_data_e3b])
run_data_e3b

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.train_auc,metrics.val_auc,metrics.train_mse,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2017-2021 all leagues data,3532,4,0.716869,0.712036,0.212448,0.213771,0.007435,-0.030385


## Experiment 4: Training on Non-EPL Data Only

In [29]:
train_e4 = train_full.copy()
train_e4 = train_e4.loc[train_e4["div"] != "E0"]
train_e4 = train_e4.drop(columns="div")

In [30]:
e4_run_id = generate_random_string()
print(f"Experiment run ID: {e4_run_id}")

run_experiment(
    "data-needs-experiment",
    train_e4,
    val,
    hidden_units=None,
    learning_rate=0.001,
    num_epochs=10000,
    num_samples=1000,
    num_batches=1,
    league_tag="EPL",
    run_id=e4_run_id,
    run_description="training on 2010-2021 EPL removed data",
    return_model=False,
)

Experiment run ID: HHTybMHw


Training Progress: 100%|██████████| 10000/10000 [00:22<00:00, 443.83epoch/s]
Training Output Sampling Progress: 100%|██████████| 1000/1000 [00:27<00:00, 36.84it/s]
Validation Output Sampling Progress: 100%|██████████| 1000/1000 [00:26<00:00, 38.22it/s]


🏃 View run thundering-slug-333 at: http://localhost:5001/#/experiments/3/runs/4a20ebf1dddd464ca501c04965e88498
🧪 View experiment at: http://localhost:5001/#/experiments/3


In [32]:
run_data_e4 = mlflow.search_runs(
    filter_string=f"tags.run_id = '{e4_run_id}'"
    )[['tags.run_description', 'params.n_train', 'params.num_train_seasons', 'metrics.train_auc', 'metrics.val_auc', 'metrics.train_mse', 'metrics.val_mse', "metrics.val_mse_diff", "metrics.val_auc_diff"]]

run_data_all = pd.concat([run_data_all, run_data_e4])
run_data_e4

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.train_auc,metrics.val_auc,metrics.train_mse,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2010-2021 EPL removed data,7773,11,0.696691,0.713012,0.218244,0.212588,0.006251,-0.02941


## Comparison and Conclusion

In [33]:
run_data_all.drop(
    columns=["metrics.train_auc", "metrics.train_mse"]
).sort_values("metrics.val_mse", ascending=True)

Unnamed: 0,tags.run_description,params.n_train,params.num_train_seasons,metrics.val_auc,metrics.val_mse,metrics.val_mse_diff,metrics.val_auc_diff
0,training on 2010-2021 EPL data,2067,11,0.717485,0.21142,0.005083,-0.024936
0,training on 2010-2021 EPL removed data,7773,11,0.713012,0.212588,0.006251,-0.02941
0,training on 2010-2021 all leagues data,9840,11,0.71195,0.212736,0.006399,-0.030471
0,training on 2013-2021 all leagues data,7127,8,0.71195,0.213103,0.006767,-0.030471
0,training on 2013-2021 EPL data,1505,8,0.710517,0.213557,0.00722,-0.031905
0,training on 2017-2021 all leagues data,3532,4,0.712036,0.213771,0.007435,-0.030385
0,training on 2017-2021 EPL data,755,4,0.707477,0.215056,0.008719,-0.034945


Before making any conclusions on the differences in performance, the reader must keep in mind a few important points:

1. We are dealing with very small discrepancies in the evaluation metrics that, at face value, seem negligible. However, when we examine the `val_mse_diff` and `val_auc_diff` columns—which represent the disparity in MSE and AUC between our model and the bookies' odds—we see that these differences are not as insignificant as they first appear. For example, while a $0.001$ difference in MSE might seem inconsequential, if the total difference between our model’s MSE and that of the bookies' odds is only $0.02$, then suddenly that $0.001$ decrease becomes more meaningful and something we should strive to avoid. It is important therefore that we use the `_diff` columns for context when disucssing changes in metrics between runs.

2. Due to the model's feature selection, we are only utilising half of the available data points per season (i.e., games in the second half of the season). This means that any conclusions we attribute to insufficient data may not hold to the same extent (if at all) if we adopt an approach that allows us to use most or all of each season’s games.

3. Similarly, we cannot rule out that a different approach could lead to entirely different conclusions for other reasons. For example, if we were to find different features that (for whatever reason) carry more predictive signal in later seasons, a model that heavily relies on such features would likely benefit from a more recent training set. However, the goal of this notebook is to develop a rough understanding of how substituting earlier EPL seasons for more recent foreign seasons might affect model performance. This is a trade-off we may have to consider if we want a more comprehensive feature set.