# Predicting Premier League Match Results
## Overview

This notebook uses a set of scraped match data to train a model that predicts the outcome of a Premier League match. The model is a random forest classifier that is trained on a set of features including the home team, away team, venue, time of day, and day of the week. The model also uses rolling averages of the past 3 games to assess how a team has been doing. The model is then used to make predictions on a test set of matches. The notebook also includes some data exploration and visualization to gain insights about the data. 

The model's performance is evaluated using accuracy score, precision score, and error. The accuracy score is a measure of the proportion of correctly classified instances. The precision score is a measure of the proportion of true positives among all positive predictions, which is the most important metric for this model since we are most concerned with predicting win rate. The error is a measure of the proportion of misclassified instances. 

The model could also be improved by using different algorithms such as support vector machines and gradient boosting.

In [1]:
import pandas as pd
import joblib

In [213]:
matches = pd.read_csv("pl_matches.csv", index_col=0)

In [214]:
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
0,2024-08-17,12:30,Premier League,Matchweek 1,Sat,Away,W,2.0,0.0,Ipswich Town,...,Match Report,,18.0,5.0,14.8,0.0,0,0,2024,Liverpool
1,2024-08-25,16:30,Premier League,Matchweek 2,Sun,Home,W,2.0,0.0,Brentford,...,Match Report,,19.0,8.0,13.6,1.0,0,0,2024,Liverpool
2,2024-09-01,16:00,Premier League,Matchweek 3,Sun,Away,W,3.0,0.0,Manchester Utd,...,Match Report,,11.0,3.0,13.4,0.0,0,0,2024,Liverpool
3,2024-09-14,15:00,Premier League,Matchweek 4,Sat,Home,L,0.0,1.0,Nott'ham Forest,...,Match Report,,14.0,5.0,14.9,0.0,0,0,2024,Liverpool
5,2024-09-21,15:00,Premier League,Matchweek 5,Sat,Home,W,3.0,0.0,Bournemouth,...,Match Report,,19.0,12.0,16.6,0.0,0,0,2024,Liverpool


In [215]:
matches.shape

(5560, 28)

In [216]:
matches["team"].value_counts()

team
Liverpool                   278
West Ham United             278
Chelsea                     278
Arsenal                     278
Brighton and Hove Albion    278
Tottenham Hotspur           278
Crystal Palace              278
Newcastle United            278
Manchester City             278
Manchester United           278
Everton                     278
Southampton                 240
Wolverhampton Wanderers     240
Leicester City              240
Burnley                     228
Bournemouth                 202
Aston Villa                 202
Fulham                      164
Watford                     152
Brentford                   126
Leeds United                114
Sheffield United            114
Nottingham Forest            88
Norwich City                 76
West Bromwich Albion         76
Huddersfield Town            76
Luton Town                   38
Cardiff City                 38
Swansea City                 38
Stoke City                   38
Ipswich Town                 12
Nam

In [217]:
matches[matches["team"] == "Liverpool"].sort_values("date")

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
0,2017-08-12,12:30,Premier League,Matchweek 1,Sat,Away,D,3.0,3.0,Watford,...,Match Report,,13.0,4.0,13.9,0.0,1,1,2017,Liverpool
2,2017-08-19,15:00,Premier League,Matchweek 2,Sat,Home,W,1.0,0.0,Crystal Palace,...,Match Report,,23.0,13.0,18.6,2.0,0,0,2017,Liverpool
4,2017-08-27,16:00,Premier League,Matchweek 3,Sun,Home,W,4.0,0.0,Arsenal,...,Match Report,,18.0,10.0,15.9,0.0,0,0,2017,Liverpool
5,2017-09-09,12:30,Premier League,Matchweek 4,Sat,Away,L,0.0,5.0,Manchester City,...,Match Report,,7.0,3.0,20.5,2.0,0,0,2017,Liverpool
7,2017-09-16,15:00,Premier League,Matchweek 5,Sat,Home,D,1.0,1.0,Burnley,...,Match Report,,35.0,9.0,20.6,0.0,0,0,2017,Liverpool
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,2024-10-20,16:30,Premier League,Matchweek 8,Sun,Home,W,2.0,1.0,Chelsea,...,Match Report,,8.0,4.0,22.7,1.0,1,1,2024,Liverpool
12,2024-10-27,16:30,Premier League,Matchweek 9,Sun,Away,D,2.0,2.0,Arsenal,...,Match Report,,9.0,4.0,17.8,0.0,0,0,2024,Liverpool
14,2024-11-02,15:00,Premier League,Matchweek 10,Sat,Home,W,2.0,1.0,Brighton,...,Match Report,,16.0,8.0,14.0,0.0,0,0,2024,Liverpool
16,2024-11-09,20:00,Premier League,Matchweek 11,Sat,Home,W,2.0,0.0,Aston Villa,...,Match Report,,14.0,5.0,16.1,0.0,0,0,2024,Liverpool


In [218]:
matches["round"].value_counts()

round
Matchweek 1     160
Matchweek 8     160
Matchweek 2     160
Matchweek 12    160
Matchweek 11    160
Matchweek 10    160
Matchweek 9     160
Matchweek 7     160
Matchweek 6     160
Matchweek 5     160
Matchweek 4     160
Matchweek 3     160
Matchweek 33    140
Matchweek 28    140
Matchweek 30    140
Matchweek 31    140
Matchweek 32    140
Matchweek 34    140
Matchweek 29    140
Matchweek 35    140
Matchweek 36    140
Matchweek 37    140
Matchweek 26    140
Matchweek 27    140
Matchweek 21    140
Matchweek 18    140
Matchweek 25    140
Matchweek 24    140
Matchweek 23    140
Matchweek 22    140
Matchweek 20    140
Matchweek 19    140
Matchweek 17    140
Matchweek 16    140
Matchweek 15    140
Matchweek 14    140
Matchweek 13    140
Matchweek 38    140
Name: count, dtype: int64

In [219]:
matches.dtypes

date              object
time              object
comp              object
round             object
day               object
venue             object
result            object
gf               float64
ga               float64
opponent          object
xg               float64
xga              float64
poss             float64
attendance       float64
captain           object
formation         object
opp formation     object
referee           object
match report      object
notes            float64
sh               float64
sot              float64
dist             float64
fk               float64
pk                 int64
pkatt              int64
season             int64
team              object
dtype: object

In [220]:
del matches["comp"]

In [221]:
del matches["notes"]

In [222]:
matches["date"] = pd.to_datetime(matches["date"])

In [223]:
matches["target"] = (matches["result"] == "W").astype("int")

In [224]:
matches

Unnamed: 0,date,time,round,day,venue,result,gf,ga,opponent,xg,...,match report,sh,sot,dist,fk,pk,pkatt,season,team,target
0,2024-08-17,12:30,Matchweek 1,Sat,Away,W,2.0,0.0,Ipswich Town,2.6,...,Match Report,18.0,5.0,14.8,0.0,0,0,2024,Liverpool,1
1,2024-08-25,16:30,Matchweek 2,Sun,Home,W,2.0,0.0,Brentford,2.5,...,Match Report,19.0,8.0,13.6,1.0,0,0,2024,Liverpool,1
2,2024-09-01,16:00,Matchweek 3,Sun,Away,W,3.0,0.0,Manchester Utd,1.8,...,Match Report,11.0,3.0,13.4,0.0,0,0,2024,Liverpool,1
3,2024-09-14,15:00,Matchweek 4,Sat,Home,L,0.0,1.0,Nott'ham Forest,0.9,...,Match Report,14.0,5.0,14.9,0.0,0,0,2024,Liverpool,0
5,2024-09-21,15:00,Matchweek 5,Sat,Home,W,3.0,0.0,Bournemouth,2.0,...,Match Report,19.0,12.0,16.6,0.0,0,0,2024,Liverpool,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2018-04-15,16:00,Matchweek 34,Sun,Away,W,1.0,0.0,Manchester Utd,0.7,...,Match Report,10.0,4.0,18.1,0.0,0,0,2017,West Bromwich Albion,1
39,2018-04-21,12:30,Matchweek 35,Sat,Home,D,2.0,2.0,Liverpool,1.3,...,Match Report,13.0,6.0,17.7,0.0,0,0,2017,West Bromwich Albion,0
40,2018-04-28,15:00,Matchweek 36,Sat,Away,W,1.0,0.0,Newcastle Utd,0.7,...,Match Report,9.0,2.0,20.1,0.0,0,0,2017,West Bromwich Albion,1
41,2018-05-05,15:00,Matchweek 37,Sat,Home,W,1.0,0.0,Tottenham,1.6,...,Match Report,9.0,1.0,10.2,0.0,0,0,2017,West Bromwich Albion,1


In [225]:
matches["venue_code"] = matches["venue"].astype("category").cat.codes

In [226]:
matches["opp_code"] = matches["opponent"].astype("category").cat.codes

In [227]:
import json

# Extract the mapping directly from the DataFrame
opponent_to_code = dict(matches[['opponent', 'opp_code']].drop_duplicates().values)

# Save the mapping as a JSON file
with open('opponent_to_code.json', 'w') as f:
    json.dump(opponent_to_code, f)

print("Mapping saved successfully:", opponent_to_code)

Mapping saved successfully: {'Ipswich Town': 12, 'Brentford': 3, 'Manchester Utd': 18, "Nott'ham Forest": 21, 'Bournemouth': 2, 'Wolves': 30, 'Crystal Palace': 8, 'Chelsea': 7, 'Arsenal': 0, 'Brighton': 4, 'Aston Villa': 1, 'Southampton': 23, 'West Ham': 29, 'Newcastle Utd': 19, 'Fulham': 10, 'Tottenham': 26, 'Manchester City': 17, 'Liverpool': 15, 'Leicester City': 14, 'Everton': 9, 'Burnley': 5, 'Sheffield Utd': 22, 'Luton Town': 16, 'Leeds United': 13, 'Norwich City': 20, 'Watford': 27, 'West Brom': 28, 'Huddersfield': 11, 'Cardiff City': 6, 'Stoke City': 24, 'Swansea City': 25}


In [228]:
matches["hour"] = matches["time"].str.replace(":.+", "", regex=True).astype("int")

In [229]:
matches["day_code"] = matches["date"].dt.dayofweek

In [230]:
matches

Unnamed: 0,date,time,round,day,venue,result,gf,ga,opponent,xg,...,fk,pk,pkatt,season,team,target,venue_code,opp_code,hour,day_code
0,2024-08-17,12:30,Matchweek 1,Sat,Away,W,2.0,0.0,Ipswich Town,2.6,...,0.0,0,0,2024,Liverpool,1,0,12,12,5
1,2024-08-25,16:30,Matchweek 2,Sun,Home,W,2.0,0.0,Brentford,2.5,...,1.0,0,0,2024,Liverpool,1,1,3,16,6
2,2024-09-01,16:00,Matchweek 3,Sun,Away,W,3.0,0.0,Manchester Utd,1.8,...,0.0,0,0,2024,Liverpool,1,0,18,16,6
3,2024-09-14,15:00,Matchweek 4,Sat,Home,L,0.0,1.0,Nott'ham Forest,0.9,...,0.0,0,0,2024,Liverpool,0,1,21,15,5
5,2024-09-21,15:00,Matchweek 5,Sat,Home,W,3.0,0.0,Bournemouth,2.0,...,0.0,0,0,2024,Liverpool,1,1,2,15,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2018-04-15,16:00,Matchweek 34,Sun,Away,W,1.0,0.0,Manchester Utd,0.7,...,0.0,0,0,2017,West Bromwich Albion,1,0,18,16,6
39,2018-04-21,12:30,Matchweek 35,Sat,Home,D,2.0,2.0,Liverpool,1.3,...,0.0,0,0,2017,West Bromwich Albion,0,1,15,12,5
40,2018-04-28,15:00,Matchweek 36,Sat,Away,W,1.0,0.0,Newcastle Utd,0.7,...,0.0,0,0,2017,West Bromwich Albion,1,0,19,15,5
41,2018-05-05,15:00,Matchweek 37,Sat,Home,W,1.0,0.0,Tottenham,1.6,...,0.0,0,0,2017,West Bromwich Albion,1,1,26,15,5


In [231]:
from sklearn.ensemble import RandomForestClassifier

In [232]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

In [233]:
train = matches[matches["date"] < '2024-01-01']

In [234]:
test = matches[matches["date"] > '2024-01-01']

In [235]:
predictors = ["venue_code", "opp_code", "hour", "day_code"]

In [236]:
rf.fit(train[predictors], train["target"])

In [237]:
preds = rf.predict(test[predictors])

In [238]:
from sklearn.metrics import accuracy_score

In [239]:
error = accuracy_score(test["target"], preds)

In [240]:
error

0.6501650165016502

In [241]:
combined = pd.DataFrame(dict(actual=test["target"], predicted=preds))

In [242]:
pd.crosstab(index=combined["actual"], columns=combined["predicted"])

predicted,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,307,75
1,137,87


In [243]:
from sklearn.metrics import precision_score

precision_score(test["target"], preds)

0.5370370370370371

In [244]:
grouped_matches = matches.groupby("team")

In [245]:
group = grouped_matches.get_group("Liverpool").sort_values("date")

In [246]:
def rolling_averages(group, cols, new_cols):
    group = group.sort_values("date")
    rolling_stats = group[cols].rolling(3, closed='left').mean()
    group[new_cols] = rolling_stats
    group = group.dropna(subset=new_cols)
    return group

In [247]:
cols = ["gf", "ga", "sh", "sot", "dist", "fk", "pk", "pkatt"]
new_cols = [f"{c}_rolling" for c in cols]

rolling_averages(group, cols, new_cols)

Unnamed: 0,date,time,round,day,venue,result,gf,ga,opponent,xg,...,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
5,2017-09-09,12:30,Matchweek 4,Sat,Away,L,0.0,5.0,Manchester City,0.7,...,12,5,2.666667,1.000000,18.000000,9.000000,16.133333,0.666667,0.333333,0.333333
7,2017-09-16,15:00,Matchweek 5,Sat,Home,D,1.0,1.0,Burnley,2.2,...,15,5,1.666667,1.666667,16.000000,8.666667,18.333333,1.333333,0.000000,0.000000
9,2017-09-23,17:30,Matchweek 6,Sat,Away,W,3.0,2.0,Leicester City,1.7,...,17,5,1.666667,2.000000,20.000000,7.333333,19.000000,0.666667,0.000000,0.000000
11,2017-10-01,16:30,Matchweek 7,Sun,Away,D,1.0,1.0,Newcastle Utd,1.3,...,16,6,1.333333,2.666667,21.666667,6.000000,19.766667,1.000000,0.000000,0.000000
12,2017-10-14,12:30,Matchweek 8,Sat,Home,D,0.0,0.0,Manchester Utd,1.5,...,12,5,1.666667,1.333333,25.000000,5.666667,18.866667,0.666667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,2024-10-20,16:30,Matchweek 8,Sun,Home,W,2.0,1.0,Chelsea,1.9,...,16,6,2.000000,0.333333,14.666667,7.000000,17.833333,0.333333,0.333333,0.333333
12,2024-10-27,16:30,Matchweek 9,Sun,Away,D,2.0,2.0,Arsenal,0.8,...,16,6,1.666667,0.666667,11.000000,4.333333,19.866667,0.666667,0.666667,0.666667
14,2024-11-02,15:00,Matchweek 10,Sat,Home,W,2.0,1.0,Brighton,1.6,...,15,5,1.666667,1.000000,11.000000,4.000000,19.800000,0.333333,0.333333,0.333333
16,2024-11-09,20:00,Matchweek 11,Sat,Home,W,2.0,0.0,Aston Villa,2.0,...,20,5,2.000000,1.333333,11.000000,5.333333,18.166667,0.333333,0.333333,0.333333


In [248]:
matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))

  matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))


In [249]:
matches_rolling

Unnamed: 0_level_0,Unnamed: 1_level_0,date,time,round,day,venue,result,gf,ga,opponent,xg,...,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Arsenal,4,2017-09-09,15:00,Matchweek 4,Sat,Home,W,3.0,0.0,Bournemouth,2.2,...,15,5,1.333333,2.666667,17.666667,5.333333,18.133333,0.000000,0.000000,0.000000
Arsenal,6,2017-09-17,13:30,Matchweek 5,Sun,Away,D,0.0,0.0,Chelsea,1.4,...,13,6,1.000000,1.666667,14.333333,5.000000,16.766667,0.333333,0.000000,0.000000
Arsenal,8,2017-09-25,20:00,Matchweek 6,Mon,Home,W,2.0,0.0,West Brom,2.2,...,20,0,1.000000,1.333333,12.000000,3.666667,16.566667,0.333333,0.000000,0.000000
Arsenal,10,2017-10-01,12:00,Matchweek 7,Sun,Home,W,2.0,0.0,Brighton,2.4,...,12,6,1.666667,0.000000,14.333333,5.333333,17.400000,1.333333,0.333333,0.333333
Arsenal,11,2017-10-14,17:30,Matchweek 8,Sat,Away,L,1.0,2.0,Watford,1.0,...,17,5,1.333333,0.000000,17.000000,5.000000,18.333333,1.666667,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wolverhampton Wanderers,9,2024-10-20,14:00,Matchweek 8,Sun,Home,L,1.0,2.0,Manchester City,0.8,...,14,6,1.666667,3.333333,11.666667,4.333333,19.566667,0.000000,0.000000,0.000000
Wolverhampton Wanderers,10,2024-10-26,15:00,Matchweek 9,Sat,Away,D,2.0,2.0,Brighton,1.3,...,15,5,1.666667,3.000000,9.333333,3.666667,20.466667,0.000000,0.000000,0.000000
Wolverhampton Wanderers,11,2024-11-02,17:30,Matchweek 10,Sat,Home,D,2.0,2.0,Crystal Palace,1.5,...,17,5,2.000000,3.000000,11.333333,5.000000,16.800000,0.000000,0.000000,0.000000
Wolverhampton Wanderers,12,2024-11-09,15:00,Matchweek 11,Sat,Home,W,2.0,0.0,Southampton,1.3,...,15,5,1.666667,2.000000,9.333333,5.000000,15.700000,0.000000,0.000000,0.000000


In [250]:
matches_rolling = matches_rolling.droplevel('team')

In [251]:
matches_rolling

Unnamed: 0,date,time,round,day,venue,result,gf,ga,opponent,xg,...,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
4,2017-09-09,15:00,Matchweek 4,Sat,Home,W,3.0,0.0,Bournemouth,2.2,...,15,5,1.333333,2.666667,17.666667,5.333333,18.133333,0.000000,0.000000,0.000000
6,2017-09-17,13:30,Matchweek 5,Sun,Away,D,0.0,0.0,Chelsea,1.4,...,13,6,1.000000,1.666667,14.333333,5.000000,16.766667,0.333333,0.000000,0.000000
8,2017-09-25,20:00,Matchweek 6,Mon,Home,W,2.0,0.0,West Brom,2.2,...,20,0,1.000000,1.333333,12.000000,3.666667,16.566667,0.333333,0.000000,0.000000
10,2017-10-01,12:00,Matchweek 7,Sun,Home,W,2.0,0.0,Brighton,2.4,...,12,6,1.666667,0.000000,14.333333,5.333333,17.400000,1.333333,0.333333,0.333333
11,2017-10-14,17:30,Matchweek 8,Sat,Away,L,1.0,2.0,Watford,1.0,...,17,5,1.333333,0.000000,17.000000,5.000000,18.333333,1.666667,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,2024-10-20,14:00,Matchweek 8,Sun,Home,L,1.0,2.0,Manchester City,0.8,...,14,6,1.666667,3.333333,11.666667,4.333333,19.566667,0.000000,0.000000,0.000000
10,2024-10-26,15:00,Matchweek 9,Sat,Away,D,2.0,2.0,Brighton,1.3,...,15,5,1.666667,3.000000,9.333333,3.666667,20.466667,0.000000,0.000000,0.000000
11,2024-11-02,17:30,Matchweek 10,Sat,Home,D,2.0,2.0,Crystal Palace,1.5,...,17,5,2.000000,3.000000,11.333333,5.000000,16.800000,0.000000,0.000000,0.000000
12,2024-11-09,15:00,Matchweek 11,Sat,Home,W,2.0,0.0,Southampton,1.3,...,15,5,1.666667,2.000000,9.333333,5.000000,15.700000,0.000000,0.000000,0.000000


In [252]:
matches_rolling.index = range(matches_rolling.shape[0])

In [253]:
def make_predictions(data, predictors):
    train = data[data["date"] < '2023-01-01']
    test = data[data["date"] > '2023-01-01']
    rf.fit(train[predictors], train["target"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["target"], predicted=preds), index=test.index)
    error = accuracy_score(test["target"], preds)
    precision = precision_score(test["target"], preds)
    return combined, precision, error

In [254]:
combined, precision, error = make_predictions(matches_rolling, predictors + new_cols)

In [255]:
precision

0.585427135678392

In [256]:
error

0.6610407876230661

In [257]:
combined = combined.merge(matches_rolling[["date", "team", "opponent", "result"]], left_index=True, right_index=True)

In [258]:
combined.head(10)

Unnamed: 0,actual,predicted,date,team,opponent,result
203,0,1,2023-01-03,Arsenal,Newcastle Utd,D
204,1,1,2023-01-15,Arsenal,Tottenham,W
205,1,0,2023-01-22,Arsenal,Manchester Utd,W
206,0,1,2023-02-04,Arsenal,Everton,L
207,0,1,2023-02-11,Arsenal,Brentford,D
208,0,0,2023-02-15,Arsenal,Manchester City,L
209,1,1,2023-02-18,Arsenal,Aston Villa,W
210,1,0,2023-02-25,Arsenal,Leicester City,W
211,1,0,2023-03-01,Arsenal,Everton,W
212,1,1,2023-03-04,Arsenal,Bournemouth,W


In [259]:
class MissingDict(dict):
    __missing__ = lambda self, key: key

map_values = {"Brighton and Hove Albion": "Brighton", "Manchester United": "Manchester Utd", "Newcastle United": "Newcastle Utd", "Tottenham Hotspur": "Tottenham", "West Ham United": "West Ham", "Wolverhampton Wanderers": "Wolves"} 
mapping = MissingDict(**map_values)

In [260]:
combined["new_team"] = combined["team"].map(mapping)

In [261]:
merged = combined.merge(combined, left_on=["date", "new_team"], right_on=["date", "opponent"])

In [262]:
merged

Unnamed: 0,actual_x,predicted_x,date,team_x,opponent_x,result_x,new_team_x,actual_y,predicted_y,team_y,opponent_y,result_y,new_team_y
0,0,1,2023-01-03,Arsenal,Newcastle Utd,D,Arsenal,0,0,Newcastle United,Arsenal,D,Newcastle Utd
1,1,1,2023-01-15,Arsenal,Tottenham,W,Arsenal,0,0,Tottenham Hotspur,Arsenal,L,Tottenham
2,1,0,2023-01-22,Arsenal,Manchester Utd,W,Arsenal,0,1,Manchester United,Arsenal,L,Manchester Utd
3,0,1,2023-02-04,Arsenal,Everton,L,Arsenal,1,0,Everton,Arsenal,W,Everton
4,0,1,2023-02-11,Arsenal,Brentford,D,Arsenal,0,0,Brentford,Arsenal,D,Brentford
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1302,0,0,2024-10-20,Wolverhampton Wanderers,Manchester City,L,Wolves,1,0,Manchester City,Wolves,W,Manchester City
1303,0,0,2024-10-26,Wolverhampton Wanderers,Brighton,D,Wolves,0,0,Brighton and Hove Albion,Wolves,D,Brighton
1304,0,0,2024-11-02,Wolverhampton Wanderers,Crystal Palace,D,Wolves,0,0,Crystal Palace,Wolves,D,Crystal Palace
1305,1,1,2024-11-09,Wolverhampton Wanderers,Southampton,W,Wolves,0,0,Southampton,Wolves,L,Southampton


In [263]:
merged[(merged["predicted_x"] == 1) & (merged["predicted_y"] ==0)]["actual_x"].value_counts()

actual_x
1    206
0    131
Name: count, dtype: int64

In [264]:
joblib.dump(rf, 'rf_rolling_model.pkl')

['rf_rolling_model.pkl']