## Predicting Football Matches

In this project, I will predict the winner of football matches in the English Premier League (EPL).

Project Steps:

* Scrape match data using requests, BeautifulSoup, and pandas from [FBRef](https://fbref.com/en/comps/9/Premier-League-Stat)
* Clean the data and get it ready for machine learning using pandas.
* Make predictions about who will win a match using scikit-learn.
* Measure error and improve our predictions.

In [1]:
# Import the libraries
import pandas as pd

# Import ML models
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score


In [2]:
# Read the data
matches = pd.read_csv("matches_.csv", index_col=0 , encoding='latin-1')

In [3]:
# Inspect the head of df
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,15/8/2021,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,Match Report,,18,4,16.9,1,0,0,2022,Manchester City
2,21/8/2021,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,Match Report,,16,4,17.3,1,0,0,2022,Manchester City
3,28/8/2021,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,Match Report,,25,10,14.3,0,0,0,2022,Manchester City
4,11/9/2021,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,Match Report,,25,8,14.0,0,0,0,2022,Manchester City
6,18/9/2021,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,...,Match Report,,16,1,15.7,1,0,0,2022,Manchester City


In [4]:
matches.shape

(1521, 27)

In [5]:
# 2 seasons * 20 squads * 38 matches
2 * 20 * 38

1520

In [6]:
matches["team"].value_counts()

Manchester City             76
Brighton and Hove Albion    76
Liverpool                   76
Everton                     76
Burnley                     76
Leeds United                76
Aston Villa                 76
Crystal Palace              76
Chelsea                     76
Southampton                 76
Leicester City              76
Newcastle United            76
Wolverhampton Wanderers     76
West Ham United             76
Manchester United           76
Tottenham Hotspur           76
Arsenal                     76
Brentford                   39
Watford                     38
Norwich City                38
Fulham                      38
West Bromwich Albion        38
Sheffield United            38
Name: team, dtype: int64

In [45]:
# Let's take a look at a team.
matches[matches["team"] == "Arsenal"].sort_values("date")

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
11,2020-01-11,16:30,Premier League,Matchweek 7,Sun,Away,W,1,0,Manchester Utd,...,0,1,1,2021,Arsenal,0,13,16,5,1
6,2020-04-10,14:00,Premier League,Matchweek 4,Sun,Home,W,2,1,Sheffield Utd,...,0,0,0,2021,Arsenal,1,16,14,4,1
18,2020-06-12,16:30,Premier League,Matchweek 11,Sun,Away,L,0,2,Tottenham,...,1,0,0,2021,Arsenal,0,18,16,4,0
13,2020-08-11,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,0,0,0,2021,Arsenal,1,1,19,1,0
2,2020-09-19,20:00,Premier League,Matchweek 2,Sat,Home,W,2,1,West Ham,...,0,0,0,2021,Arsenal,1,21,20,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31,2022-06-03,14:00,Premier League,Matchweek 28,Sun,Away,W,3,2,Watford,...,0,0,0,2022,Arsenal,0,19,14,4,1
41,2022-08-05,14:00,Premier League,Matchweek 36,Sun,Home,W,2,1,Leeds United,...,2,0,0,2022,Arsenal,1,9,14,4,1
36,2022-09-04,15:00,Premier League,Matchweek 32,Sat,Home,L,1,2,Brighton,...,2,0,0,2022,Arsenal,1,3,15,6,0
28,2022-10-02,19:45,Premier League,Matchweek 24,Thu,Away,W,1,0,Wolves,...,0,0,0,2022,Arsenal,0,22,19,6,1


In [8]:
# Let's validate the match week
matches["round"].value_counts()

Matchweek 34    41
Matchweek 1     40
Matchweek 29    40
Matchweek 22    40
Matchweek 23    40
Matchweek 24    40
Matchweek 25    40
Matchweek 26    40
Matchweek 27    40
Matchweek 28    40
Matchweek 31    40
Matchweek 2     40
Matchweek 32    40
Matchweek 30    40
Matchweek 33    40
Matchweek 35    40
Matchweek 36    40
Matchweek 37    40
Matchweek 21    40
Matchweek 20    40
Matchweek 19    40
Matchweek 18    40
Matchweek 3     40
Matchweek 4     40
Matchweek 5     40
Matchweek 6     40
Matchweek 7     40
Matchweek 8     40
Matchweek 9     40
Matchweek 10    40
Matchweek 11    40
Matchweek 12    40
Matchweek 13    40
Matchweek 14    40
Matchweek 15    40
Matchweek 16    40
Matchweek 17    40
Matchweek 38    40
Name: round, dtype: int64

# Let's clean the data

In [9]:
matches.dtypes

date             object
time             object
comp             object
round            object
day              object
venue            object
result           object
gf                int64
ga                int64
opponent         object
xg              float64
xga             float64
poss              int64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes           float64
sh                int64
sot               int64
dist            float64
fk                int64
pk                int64
pkatt             int64
season            int64
team             object
dtype: object

In [10]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1521 entries, 1 to 41
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          1521 non-null   object 
 1   time          1521 non-null   object 
 2   comp          1521 non-null   object 
 3   round         1521 non-null   object 
 4   day           1521 non-null   object 
 5   venue         1521 non-null   object 
 6   result        1521 non-null   object 
 7   gf            1521 non-null   int64  
 8   ga            1521 non-null   int64  
 9   opponent      1521 non-null   object 
 10  xg            1521 non-null   float64
 11  xga           1521 non-null   float64
 12  poss          1521 non-null   int64  
 13  attendance    825 non-null    float64
 14  captain       1521 non-null   object 
 15  formation     1521 non-null   object 
 16  referee       1521 non-null   object 
 17  match report  1521 non-null   object 
 18  notes         0 non-null      

In [11]:
# Convert Date column to datetime
matches["date"] = pd.to_datetime(matches["date"])

# Convert venue code to numbers home/away
matches["venue_code"] = matches["venue"].astype("category").cat.codes

# Create unique code for oppenent squad venue code
matches["opp_code"] = matches["opponent"].astype("category").cat.codes

# Replace the minutes of time column to just hour.
matches["hour"] = matches["time"].str.replace(":.+", "", regex=True).astype("int")

# Returns the day of week on date column
matches["day_code"] = matches["date"].dt.dayofweek

# Convert target to categorical W=1, L/D=0
matches["target"] = (matches["result"] == "W").astype("int")

In [12]:
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,1,0,0,2022,Manchester City,0,18,16,6,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,1,0,0,2022,Manchester City,1,15,15,5,1
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,0,0,0,2022,Manchester City,1,0,12,5,1
4,2021-11-09,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,0,0,0,2022,Manchester City,0,10,15,1,1
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,...,1,0,0,2022,Manchester City,1,17,15,5,0


# Time to train the ML model

In [14]:
# Initiate the class
rf = RandomForestClassifier(n_estimators=100, min_samples_split=10, random_state=42)

In [15]:
# Split the test/train data, base on time-series data
train = matches[matches["date"] < '2022-01-01']
test = matches[matches["date"] > '2022-01-01']

predictors = ["venue_code", "opp_code", "hour", "day_code"]

#Fit the model
rf.fit(train[predictors], train["target"])

RandomForestClassifier(min_samples_split=10, random_state=42)

In [16]:
# Predict the model
preds = rf.predict(test[predictors])

In [17]:
# Evaluate the model
error = accuracy_score(test["target"], preds)
print(error)

0.6066838046272494


In [18]:
# Create confusion matrix
combined = pd.DataFrame(dict(actual=test["target"], predicted=preds))
pd.crosstab(index=combined["actual"], columns=combined["predicted"])

predicted,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,194,40
1,113,42


In [19]:
# Evaluate the precision
precision_score(test["target"], preds)

0.5121951219512195

# Create additional predictors to improve accuracy of the model

In [20]:
# Group the df by team
grouped_matches = matches.groupby("team")

In [21]:
group = grouped_matches.get_group("Arsenal").sort_values("date")

In [22]:
group

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
11,2020-01-11,16:30,Premier League,Matchweek 7,Sun,Away,W,1,0,Manchester Utd,...,0,1,1,2021,Arsenal,0,13,16,5,1
6,2020-04-10,14:00,Premier League,Matchweek 4,Sun,Home,W,2,1,Sheffield Utd,...,0,0,0,2021,Arsenal,1,16,14,4,1
18,2020-06-12,16:30,Premier League,Matchweek 11,Sun,Away,L,0,2,Tottenham,...,1,0,0,2021,Arsenal,0,18,16,4,0
13,2020-08-11,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,0,0,0,2021,Arsenal,1,1,19,1,0
2,2020-09-19,20:00,Premier League,Matchweek 2,Sat,Home,W,2,1,West Ham,...,0,0,0,2021,Arsenal,1,21,20,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31,2022-06-03,14:00,Premier League,Matchweek 28,Sun,Away,W,3,2,Watford,...,0,0,0,2022,Arsenal,0,19,14,4,1
41,2022-08-05,14:00,Premier League,Matchweek 36,Sun,Home,W,2,1,Leeds United,...,2,0,0,2022,Arsenal,1,9,14,4,1
36,2022-09-04,15:00,Premier League,Matchweek 32,Sat,Home,L,1,2,Brighton,...,2,0,0,2022,Arsenal,1,3,15,6,0
28,2022-10-02,19:45,Premier League,Matchweek 24,Thu,Away,W,1,0,Wolves,...,0,0,0,2022,Arsenal,0,22,19,6,1


In [23]:
def rolling_averages(group, cols, new_cols):
    group = group.sort_values("date")
    rolling_stats = group[cols].rolling(3, closed='left').mean() #calculate rolling average without the current week
    group[new_cols] = rolling_stats
    group = group.dropna(subset=new_cols)
    return group

In [24]:
# Create cols, new cols
cols = ["gf", "ga", "sh", "sot", "dist", "fk", "pk", "pkatt"]
new_cols = [f"{c}_rolling" for c in cols]

new_cols

['gf_rolling',
 'ga_rolling',
 'sh_rolling',
 'sot_rolling',
 'dist_rolling',
 'fk_rolling',
 'pk_rolling',
 'pkatt_rolling']

In [25]:
rolling_averages(group, cols, new_cols)

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
13,2020-08-11,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,1,0,1.000000,1.000000,8.000000,2.666667,18.166667,0.333333,0.333333,0.333333
2,2020-09-19,20:00,Premier League,Matchweek 2,Sat,Home,W,2,1,West Ham,...,5,1,0.666667,2.000000,10.333333,3.000000,15.800000,0.333333,0.000000,0.000000
4,2020-09-28,20:00,Premier League,Matchweek 3,Mon,Away,L,1,3,Liverpool,...,0,0,0.666667,2.000000,10.333333,2.333333,15.333333,0.333333,0.000000,0.000000
7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0,1,Manchester City,...,5,0,1.000000,2.333333,7.666667,2.666667,15.400000,0.000000,0.000000,0.000000
9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0,1,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.000000,16.266667,0.666667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31,2022-06-03,14:00,Premier League,Matchweek 28,Sun,Away,W,3,2,Watford,...,4,1,2.666667,1.333333,16.333333,5.333333,18.533333,0.333333,0.666667,0.666667
41,2022-08-05,14:00,Premier League,Matchweek 36,Sun,Home,W,2,1,Leeds United,...,4,1,2.666667,1.666667,17.333333,4.666667,18.066667,0.000000,0.333333,0.333333
36,2022-09-04,15:00,Premier League,Matchweek 32,Sat,Home,L,1,2,Brighton,...,6,0,3.333333,1.333333,20.000000,7.000000,15.933333,0.666667,0.333333,0.333333
28,2022-10-02,19:45,Premier League,Matchweek 24,Thu,Away,W,1,0,Wolves,...,6,1,2.000000,1.666667,18.333333,5.666667,15.700000,1.333333,0.000000,0.000000


In [26]:
# Apply rolling averages for every match, every team
matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))

matches_rolling

Unnamed: 0_level_0,Unnamed: 1_level_0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Arsenal,13,2020-08-11,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,1,0,1.000000,1.000000,8.000000,2.666667,18.166667,0.333333,0.333333,0.333333
Arsenal,2,2020-09-19,20:00,Premier League,Matchweek 2,Sat,Home,W,2,1,West Ham,...,5,1,0.666667,2.000000,10.333333,3.000000,15.800000,0.333333,0.000000,0.000000
Arsenal,4,2020-09-28,20:00,Premier League,Matchweek 3,Mon,Away,L,1,3,Liverpool,...,0,0,0.666667,2.000000,10.333333,2.333333,15.333333,0.333333,0.000000,0.000000
Arsenal,7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0,1,Manchester City,...,5,0,1.000000,2.333333,7.666667,2.666667,15.400000,0.000000,0.000000,0.000000
Arsenal,9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0,1,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.000000,16.266667,0.666667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wolverhampton Wanderers,38,2022-07-05,15:00,Premier League,Matchweek 36,Sat,Away,D,2,2,Chelsea,...,1,0,0.666667,2.000000,12.000000,4.666667,18.166667,0.666667,0.000000,0.000000
Wolverhampton Wanderers,35,2022-08-04,20:00,Premier League,Matchweek 32,Fri,Away,L,0,1,Newcastle Utd,...,3,0,1.333333,2.000000,12.666667,4.333333,16.433333,0.666667,0.000000,0.000000
Wolverhampton Wanderers,25,2022-10-02,19:45,Premier League,Matchweek 24,Thu,Home,L,0,1,Arsenal,...,6,0,1.000000,2.000000,8.666667,3.666667,15.766667,0.000000,0.000000,0.000000
Wolverhampton Wanderers,31,2022-10-03,19:30,Premier League,Matchweek 19,Thu,Home,W,4,0,Watford,...,0,1,0.666667,1.333333,11.333333,3.333333,17.133333,0.666667,0.000000,0.000000


In [27]:
# Drop the additional index level.

matches_rolling = matches_rolling.droplevel('team')

matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
13,2020-08-11,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,1,0,1.000000,1.000000,8.000000,2.666667,18.166667,0.333333,0.333333,0.333333
2,2020-09-19,20:00,Premier League,Matchweek 2,Sat,Home,W,2,1,West Ham,...,5,1,0.666667,2.000000,10.333333,3.000000,15.800000,0.333333,0.000000,0.000000
4,2020-09-28,20:00,Premier League,Matchweek 3,Mon,Away,L,1,3,Liverpool,...,0,0,0.666667,2.000000,10.333333,2.333333,15.333333,0.333333,0.000000,0.000000
7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0,1,Manchester City,...,5,0,1.000000,2.333333,7.666667,2.666667,15.400000,0.000000,0.000000,0.000000
9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0,1,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.000000,16.266667,0.666667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2022-07-05,15:00,Premier League,Matchweek 36,Sat,Away,D,2,2,Chelsea,...,1,0,0.666667,2.000000,12.000000,4.666667,18.166667,0.666667,0.000000,0.000000
35,2022-08-04,20:00,Premier League,Matchweek 32,Fri,Away,L,0,1,Newcastle Utd,...,3,0,1.333333,2.000000,12.666667,4.333333,16.433333,0.666667,0.000000,0.000000
25,2022-10-02,19:45,Premier League,Matchweek 24,Thu,Home,L,0,1,Arsenal,...,6,0,1.000000,2.000000,8.666667,3.666667,15.766667,0.000000,0.000000,0.000000
31,2022-10-03,19:30,Premier League,Matchweek 19,Thu,Home,W,4,0,Watford,...,0,1,0.666667,1.333333,11.333333,3.333333,17.133333,0.666667,0.000000,0.000000


In [28]:
# Reset index of matches_rolling
matches_rolling.index = range(matches_rolling.shape[0])
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
0,2020-08-11,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,1,0,1.000000,1.000000,8.000000,2.666667,18.166667,0.333333,0.333333,0.333333
1,2020-09-19,20:00,Premier League,Matchweek 2,Sat,Home,W,2,1,West Ham,...,5,1,0.666667,2.000000,10.333333,3.000000,15.800000,0.333333,0.000000,0.000000
2,2020-09-28,20:00,Premier League,Matchweek 3,Mon,Away,L,1,3,Liverpool,...,0,0,0.666667,2.000000,10.333333,2.333333,15.333333,0.333333,0.000000,0.000000
3,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0,1,Manchester City,...,5,0,1.000000,2.333333,7.666667,2.666667,15.400000,0.000000,0.000000,0.000000
4,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0,1,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.000000,16.266667,0.666667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1444,2022-07-05,15:00,Premier League,Matchweek 36,Sat,Away,D,2,2,Chelsea,...,1,0,0.666667,2.000000,12.000000,4.666667,18.166667,0.666667,0.000000,0.000000
1445,2022-08-04,20:00,Premier League,Matchweek 32,Fri,Away,L,0,1,Newcastle Utd,...,3,0,1.333333,2.000000,12.666667,4.333333,16.433333,0.666667,0.000000,0.000000
1446,2022-10-02,19:45,Premier League,Matchweek 24,Thu,Home,L,0,1,Arsenal,...,6,0,1.000000,2.000000,8.666667,3.666667,15.766667,0.000000,0.000000,0.000000
1447,2022-10-03,19:30,Premier League,Matchweek 19,Thu,Home,W,4,0,Watford,...,0,1,0.666667,1.333333,11.333333,3.333333,17.133333,0.666667,0.000000,0.000000


# Retain the ML model

In [29]:
def make_predictions(data, predictors):
    train = data[data["date"] < '2022-01-01']
    test = data[data["date"] > '2022-01-01']
    rf.fit(train[predictors], train["target"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["target"], predicted=preds), index=test.index)
    precision = precision_score(test["target"], preds)
    return combined, precision

In [30]:
combined, precision = make_predictions(matches_rolling, predictors + new_cols)

In [31]:
precision

0.6024096385542169

In [32]:
combined

Unnamed: 0,actual,predicted
55,1,1
56,0,0
57,1,0
58,1,1
59,1,0
...,...,...
1444,0,0
1445,0,0
1446,0,0
1447,1,0


In [33]:
# Let's merge the team with the acutal and prediction
combined = combined.merge(matches_rolling[["date", "team", "opponent", "result"]], left_index=True, right_index=True)

In [34]:
combined.head(10)

Unnamed: 0,actual,predicted,date,team,opponent,result
55,1,1,2022-01-05,Arsenal,West Ham,W
56,0,0,2022-01-23,Arsenal,Burnley,D
57,1,0,2022-02-19,Arsenal,Brentford,W
58,1,1,2022-02-24,Arsenal,Wolves,W
59,1,0,2022-03-13,Arsenal,Leicester City,W
60,0,0,2022-03-16,Arsenal,Liverpool,L
61,1,0,2022-03-19,Arsenal,Aston Villa,W
62,0,0,2022-04-04,Arsenal,Crystal Palace,L
63,0,0,2022-04-16,Arsenal,Southampton,L
64,1,0,2022-04-20,Arsenal,Chelsea,W


# Combine home and away predicitons

In [35]:
# Create a dictionary and map the names of the team

class MissingDict(dict):
    __missing__ = lambda self, key: key

map_values = {
    "Brighton and Hove Albion": "Brighton", 
    "Manchester United": "Manchester Utd", 
    "Newcastle United": "Newcastle Utd", 
    "Tottenham Hotspur": "Tottenham", 
    "West Ham United": "West Ham", 
    "Wolverhampton Wanderers": "Wolves"} 
mapping = MissingDict(**map_values)

In [36]:
# Validating our dictionary. If we check West ham united, the result will return West Ham
mapping["West Ham United"]

'West Ham'

In [37]:
# Create new column with the mapping team name
combined["new_team"] = combined["team"].map(mapping)

In [38]:
# Merge mapping team name to df
merged = combined.merge(combined, left_on=["date", "new_team"], right_on=["date", "opponent"])

In [39]:
merged.head()

Unnamed: 0,actual_x,predicted_x,date,team_x,opponent_x,result_x,new_team_x,actual_y,predicted_y,team_y,opponent_y,result_y,new_team_y
0,1,1,2022-01-05,Arsenal,West Ham,W,Arsenal,0,0,West Ham United,Arsenal,L,West Ham
1,0,0,2022-01-23,Arsenal,Burnley,D,Arsenal,0,0,Burnley,Arsenal,D,Burnley
2,1,0,2022-02-19,Arsenal,Brentford,W,Arsenal,0,0,Brentford,Arsenal,L,Brentford
3,1,1,2022-02-24,Arsenal,Wolves,W,Arsenal,0,0,Wolverhampton Wanderers,Arsenal,L,Wolves
4,1,0,2022-03-13,Arsenal,Leicester City,W,Arsenal,0,0,Leicester City,Arsenal,L,Leicester City


In [40]:
# Look at which rows predicted to win and which rows predected to lose
merged[(merged["predicted_x"] == 1) & (merged["predicted_y"] ==0)]["actual_x"].value_counts()

1    47
0    26
Name: actual_x, dtype: int64

In [43]:
47/63

0.746031746031746

In [42]:
matches.columns

Index(['date', 'time', 'comp', 'round', 'day', 'venue', 'result', 'gf', 'ga',
       'opponent', 'xg', 'xga', 'poss', 'attendance', 'captain', 'formation',
       'referee', 'match report', 'notes', 'sh', 'sot', 'dist', 'fk', 'pk',
       'pkatt', 'season', 'team', 'venue_code', 'opp_code', 'hour', 'day_code',
       'target'],
      dtype='object')