## EPL Game Predictions
In this project, the goal is to use data from 'matches' file to create predictions of game results and judge our accuracy. 

## Read In Data
First off we will read in the data, then familiarize ourselves with the dataframe and type of data we are dealing with.

In [415]:
import pandas as pd
matches = pd.read_csv('matches_1.csv', index_col = 0)
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,Match Report,,18.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,Match Report,,16.0,4.0,18.5,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,Match Report,,25.0,10.0,14.8,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,Match Report,,25.0,8.0,14.3,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,...,Match Report,,16.0,1.0,16.4,1.0,0.0,0.0,2022,Manchester City


In [416]:
matches.shape

(1520, 27)

In [417]:
matches['team'].value_counts()

Manchester City             76
Wolverhampton Wanderers     76
Burnley                     76
Leeds United                76
Everton                     76
Southampton                 76
Aston Villa                 76
Liverpool                   76
Newcastle United            76
Crystal Palace              76
Brighton and Hove Albion    76
Leicester City              76
West Ham United             76
Manchester United           76
Arsenal                     76
Tottenham Hotspur           76
Chelsea                     76
Brentford                   38
Watford                     38
Norwich City                38
Fulham                      38
West Bromwich Albion        38
Sheffield United            38
Name: team, dtype: int64

In [418]:
matches['round'].value_counts()

Matchweek 1     40
Matchweek 29    40
Matchweek 22    40
Matchweek 23    40
Matchweek 24    40
Matchweek 25    40
Matchweek 26    40
Matchweek 27    40
Matchweek 28    40
Matchweek 31    40
Matchweek 2     40
Matchweek 32    40
Matchweek 30    40
Matchweek 34    40
Matchweek 35    40
Matchweek 36    40
Matchweek 33    40
Matchweek 37    40
Matchweek 21    40
Matchweek 20    40
Matchweek 19    40
Matchweek 18    40
Matchweek 3     40
Matchweek 4     40
Matchweek 5     40
Matchweek 6     40
Matchweek 7     40
Matchweek 8     40
Matchweek 9     40
Matchweek 10    40
Matchweek 11    40
Matchweek 12    40
Matchweek 13    40
Matchweek 14    40
Matchweek 15    40
Matchweek 16    40
Matchweek 17    40
Matchweek 38    40
Name: round, dtype: int64

In [419]:
matches.dtypes

date             object
time             object
comp             object
round            object
day              object
venue            object
result           object
gf                int64
ga                int64
opponent         object
xg              float64
xga             float64
poss            float64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes           float64
sh              float64
sot             float64
dist            float64
fk              float64
pk              float64
pkatt           float64
season            int64
team             object
dtype: object

In [420]:
matches['date'] = pd.to_datetime(matches['date'])


In [421]:
matches.dtypes

date            datetime64[ns]
time                    object
comp                    object
round                   object
day                     object
venue                   object
result                  object
gf                       int64
ga                       int64
opponent                object
xg                     float64
xga                    float64
poss                   float64
attendance             float64
captain                 object
formation               object
referee                 object
match report            object
notes                  float64
sh                     float64
sot                    float64
dist                   float64
fk                     float64
pk                     float64
pkatt                  float64
season                   int64
team                    object
dtype: object

## Create Predictors for Function

In [422]:
matches['venue_code'] = matches['venue'].astype('category').cat.codes

In [423]:
matches['opp_code'] = matches['opponent'].astype('category').cat.codes

In [424]:
matches['hour'] = matches['time'].str.replace(':.+','',regex = True).astype('int')

In [425]:
matches['day_code'] = matches['date'].dt.dayofweek

In [426]:
matches['target'] = (matches['result'] == 'W').astype('int')

In [427]:
matches


Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,1.0,0.0,0.0,2022,Manchester City,0,18,16,6,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,1.0,0.0,0.0,2022,Manchester City,1,15,15,5,1
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,0.0,0.0,0.0,2022,Manchester City,1,0,12,5,1
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,0.0,0.0,0.0,2022,Manchester City,0,10,15,5,1
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,...,1.0,0.0,0.0,2022,Manchester City,1,17,15,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0,4,Tottenham,...,0.0,0.0,0.0,2021,Sheffield United,0,18,19,6,0
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0,2,Crystal Palace,...,1.0,0.0,0.0,2021,Sheffield United,1,6,15,5,0
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1,0,Everton,...,0.0,0.0,0.0,2021,Sheffield United,0,7,19,6,1
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0,1,Newcastle Utd,...,1.0,0.0,0.0,2021,Sheffield United,0,14,18,2,0


## Train RandomForest Model
After training and fitting the model and using it to find predictions, we will judge our accuracy 

In [428]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 50, min_samples_split = 10, random_state = 1)


In [429]:
train_data = matches[matches['date'] < '2022-01-01']
test_data = matches[matches['date'] > '2022-01-01']

In [430]:
predictors = ['opp_code', 'venue_code', 'hour', 'day_code']

In [431]:
rf.fit(train_data[predictors], train_data['target'])

RandomForestClassifier(min_samples_split=10, n_estimators=50, random_state=1)

In [432]:
prediction = rf.predict(test_data[predictors])

In [433]:
from sklearn.metrics import accuracy_score

In [434]:
acc = accuracy_score(test_data['target'], prediction)

In [435]:
acc

0.6108247422680413

In [436]:
combo = pd.DataFrame(dict(actual = test_data['target'], predictions = prediction))

In [437]:
pd.crosstab(index = combo['actual'], columns = combo['predictions'])

predictions,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,187,46
1,105,50


In [438]:
from sklearn.metrics import precision_score

In [439]:
precision_score(test_data['target'], prediction)

0.5208333333333334

## Generating More Predictors and Retraining Model
The predictors that will be added to original list of predictors will be rolling averages.

In [440]:
grouped_mathces = matches.groupby('team')

In [441]:
def rolling_averages(group,cols,new_cols):
    group = group.sort_values('date')
    rolling_stat = group[cols].rolling(3, closed = 'left').mean()
    group[new_cols] = rolling_stat
    group = group.dropna(subset = new_cols)
    return group


In [442]:
cols = ['gf', 'ga', 'sh', 'sot', 'dist', 'fk', 'pk', 'pkatt']
new_cols = [f'{c}_rolling' for c in cols]

In [443]:
new_cols

['gf_rolling',
 'ga_rolling',
 'sh_rolling',
 'sot_rolling',
 'dist_rolling',
 'fk_rolling',
 'pk_rolling',
 'pkatt_rolling']

In [444]:
matches_rolling = matches.groupby('team').apply(lambda x: rolling_averages(x,cols,new_cols))

In [445]:
matches_rolling = matches_rolling.droplevel('team')

In [446]:
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
6,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2,1,Sheffield Utd,...,6,1,2.000000,1.333333,8.000000,3.666667,14.633333,0.666667,0.000000,0.000000
7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0,1,Manchester City,...,5,0,1.666667,1.666667,5.666667,3.666667,15.366667,0.000000,0.000000,0.000000
9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0,1,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.666667,16.566667,0.666667,0.000000,0.000000
11,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1,0,Manchester Utd,...,6,1,0.666667,1.000000,9.666667,4.000000,16.566667,1.000000,0.000000,0.000000
13,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,6,0,0.333333,0.666667,9.666667,2.666667,19.333333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,2022-04-30,15:00,Premier League,Matchweek 35,Sat,Home,L,0,3,Brighton,...,5,0,0.666667,1.000000,8.666667,3.333333,17.400000,0.000000,0.000000,0.000000
38,2022-05-07,15:00,Premier League,Matchweek 36,Sat,Away,D,2,2,Chelsea,...,5,0,0.000000,1.666667,8.666667,2.333333,18.666667,0.333333,0.000000,0.000000
39,2022-05-11,20:15,Premier League,Matchweek 33,Wed,Home,L,1,5,Manchester City,...,2,0,0.666667,2.000000,11.666667,3.000000,17.800000,0.333333,0.000000,0.000000
40,2022-05-15,14:00,Premier League,Matchweek 37,Sun,Home,D,1,1,Norwich City,...,6,0,1.000000,3.333333,10.666667,2.666667,17.100000,0.333333,0.000000,0.000000


In [447]:
matches_rolling.index = range(matches_rolling.shape[0])

In [448]:
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
0,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2,1,Sheffield Utd,...,6,1,2.000000,1.333333,8.000000,3.666667,14.633333,0.666667,0.000000,0.000000
1,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0,1,Manchester City,...,5,0,1.666667,1.666667,5.666667,3.666667,15.366667,0.000000,0.000000,0.000000
2,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0,1,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.666667,16.566667,0.666667,0.000000,0.000000
3,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1,0,Manchester Utd,...,6,1,0.666667,1.000000,9.666667,4.000000,16.566667,1.000000,0.000000,0.000000
4,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0,3,Aston Villa,...,6,0,0.333333,0.666667,9.666667,2.666667,19.333333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1443,2022-04-30,15:00,Premier League,Matchweek 35,Sat,Home,L,0,3,Brighton,...,5,0,0.666667,1.000000,8.666667,3.333333,17.400000,0.000000,0.000000,0.000000
1444,2022-05-07,15:00,Premier League,Matchweek 36,Sat,Away,D,2,2,Chelsea,...,5,0,0.000000,1.666667,8.666667,2.333333,18.666667,0.333333,0.000000,0.000000
1445,2022-05-11,20:15,Premier League,Matchweek 33,Wed,Home,L,1,5,Manchester City,...,2,0,0.666667,2.000000,11.666667,3.000000,17.800000,0.333333,0.000000,0.000000
1446,2022-05-15,14:00,Premier League,Matchweek 37,Sun,Home,D,1,1,Norwich City,...,6,0,1.000000,3.333333,10.666667,2.666667,17.100000,0.333333,0.000000,0.000000


In [449]:
def make_predictions(data,predictors):
    train_data = data[data['date'] < '2022-01-01']
    test_data = data[data['date'] > '2022-01-01']
    rf.fit(train_data[predictors], train_data['target'])
    prediction = rf.predict(test_data[predictors])
    combo = pd.DataFrame(dict(actual = test_data['target'], predictions = prediction), index = test_data.index)
    precision = precision_score(test_data['target'], prediction)
    return combo, precision

In [450]:
preds = predictors + new_cols

In [451]:
combo, precision = make_predictions(matches_rolling, preds)

### Precision has improved, but could be better

In [453]:
precision

0.5657894736842105

In [454]:
combo

Unnamed: 0,actual,predictions
55,0,1
56,1,1
57,1,0
58,1,1
59,1,1
...,...,...
1443,0,0
1444,0,0
1445,0,0
1446,0,0


In [455]:
combo = combo.merge(matches_rolling[['date', 'team', 'opponent', 'result']], left_index = True, right_index = True)

In [456]:
combo

Unnamed: 0,actual,predictions,date,team,opponent,result
55,0,1,2022-01-23,Arsenal,Burnley,D
56,1,1,2022-02-10,Arsenal,Wolves,W
57,1,0,2022-02-19,Arsenal,Brentford,W
58,1,1,2022-02-24,Arsenal,Wolves,W
59,1,1,2022-03-06,Arsenal,Watford,W
...,...,...,...,...,...,...
1443,0,0,2022-04-30,Wolverhampton Wanderers,Brighton,L
1444,0,0,2022-05-07,Wolverhampton Wanderers,Chelsea,D
1445,0,0,2022-05-11,Wolverhampton Wanderers,Manchester City,L
1446,0,0,2022-05-15,Wolverhampton Wanderers,Norwich City,D


In [457]:
class MissingDict(dict):
    _missing_ = lambda self, key:key
    
mapping_values = {'Brighton and Hove Albion': 'Brighton',
                  'Manchester United': 'Manchester Utd',
                  'Newcastle United': 'Newcastle Utd',
                  'Tottenham Hotspur': 'Tottenham',
                  'West Ham United': 'West Ham',
                  'Wolverhampton Wanderers': 'Wolves'
                 }
mapping = MissingDict(**mapping_values)



In [458]:
combo['new_team'] = combo['team'].map(map_values)

In [459]:
combo

Unnamed: 0,actual,predictions,date,team,opponent,result,new_team
55,0,1,2022-01-23,Arsenal,Burnley,D,Arsenal
56,1,1,2022-02-10,Arsenal,Wolves,W,Arsenal
57,1,0,2022-02-19,Arsenal,Brentford,W,Arsenal
58,1,1,2022-02-24,Arsenal,Wolves,W,Arsenal
59,1,1,2022-03-06,Arsenal,Watford,W,Arsenal
...,...,...,...,...,...,...,...
1443,0,0,2022-04-30,Wolverhampton Wanderers,Brighton,L,Wolves
1444,0,0,2022-05-07,Wolverhampton Wanderers,Chelsea,D,Wolves
1445,0,0,2022-05-11,Wolverhampton Wanderers,Manchester City,L,Wolves
1446,0,0,2022-05-15,Wolverhampton Wanderers,Norwich City,D,Wolves


## Seeing if Predictions Were Consistent
Testing precision again by viewing our results from the home and away teams' perspectives

In [460]:
merged = combo.merge(combo, left_on=['date', "new_team"], right_on = ['date', 'opponent'])

In [461]:
merged

Unnamed: 0,actual_x,predictions_x,date,team_x,opponent_x,result_x,new_team_x,actual_y,predictions_y,team_y,opponent_y,result_y,new_team_y
0,0,1,2022-01-23,Arsenal,Burnley,D,Arsenal,0,0,Burnley,Arsenal,D,Burnley
1,1,1,2022-02-10,Arsenal,Wolves,W,Arsenal,0,0,Wolverhampton Wanderers,Arsenal,L,Wolves
2,1,0,2022-02-19,Arsenal,Brentford,W,Arsenal,0,0,Brentford,Arsenal,L,Brentford
3,1,1,2022-02-24,Arsenal,Wolves,W,Arsenal,0,0,Wolverhampton Wanderers,Arsenal,L,Wolves
4,1,1,2022-03-06,Arsenal,Watford,W,Arsenal,0,0,Watford,Arsenal,L,Watford
...,...,...,...,...,...,...,...,...,...,...,...,...,...
383,0,0,2022-04-30,Wolverhampton Wanderers,Brighton,L,Wolves,1,0,Brighton and Hove Albion,Wolves,W,Brighton
384,0,0,2022-05-07,Wolverhampton Wanderers,Chelsea,D,Wolves,0,1,Chelsea,Wolves,D,Chelsea
385,0,0,2022-05-11,Wolverhampton Wanderers,Manchester City,L,Wolves,1,1,Manchester City,Wolves,W,Manchester City
386,0,0,2022-05-15,Wolverhampton Wanderers,Norwich City,D,Wolves,0,0,Norwich City,Wolves,D,Norwich City


In [462]:
merged[(merged['predictions_x'] == 1) & (merged['predictions_y'] == 0)]['actual_x'].value_counts()

1    40
0    28
Name: actual_x, dtype: int64

### Able to improve precision by just a bit more!

In [463]:
26 / 43

0.6046511627906976

## Next Possible Steps
We have a decent model already, but still not perfect. Next steps that could be taken are:
- Scrape more data(for more than 2 years as we have)
- Adding more predictors 
- Using a different model that still picks up non-linear tendencies 