# Predicting EPL Football Match Winners Using Machine Learning

In [1]:
import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

from sklearn.metrics import precision_score

In [2]:
plm = pd.read_csv("premier_league_matches.csv", index_col = 0)

In [3]:
plm.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,Match Report,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,Match Report,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,Match Report,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,Match Report,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,Match Report,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City


In [4]:
plm.shape

(1389, 27)

In [5]:
plm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1389 entries, 1 to 42
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          1389 non-null   object 
 1   time          1389 non-null   object 
 2   comp          1389 non-null   object 
 3   round         1389 non-null   object 
 4   day           1389 non-null   object 
 5   venue         1389 non-null   object 
 6   result        1389 non-null   object 
 7   gf            1389 non-null   float64
 8   ga            1389 non-null   float64
 9   opponent      1389 non-null   object 
 10  xg            1389 non-null   float64
 11  xga           1389 non-null   float64
 12  poss          1389 non-null   float64
 13  attendance    693 non-null    float64
 14  captain       1389 non-null   object 
 15  formation     1389 non-null   object 
 16  referee       1389 non-null   object 
 17  match report  1389 non-null   object 
 18  notes         0 non-null      

## Investigating Missing Data

Some of the match data is missing. It is time to determine exactly what's missing. In the English Premier League, there are 20 teams, and each team plays 38 matches. There is data for two seasons. So there should be 2 * 20 * 38 matches, or 1520.

Three teams are relegated each season to a lower league, and three are promoted. So given the relegations/promotions that happened at the end of the 2020-2021 season, there should be 6 teams with 38 matches and 17 teams with 76 matches. Of course, since the data was scraped partway through the season, this may not be true.

Steps:
1. Determine how many matches there is data for.
2. Find how many rows are missing due to the data being scraped partway through the season.
3. Figure out whether there are any teams that are missing more data than expected.

### Determine How Many Matches there is Data For

There is data for 1,389 matches because that is the amount of rows in the dataset (each row represents a match).

### Find How Many Rows are Missing Due to the Data Being Scraped Partway Through The Season

There are 131 rows missing due to the data being scraped partway through the season.

### Figure out Whether There are Any Teams that are Missing More Data Than Expected

In [6]:
plm['team'].value_counts()

Southampton                 72
Brighton and Hove Albion    72
Manchester United           72
West Ham United             72
Newcastle United            72
Burnley                     71
Leeds United                71
Crystal Palace              71
Manchester City             71
Wolverhampton Wanderers     71
Tottenham Hotspur           71
Arsenal                     71
Leicester City              70
Chelsea                     70
Aston Villa                 70
Everton                     70
Liverpool                   38
Fulham                      38
West Bromwich Albion        38
Sheffield United            38
Brentford                   34
Watford                     33
Norwich City                33
Name: team, dtype: int64

Unexpectedly, there are 7 teams that are relegated to the lower league (there are supposed to be 6 that are relegated to the lower league), and Liverpool was not relegated last season.

In [7]:
plm[plm['team'] == 'Liverpool']

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2020-09-12,17:30,Premier League,Matchweek 1,Sat,Home,W,4.0,3.0,Leeds United,...,Match Report,,20.0,4.0,17.0,0.0,2.0,2.0,2021,Liverpool
2,2020-09-20,16:30,Premier League,Matchweek 2,Sun,Away,W,2.0,0.0,Chelsea,...,Match Report,,17.0,5.0,17.7,1.0,0.0,0.0,2021,Liverpool
4,2020-09-28,20:00,Premier League,Matchweek 3,Mon,Home,W,3.0,1.0,Arsenal,...,Match Report,,21.0,9.0,16.8,0.0,0.0,0.0,2021,Liverpool
6,2020-10-04,19:15,Premier League,Matchweek 4,Sun,Away,L,2.0,7.0,Aston Villa,...,Match Report,,14.0,8.0,15.8,1.0,0.0,0.0,2021,Liverpool
7,2020-10-17,12:30,Premier League,Matchweek 5,Sat,Away,D,2.0,2.0,Everton,...,Match Report,,22.0,8.0,15.0,1.0,0.0,0.0,2021,Liverpool
9,2020-10-24,20:00,Premier League,Matchweek 6,Sat,Home,W,2.0,1.0,Sheffield Utd,...,Match Report,,17.0,5.0,18.2,1.0,0.0,0.0,2021,Liverpool
11,2020-10-31,17:30,Premier League,Matchweek 7,Sat,Home,W,2.0,1.0,West Ham,...,Match Report,,8.0,2.0,18.6,1.0,1.0,1.0,2021,Liverpool
13,2020-11-08,16:30,Premier League,Matchweek 8,Sun,Away,D,1.0,1.0,Manchester City,...,Match Report,,9.0,2.0,21.5,0.0,1.0,1.0,2021,Liverpool
14,2020-11-22,19:15,Premier League,Matchweek 9,Sun,Home,W,3.0,0.0,Leicester City,...,Match Report,,24.0,12.0,11.9,0.0,0.0,0.0,2021,Liverpool
16,2020-11-28,12:30,Premier League,Matchweek 10,Sat,Away,D,1.0,1.0,Brighton,...,Match Report,,6.0,2.0,20.9,0.0,0.0,0.0,2021,Liverpool


Some of the data of Liverpool might have been moved to the next season, so we are missing one season for Liverpool.

In [8]:
plm['round'].value_counts()

Matchweek 1     39
Matchweek 16    39
Matchweek 34    39
Matchweek 32    39
Matchweek 31    39
Matchweek 29    39
Matchweek 28    39
Matchweek 26    39
Matchweek 25    39
Matchweek 24    39
Matchweek 23    39
Matchweek 2     39
Matchweek 19    39
Matchweek 17    39
Matchweek 20    39
Matchweek 15    39
Matchweek 5     39
Matchweek 3     39
Matchweek 13    39
Matchweek 12    39
Matchweek 4     39
Matchweek 11    39
Matchweek 10    39
Matchweek 9     39
Matchweek 8     39
Matchweek 14    39
Matchweek 7     39
Matchweek 6     39
Matchweek 30    37
Matchweek 27    37
Matchweek 22    37
Matchweek 21    37
Matchweek 18    37
Matchweek 33    32
Matchweek 35    20
Matchweek 36    20
Matchweek 37    20
Matchweek 38    20
Name: round, dtype: int64

The data was scraped, and there sould be 39 because Liverpool is missing, and many of these metrics have fewer than 39, and the reason is that the data was scraped for 2021 to 2022 while the season was still ongoing, so actually, some of the matchweeks have fewer than 39, which is okay. We can still work with this data, but this explains where some of the missing rows went.

## Cleaning Data for Machine Learning

Next, the data will be cleaned prepared for machine learning.

Steps:
1. Verify that all of the columns that will be used as predictors are numeric.
2. Remove any extra columns that aren't informative, or will not be used. This will make it easier to work with the `plm` DataFrame later.
3. Ensure that the date column is stored as the correct datatype. It will be necessary to create predictors later.

### Verify that All of the Columns that will be Used as Predictors are Numeric

In [9]:
plm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1389 entries, 1 to 42
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          1389 non-null   object 
 1   time          1389 non-null   object 
 2   comp          1389 non-null   object 
 3   round         1389 non-null   object 
 4   day           1389 non-null   object 
 5   venue         1389 non-null   object 
 6   result        1389 non-null   object 
 7   gf            1389 non-null   float64
 8   ga            1389 non-null   float64
 9   opponent      1389 non-null   object 
 10  xg            1389 non-null   float64
 11  xga           1389 non-null   float64
 12  poss          1389 non-null   float64
 13  attendance    693 non-null    float64
 14  captain       1389 non-null   object 
 15  formation     1389 non-null   object 
 16  referee       1389 non-null   object 
 17  match report  1389 non-null   object 
 18  notes         0 non-null      

For now, to make predictors, the object values will remain (except for `date`).

### Removing any Extra Columns that Aren't Necessary

`notes` and `match report` do not sound right (in the preview of the dataset). They might be unnecessary. These two columns need to be investigated.

#### Investigating `notes`

In [10]:
plm['notes'].unique()

array([nan])

In [11]:
del plm['notes']

It is clear that `notes` is useless because it is full of null values.

#### Investigating `report`

In [12]:
plm['match report'].unique()

array(['Match Report'], dtype=object)

In [13]:
del plm['match report']

Is is clear that `match report` is useless because it is filled with `'Match Report'`.

### Storing the Date Column as the Correct Datatype (Datetime)

In [14]:
plm['date'] = pd.to_datetime(plm['date'])

In [15]:
plm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1389 entries, 1 to 42
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        1389 non-null   datetime64[ns]
 1   time        1389 non-null   object        
 2   comp        1389 non-null   object        
 3   round       1389 non-null   object        
 4   day         1389 non-null   object        
 5   venue       1389 non-null   object        
 6   result      1389 non-null   object        
 7   gf          1389 non-null   float64       
 8   ga          1389 non-null   float64       
 9   opponent    1389 non-null   object        
 10  xg          1389 non-null   float64       
 11  xga         1389 non-null   float64       
 12  poss        1389 non-null   float64       
 13  attendance  693 non-null    float64       
 14  captain     1389 non-null   object        
 15  formation   1389 non-null   object        
 16  referee     1389 non-null 

## Creating Predictors for Machine Learning

Now that the data is temporarily cleaned, the predictors and the target will be set up. In this case, the target is if a team will win, so a win will be coded as a 2, a loss will be coded as a 0, and a draw will be coded as a 1. Then, this this target will be predicted using the predictor columns.

The initial predictors will be a set of codes that correspond to the venue, the opponent, the hour, and the day.

Steps:
1. Create a target column that is 0 when the team lost, 2 when they won the match, and 1 when they had a draw.
2. Create the Initial Predictors:
    * Turn the `venue` and `opp_code` columns into numeric columns.
    * Add in numberic columns indicating the hour when the match took place and the day it took place.

### Create a Target Column

In [16]:
plm['target'] = plm['result']

In [17]:
plm['target'] = np.where(plm['result'] == 'L', 0, plm['target'])

In [18]:
plm['target'] = np.where(plm['result'] == 'W', 2, plm['target'])

In [19]:
plm['target'] = np.where(plm['result'] == 'D', 1, plm['target'])

In [20]:
plm['target'] = plm['target'].astype(int)

In [21]:
plm

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,referee,sh,sot,dist,fk,pk,pkatt,season,team,target
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,Anthony Taylor,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,Graham Scott,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City,2
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,Martin Atkinson,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City,2
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,Paul Tierney,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City,2
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,Jonathan Moss,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0.0,4.0,Tottenham,...,Andre Marriner,8.0,1.0,17.4,0.0,0.0,0.0,2021,Sheffield United,0
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0.0,2.0,Crystal Palace,...,Simon Hooper,7.0,0.0,11.4,1.0,0.0,0.0,2021,Sheffield United,0
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1.0,0.0,Everton,...,Jonathan Moss,10.0,3.0,17.0,0.0,0.0,0.0,2021,Sheffield United,2
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0.0,1.0,Newcastle Utd,...,Robert Jones,11.0,1.0,16.0,1.0,0.0,0.0,2021,Sheffield United,0


### Create Initial Predictors

#### Turn the `venue` and `opponent` into Numeric Columns

##### Turn `venue` Into a Numeric Column

In [22]:
plm['venue_int'] = plm['venue'].astype('category').cat.codes

In [23]:
plm

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,sh,sot,dist,fk,pk,pkatt,season,team,target,venue_int
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City,0,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City,2,1
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City,2,1
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City,2,0
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0.0,4.0,Tottenham,...,8.0,1.0,17.4,0.0,0.0,0.0,2021,Sheffield United,0,0
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0.0,2.0,Crystal Palace,...,7.0,0.0,11.4,1.0,0.0,0.0,2021,Sheffield United,0,1
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1.0,0.0,Everton,...,10.0,3.0,17.0,0.0,0.0,0.0,2021,Sheffield United,2,0
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0.0,1.0,Newcastle Utd,...,11.0,1.0,16.0,1.0,0.0,0.0,2021,Sheffield United,0,0


##### Turn `opponent` into a Numeric Column

In [24]:
plm['opp_int'] = plm['opponent'].astype('category').cat.codes

In [25]:
plm

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,sot,dist,fk,pk,pkatt,season,team,target,venue_int,opp_int
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,4.0,16.9,1.0,0.0,0.0,2022,Manchester City,0,0,18
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,4.0,17.3,1.0,0.0,0.0,2022,Manchester City,2,1,15
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,10.0,14.3,0.0,0.0,0.0,2022,Manchester City,2,1,0
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,8.0,14.0,0.0,0.0,0.0,2022,Manchester City,2,0,10
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,1.0,15.7,1.0,0.0,0.0,2022,Manchester City,1,1,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0.0,4.0,Tottenham,...,1.0,17.4,0.0,0.0,0.0,2021,Sheffield United,0,0,18
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0.0,2.0,Crystal Palace,...,0.0,11.4,1.0,0.0,0.0,2021,Sheffield United,0,1,6
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1.0,0.0,Everton,...,3.0,17.0,0.0,0.0,0.0,2021,Sheffield United,2,0,7
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0.0,1.0,Newcastle Utd,...,1.0,16.0,1.0,0.0,0.0,2021,Sheffield United,0,0,14


### Add in Numeric Columns Indicating the Hour when the Match Took Place and the Day it Took Place

#### Add in the Numeric Column Indicating the Hour when the Match Took Place

In [26]:
plm['hour'] = plm['time'].str.replace(":.+", "", regex = True).astype('int')

In [27]:
plm

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,dist,fk,pk,pkatt,season,team,target,venue_int,opp_int,hour
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,16.9,1.0,0.0,0.0,2022,Manchester City,0,0,18,16
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,17.3,1.0,0.0,0.0,2022,Manchester City,2,1,15,15
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,14.3,0.0,0.0,0.0,2022,Manchester City,2,1,0,12
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,14.0,0.0,0.0,0.0,2022,Manchester City,2,0,10,15
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,15.7,1.0,0.0,0.0,2022,Manchester City,1,1,17,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0.0,4.0,Tottenham,...,17.4,0.0,0.0,0.0,2021,Sheffield United,0,0,18,19
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0.0,2.0,Crystal Palace,...,11.4,1.0,0.0,0.0,2021,Sheffield United,0,1,6,15
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1.0,0.0,Everton,...,17.0,0.0,0.0,0.0,2021,Sheffield United,2,0,7,19
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0.0,1.0,Newcastle Utd,...,16.0,1.0,0.0,0.0,2021,Sheffield United,0,0,14,18


#### Add in the Numeric Column Indicating the Day when the Match Took Place

In [28]:
plm['day_int'] = plm['date'].dt.dayofweek

In [29]:
plm

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,target,venue_int,opp_int,hour,day_int
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,1.0,0.0,0.0,2022,Manchester City,0,0,18,16,6
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,1.0,0.0,0.0,2022,Manchester City,2,1,15,15,5
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,0.0,0.0,0.0,2022,Manchester City,2,1,0,12,5
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,0.0,0.0,0.0,2022,Manchester City,2,0,10,15,5
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,1.0,0.0,0.0,2022,Manchester City,1,1,17,15,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0.0,4.0,Tottenham,...,0.0,0.0,0.0,2021,Sheffield United,0,0,18,19,6
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0.0,2.0,Crystal Palace,...,1.0,0.0,0.0,2021,Sheffield United,0,1,6,15,5
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1.0,0.0,Everton,...,0.0,0.0,0.0,2021,Sheffield United,2,0,7,19,6
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0.0,1.0,Newcastle Utd,...,1.0,0.0,0.0,2021,Sheffield United,0,0,14,18,2


## Training an Initial Machine Learning Model

Now that there is a target and there are predictors, it is time to train the initial model. A random forest classifier will be used to make the initial predictions and measure the accuracy of the predictions.

The data will have to be first split into training and test sets. The training set is what the model will be trained with, and the test set will measure the accuracy of 

### Initialize a Random Forest Classifier

In [30]:
rf = RandomForestClassifier(n_estimators = 40, min_samples_split = 10, random_state = 1)

### Split Up the Traning and Test Data (Based on Whether it is Made in 2022)

In [31]:
train = plm[plm['date'] < '2022-01-01']

In [32]:
test = plm[plm['date'] >= '2022-01-01']

### Train the Model on the Training Data, and Make Predictions on the Test Data

#### Train the Model on the Training Data

In [33]:
predictors = ['venue_int', 'opp_int', 'hour', 'day_int']

In [34]:
rf.fit(train[predictors], train['target'])

RandomForestClassifier(min_samples_split=10, n_estimators=40, random_state=1)

#### Make Predictions on the Test Data

In [35]:
preds = rf.predict(test[predictors])

### Measure the Precision of the Predictions

In [36]:
accuracy_score(test['target'], preds)

0.45390070921985815

In [37]:
combined = pd.DataFrame(dict(actual = test['target'], prediction = preds))

In [38]:
pd.crosstab(index = combined['actual'], columns = combined['prediction'])

prediction,0,1,2
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,65,30,24
1,21,10,25
2,37,17,53


# HELP!
# HELP!
# HELP!

In [39]:
precision_score(test['target'], preds)

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

The model did not do well as its accuracy is around 45%.

## Improving the Model with Rolling Averages

The next thing that can be done is to improve the accuracy of the model with rolling averages. Rolling averages will compute the average team stats in the last (any number, must be chosen) matches. These rolling averages will give the model information about what happened in the matches prior to the current one.

To compute these rolling averages, the data needs to be grouped by teams. Grouping by team will ensure that there will be rolling averages for matches by that team only. The dates also need to be sorted so that the rolling averages are in the right order.

Be careful not to include the current row in the rolling average. The current row contains stats for the match that is being predicting. In the real world, if the outcome of a future match is being predicted, nobody knows how many goals the team scored in that match (since it hasn't been played yet!).

Steps:
1. Group the data by team.
2. Within each group, compute rolling averages for informative columns.
3. Add the columns back to the original DataFrame.

### Group the Data by Team

In [41]:
grouped_plm = plm.groupby('team')

In [48]:
group = grouped_plm.get_group('Manchester City')

In [49]:
group

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,target,venue_int,opp_int,hour,day_int
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,1.0,0.0,0.0,2022,Manchester City,0,0,18,16,6
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,1.0,0.0,0.0,2022,Manchester City,2,1,15,15,5
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,0.0,0.0,0.0,2022,Manchester City,2,1,0,12,5
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,0.0,0.0,0.0,2022,Manchester City,2,0,10,15,5
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,1.0,0.0,0.0,2022,Manchester City,1,1,17,15,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54,2021-05-01,12:30,Premier League,Matchweek 34,Sat,Away,W,2.0,0.0,Crystal Palace,...,1.0,0.0,0.0,2021,Manchester City,2,0,6,12,5
56,2021-05-08,17:30,Premier League,Matchweek 35,Sat,Home,L,1.0,2.0,Chelsea,...,0.0,0.0,1.0,2021,Manchester City,0,1,5,17,5
57,2021-05-14,20:00,Premier League,Matchweek 36,Fri,Away,W,4.0,3.0,Newcastle Utd,...,1.0,0.0,0.0,2021,Manchester City,2,0,14,20,4
58,2021-05-18,19:00,Premier League,Matchweek 37,Tue,Away,L,2.0,3.0,Brighton,...,1.0,0.0,0.0,2021,Manchester City,0,0,3,19,1


### Within Each Group, Compute Rolling Averages

In [51]:
def rolling_averages(group, cols, new_cols):
    group = group.sort_values('date')
    rolling_sats = group[cols].rolling(3, closed = "left").mean()
    group = group.dropna(subset = new_cols)
    return group

In [45]:
cols = ['gf', 'ga', 'sh', 'sot', 'dist', 'fk', 'pk', 'pkatt']
new_cols = [f'{c}_rolling' for c in cols]

In [46]:
new_cols

['gf_rolling',
 'ga_rolling',
 'sh_rolling',
 'sot_rolling',
 'dist_rolling',
 'fk_rolling',
 'pk_rolling',
 'pkatt_rolling']

In [52]:
rolling_averages(group, cols, new_cols)

KeyError: ['gf_rolling', 'ga_rolling', 'sh_rolling', 'sot_rolling', 'dist_rolling', 'fk_rolling', 'pk_rolling', 'pkatt_rolling']