# Premier League 2021/22 Predictions

This model uses data from https://www.football-data.co.uk/englandm.php and https://www.kaggle.com/quadeer15sh/premier-league-standings-11-seasons-20102021, as well as https://en.wikipedia.org/wiki/2020%E2%80%9321_EFL_Championship

Present-time league data taken from https://footystats.org/england/premier-league on 2021/11/10

## Importing data
We will look at data from the past 5 years

In [205]:
import pandas as pd

csv_names = ["16_17", "17_18", "18_19", "19_20", "20_21", "21_22", "EPL Standings 2010-2021"]
CSVs = {}
path = "./Data/"

for name in csv_names:
     CSVs[name] = pd.read_csv(path + name + ".csv")

CSVs[csv_names[0]].head()

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,13/08/16,Burnley,Swansea,0,1,A,0,0,D,...,1.61,32,-0.25,2.13,2.06,1.86,1.81,2.79,3.16,2.89
1,E0,13/08/16,Crystal Palace,West Brom,0,1,A,0,0,D,...,1.52,33,-0.5,2.07,2.0,1.9,1.85,2.25,3.15,3.86
2,E0,13/08/16,Everton,Tottenham,1,1,D,1,0,H,...,1.77,32,0.25,1.91,1.85,2.09,2.0,3.64,3.54,2.16
3,E0,13/08/16,Hull,Leicester,2,1,H,1,0,H,...,1.67,31,0.25,2.35,2.26,2.03,1.67,4.68,3.5,1.92
4,E0,13/08/16,Man City,Sunderland,2,1,H,1,0,H,...,2.48,34,-1.5,1.81,1.73,2.2,2.14,1.25,6.5,14.5


## Linear regression

For linear regression, we will use standings from last season to predict the outcome of games

The features we want are:
- Games won last season
- Games drawn last season
- Games lost last season
- Promoted last season
- Goals for last season
- Goals against last season


First, we isolate the data from last season

In [206]:
csv_name = csv_names[-1]
csv = CSVs[csv_name]

data_last_season = csv.loc[csv['Season'] == "2020-21"]
data_last_season.head()

Unnamed: 0,Season,Pos,Team,Pld,W,D,L,GF,GA,GD,Pts,Qualification or relegation
200,2020-21,1,Manchester City,38,27,5,6,83,32,51,86,Qualification for the Champions League group s...
201,2020-21,2,Manchester United,38,21,11,6,73,44,29,74,Qualification for the Champions League group s...
202,2020-21,3,Liverpool,38,20,9,9,68,42,26,69,Qualification for the Champions League group s...
203,2020-21,4,Chelsea,38,19,10,9,58,36,22,67,Qualification for the Champions League group s...
204,2020-21,5,Leicester City,38,20,6,12,68,50,18,66,Qualification for the Europa League group stag...


Next, we drop columns we don't want and reset the key

In [207]:
data_reduced = data_last_season.set_index("Team")

data_reduced = data_reduced.drop(columns = ["Season", "Pld", "GD", "Pts", "Qualification or relegation"], axis = 1)

data_reduced.head()

Unnamed: 0_level_0,Pos,W,D,L,GF,GA
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Manchester City,1,27,5,6,83,32
Manchester United,2,21,11,6,73,44
Liverpool,3,20,9,9,68,42
Chelsea,4,19,10,9,58,36
Leicester City,5,20,6,12,68,50


Now we have to add in the teams which were promoted, and drop the relegated teams. We will keep their stats from the Championship, but scale them by the number of games played. This will skew the results, so we will also add in a feature "Promoted" so the model can correct for this

In [218]:
data_promoted = data_reduced.assign(Promoted = False)
data_complete = data_promoted
data_complete.loc[["Leeds United", "West Bromwich Albion", "Fulham"],"Promoted"] = True
data_complete

Unnamed: 0_level_0,Pos,W,D,L,GF,GA,Promoted
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Manchester City,1,27,5,6,83,32,False
Manchester United,2,21,11,6,73,44,False
Liverpool,3,20,9,9,68,42,False
Chelsea,4,19,10,9,58,36,False
Leicester City,5,20,6,12,68,50,False
West Ham United,6,19,8,11,62,47,False
Tottenham Hotspur,7,18,8,12,68,45,False
Arsenal,8,18,7,13,55,39,False
Leeds United,9,18,5,15,62,54,True
Everton,10,17,8,13,47,48,False


Next we must normalize our numerical data, and extract our y value

In [219]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
numerical_cols = ["W", "D", "L", "GF", "GA"]
data_norm = data_complete
data_norm[numerical_cols] = data_complete[numerical_cols].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

y_column = "Pos"
y = data_norm[y_column].to_frame()
X = data_norm.loc[:, data_norm.columns != y_column]
X

Unnamed: 0_level_0,W,D,L,GF,GA,Promoted
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Manchester City,1.0,0.25,0.0,1.0,0.0,False
Manchester United,0.727273,0.75,0.0,0.84127,0.272727,False
Liverpool,0.681818,0.583333,0.130435,0.761905,0.227273,False
Chelsea,0.636364,0.666667,0.130435,0.603175,0.090909,False
Leicester City,0.681818,0.333333,0.26087,0.761905,0.409091,False
West Ham United,0.636364,0.5,0.217391,0.666667,0.340909,False
Tottenham Hotspur,0.590909,0.5,0.26087,0.761905,0.295455,False
Arsenal,0.590909,0.416667,0.304348,0.555556,0.159091,False
Leeds United,0.590909,0.25,0.391304,0.666667,0.5,True
Everton,0.545455,0.5,0.304348,0.428571,0.363636,False


Finally, we are ready to perform our regression

In [220]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

print(regressor.intercept_)

print(regressor.coef_)

[15.13867771]
[[-14.54724789  -2.37738511  15.15513369  -0.99589237  -4.23592226
   -2.05192714]]


Let's see if that worked

In [221]:
y_pred = regressor.predict(X_test)
results = pd.DataFrame({'Actual': y_test.values.flatten(), 'Predicted': y_pred.flatten()}, index=y_test.index)
results

Unnamed: 0_level_0,Actual,Predicted
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
West Bromwich Albion,19,17.373374
Manchester United,2,0.782756
Sheffield United,20,25.986935
Leeds United,9,7.044684


Not bad! Now let's see what our model predicts based on this year's form

In [226]:
test_data_this_season = csv.loc[csv['Season'] == "2021-22"]
test_reduced = test_data_this_season.set_index("Team").drop(columns = ["Season", "Pld", "GD", "Pts", "Qualification or relegation"], axis = 1)
test_promoted = test_reduced.assign(Promoted = False)
test_complete = test_promoted
test_complete.loc[["Brentford", "Watford", "Norwich City"],"Promoted"] = True

test_norm = test_complete
test_norm[numerical_cols] = test_promoted[numerical_cols].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

X_test_data = test_norm.loc[:, test_norm.columns != y_column]
X_test_data
y_pred_21_22 = regressor.predict(X_test_data)
results = pd.DataFrame({'Predicted': y_pred_21_22.flatten()}, index=test_norm.index)
results = results.sort_values('Predicted')
results

Unnamed: 0_level_0,Predicted
Team,Unnamed: 1_level_1
Chelsea,-0.726725
Liverpool,0.458125
West Ham United,1.677037
Manchester City,3.063134
Arsenal,6.043497
Brighton & Hove Albion,6.319705
Crystal Palace,7.162639
Manchester United,9.026933
Leicester City,10.292231
Southampton,10.379737


We didn't bound the output to a 1 to 20 scale so it's a bit off, but we can infer the answer from the order.

It looks like bad news for my team, Tottenham Hotspur! Maybe we'll have to try a different model...