# Pythagorean Expectation and English Soccer

In soccer, teams score goals, and we can calculate Pythagorean Expectations based on goals scored and goals conceded.

The structure of competition in soccer in most countries around the world is different from the sports we have looked at so far. Rather than leagues operating as independent entities, they are connected through a hierarchical system, sometimes called "the pyramid". In England, the English Premier League is at the top of the pyramid (it used to be called the First Division) and contains 20 teams. 

Beneath the Premier League is The Football League Championship (it used to be called Division Two) and it contains 24 teams. The Premier League and the Championship are linked via the system of promotion and relegation. At the end of each season, the three worst performing teams (measured by points won in competition) are relegated to play Championship soccer in the following season, to be replaced by the three best performing teams in the Championship. Beneath the Championship are two more leagues - League One (formerly Third Division) and League Two (formerly Fourth Division). These leagues are also linked, hierarchically, through promotion and relegation. Thus it makes sense to think of these four divisions as part of a common system. 

In any one season, there are 92 teams in the system. Even though teams compete in different divisions, we can define both win percentage and Pythagorean Expectation for each team, in order to see how well the data fits.

In each of the four divisions, every team plays every other team twice in a season, once at home and once away. There is no playoff, so the champion is the team at the end of the season with the largest number of points (3 points for a win, one for a draw (tie)). Unlike the leagues we have looked at so far, draws are not only possible but are quite common. We need to adjust our definition of win percentage for this case. We could create a statistic such as the percentage of maximum possible points, but instead, we do something simpler- we give a value of 1 for a win, 0 for a loss, and 1/2 for a draw.

We now follow the same procedure we have used to date

Pythagorean Expectation in Football = (total goals scored)**2/((total goals scored)**2 + (total goals conceeded)**2)

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
Eng18 = pd.read_csv('2021-2022.csv')
print(Eng18.columns.tolist())

['Div', 'Date', 'Time', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'VCH', 'VCD', 'VCA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA', 'B365>2.5', 'B365<2.5', 'P>2.5', 'P<2.5', 'Max>2.5', 'Max<2.5', 'Avg>2.5', 'Avg<2.5', 'AHh', 'B365AHH', 'B365AHA', 'PAHH', 'PAHA', 'MaxAHH', 'MaxAHA', 'AvgAHH', 'AvgAHA', 'B365CH', 'B365CD', 'B365CA', 'BWCH', 'BWCD', 'BWCA', 'IWCH', 'IWCD', 'IWCA', 'PSCH', 'PSCD', 'PSCA', 'WHCH', 'WHCD', 'WHCA', 'VCCH', 'VCCD', 'VCCA', 'MaxCH', 'MaxCD', 'MaxCA', 'AvgCH', 'AvgCD', 'AvgCA', 'B365C>2.5', 'B365C<2.5', 'PC>2.5', 'PC<2.5', 'MaxC>2.5', 'MaxC<2.5', 'AvgC>2.5', 'AvgC<2.5', 'AHCh', 'B365CAHH', 'B365CAHA', 'PCAHH', 'PCAHA', 'MaxCAHH', 'MaxCAHA', 'AvgCAHH', 'AvgCAHA']


In [3]:
#FTHG - Number of Goals scored by Home Team
#FTAG - Number of Goals scored by Away Team
#FTR - Result of the match

Eng18 = Eng18[['Div','Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR']]
Eng18

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,E0,13/08/2021,Brentford,Arsenal,2,0,H
1,E0,14/08/2021,Man United,Leeds,5,1,H
2,E0,14/08/2021,Burnley,Brighton,1,2,A
3,E0,14/08/2021,Chelsea,Crystal Palace,3,0,H
4,E0,14/08/2021,Everton,Southampton,3,1,H
...,...,...,...,...,...,...,...
375,E0,22/05/2022,Crystal Palace,Man United,1,0,H
376,E0,22/05/2022,Leicester,Southampton,4,1,H
377,E0,22/05/2022,Liverpool,Wolves,3,1,H
378,E0,22/05/2022,Man City,Aston Villa,3,2,H


In [4]:
# Once again our data is in the form of game results. We first identify whether the result was a win for the home team (H),
# the away team (A) or a draw (D). We also create the counting variable.

Eng18['hwinvalue']=np.where(Eng18['FTR']=='H',1,np.where(Eng18['FTR']=='D',.5,0))
Eng18['awinvalue']=np.where(Eng18['FTR']=='A',1,np.where(Eng18['FTR']=='D',.5,0))
Eng18['count']=1

In [6]:
# Once again we have to create separate dfs to calculate home team and away team performance.
# Here is the home team df, including only the variables we need.
#count/ph - Number of home matches played

Enghome = Eng18.groupby(['HomeTeam','Div'])[['count','hwinvalue', 'FTHG','FTAG']].sum().reset_index()
Enghome = Enghome.rename(columns={'HomeTeam':'team','count':'Ph','FTHG':'FTHGh','FTAG':'FTAGh'})
Enghome

Unnamed: 0,team,Div,Ph,hwinvalue,FTHGh,FTAGh
0,Arsenal,E0,19,14.0,35,17
1,Aston Villa,E0,19,8.5,29,29
2,Brentford,E0,19,8.5,22,21
3,Brighton,E0,19,8.5,19,23
4,Burnley,E0,19,8.0,18,25
5,Chelsea,E0,19,12.5,37,22
6,Crystal Palace,E0,19,11.0,27,17
7,Everton,E0,19,10.0,27,25
8,Leeds,E0,19,7.0,19,38
9,Leicester,E0,19,12.0,34,23


In [8]:
# Now we create the mirror image df for the away team results.
#pa - Number of away matches played

Engaway = Eng18.groupby('AwayTeam')[['count','awinvalue', 'FTHG','FTAG']].sum().reset_index()
Engaway = Engaway.rename(columns={'AwayTeam':'team','count':'Pa','FTHG':'FTHGa','FTAG':'FTAGa'})
Engaway

Unnamed: 0,team,Pa,awinvalue,FTHGa,FTAGa
0,Arsenal,19,9.5,31,26
1,Aston Villa,19,7.5,25,23
2,Brentford,19,8.0,35,26
3,Brighton,19,11.0,21,23
4,Burnley,19,6.0,28,16
5,Chelsea,19,14.0,11,39
6,Crystal Palace,19,7.5,29,23
7,Everton,19,4.0,41,16
8,Leeds,19,7.5,41,23
9,Leicester,19,7.0,36,28


In [9]:
# Merge the home team and away team results
#FTHGh - Goals scored by team in home matches
#FTHGa - Goals conceeded by team in home matches
#FTAGh - Goals conceeded by home team when they were playing away
#FTAGa - Goals scored by home team when they were playing away matches 


Eng18 = pd.merge(Enghome, Engaway, on = ['team'])
Eng18

Unnamed: 0,team,Div,Ph,hwinvalue,FTHGh,FTAGh,Pa,awinvalue,FTHGa,FTAGa
0,Arsenal,E0,19,14.0,35,17,19,9.5,31,26
1,Aston Villa,E0,19,8.5,29,29,19,7.5,25,23
2,Brentford,E0,19,8.5,22,21,19,8.0,35,26
3,Brighton,E0,19,8.5,19,23,19,11.0,21,23
4,Burnley,E0,19,8.0,18,25,19,6.0,28,16
5,Chelsea,E0,19,12.5,37,22,19,14.0,11,39
6,Crystal Palace,E0,19,11.0,27,17,19,7.5,29,23
7,Everton,E0,19,10.0,27,25,19,4.0,41,16
8,Leeds,E0,19,7.0,19,38,19,7.5,41,23
9,Leicester,E0,19,12.0,34,23,19,7.0,36,28


In [10]:
# Sum the results by home and away measures to get the team overall performance for the season

Eng18['W'] = Eng18['hwinvalue']+Eng18['awinvalue']
Eng18['G'] = Eng18['Ph']+Eng18['Pa']
Eng18['GF'] = Eng18['FTHGh']+Eng18['FTAGa']
Eng18['GA'] = Eng18['FTAGh']+Eng18['FTHGa']
Eng18

Unnamed: 0,team,Div,Ph,hwinvalue,FTHGh,FTAGh,Pa,awinvalue,FTHGa,FTAGa,W,G,GF,GA
0,Arsenal,E0,19,14.0,35,17,19,9.5,31,26,23.5,38,61,48
1,Aston Villa,E0,19,8.5,29,29,19,7.5,25,23,16.0,38,52,54
2,Brentford,E0,19,8.5,22,21,19,8.0,35,26,16.5,38,48,56
3,Brighton,E0,19,8.5,19,23,19,11.0,21,23,19.5,38,42,44
4,Burnley,E0,19,8.0,18,25,19,6.0,28,16,14.0,38,34,53
5,Chelsea,E0,19,12.5,37,22,19,14.0,11,39,26.5,38,76,33
6,Crystal Palace,E0,19,11.0,27,17,19,7.5,29,23,18.5,38,50,46
7,Everton,E0,19,10.0,27,25,19,4.0,41,16,14.0,38,43,66
8,Leeds,E0,19,7.0,19,38,19,7.5,41,23,14.5,38,42,79
9,Leicester,E0,19,12.0,34,23,19,7.0,36,28,19.0,38,62,59


In [11]:
# Create the win percentage and Pythagorean Expectation
#wpc - win %
#Pythagorean expectation - Estimated winning percentage

Eng18['wpc'] = Eng18['W']/Eng18['G']
Eng18['pyth'] = Eng18['GF']**2/(Eng18['GF']**2 + Eng18['GA']**2)
Eng18

Unnamed: 0,team,Div,Ph,hwinvalue,FTHGh,FTAGh,Pa,awinvalue,FTHGa,FTAGa,W,G,GF,GA,wpc,pyth
0,Arsenal,E0,19,14.0,35,17,19,9.5,31,26,23.5,38,61,48,0.618421,0.617593
1,Aston Villa,E0,19,8.5,29,29,19,7.5,25,23,16.0,38,52,54,0.421053,0.481139
2,Brentford,E0,19,8.5,22,21,19,8.0,35,26,16.5,38,48,56,0.434211,0.423529
3,Brighton,E0,19,8.5,19,23,19,11.0,21,23,19.5,38,42,44,0.513158,0.476757
4,Burnley,E0,19,8.0,18,25,19,6.0,28,16,14.0,38,34,53,0.368421,0.291551
5,Chelsea,E0,19,12.5,37,22,19,14.0,11,39,26.5,38,76,33,0.697368,0.841369
6,Crystal Palace,E0,19,11.0,27,17,19,7.5,29,23,18.5,38,50,46,0.486842,0.541594
7,Everton,E0,19,10.0,27,25,19,4.0,41,16,14.0,38,43,66,0.368421,0.297985
8,Leeds,E0,19,7.0,19,38,19,7.5,41,23,14.5,38,42,79,0.381579,0.220362
9,Leicester,E0,19,12.0,34,23,19,7.0,36,28,19.0,38,62,59,0.5,0.524778


## Lets find out OverAchievers and UnderAchievers team

In [13]:
Eng18['Difference'] = Eng18['wpc'] - Eng18['pyth']

# Identify overachievers and underachievers
overachievers = Eng18[Eng18['Difference'] > 0]
underachievers = Eng18[Eng18['Difference'] < 0]

print("Overachievers:")
print(overachievers[['team', 'wpc', 'pyth', 'Difference']])

print("\nUnderachievers:")
print(underachievers[['team', 'wpc', 'pyth', 'Difference']])

Overachievers:
           team       wpc      pyth  Difference
0       Arsenal  0.618421  0.617593    0.000828
2     Brentford  0.434211  0.423529    0.010681
3      Brighton  0.513158  0.476757    0.036401
4       Burnley  0.368421  0.291551    0.076870
7       Everton  0.368421  0.297985    0.070436
8         Leeds  0.381579  0.220362    0.161217
12   Man United  0.552632  0.500000    0.052632
13    Newcastle  0.473684  0.334948    0.138736
14      Norwich  0.223684  0.069743    0.153941
15  Southampton  0.407895  0.291732    0.116162
17      Watford  0.223684  0.163162    0.060523
19       Wolves  0.473684  0.438506    0.035178

Underachievers:
              team       wpc      pyth  Difference
1      Aston Villa  0.421053  0.481139   -0.060086
5          Chelsea  0.697368  0.841369   -0.144001
6   Crystal Palace  0.486842  0.541594   -0.054752
9        Leicester  0.500000  0.524778   -0.024778
10       Liverpool  0.842105  0.928932   -0.086827
11        Man City  0.842105  0.935478