# Hypothesis Testing

The purpose of this file is to conduct hypothesis testing on the Professional Football League Dataset. The football league dataset is comprised of over 9,000 matches played in the top divisions of soccer in Germany(Bundesliga), England (English Premier League), Spain (La Liga), France (Ligue 1), and Italy (Serie A). This dataset contains matches played starting from the 2016-2017 season, all the way through the 2020-2021 for each of the aformentioned leagues.

In [1]:
#if you do not have researchpy, it may be worth running this command first --> !pip install researchpy
import pandas as pd
import numpy as np
from scipy import stats
import scipy.stats.distributions as dist
import researchpy as rp
import os

In [2]:
#This first command tells you what the current directory is for your notebook
path = os.getcwd()
print(path)
#Change this below command to the location where the file is stored on your local machine
os.chdir(r"C:\Users\stanma02\Desktop\Final Project\WebScraper and Data\Capstone Data")
path = os.getcwd()
print(path)

C:\Users\stanma02\Downloads
C:\Users\stanma02\Desktop\Final Project\WebScraper and Data\Capstone Data


In [3]:
df = pd.read_csv("EPL.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1900 entries, 0 to 1899
Data columns (total 44 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Match_ID             1900 non-null   int64  
 1   League               1900 non-null   object 
 2   Season               1900 non-null   object 
 3   Wk                   1900 non-null   int64  
 4   Day                  1900 non-null   object 
 5   Date                 1900 non-null   object 
 6   Time                 1900 non-null   object 
 7   Home                 1900 non-null   object 
 8   Score                1900 non-null   object 
 9   Away                 1900 non-null   object 
 10  Attendance           1900 non-null   float64
 11  Venue                1900 non-null   object 
 12  Referee              1900 non-null   object 
 13  Match Report         1900 non-null   object 
 14  Notes                0 non-null      float64
 15  Venue Key            1900 non-null   o

In [4]:
##Here are some examples of metrics to test
df['GoalDifferential']=df['HomeGoal']+df['AwayGoal']
df['HomePassAccuracy']=df['HomePassesCompleted']/df['HomePassesAttempts']
df['HomeTandInt']=df['HomeTackles']+df['HomeInterceptions']
df['AwayFoulsandYRcards']= df['AwayFouls']+2*df['AwayYellow']+4*df['AwayRed']
df['HomeFoulsandYRcards']= df['HomeFouls']+2*df['HomeYellow']+4*df['HomeRed']

In [5]:
df['Season'].value_counts()

18-19    380
20-21    380
17-18    380
19-20    380
16-17    380
Name: Season, dtype: int64

In [6]:
df['VAR'].value_counts()

1    1140
0     760
Name: VAR, dtype: int64

In [7]:
#Here we can create two new dataframes: one with all the games that have VAR and one where all the games do not have VAR
NoVarGames=df[df['VAR']==0]
VarGames=df[df['VAR']==1]

In [8]:
df['Attendance'].value_counts()

0.0        440
2000.0      13
10000.0      7
8000.0       4
24121.0      4
          ... 
24490.0      1
39328.0      1
53145.0      1
10446.0      1
31488.0      1
Name: Attendance, Length: 1360, dtype: int64

In [9]:
#Here we create a simple categorical variable for fan attendance, just for initial testing.

def attendance(c):
    if c['Attendance']==0:
        return "NoFans"
    else:
        return "Fans"

df['BiAttendance']=df.apply(attendance,axis=1)

In [10]:
No_Fans= df[df['Attendance']==0]
No_Fans['Season'].value_counts()
#Below we can see the vast majority of games with no fans are in the 20-21 and 19-20 seasons when the pandemic occured

20-21    348
19-20     92
Name: Season, dtype: int64

In [11]:
#Here we are looking at the split of games given presence of fans and VAR
Table= pd.crosstab(df.BiAttendance,df.VAR)
Table

VAR,0,1
BiAttendance,Unnamed: 1_level_1,Unnamed: 2_level_1
Fans,760,700
NoFans,0,440


Now we are going to test for the impact of VAR on game results. Before we do this, we want to test on only matches where fans are present. This will allow us to control for the presence of fans, which may confound our results.

The below code shows us the total win, tie, and loss rate for home teams where var is both present and not present while fans are present. What we want to do next is test to see if the proportion of wins (the win rate) is different when var is present vs when it is not present. We can do this with a two sample proportion t-test. An example of this can be found in this article: https://medium.com/analytics-vidhya/testing-a-difference-in-population-proportions-in-python-89d57a06254

This is the test:

All other variables held constant, controlling for the presence of fans:

Null Hypothesis: VAR has no impact on home team win percentage

Alternative Hypothesis: VAR has an impact on home team win percentage

Our criteria will be a p-value of 0.05 to be considered statistically significant

In [12]:
Fans= df[df['Attendance']!=0]
NoFans= df[df['Attendance']==0]

In [13]:
pd.crosstab(Fans.HomeResult,Fans.VAR).apply(lambda r:r/r.sum(),axis=0)

VAR,0,1
HomeResult,Unnamed: 1_level_1,Unnamed: 2_level_1
Loss,0.285526,0.322857
Tie,0.240789,0.21
Win,0.473684,0.467143


In [14]:
total_proportion_Won = (Fans.HomeResult == "Win").mean()
num_NoVAR=Fans[Fans.VAR==0]
num_VAR=Fans[Fans.VAR==1]
print(num_NoVAR.shape)
print(num_VAR.shape)

(760, 50)
(700, 50)


In [15]:
prop = Fans.groupby("VAR")["HomeResult"].agg([lambda z: np.mean(z=="Win"), "size"])
prop.columns = ["prop_won", 'counts']
prop.head()

Unnamed: 0_level_0,prop_won,counts
VAR,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.473684,760
1,0.467143,700


In [16]:
variance= total_proportion_Won*(1-total_proportion_Won)
standard_error= np.sqrt(variance*(1/prop.counts[1] + 1/prop.counts[0]))
print(standard_error)

0.026147843167994504


In [17]:
best_estimate= (prop.prop_won[1]-prop.prop_won[0])
print(best_estimate)

h_est=0

test_stat= (best_estimate-h_est)/standard_error

print(test_stat)

-0.006541353383458626
-0.25016799058460687


In [18]:
# Calculate the  p-value
pvalue = 2*dist.norm.cdf(-np.abs(test_stat)) # Multiplied by two indicates a two tailed testing.
print("Computed P-value is", pvalue)

Computed P-value is 0.8024574381567686


At this P-value, we fail to reject the null hypothesis that VAR has no impact on home team win percentage.

Now, that we know that VAR is not a confounding variable, when we measure for the impact of fans, we want to use all the games where VAR is and is not present.

In [19]:
df['BiAttendance'].value_counts()

Fans      1460
NoFans     440
Name: BiAttendance, dtype: int64

This is the test we will conduct on the dataset of all Ligue 1 games:

All other variables held constant, having controlled for the presence of VAR:

Null Hypothesis: The presence of fans has no impact on home team win percentage

Alternative Hypothesis: The presence of fans has no impact on home team win percentage

Our criteria will be a p-value of 0.05 to be considered statistically significant

In [20]:
pd.crosstab(df.HomeResult,df.BiAttendance).apply(lambda r:r/r.sum(),axis=0)

BiAttendance,Fans,NoFans
HomeResult,Unnamed: 1_level_1,Unnamed: 2_level_1
Loss,0.303425,0.388636
Tie,0.226027,0.225
Win,0.470548,0.386364


In [21]:
total_proportion_Won = (df.HomeResult == "Win").mean()
num_NoFans=df[df.BiAttendance=="NoFans"]
num_Fans=df[df.BiAttendance=="Fans"]
print(num_NoFans.shape)
print(num_Fans.shape)

(440, 50)
(1460, 50)


In [22]:
prop = df.groupby("BiAttendance")["HomeResult"].agg([lambda z: np.mean(z=="Win"), "size"])
prop.columns = ["prop_won", 'counts']
prop.head()

Unnamed: 0_level_0,prop_won,counts
BiAttendance,Unnamed: 1_level_1,Unnamed: 2_level_1
Fans,0.470548,1460
NoFans,0.386364,440


In [23]:
total_proportion_won = (df.HomeResult == "Win").mean()
variance= total_proportion_won*(1-total_proportion_won)
standard_error= np.sqrt(variance*(1/prop.counts.Fans + 1/prop.counts.NoFans))
print(standard_error)

0.02706157059861965


In [24]:
best_estimate= (prop.prop_won.Fans-prop.prop_won.NoFans)
print(best_estimate)

h_est=0

test_stat= (best_estimate-h_est)/standard_error

print(test_stat)

0.08418430884184308
3.1108434203791977


In [25]:
# Calculate the  p-value
pvalue = 2*dist.norm.cdf(-np.abs(test_stat)) # Multiplied by two indicates a two tailed testing.
print("Computed P-value is", pvalue)

Computed P-value is 0.0018655383114333341


Based on a 0.05 level of significance we reject the null hypothesis that the presence of fans has no impact on home team win percentage 