# Hypothesis Testing

The purpose of this file is to conduct hypothesis testing on the Professional Football League Dataset. The football league dataset is comprised of over 9,000 matches played in the top divisions of soccer in Germany(Bundesliga), England (English Premier League), Spain (La Liga), France (Ligue 1), and Italy (Serie A). This dataset contains matches played starting from the 2016-2017 season, all the way through the 2020-2021 for each of the aformentioned leagues.

In [1]:
#if you do not have researchpy, it may be worth running this command first --> !pip install researchpy
import pandas as pd
import numpy as np
from scipy import stats
import scipy.stats.distributions as dist
import researchpy as rp
import os

In [2]:
#This first command tells you what the current directory is for your notebook
path = os.getcwd()
print(path)
#Change this below command to the location where the file is stored on your local machine
os.chdir(r"C:\Users\stanma02\Desktop\Final Project\WebScraper and Data\Capstone Data")
path = os.getcwd()
print(path)

C:\Users\stanma02\Downloads
C:\Users\stanma02\Desktop\Final Project\WebScraper and Data\Capstone Data


In [3]:
df = pd.read_csv("SerieA.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 44 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Match_ID             1898 non-null   int64  
 1   League               1898 non-null   object 
 2   Season               1898 non-null   object 
 3   Wk                   1898 non-null   int64  
 4   Day                  1898 non-null   object 
 5   Date                 1898 non-null   object 
 6   Time                 1898 non-null   object 
 7   Home                 1898 non-null   object 
 8   Score                1898 non-null   object 
 9   Away                 1898 non-null   object 
 10  Attendance           1898 non-null   float64
 11  Venue                1898 non-null   object 
 12  Referee              1898 non-null   object 
 13  Match Report         1898 non-null   object 
 14  Notes                0 non-null      float64
 15  Venue Key            1898 non-null   o

In [4]:
##Here are some examples of metrics to test
df['GoalDifferential']=df['HomeGoal']+df['AwayGoal']
df['HomePassAccuracy']=df['HomePassesCompleted']/df['HomePassesAttempts']
df['HomeTandInt']=df['HomeTackles']+df['HomeInterceptions']
df['AwayFoulsandYRcards']= df['AwayFouls']+2*df['AwayYellow']+4*df['AwayRed']
df['HomeFoulsandYRcards']= df['HomeFouls']+2*df['HomeYellow']+4*df['HomeRed']

In [5]:
df['Season'].value_counts()

17-18    380
18-19    380
19-20    380
20-21    379
16-17    379
Name: Season, dtype: int64

In [6]:
df['VAR'].value_counts()

1    1519
0     379
Name: VAR, dtype: int64

In [7]:
#Here we can create two new dataframes: one with all the games that have VAR and one where all the games do not have VAR
NoVarGames=df[df['VAR']==0]
VarGames=df[df['VAR']==1]

In [8]:
df['Attendance'].value_counts()

0.0        466
1000.0      42
10000.0     23
25000.0     16
20000.0     13
          ... 
27039.0      1
44118.0      1
16536.0      1
13918.0      1
25513.0      1
Name: Attendance, Length: 1239, dtype: int64

In [9]:
#Here we create a simple categorical variable for fan attendance, just for initial testing.

def attendance(c):
    if c['Attendance']==0:
        return "NoFans"
    else:
        return "Fans"

df['BiAttendance']=df.apply(attendance,axis=1)

In [10]:
No_Fans= df[df['Attendance']==0]
No_Fans['Season'].value_counts()
#Below we can see the vast majority of games with no fans are in the 20-21 and 19-20 seasons when the pandemic occured

20-21    335
19-20    130
18-19      1
Name: Season, dtype: int64

In [11]:
#Here we are looking at the split of games given presence of fans and VAR
Table= pd.crosstab(df.BiAttendance,df.VAR)
Table

VAR,0,1
BiAttendance,Unnamed: 1_level_1,Unnamed: 2_level_1
Fans,379,1053
NoFans,0,466


Now we are going to test for the impact of VAR on game results. Before we do this, we want to test on only matches where fans are present. This will allow us to control for the presence of fans, which may confound our results.

The below code shows us the total win, tie, and loss rate for home teams where var is both present and not present while fans are present. What we want to do next is test to see if the proportion of wins (the win rate) is different when var is present vs when it is not present. We can do this with a two sample proportion t-test. An example of this can be found in this article: https://medium.com/analytics-vidhya/testing-a-difference-in-population-proportions-in-python-89d57a06254

This is the test:

All other variables held constant, controlling for the presence of fans:

Null Hypothesis: VAR has no impact on home team win percentage

Alternative Hypothesis: VAR has an impact on home team win percentage

Our criteria will be a p-value of 0.05 to be considered statistically significant

In [12]:
Fans= df[df['Attendance']!=0]
NoFans= df[df['Attendance']==0]

In [13]:
pd.crosstab(Fans.HomeResult,Fans.VAR).apply(lambda r:r/r.sum(),axis=0)

VAR,0,1
HomeResult,Unnamed: 1_level_1,Unnamed: 2_level_1
Loss,0.30343,0.330484
Tie,0.211082,0.243115
Win,0.485488,0.426401


In [14]:
total_proportion_Won = (Fans.HomeResult == "Win").mean()
num_NoVAR=Fans[Fans.VAR==0]
num_VAR=Fans[Fans.VAR==1]
print(num_NoVAR.shape)
print(num_VAR.shape)

(379, 50)
(1053, 50)


In [15]:
prop = Fans.groupby("VAR")["HomeResult"].agg([lambda z: np.mean(z=="Win"), "size"])
prop.columns = ["prop_won", 'counts']
prop.head()

Unnamed: 0_level_0,prop_won,counts
VAR,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.485488,379
1,0.426401,1053


In [16]:
variance= total_proportion_Won*(1-total_proportion_Won)
standard_error= np.sqrt(variance*(1/prop.counts[1] + 1/prop.counts[0]))
print(standard_error)

0.029748833780531395


In [17]:
best_estimate= (prop.prop_won[1]-prop.prop_won[0])
print(best_estimate)

h_est=0

test_stat= (best_estimate-h_est)/standard_error

print(test_stat)

-0.05908736691498345
-1.9862078409827328


In [18]:
# Calculate the  p-value
pvalue = 2*dist.norm.cdf(-np.abs(test_stat)) # Multiplied by two indicates a two tailed testing.
print("Computed P-value is", pvalue)

Computed P-value is 0.04701025057889068


At this P-value, we reject the null hypothesis that VAR has no impact on home team win percentage.

Now, that we know that VAR is a confounding variable, when we measure for the impact of fans, we want to use only games where VAR is present.

In [19]:
VarGames['BiAttendance']= VarGames.apply(attendance,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [20]:
VarGames['BiAttendance'].value_counts()

Fans      1053
NoFans     466
Name: BiAttendance, dtype: int64

This is the test we will conduct on the dataset of VAR games:

All other variables held constant, controlling for the presence of VAR:

Null Hypothesis: The presence of fans has no impact on home team win percentage

Alternative Hypothesis: The presence of fans has no impact on home team win percentage

Our criteria will be a p-value of 0.05 to be considered statistically significant

In [21]:
pd.crosstab(VarGames.HomeResult,VarGames.BiAttendance).apply(lambda r:r/r.sum(),axis=0)

BiAttendance,Fans,NoFans
HomeResult,Unnamed: 1_level_1,Unnamed: 2_level_1
Loss,0.330484,0.334764
Tie,0.243115,0.248927
Win,0.426401,0.416309


In [22]:
total_proportion_Won = (VarGames.HomeResult == "Win").mean()
num_NoFans=VarGames[VarGames.BiAttendance=="NoFans"]
num_Fans=VarGames[VarGames.BiAttendance=="Fans"]
print(num_NoFans.shape)
print(num_Fans.shape)

(466, 50)
(1053, 50)


In [23]:
prop = VarGames.groupby("BiAttendance")["HomeResult"].agg([lambda z: np.mean(z=="Win"), "size"])
prop.columns = ["prop_won", 'counts']
prop.head()

Unnamed: 0_level_0,prop_won,counts
BiAttendance,Unnamed: 1_level_1,Unnamed: 2_level_1
Fans,0.426401,1053
NoFans,0.416309,466


In [24]:
total_proportion_won = (VarGames.HomeResult == "Win").mean()
variance= total_proportion_won*(1-total_proportion_won)
standard_error= np.sqrt(variance*(1/prop.counts.Fans + 1/prop.counts.NoFans))
print(standard_error)

0.0274897954741792


In [25]:
best_estimate= (prop.prop_won.Fans-prop.prop_won.NoFans)
print(best_estimate)

h_est=0

test_stat= (best_estimate-h_est)/standard_error

print(test_stat)

0.010091746858556627
0.36710883746063777


In [26]:
# Calculate the  p-value
pvalue = 2*dist.norm.cdf(-np.abs(test_stat)) # Multiplied by two indicates a two tailed testing.
print("Computed P-value is", pvalue)

Computed P-value is 0.7135378355002948


Based on a 0.05 level of significance we fail to reject the null hypothesis that The presence of fans has no impact on home team win percentage 