In [1]:
import pandas as pd
import numpy as np
import time
from datetime import date
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_1samp

#### Importing the dataframe

In [2]:
df = pd.read_csv("Clean Data/expanded_clean.csv")

# Hypothesis testing:  T Test

My hypothesis is, allowing online sports betting rises the amount wagered per person by at least 30%.
This means:

Null Hypothesis
H0 : average handle per capita for states with online betting = 1.3 times average handle per capita for states withoutonline betting

Alternative Hypothesis: 
H1 : average handle per capita for states with online betting > 1.3 times average handle per capita for states without online betting

This is a one sided test since the hypothesis is  that the population value is equal or higher than our test value.

We test for a 95% confidence level, that mean we reject our H0 if our p-value is lower than 0.05.

#### Data

In [3]:
# First we calculate the mean for the states in which online betting is not allowed
mean_no = np.mean(df[df['online'] == 0]['handle_capita'])
mean_no

5.375786548196981

In [4]:
# No we get a list of data points with  the handle per capita for the other states
data_yes = list(df[(df['online'] == 1) & (df['handle'] != 0)]['handle_capita'])

### Making the Hypothesis Test

In [5]:
import scipy.stats
from scipy.stats import ttest_1samp

In [6]:
stat, pval = ttest_1samp(data_yes, (mean_no*1.3))

In [7]:
# We check our p value for a one-tailed test
print(pval/2)
print(stat)

5.069847618641556e-65
18.773064278669978


### Conclusion
The pvalue is very close to zero, so we can revoke the H0, that means the mean with online gambling very likely differs from the mean without online gambling times 1.3.
Since the stat is positive we can assume that the actual mean is higher than the assumption in the H0, so our H1 is very likely to be true.

# Hypothesis testing:  ANOVA Test

We will group our data into states with low medium and high tax rates and test the assumption, that the tax rate affects handles(per capita). To use the data in our further analysis we will do the grouping in notebook 6 and use the data here.

### Hypothesis

We assume, that the tax group affects handles.

H0: handle_capita for 'Low' = handle_capita for 'Medium' = handle_capita for 'High'
    
H1: handle_capita is not equal for all groups
    
We again test for a 95% confidence level

In [8]:
# First we have to create our groups
Low = list(df[df['tax_group'] == 'Low']['handle_capita'])
Medium = list(df[df['tax_group'] == 'Medium']['handle_capita'])
High = list(df[df['tax_group'] == 'High']['handle_capita'])

In [9]:
# Now we have a problem, the groups have to be of the same length!
print(len(Low))
print(len(Medium))
print(len(High))

333
511
191


In [10]:
# We will take a sample of 150 values for each group
import random

In [11]:
Low = random.sample(Low, 100)
Medium = random.sample(Medium, 100)
High = random.sample(High, 100)

In [12]:
# No we can execute the ANOVA Test
from scipy.stats import f_oneway

In [13]:
stat, pvalue = f_oneway(Low, Medium, High)

#### Result

In [14]:
print(stat)
print(pvalue)

5.5216017358426805
0.004420847125742754


Our p value is very close to 0, that means, that we can reject the H0.
Therefore H1 seems very lilely to be true. In simpler terms: We can assume that the tax group influences the amount wagered per capita.

# Saving the new dataframe

In [15]:
# As csv
df.to_csv('Clean Data/intermediate.csv', index=False)