## The purpose of goodness of dit tests are to test whether the sample is coming from a population with a specific distribution. Chi-squared is the underlying distribution used for the test. Other similar tests include the Anderson-Darling and Kolmogorov-Smirnov tests.

## As an example, say we flip a coin. We perform a test to see whether the action of flipping a coin comes from aq specified distribution. The expectation is that coin flips come from a binomial distribution where the probability of heads/tails is 50/50.

$$
H_0: \text{The data follows a specified distribution} \\
H_a: \text{The data do not follow the specified distribution}
$$

Calculated Statistic:

$$
\chi^2 = \sum_{i=1}^{k} \frac{(O-E)^2}{E}
$$

- k-1 degrees of freedom for a specific alpha  
- O - observed  
- E - expected

Example: Coin flipped 100 times and 40 heads are observed. Is this coin biased? (Check with 95% cofidence)

Heads: exp = 50, obs = 40, o-e = -10, (o-e)^2 = 100, (o-e)^2/e = 2  
Tails: exp = 50, obs = 60, o-e = 10, (o-e)^2 = 100, (o-e)^2/e = 2

chi^2 = 4  
df = 1  
chi_critical = 3.84

Hence we reject the null. At 95% confidence this coin is biased.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
import scipy.stats as stats

In [6]:
expected = [50, 50]
observed = [40, 60]

stats.chisquare(f_obs=observed, f_exp=expected) # if we omit the expected the probabilities are the same
#  p < 0.05 so we reject

Power_divergenceResult(statistic=4.0, pvalue=0.04550026389635857)

In [8]:
obs_d = [10, 6, 8, 22, 11, 8]

stats.chisquare(obs_d)
# this die is biased

Power_divergenceResult(statistic=15.215384615384615, pvalue=0.009480606629220312)

In [15]:
shirt_sales = pd.DataFrame(
    data={
        "size": ["small", "medium", "large", "extra-large"],
        "historic_proportion_sales": [0.1, 0.2, 0.4, 0.3],
        "actual_sales": [25, 41, 91, 68]
    }
)
shirt_sales["expected_sales"] = shirt_sales["actual_sales"].sum()*shirt_sales["historic_proportion_sales"]
shirt_sales

Unnamed: 0,size,historic_proportion_sales,actual_sales,expected_sales
0,small,0.1,25,22.5
1,medium,0.2,41,45.0
2,large,0.4,91,90.0
3,extra-large,0.3,68,67.5


In [17]:
stats.chisquare(shirt_sales["actual_sales"].values, shirt_sales["expected_sales"].values)

# we do not reject the null, the proportions of shirt sales is not statistically significant

Power_divergenceResult(statistic=0.648148148148148, pvalue=0.8853267818237286)