# Hypothesis testing
in this activity, we will learn how to perform hypothesis testing by statistical simulation

## use-case
imagine you are the wner of two electronics stores. in both stores, you sell the same model of cell phone at the same price. you are curious if the sales of this cell phone are the same at each store.

## simulation
let's simulate these sales with python and the numpy library.


In [1]:
#import numpy
import numpy as np

#set a random seed to replicate results
np.random.seed(42)

In [2]:
#since we dont have real data , 
# we will simulate our one-year sales with np.random.normal function.

#sales history in days
history = 365

#generate one year sales data for store A
mean_A = 20
std_A = 5
shop_A_sales = np.random.normal(mean_A, std_A, history)

#generate one year sales data for store B
mean_B = 19.5
std_B = 5
shop_B_sales = np.random.normal(mean_B, std_B, history)


We will be testing the following hypothesis: * H0 = the mean of sales of shop A equals the mean of sales of shop B (i.e. the difference between the sales is equal to zero) * HA = the means are not equal

Set the significance level alpha (the probability of rejecting the null hypothesis when it is true) to 0.05.

In [3]:
#set the significance level
alpha = 0.05


In [4]:
#the means of sales for both stores are :

#print the store A mean
print("Store A mean: ", shop_A_sales.mean())

#print the store B mean
print("Store B mean: ", shop_B_sales.mean())

#the difference in the means of sales for both stores is :
observed_means_diff = shop_A_sales.mean() - shop_B_sales.mean()
print("Observed means difference: ", observed_means_diff)


Store A mean:  20.04973201106029
Store B mean:  19.309929401404304
Observed means difference:  0.7398026096559853


Because the mean of sales in store A is not so far from the mean of sales in store B, and their standard deviations are equal, it is tough to decide if the sales are equal.

Let's simulate what it would look like if both stores' sales were identically distributed. We can do that by combining sales data from both stores.

In [5]:
both_sales = np.concatenate((shop_A_sales, shop_B_sales))

Now, we have to perform permutation on both_sales and re-create the sales with this permutation data.


A permutation is a random reordering of the entries in an array.


In [6]:
#shuffle the combined sales data aka permutation
sales_perm = np.random.permutation(both_sales)

# permutation replicates
#split the permuted sales data into two parts
perm_shop_A_sales = sales_perm[:len(shop_A_sales)]
perm_shop_B_sales = sales_perm[len(shop_A_sales):]


After this step, we have to compute the difference between the permutation replicates means.

In [7]:
#print the difference between the permutation replicates means
print("Permutation replicates means difference: ", perm_shop_A_sales.mean() - perm_shop_B_sales.mean())

Permutation replicates means difference:  0.21098789154327235


We can see that there is a difference between the permutation replicates mean and the original sales mean. But this was only one permutation. Let's try 1,000 different permutations and store the differences of the permutation replicates means in a list.

In [8]:
#create an empty list to stroe the permutation replicates means
perm_repl_means = []

#generate 10000 permutation replicates
for i in range(10000):
    #permutation
    sales_perm = np.random.permutation(both_sales)
    
    #split the permuted sales data into two parts
    perm_shop_A_sales = sales_perm[:len(shop_A_sales)]
    perm_shop_B_sales = sales_perm[len(shop_A_sales):]
    
    #permutation replicates mean
    perm_repl_mean = perm_shop_A_sales.mean() - perm_shop_B_sales.mean()
    
    #append the permutation replicates mean to the list
    perm_repl_means.append(perm_repl_mean)
    

The last thing that remains is to compute the p-value.
Note

The p-value is the probability of observing a test statistic as extreme or more extreme than the one you've observed, given that the null hypothesis is true.


In [9]:
# compute the p-value
p = np.sum(np.abs(perm_repl_means) >= np.abs(observed_means_diff)) / len(perm_repl_means)

# print the p-value
print("p-value: ", p)

p-value:  0.041


The p-value tells us that there is about a 4.2% chance that we would get the difference of means observed in the experiment if sales were exactly the same. 

In [10]:
#finally, we can make a decision based on the p-value
if p < alpha:
    print("We reject the null hypothesis aka H0. There is a significant difference between the means of sales for both stores.")
else:
    print("We fail to reject the null hypothesis aka H0. There is no significant difference between the means of sales for both stores.")

We reject the null hypothesis aka H0. There is a significant difference between the means of sales for both stores.


Because the p_value is smaller than our significance level alpha we reject the null hypothesis that our cell phone sales are equal in both stores.



Think about where we used the central limit theorem in this exercise.

we used the central limit theorem every loop we made a new permutation of the sales data nd split it to calculate new means(xbar).
we then took these means to find a very close approximation of the true mean (sigma) of the entire population of sales data from both stores