# Exercise: hypothesis testing
In the previous unit, we ran two hypothesis tests. In both examples, we created a sample which was on purpose different from our underlying populations. Repeat the two hypothesis tests but this time for samples which are randomly selected. You can keep the same sample sizes. Draw the appropriate conclusions for each test.

In [245]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [246]:
# population data
poisson1 = stats.poisson.rvs(mu=55, size=200000)
poisson2 = stats.poisson.rvs(mu=10, size=100000)

population = np.concatenate((poisson1, poisson2))
population 

array([47, 49, 48, ...,  9,  8, 10])

In [247]:
# taking a sample from population
sample = np.random.choice(population, size=100)


In [248]:
sample.mean()

41.01

In [249]:
population.mean()

39.995846666666665

### T-test

Does the sample's mean differ significantly from the population's mean ? 

H0 = the dataset sample has the same mean as the dataset population 

In [250]:
# data
data = pd.DataFrame(
    ["red"] * 50000 + ["blue"] * 30000 + ["green"] * 10000 + ["white"] * 10000
)

In [251]:
# compute the t-test
t_statistic, p_value = stats.ttest_1samp(sample, popmean=population.mean())

In [252]:
# print p_value
p_value

0.6406694594182962

In [253]:
# check null hypothesis
print("Null hypothesis test, we reject if it is true", p_value < 0.05)

Null hypothesis test, we reject if it is true False


#### Chi-square test

We test the following hypothesis: 

H0 = the sample has the same distribution as the population 

In [254]:
# count the number of values for each colour in the data
data_count = pd.crosstab(index=data[0], columns="count")
data_count

col_0,count
0,Unnamed: 1_level_1
blue,30000
green,10000
red,50000
white,10000


In [255]:
# setup the sample 
sample_data = data.sample(1000)

In [256]:
# count the number of values for each colour from the sample
sample_data_count = pd.crosstab(index=sample_data[0], columns="count")
sample_data_count 

col_0,count
0,Unnamed: 1_level_1
blue,311
green,110
red,465
white,114


In [257]:
# compute the exepected count
exepected_count = data_count * len(sample_data) / len(data)
exepected_count

col_0,count
0,Unnamed: 1_level_1
blue,300.0
green,100.0
red,500.0
white,100.0


In [258]:
# compute the chi-square
chi_square = (((sample_data_count - exepected_count) ** 2) / exepected_count).sum()
chi_square

col_0
count    5.813333
dtype: float64

In [259]:
# compute the p-value
p_value = 1 - stats.chi2.cdf(x=chi_square, df=3)
p_value

array([0.12105369])

In [260]:
print("Null hypothesis test, we reject if it is true", p_value < 0.05)

Null hypothesis test, we reject if it is true [False]
