# Hypothesis Tests Implementation using Python

In [27]:
import numpy as np
from scipy.stats import chi2_contingency, ttest_1samp
from statsmodels.stats.weightstats import ztest

### Chi-Squared Test
Tests whether two categorical variables are related or independent.

Assumptions:

- Observations used in the calculation of the contingency table are independent.
- 25 or more examples in each cell of the contingency table.


Interpretation:

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Test statistic:

$$ \chi^2 = \sum_{i,j}\frac{(O_{i,j} - \hat{E}_{i,j})^2}{\hat{E}_{i,j}} $$

In [32]:
# contingency table
data = np.array([[120, 90, 40],
        [110, 95, 45]])
print(data)
stat, p, dof, expected = chi2_contingency(data)
print('stat = %.3f, p = %.3f' % (stat, p))
if p > 0.05:
	print('Failed to reject H0: Samples are independent')
else:
	print('H0 Rejected: Samples are dependent')

[[120  90  40]
 [110  95  45]]
stat = 0.864, p = 0.649
Failed to reject H0: Samples are independent


Since the p-value (.649) of the test is not less than 0.05, we fail to reject the null hypothesis. 

### T-Test
Tests whether the sample mean is statistically different from a known or hypothesised population mean.

Assumptions:
- Population distributions are normal
- Samples have equal variances
- The two samples are independent
- The population standard deviation is not known / the sample standard deviation is used

Interpretation:
- H0: The known assumed mean is equal to the expected sample mean
- H1: The known assumed mean is not equal to the expected sample mean

Test statstic:
$$T = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{N}}}$$

In [34]:
data = np.random.randint(20, 30, size=(20))
print("The data randomly distributed between 20 and 30 are: ", data)

The data randomly distributed between 20 and 30 are:  [29 29 23 28 27 29 29 23 23 27 24 28 25 22 22 27 21 25 26 22]


In [35]:
data_mean = np.mean(data)
print("Mean of the data: ", data_mean)
print("Testing the Sample with an assumed mean of 25")
tset, pval = ttest_1samp(data, 25)
print('stat = %.3f, p = %.3f' % (tset, pval))
if pval > 0.05:
	print('Failed to reject H0: The known assumed mean is equal to the expected sample mean')
else:
	print('H0 Rejected: The known assumed mean is not equal to the expected sample mean')

Mean of the data:  25.45
Testing the Sample with an assumed mean of 25
stat = 0.724, p = 0.478
Failed to reject H0: The known assumed mean is equal to the expected sample mean


### Z-Test
Tests whether two sample means are approximately the same or different when their variance is known and the sample size is large (should be >= 30)

Assumptions:
- The two samples are independent
- The population variance is known
- The sample size is large

Interpretation:
- H0: The known assumed mean is equal to the expected sample mean
- H1: The known assumed mean is larger than to the expected sample mean
 
Test statstic:
$$T = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{N}}}$$

In [60]:
# Generate a random array of 50 numbers having mean 100 and sd 15

data = np.random.normal(100, 15, size=50)

print("The 50 data randomly sampled from the 'standard normal' distribution are: \n", data)
# print mean and sd
print('mean = %.2f std = %.2f' % (np.mean(data), np.std(data)))

The 50 data randomly sampled from the 'standard normal' distribution are: 
 [105.72079612  77.29111111 101.64576895 101.60891679  77.18732548
  98.66849317  85.36956341  90.96426974 117.130895   105.87871293
 102.05520568  91.12885397 101.57850652 101.15981363 112.06194031
  84.59467265 111.29664749  98.98304705 126.78205818 121.81068045
  98.00456923 117.16212387 105.89468253 105.15019511 119.63815283
  82.50258047  76.82717293  65.65051216  98.30898489 106.90534966
 112.74943883  88.87732915  93.90471398  99.47712548  81.57373683
  84.29942799 117.70983087  90.13335387 109.49975705  97.68070926
 104.07899638 122.75635205  97.39811195  79.91255871 103.39248619
  73.36506844 111.39959981 110.44923095 119.2030554  114.20915758]
mean = 100.02 std = 14.28


In [61]:
assumed_mean =100
z_score, p_value= ztest(data, value = assumed_mean, alternative='larger')
print('stat = %.3f, p = %.3f' % (z_score, p_value))

if p_value > 0.05:
	print('Failed to reject H0: The known assumed mean is equal to the expected sample mean')
else:
	print('H0 Rejected: The known assumed mean is larger than the expected sample mean')

stat = 0.010, p = 0.496
Failed to reject H0: The known assumed mean is equal to the expected sample mean
