In [1]:
import numpy as np
from scipy import stats

# 1. Test a claim - significance test
## 1.1. One sample
### 1.1.1. Mean (quantitative data)
Fact: water is bad when bacteria is > 88. Null hypothesis: this water is safe. To reject it, P-value needs to be < 0.05 (one-sided).

H0: m = 88

Ha: m > 88

In [2]:
bac = np.array([248, 37, 146, 19, 66, 236, 164, 30, 13, 144, 242, 20])

In [3]:
t_statistic, P_value = stats.ttest_1samp(bac, 88)

In [4]:
P_value = P_value / 2 # one-sided
print('t =', t_statistic)
print('P =', P_value)

t = 0.9499095403173593
P = 0.18128112986496991


P > 0.05 so we cannot reject H0. Water is likely safe.

### 1.1.2. Proportion (categorical data)
We have two sheets with pictures of dogs and dog owners. One sheet - owners and dogs are associated correctly. The other sheet - random associations.

61 students are asked to guess which sheet is the real association. 49 of them guess correctly. If this was a random guess, 50% would guess right. Is this evidence that students are doing better than plain guessing? (P < 0.05)

H0: p = 0.5

Ha: p > 0.5

In [5]:
# TODO: look for a simpler solution.

p0 = 0.5
n = 61
succ = 49
p_hat = succ / n

In [6]:
z_stat = (p_hat - p0) / (p0 * (1 - p0) / n) ** 0.5
P_value = 1 - stats.norm.cdf(z_stat)

In [7]:
print('P =', P_value)

P = 1.082577341104951e-06


P < 0.05 so we reject H0. Students are doing better than random guessing.

## 1.2. Two samples
### 1.2.1. Compare means (quantitative data)
Is it true that lean people spend more time standing and walking, than do obese people?

Collect daily minutes spent standing/walking from 10 lean and 10 obese people. Calculate the average daily minutes for these two groups.

- H0: averages are equal
- Ha: averages are not equal

If P < 0.05, H0 will be rejected.

In [8]:
lean  = np.array([511.1, 607.925, 319.212, 584.644, 578.869, 543.388, 677.188, 555.656, 374.831, 504.7])
obese = np.array([260.244, 464.756, 367.138, 413.667, 347.375, 416.531, 358.65, 267.344, 410.631, 426.356])
lean.mean() - obese.mean()

152.4821

Averages are not equal.

In [9]:
t_stat, p_val = stats.ttest_ind(lean, obese)

In [10]:
print(t_stat, p_val)

3.8083756040290306 0.0012871405333009754


P < 0.05 so we reject H0. Lean people do spend more time standing and walking.

### 1.2.2. Compare proportions (categorical data)
A false event was shown to a large sample of test subjects, who were asked if they "remember" it. 616 subjects self-identified as progressive, 49 self-identified as conservative.

Of the progressive group, 212 "remembered" the event. Of the conservative group, 7 "remembered" it. Is this strong evidence that a larger proportion of progressives have this false memory?

- H0:  p1 = p2
- Ha: p1 <> p2

If P < 0.05, H0 will be rejected.

In [11]:
n1 = 616
n2 = 49
s1 = 212
s2 = 7

p1 = s1 / n1
p2 = s2 / n2
p = (s1 + s2) / (n1 + n2)
z = (p1 - p2) / (p * (1 - p) * (1 / n1 + 1 / n2)) ** 0.5

In [12]:
P = 1 - stats.norm.cdf(z)

In [13]:
print('P =', P)

P = 0.0019527407222336146


P < 0.05 so we reject H0. Progressives are more likely to have this false memory.