# Assn07:  Hypothesis testing

Refer to the Lecture07 notes (available on PandA) when completing this assignment.

In [1]:
import numpy as np
from scipy import stats

___

## Questions


**Question**: Are a one-sample t test and a paired t test equivalent?  Why or why not?

**Answer**: They are equivalent.

We have to check if $\bar{y}$ minus $\bar{\mu}$ equals to $\bar{d}$ which is mean of difference of each $y_i$ and $\mu _i$. 

First, we change $\bar{y}$ to a sum of all y divided by n, 
$$\bar{y} = \frac{y_1 + y_2 + ... +y_i}{i}$$

we change mu mean to a sum of all mu divided by n. 
$$\bar{\mu} = \frac{\mu_1 + \mu_2 + ... +\mu_i}{i}$$

Then, we pair the each y to each mu.
$$\bar{y} - \bar{\mu} = \frac{(y_1 + y_2 + ... +y_i)-(\mu_1 + \mu_2 + ... +\mu_i)}{i}$$
$$\bar{y} - \bar{\mu} = \frac{(y_1 - \mu_1)+ (y_2 - \mu_2) + ... + (y_i-\mu_i)}{i}$$

We then get the sum of difference of y and mu - all divided by n. Finally, we notice that they are equivalent.
$$\bar{y} - \bar{\mu} = \frac{(y_1 - \mu_1)+ (y_2 - \mu_2) + ... + (y_i-\mu_i)}{i} = \bar{d}$$

___

**Question**: What is the difference between two-tailed and one-tailed results?

**Answer**: One-tailed method does not consider only when the first dataset is higher than the other. It does not consider the other case when the first dataset is less than the other. This means two-tailed is typically more universal as it considers both cases.

___

## t statistic calculations

___

**Task**: Use `np.mean` and `np.std` to compute the t statistic value for the "One-sample t test" example from the Lecture07 notes (see Equation 1).

**Note!**  The function `np.std` uses `ddof=0` by default; for hypothesis testing you should use `np.std(y, ddof=1)`.

In [13]:
# (Enter relevant calculations here)

import numpy as np
from scipy import stats
y       = np.array([23, 15, -5, 7, 1, -10, 12, -8, 20, 8, -2, -5]) 
mean = np.mean(y)
std = np.std(y, ddof = 1)
t = mean/(std/(len(y))**0.5)
print('t=',t)

t= 1.4492553137533357


___

**Task**: Use `np.mean` and `np.std` to compute the t statistic value for the "Paired t test" example from the Lecture07 notes (see Equation 2).

**Note!**  The function `np.std` uses `ddof=0` by default; for hypothesis testing you should use `np.std(y, ddof=1)`.

In [24]:
# (Enter relevant calculations here)
y_pre  = np.array( [3, 0, 6, 7, 4, 3, 2, 1, 4] )
y_post = np.array( [5, 1, 5, 7, 10, 9, 7, 11, 8] )
y_d = y_pre - y_post
print(y_d)
dmean = np.mean(y_d)
dstd = np.std(y_d, ddof = 1)
print("t=", dmean/(dstd/(len(y_d))**0.5))

[ -2  -1   1   0  -6  -6  -5 -10  -4]
t= -3.1428571428571423


___

**Task**: Use `np.mean` and `np.std` to compute the t statistic value for the "Two-sample t test" example from the Lecture07 notes (see Equation 3).

**Note!**  The function `np.std` uses `ddof=0` by default; for hypothesis testing you should use `np.std(y, ddof=1)`.

In [43]:
# (Enter relevant calculations here)

beginning = np.array( [3067, 2730, 2840, 2913, 2789] )
end       = np.array( [3200, 2777, 2623, 3044, 2834] )

bmean = np.mean(beginning)
emean = np.mean(end)

bstd = np.std(beginning, ddof = 1)
estd = np.std(end, ddof = 1)

nb = len(beginning)
ne = len(end)

d = (nb-1)*(bstd**2) + (ne-1)*(estd**2)
std = (d/(ne+nb-2))**0.5
t = (bmean - emean)/(std*(1/nb + 1/ne)**0.5)

print('t=',t)

t= -0.2372742730908139


___

## Dataset analyses

___

**Task**: Use Python to conduct t tests for the datasets at the following links. Then fill in the table below.

* [Dataset 1](https://www.youtube.com/watch?v=OHHhzLHakKA) (StatisticsHowTo)
* [Dataset 2](http://www.statstutor.ac.uk/resources/uploaded/paired-t-test.pdf) (StatsTutor.co.uk)
* [Dataset 3](https://www.statsdirect.com/help/parametric_methods/single_sample_t.htm) (StatsDirect.com)
* [Dataset 4](https://www.youtube.com/watch?v=Q0V7WpzICI8) (MarinStatsLectures
* [Dataset 5](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS4-OneSampleTtest/SAS4-OneSampleTtest6.html) (Boston University)



**Answer**: (Enter your answers in the table below. Report t and p values using 3 decimals. For example, use "0.000", not "0.00012345".)

<br>

Dataset |  Test type    | t    | p    | H0 rejected?
------------ | ------------ | ------------- | ------------- | -------------
<img width=50/> | <img width=100/> | <img width=100/> | <img width=100/> | <img width=100/> | <img width=100/>
1      | Two-sample |     0.283      |     0.779       | Not
2      | Paired t   |    -3.231      |     0.004       | Yes
3      | One-sample |     4.152      |     0.000       | Yes
4      |  Paired t  |     2.340      |     0.041       | Yes
5      | One-sample |     7.719      |     0.000       | Yes


In [16]:
# (Enter relevant calculations here)
import numpy as np
from scipy import stats

D1old = np.array([44,49,56,51,38,44,61,51,49,60,39,51,43,37,45])
D1new = np.array([51,42,37,45,47,65,49,69,38,44,49,56,51,50,38])
t1,p1    = stats.ttest_ind(D1old, D1new)
print("t1=%.3f"%t1,"p1=%.3f"%p1)

D2old = np.array([18,21,16,22,19,24,17,21,23,18,14,16,16,19,18,20,12,22,15,17])
D2new = np.array([22,25,17,24,16,29,20,23,19,20,15,15,18,26,18,24,18,25,19,16])
t2,p2 = stats.ttest_rel(D2old, D2new)
print("t2=%.3f"%t2,"p2=%.3f"%p2)

D3dat = np.array([128,127,118,115,144,142,133,140,132,131,111,132,149,122,139,119,136,129,126,128])
mu3 = 120
t3,p3 = stats.ttest_1samp(D3dat, mu3)
print("t3=%.3f"%t3,"p3=%.3f"%p3)

D4old = np.array([135,142,137,122,147,151,131,117,154,143,133])
D4new = np.array([127,145,131,125,132,147,119,125,132,139,122])
t4,p4 = stats.ttest_rel(D4old, D4new)
print("t4=%.3f"%t4,"p4=%.3f"%p4)

D5old = np.array([240,243,250,254,264,279,284,285,290,298,302,310,312,315,322,337,348,384,386,520])
D5new = np.array([209,209,173,165,239,270,274,254,223,209,219,281,251,208,227,269,299,238,285,325])
mu5 = 200
t5,p5 = stats.ttest_1samp(D5old, mu5)
print('t5=%.3f'%t5,"p5=%.3f"%p5)

t1=-0.283 p1=0.779
t2=-3.231 p2=0.004
t3=4.512 p3=0.000
t4=2.340 p4=0.041
t5=7.719 p5=0.000


___

# BONUS

(No bonus problem for this assignment)