### Lab: Hypothesis Testing
It is assumed that the mean systolic blood pressure is μ = 120 mm Hg. In the Honolulu Heart Study, a sample of n = 100 people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

- Set up the hypothesis test.
- Write down all the steps followed for setting up the test.
- Calculate the test statistic by hand and also code it in Python. It should be 4.76190. What decision can you make based on this calculated value?


In [13]:
import numpy as np
import pandas as pd
import scipy.stats #as stats
from scipy.stats import t
from scipy.stats import ttest_ind

#### 6 Analysis Steps
Step 1: Determine H0 and H1 (Ha)<br>
Step 2: Investigate sample size n >= 30 or n < 30. Sample size is n >= 30, but as σ of population is unknown we use T test statistic.<br>
Step 3: Calculate test statistic --> we decided for T: tc = (x̄ - μ) / (S /$\sqrt{n}$)<br>
Step 4: Calculate tc critical value to derive the interval<br>
Step 5: Decide if H0 can be rejected or we fail to reject H0<br>
Step 6: Determine P value math. 

#### 1. Setup of hypothesis test
**Step 1:** <br>
H0 = The mean blood pressure in Honolulu Heart Study does not show a significant difference to the mean systolic blood pressure: **μ != 120 mm Hg**. </br>
H1 = The mean blood pressure in Honolulu Heart Study (**x̄ = 130.1 mm Hg**) shows a significant difference to the mean systolic blood pressure which is **μ = 120 mm Hg**.<br>
We decided for a 2-sided test.

#### Step 2:
Sample size:     n = 100<br>
Population mean: μ = 120 mm Hg<br>
Sample mean:     x̄ = 130.1 mm Hg<br>
Sample std. dev.: S = 21.21 mm Hg<br>
Std. dev. of population is unknown: σ = ?<br>

As σ of population is not known we choose t-test statistic for step 3.

#### Step 3: Calculation of the test statistic of the sample

In [2]:
# Calculation of the t-test statistic of the sample
# (x̄ - μ) / (S / 100^1/2) 
t_sample = (130.1-120)/(21.21/(100**(1/2)))
print('t_sample = +/- {:.4f}'.format(t_sample))

t_sample = +/- 4.7619


#### Step 4:
#### Calculation of critical interval

In [5]:
# 2-sided student test
tc = scipy.stats.t.ppf(1-(0.05/2), df=99) #can be also written like: tc = scipy.stats.t.ppf(0.975, df=99)
print('tc = +/-{:.4f}'.format(tc))

# 2-sided normal distribution test with confidence level 95% which is the same as saying 𝛼 = 1−0.95=0.05
zc = stats.norm.ppf(0.975)
print('zc = +/-{:.4f}'.format(zc))

tc = +/-1.9842
zc = +/-1.9600


#### Step 5:
H0 has to be rejected: the t_sample = +/- 4.7619 is not in the critical interval.
H1 has to be accepted: the t_sample is significantly different to the mean of the systolic blood pressure.

#### Step 6

In [6]:
tc = scipy.stats.t.ppf(0.99999671, df=99)
P = (1-0.99999671)*100
print('P of t_sample = +/- {:.5f}%'.format(P))
tc

P of t_sample = +/- 0.00033%


4.761251569030441

We will have another simple example on two sample one sided t test. In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file Data/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

#### Step 1:
H0: The old machine performs the same as the new machine. The time (mean of time) it takes the old machine to pack ten cartons is not significantly slower compared to the time it takes the new machine to pack ten cartons. x̄_old = 43.23 x̄_new = 42.14.
H1: The old machine performs slower than the new machine, the (mean) time it takes the old machine to pack ten cartons is significantly slower compared to the mean time it takes the new machine.
#### Step 2: 
As sample size n < 30, we use T test statistic.
#### Step 3: 
Calculate test statistic --> we decided for T: tc = (x̄ - μ) / (S /𝑛⎯⎯√)
Step 4: Calculate tc critical value to derive the interval
Step 5: Decide if H0 can be rejected or we fail to reject H0
Step 6: Determine P value math.

In [7]:
df = pd.read_csv('Data_Machine.csv')

In [8]:
df=df.iloc[:,0:2]
df.rename(columns={"  Old_Machine": "Old_Machine"}, inplace=True)
df.columns

Index(['New_Machine', 'Old_Machine'], dtype='object')

In [16]:
n_old = len(df['Old_Machine'].values)
n_new = len(df['New_Machine'].values)
old_machine_mean=df['Old_Machine'].mean()
new_machine_mean=df['New_Machine'].mean()
std_old = np.std(df['Old_Machine'].values, ddof = 1)# use np.std() ddof = ?
std_new = np.std(df['New_Machine'].values, ddof = 1)# use np.std() ddof = ?

# one-sided t statistic
t = ( new_machine_mean - old_machine_mean ) / np.sqrt( ((std_new**2)/n_new) + ((std_old**2)/n_old ) )

# one-sided student test confidence intervall
tc90 = scipy.stats.t.ppf(0.9, df=9)
tc95 = scipy.stats.t.ppf(0.95, df=9) #can be also written like: tc = scipy.stats.t.ppf(0.95, df=9)
tc99 = scipy.stats.t.ppf(0.99, df=9)
P = ttest_ind(df['Old_Machine'].values, df['New_Machine'].values)

print("The sample mean of the Old_Machine is: {:.3f}".format(old_machine_mean))
print("The sample mean of the New_Machine is: {:.3f}".format(new_machine_mean))
print("The sample standard deviation of the Old_Machine is: {:.3f}".format(std_old))
print("The sample standard deviation of the New_Machine is: {:.3f}".format(std_new))
print("Our t statistic is: {:.3f}".format(t))
print('tc90 = +/-{:.4f}'.format(tc90))
print('tc95 = +/-{:.4f}'.format(tc95))
print('tc99 = +/-{:.4f}'.format(tc99))
print('H0 can be rejected, the new machine is significantly faster.')
P

The sample mean of the Old_Machine is: 43.230
The sample mean of the New_Machine is: 42.140
The sample standard deviation of the Old_Machine is: 0.750
The sample standard deviation of the New_Machine is: 0.683
Our t statistic is: -3.397
tc90 = +/-1.3830
tc95 = +/-1.8331
tc99 = +/-2.8214
H0 can be rejected, the new machine is significantly faster.


Ttest_indResult(statistic=3.3972307061176026, pvalue=0.0032111425007745158)