# Lab | Inferential statistics

### Instructions

It is assumed that the mean systolic blood pressure is μ = 120 mm Hg. In the Honolulu Heart Study, a sample of n = 100 people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

- Set up the hypothesis test.
- Write down all the steps followed for setting up the test.
- Calculate the test statistic by hand and also code it in Python. It should be 4.76190. We will take a look at how to make decisions based on this calculated value.

### 1 Interential Statistics Background

Hypothesis testing is a form of inferential statistics that allows us to draw conclusions about an entire population based on a representative sample. When you estimate the properties of a population from a sample, the sample statistics are unlikely to equal the actual population value exactly. For instance, your sample mean is unlikely to equal the population mean. Differences that you observe in samples might be due to sample error rather than representing a true effect at the population level. The difference between the sample statistic and the population value is the sample error.

### 2 Test setup

### 2.1. Test Hypothesis

**H0 - Null hypothesis:**

The sample error, i.e. the difference between the mean systolic blood pressure μ = 120 mm Hg and the average of 130.1 mm Hg observed in the Honolulu Heart Study doesn't represent a true statistically significant phenomenon. There is actually no difference between the experimental groups, so Δμ = 0.

**HA - Alternative hypothesis:**

The population mean is significantly different from the sample mean, Δμ != 0

### 2.2 Choose level of significance / critical value

The confidence interval lets you determine with a certain probability, if it is possible to reject the null hypothesis or not. The most popular critical value is 0.05. It means that if you run the experiment 100 times, 5% of the time you can reject the null hypothesis and 95% of the times you can't.

**critical value: alpha = 0.05**


### 2.3. Assumtions about data distribution / Test selection

Depending on the assumptions of your distributions, there are different types of statistical tests to confirm or reject the hypothesis.

1. Paired or unpaired: The data of both groups don't come from the same participants.

    **Conclusion: Unpaired**
    
2. Parametric or non-parametric: The data are distributed according to a distribution (normal distribution).

    **Conclusion: Parametric**

![alt text](https://github.com/KevinSpurk/lab-inferential-statistics/blob/master/files_for_lab/pics/t-tests-selection-tree.png "T Test Selection Tree")

**Conclusion: Performing One sample t-test** 

Welch t-test would be an alternative but standard deviation and sample size for the population are not avalible 

### 3. t-test

The test will produce a **t value** to evaluate wheater to reject the null hypothesis or not. It will also produce a **p value**, which is the probability of obtaining the test statistic in question if the null hypothesis is true. This value is also interpreted when deciding whether or not to reject the null hypothesis - in particular, a small p value indicates that there is a low probability of obtaining the result if the null hypothesis is true. 

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.stats
import math

### 3.1. Calculate t value

In [2]:
n_sample = 100
mean_pop = 120
mean_sample = 130.1
stddev_sample = 21.21

In [3]:
t_stat = round((mean_sample - mean_pop)/(stddev_sample/math.sqrt(n_sample)), 4)
t_stat

4.7619

### 3.2. Compare t with critical t value

To make a comarision, we can use the following python function or check a [t table](https://www.gradecalculator.tech/t-table/ "t table") Since the Ha is Δμ != 0, which includes Δμ < 0 and Δμ > 0, I use the values for a two tailed test.

**Inputs:**

- confidence level
- degrees of freedom (number of independent observations in test sample - 1)

In [4]:
t_critical = stats.t.ppf(q=1-.05/2,df=n_sample-1)
t_critical

1.9842169515086827

**Conclusion:** t is bigger than the critical value, which implies that the difference between the means is statistically significant.

### 3.3 Compare p value with significance level

In [5]:
stats.t.sf(np.abs(t_stat), n_sample-1)*2

6.562827764430223e-06

**Conclusion:** the p value is very small and crucially below alpha = 0.05 This result supports the implication of the t statistic. Finally, **the null hypothesis should be rejected** on that basis, meaing that the difference between the means of systolic blood pressure observed in the sample vs. the population is statistically significant.