## Description
In today's class, we'll learn a method of statistical inference called *hypothesis testing*.   

*Theory time.* The general idea of statistical hypothesis testing is to set up a statistical hypothesis - e.g., that some parameter is greater than 0 - and try to *reject it* by showing that our data would be very unlikely to observe if this hypothesis was true.     

Why do we reject statistical hypotheses rather than confirm them? Because in general, confirming them is impossible using random samples. For example, let's suppose that we study the toxicity of a new drug by giving it to 10 rats and observing them for a week. Just because the rats didn't get sick may suggest that the drug is safe, but it doesn't prove it - after all, it's just 10 animals observed for a short time. On the other hand, if all the 10 animals get sick (and the control group is fine), we can definitely reject the hypothesis that the drug is safe.   

Another example: let's suppose we have a hypothesis that the average log-length of a human protein is equal 2.30, and we sample 200 proteins to verify it. If their average log-length turns out to be similar to 2.30 - for example, 2.31 with a standard deviation 0.1 - this doesn't prove our hypothesis. The true average log-length may still be equal, for example, 2.29 or 2.305. However, if the average log-length of our sample turns out to be 2.70 with a standard deviation 0.1, this is a strong evidence that our hypothesis is false.  

More generally: just because some data somewhat agree with our beliefs doesn't prove that our beliefs are true. However, if the data contradicts our beliefs, then it proves our beliefs are false.  





## Data & library imports

In [None]:
!pip install gdown

In [None]:
!gdown https://drive.google.com/uc?id=1xOJfD-jexDbHSOCg1EiyAxqc5kXjMvX0
!gdown https://drive.google.com/uc?id=1y5NKR3aWB0DbAuSWcg6ffa1Atu2unpOA

In [None]:
# !pip install --upgrade scipy

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from scipy.stats import t as tstud
from scipy.stats import ttest_ind, ttest_rel, ttest_1samp, norm, kstest, mannwhitneyu, shapiro, chisquare, chi2_contingency
import plotly.graph_objects as go
from statsmodels.stats.multitest import fdrcorrection

In [None]:
protein_lengths = pd.read_csv('protein_lengths.tsv', sep='\t')
protein_lengths['LogLength'] = np.log10(protein_lengths['Protein length'])
protein_lengths

In [None]:
human_protein_lengths = protein_lengths.loc[protein_lengths['Common name'] == 'Human'].copy()
# Note: without .copy(), some versions of Pandas may return a View.
# This may interfere with adding a new column to human_protein_lengths.
print(human_protein_lengths.head())
print()
print(human_protein_lengths.describe())

In [None]:
citizen_incomes = pd.read_csv('citizen incomes.tsv', sep='\t')
citizen_incomes

### Student's t-test

It is any statistical hypothesis test in which the test statistic follows a <a href="https://en.wikipedia.org/wiki/Student%27s_t-distribution">Student's t-distribution</a> under the null hypothesis.

##### Z-test

We assume that $X_1, X_2, \ldots, X_n$ are normally distributed with **unknown** mean $\mu$ and **known** varinace $\sigma^2$. We formulate the null hypothesis $H_0$ that the mean is $\mu_0$, while the alternative hypothesis $H_1$ states that either $\mu \neq \mu_0$ (two-sided) or $\mu > \mu_0$ (one-sided). The test statistics is as follows:

$$Z = \frac{\bar{x}-\mu_0}{\frac{\sigma}{\sqrt{n}}} = \sqrt{n}\frac{\bar{x}-\mu_0}{\sigma}.$$

Then $Z$ has the standard distribution, i.e., $Z \sim N(0,1)$, which allows us to determine the critical region and to calculate the p-value. In the case of two-sided test, p-value is the probability under the null hypothesis that Z is "more extreme" than $|z|$. Based on the p-value, we reject the null hypothesis if

$$P(Z \leq -|z|) + P(Z \geq |z|) = 2P(Z \leq -|z|) \leq \alpha.$$

#### One-sample t-test

If the variance of the normally distributed population is **not known**, then for testing a hypothesis regarding the mean the one-sample t-test is used. Let $H_0 : \mu = \mu_0$ and $H_1: \mu \neq \mu_0$. The test statistic is given by 

$$T = \sqrt{n}\frac{\bar{x}-\mu_0}{s},$$

where $s$ is the sample standard deviation, i.e.,

$$s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar x)^2}.$$

Then the test statistic T has the Student's t-distribution with $n-1$ degrees of freedom and thus the p-value can be calculated as follows:

$$2P(T_{n-1} \leq -|t|).$$


#### Two-sample t-test

This is a test of the null hypothesis that the means of two normally distributed populations are equal **with** the assumption that the **unknown** variances $\sigma_x^2$ and $\sigma_y^2$ of the two populations are **equal**.

We consider two samples from the two distributions, i.e., $x_1, x_2, \ldots, x_n$ and $y_1, y_2, \ldots, y_m$. We test the hypothesis that $H_0: \mu_x = \mu_y$ against $H_1: \mu_x \neq \mu_y$. Let $\bar{x}$ and $\bar{y}$ be the sample means of the two samples, respectively. Similarly, let us denote by $s_x^2$ and $s_y^2$ the variances of the two samples. Then, the test statistic is given by:

$$t = \frac{(\bar{x} - \bar{y}) - (\mu_x - \mu_y)}{S_p\sqrt{\frac{1}{n} + \frac{1}{m}}},$$

where the $S_p^2$ (pooled variance) is defined as follows:

$$S_p^2 = \frac{(n-1)s_x^2+(m-1)s_y^2}{n+m-2}.$$

The t statistic follows the t-distribution with $n+m-2$ degrees of freedom.


#### Two-sample t-test (Welch's test)

This is a test of the null hypothesis that the means of two normally distributed populations are equal **without** the assumption that the variances of the two populations are equal. 

We consider two samples from the two distributions, i.e., $x_1, x_2, \ldots, x_n$ and $y_1, y_2, \ldots, y_m$. We test the hypothesis that $H_0: \mu_x = \mu_y$ against $H_1: \mu_x \neq \mu_y$. Let $\bar{x}$ and $\bar{y}$ be the sample means of the two samples, respectively. Similarly, let us denote by $s_x^2$ and $s_y^2$ the variances of the two samples. Then, the test statistic is given by:

$$t = \frac{(\bar{x} - \bar{y}) - (\mu_x - \mu_y)}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}.$$

## Testing the value of the mean

To illustrate the basic concepts of statistical hypothesis testing, we'll start with simple tests for a hypothesis that the true mean value is equal to some value $\mu_0$, with an alternative hypothesis that it's different:  

$$H_0: \mu = \mu_0$$
$$H_1: \mu \neq \mu_0$$  
Note the lack of hats above any of the symbols - the hypotheses are about parameters, not estimators or any other random values.   

**Exercise 1.** Consider the human protein log-length data in the `human_protein_lengths` data frame. Consider the following two null hypotheses: $H_0^{(1)}: \mu = 2.711540$ and $H_0^{(2)}: \mu = 6$. We'll use the Student's t test to verify the hypotheses using a random sample.  

1. Select a random sample of protein log-lengths of size $N=20$.   
2. Calculate the test statistics for one-sample Student's t-tests with the assumption that the standard deviation is unknown and is estimated from the sample. Pay attention which kind of the variance estimator you need to use (biased or unbiased). How many test statistics do you need to calculate to test the two hypotheses, $H_0^{(1)}$ and $H_0^{(2)}$?   
3. Use the `tstud.ppf` (i.e. the quantile function of Student's t distribution) to calculate the critical set on the significance level 5% (i.e., Type I error risk 5%). Pay attention to the shape of the critical set - for our alternative hypothesis $H_1$, this set is a union of two semi-lines. How many quantile values do you need to calculate to test the two hypotheses, $H_0^{(1)}$ and $H_0^{(2)}$?   
  3.1. Based on the values of the test statistic and the critical set, do we reject our null hypotheses? Did we correctly detect which hypothesis is true and which is false?    
4. Use the `tstud.cdf` to calculate the p-values. Again, pay attention to the shape of the critical set. How many cdf values do you need to calculate to test the two hypotheses, $H_0^{(1)}$ and $H_0^{(2)}$?  
  4.1. Based on the p-values, do we reject any of our hypotheses on the significance level 5%? Did we correctly detect which hypothesis is true and which is false?     
5. Compare your results to the Student's t test implementation in `scipy`. The appropriate test has already been loaded in the *Data & imports* section.   
6. Are there any assumptions of the test that are violated? If so, how strongly and what effect could it have on the test result?  
7. Based on the results of this exercise, can you conclude that $N=20$ is enough to prove that $\mu=2.711540$? Does the answer to this question depend on $N$?   

In [None]:
# Set the parameters
mu1 = 2.711540
mu2 = 6.
N = 20
df = N - 1
alpha = 0.05

# Get the sample
sample=human_protein_lengths.sample(N)

## Calculate the test statistic
mean_log = sample['LogLength'].mean()
print('Estimated mean:', mean_log)
# Here we use pandas.Series.std with the default setting ddof=1, i.e., the unbiased standard deviation estimator.
# With std() from NumPy, we would have to use std_log = np.std(sample['LogLength'].values, ddof=1), i.e., we would have
# to explicitly set ddof to 1 because the default value is 0.
std_log = sample['LogLength'].std()
test1 = np.sqrt(N)*(mean_log - mu1)/std_log
test2 = np.sqrt(N)*(mean_log - mu2)/std_log
print('Value of the test statistic for H^1_0:', test1)
print('Value of the test statistic for H^2_0:', test2)

## Calculate the critical regions
q = tstud.ppf(1 - alpha/2, df)
print('We reject the hypothesis if T <', -q, "or T >", q)

## Calculate the p-value:
print('P-value for H^1_0:', 2*tstud.cdf(-abs(test1), df))
print('P-value for H^2_0:', 2*tstud.cdf(-abs(test2), df))

## Verification of the p-value and critical region calculation
print('P-value corresponding to the critical region:', 2*tstud.cdf(-q, df))

## Scipy implementation:
print('Scipy results, H^1_0:', ttest_1samp(sample['LogLength'], popmean=mu1))
print('Scipy results, H^2_0:', ttest_1samp(sample['LogLength'], popmean=mu2))

**Exercise 2.** In this exercise, we'll see a more useful application of statistical hypothesis testing: comparing two populations. Say we want to use a random sample to check if the log-lengths of human proteins are, on average, higher than the ones of another organism - like the bay bolete (a kind of mushroom).

1. What are the appropriate null and alternative hypotheses if we want to use a random sample to show that human proteins are longer in terms of the average log-length?     
2. Select a random sample of human proteins ($N_\text{human}=20$) and a random sample of bay bolete proteins ($N_\text{bolete}=20$) from the `protein_lengths` data frame.
3. Use the two-sample Student's t-test implemented in the `ttest_ind` function from the `scipy` library to calculate the p-value. Pay attention to the `alternative` keyword, as well as to any keyword arguments that may correspond to assumptions such as the equality of variances.   
4. Based on the p-value, do we reject $H_0$ on a significance level 5%? Did we confirm that humans have longer or shorter proteins than the mushroom (in terms of the log-length)?      
5.\*\* Implement your own test and compare the p-value. You can use the equations described [here](https://en.wikipedia.org/wiki/Welch%27s_t-test).  
6. What happens if we take a reverse hypothesis - that humans have lower protein log-lengths than the mushroom? Did our results confirm anything now?  
  6.1. Is the result of the test true? Compare the true average log-lengths of the two organisms.  
  6.2. Is the result of the test from point 5 true?   
7.\* Do a test for a null hypothesis that the average protein log-lengths are equal, and an alternative that they are different (in any direction; a so-called *two-sided* alternative hypothesis). Explain the difference in the results compared to the previous points.    
  7.1. Roughly speaking, what is the difference in the shape of the critical region for a one-sided and a two-sided alternative hypothesis?   
  7.2.\* Calculate the ratio of the p-value for the two-sided alternative to the p-value for the one-sided alternative that the human log-lengths are smaller than the bolete log-lengths. Explain the result.  
8. Can we use the Student's t test to test protein lengths rather than log-lengths?   


In [None]:
## Parameters
N_human = 20
N_bolete = 20

## Sample
human_sample=protein_lengths.loc[protein_lengths['Common name'] == 'Human', 'LogLength'].sample(N_human)
bolete_sample=protein_lengths.loc[protein_lengths['Common name'] == 'Bay bolete (mushroom)', 'LogLength'].sample(N_bolete)

## Do the test:
print('Scipy results for Human > Bolete alternative:', ttest_ind(human_sample, bolete_sample, alternative='less', equal_var=False))
print('Scipy results for Human < Bolete alternative:', ttest_ind(human_sample, bolete_sample, alternative='greater', equal_var=False))
print('Scipy results for a two-sided alternative:', ttest_ind(human_sample, bolete_sample, alternative='two-sided', equal_var=False))
print('Two-sided p-value divided by the one-sided p-value:',
      ttest_ind(human_sample,
                bolete_sample,
                alternative='two-sided')[1]/ttest_ind(human_sample,
                                                      bolete_sample,
                                                      alternative='greater')[1])
# Note: ttest_ind(a, b, alternative='greater') tests for E(a) > E(b).
# Warning: ttest_ind assumes equal variances by default!

## Implement the test:
meanX = human_sample.mean()
meanY = bolete_sample.mean()
var_meanX = human_sample.var()/N_human  # an estimator of the variance of the estimator of the mean of X
var_meanY = bolete_sample.var()/N_bolete
delta_mean = meanX-meanY
sd_delta = np.sqrt(var_meanX + var_meanY)
test_statistic = delta_mean / sd_delta
print('Manually calculated value of the T statistic:', test_statistic)
df = sd_delta**4 / (var_meanX**2/(N_human-1) + var_meanY**2/(N_bolete-1))
print('Manually calculated p-value (one-sided):',  1 - tstud.cdf(abs(test_statistic), df))

## Compare to the true averages:
print(protein_lengths.groupby('Common name').mean(numeric_only=True))

## Non-parametric tests

The two-sample Student's t-test assumes a normal distribution of the populations. When this assumption is only slightly violated, like for the protein log-length data, the results may still be reliable, especially for large sample sizes. For many data sets, as the sample size increases, the distribution of the estimator of the mean converges to the Normal one (this is guaranteed by the Central Limit Theorem). This means that the estimator of the mean is distributed "more normally" than the original data (where "more normal" refers to the distance between the cumulative distribution functions). This increases the robustness of the t-test for small deviations from normality. However, when this assumption is strongly violated, like for the non-transformed protein length data, the results are no longer reliable. One way to solve this problem is to use non-parametric tests. A non-parametric test is defined as a test that does not rely on the assumption of a distribution of the data.   

One of the most common non-parametric tests is the Mann-Whitney U-test, also known as the two-sample Wilcoxon's test. It's often used as a replacement for the Student's t-test when the data is not distributed normally. However, the null hypotheses of these two tests are different, and it's important to understand this difference to avoid misleading results.

In contrast to the Student's t-test, the Mann-Whitney's one doesn't test the equality of parameters like the mean - hence the name *non-parametric*. Instead, its null hypothesis is that $\mathbb{P}(X > Y) = 1/2$, i.e., that if we take a random observation $X$ from the first sample, and a random observation $Y$ from the second sample, it's equally likely that the first is greater or smaller than the second. A one-sided alternative hypothesis may be, e.g., that  $\mathbb{P}(X > Y) > 1/2$, i.e., that samples from the first population tend to be larger than sample from the second one. In this case, we say that the first sample is *stochastically greater* than the second one.  

Sidenote: the actual null hypothesis of the Mann-Whitney test is slightly different, but the one described above is a very close approximation that's much simpler to interpret and use in practice.

**Exercise 3.\*** Implement you own version of the Mann-Whitney's test. You can find the necessary equations [here](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) (use the normal approximation for the test statistic).

1. Use your implementation to test whether the protein lengths are higher in human than in the bay bolete (use a random sample of size $N$ of your choice).
2. Compare your results to the `mannwhitneyu` function from `scipy`. Pay attention to the default parameters to obtain identical results.  
3. Compare the results of the Mann-Whitney's test to the Student's t  test. Are the results of the two tests consistent? Can you conclude that one of the organisms has longer proteins? Is this a correct result?  

In [None]:
N_human = 20
N_bolete = 20

bolete_protein_lengths = protein_lengths.loc[protein_lengths['Common name'] == 'Bay bolete (mushroom)']

sample1 = protein_lengths.loc[protein_lengths['Common name'] == 'Human'].sample(N_human)
sample2 = protein_lengths.loc[protein_lengths['Common name'] == 'Bay bolete (mushroom)'].sample(N_bolete)

## Testing lengths
print('Scipy test results for lengths:')
print(ttest_ind(sample1['Protein length'],
          sample2['Protein length'],
          equal_var=False,
          alternative='greater'))

print(mannwhitneyu(sample1['Protein length'],
             sample2['Protein length'],
             alternative='greater'))
# Note: mannwhitneyu(a, b, alternative='greater') checks if a is stochastically
# greater than b (the null is that they're equal)

## Manual implementation
print()
print('Manual test results for lengths:')
# Combine the data:
combined = pd.concat([sample1, sample2]).copy()
# Get the ranks:
combined.sort_values('Protein length', inplace=True)
combined['rank'] = range(1, combined.shape[0]+1)
# Compute the statistic:
U = sum(combined.loc[combined['Common name'] == 'Human', 'rank'])
U -= N_human*(N_human+1.)/2.
print('Manually computed value of the U test statistic:', U)
# U is now equal to the number of pairs of human and bolete proteins where
# the human one is longer.
# Therefore, small values correspond to the alternative hypothesis.
# Compute the mean and standard deviation of the test statistic:
mU = N_human*N_bolete/2.
sdU = np.sqrt(N_human*N_bolete*(N_human+N_bolete+1.)/12.)
Z = (U-mU)/sdU
print('Expected value of U:', mU, 'with standard deviation:', sdU)
print('Z-score:', Z)
# Compute the approximate p-value in a one-sided test:
pval = 1 - norm.cdf(Z)
print('Manually computed p-value:', pval)
print(mannwhitneyu(sample1['Protein length'],
             sample2['Protein length'],
             alternative='greater',
             use_continuity=False,
             method = 'asymptotic'))  # to make it equal to our statistic


**Exercise 4.** The `citizen_income` data frame, loaded in the *Data & modules* section, contains the information about the yearly income in USD of randomly sampled individuals from two countries, encoded as Country `A` and Country `B`. Use an appropriate statistical test to check whether citizens of one of the countries earn more than citizens of the other country. If you use more than one test and get contradictory results, explain why that happens. 

In [None]:
# Note the reverse direction of the hypotheses below
print('T-test results:', ttest_ind(citizen_incomes.loc[citizen_incomes['Country']=='A', 'Income'],
                                   citizen_incomes.loc[citizen_incomes['Country']=='B', 'Income'],
                                   equal_var=False,
                                   alternative='less'))
print('U-test results:', mannwhitneyu(citizen_incomes.loc[citizen_incomes['Country']=='A', 'Income'],
                                      citizen_incomes.loc[citizen_incomes['Country']=='B', 'Income'],
                                      alternative='greater'))

print('Average incomes:', citizen_incomes.groupby('Country').mean())
print('Median incomes:', citizen_incomes.groupby('Country').median())

px.histogram(citizen_incomes, x='Income', color='Country', barmode='overlay')

**Exercise 5.\*\*** In this exercise, we'll see how the violations of test assumptions influence the distribution of the test statistics under the null hypothesis.   

1. Formulate a statistical hypothesis about the incomes of citizens of country `B` which you can test using any version of the Student's t test (regardless whether its assumptions are satisfied), and in which the null hypothesis $H_0$ is *true*.   
  1.1.\* Optionally, formulate a hypothesis which you can test using both Student's t and Mann-Whitney's u test.  
2. Which assumptions of your test are violated on this data set? Are they approximately satisfied for large sample sizes?    
3. Repeat the following $R=5000$ times:   
  3.1. Draw two random samples, each of size $N=10$.   
  3.2. Calculate the values of the Student's T statistic, either manually or using functions from `scipy`.   
  3.3.\* Calculate the Mann-Whitney's U statistic using functions from `scipy`.
  3.4. Save the values of the statistics in lists.  
4. Create a histogram that depicts the distribution of the statistic. Is the distribution correct (i.e. do they agree with the theoretical asumptions of the tests)? If not, how does it influence the test results?   
  4.1.\* Draw the probability density function of the theoretical distribution of the test statistics (under the null hypothesis) on the histograms.  
  4.2. Calculate the probability that your test makes a false positive error in this data set (i.e. that it incorrectly rejects a true null hypothesis; i.e. that the test statistic is within the theoretically calculated critical region).  
5. Did we analyze all the possible ways in which violated assumptions can influence test results? If not, what other problems or errors can be caused by the violated assumptions?   
6. What happens if you use protein log-lengths instead of citizen incomes?     



In [None]:
## Parameters
N = 10
R = 5000

test_population = citizen_incomes.loc[citizen_incomes['Country']=='B', 'Income']
# test_population = human_protein_lengths['LogLength']

## Distributions of the test statistics
T_values = []
U_values = []

for _ in range(R):
  sample1 = test_population.sample(N)
  sample2 = test_population.sample(N)
  #sample1 = human_protein_lengths['Protein length'].sample(N)
  #sample2 = human_protein_lengths['Protein length'].sample(N)
  T = ttest_ind(sample1,
                sample2,
                equal_var=False)[0]
  U = mannwhitneyu(sample1,
                   sample2,
                   method = 'asymptotic')[0]
  # Note: the value of the test statistic doesn't depend on the alternative
  # hypothesis; method = 'asymptotic' is faster
  T_values.append(T)
  U_values.append(U)


fig1 = px.histogram(T_values, histnorm='probability density')
fig1.add_trace(
    go.Scatter(x=np.linspace(-3, 3), y=tstud.pdf(np.linspace(-3, 3), N-1),
                name='Student\'s T')

)
# Just for comparison, I add a normal pdf
fig1.add_trace(
    go.Scatter(x=np.linspace(-3, 3), y=norm.pdf(np.linspace(-3, 3)),
    name='Gaussian')
)
fig1.show()

# For the U statistic, remember about that its expectation isn't 0.
# We can use the expectations from the previous exercise
mU = N**2/2.
sdU = np.sqrt(N**2*(2*N+1.)/12.)
fig2 = px.histogram(U_values, histnorm='probability density')
fig2.add_trace(
    go.Scatter(x=np.linspace(mU-3*sdU, mU+3*sdU), y=norm.pdf(np.linspace(mU-3*sdU, mU+3*sdU), loc=mU, scale=sdU),
    name='Gaussian')
)
fig2.show()

# Calculate the empirical probability of the critical regions
qt = tstud.ppf(1 - alpha/2, df)
qn = norm.ppf(1-alpha/2)
print('Probability of Type I error for Mann-Whitney:', np.mean([abs((u-mU)/sdU) > qn for u in U_values]))
print('Probability of Type I error for Student:', np.mean([abs(t) > qt for t in T_values]))