In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import scipy.stats as stats

# One sample tests

# Hypothesis of mean

Analysts usually want the data to be distributed normally, but in reality it is enough for the data to have many points, since the Central Limit Theorem does not define the distribution of the original data. Normality is important if someone has too few statistics to use the CLT. This remark is relevant not only for this test, but for most further tests.


$H_0: \mu = \mu_o; H_1: \mu > \mu_0$ Statistics: $T = \dfrac{x_m - \mu}{s/\sqrt n} \sim t_{n-1}$

Since the T-test and Z-test for a single sample differ only depending on whether we know the variance of our data and the resulting statistic (t-distributed or normally distributed), I will only consider the T-test here. 

Assume we need to check if the mean of our data equals to some constant or larger.

$H_0: \mu = \mu_o; H_1: \mu > \mu_0$ Statistics: $T = \dfrac{x_m - \mu}{s/\sqrt n} \sim t_{n-1}$

In [2]:
n1 = stats.norm.rvs(loc = 100, scale = 10, size = 1000)
n2 = stats.norm.rvs(loc = 105, scale = 10, size = 1000)
mu = 100
alpha = 0.05

In [3]:
def one_samp_t(x, mu, aplha = 0.05, alt = 'greater'):
    _, p = stats.ttest_1samp(x, mu, alternative=alt)
    if p > alpha:
        print(f'{p = }, we cant reject the null-hypothesis' )
    else:
        print(f'{p = }, we should reject the null-hypothesis' )

In [4]:
one_samp_t(n1, mu)
one_samp_t(n2, mu)

p = 0.6533477528274056, we cant reject the null-hypothesis
p = 7.835084194727925e-54, we should reject the null-hypothesis


# Proportion test

This tests answers the question: "Is there a statistically significant reason to assure a proportion of population to be equal to ...?". For instance, we might wanna know the proportion of males whithin a population when we conduct a survey.

This test uses statistics $T = \dfrac{x - np}{\sqrt{npq}} \sim N(0, 1)$

In [5]:
# Assume we have n points of data, our observed proportion is p_obs and test-proporsion is p_test
n = 1000
p_test = 0.52
p_obs = 0.56
p = 0.05

In [6]:
_, z_p = sm.stats.proportions_ztest(n*p_obs, n, p_test, alternative = 'larger')
if z_p > p:
    print(f'{z_p = }, {p = } H_0 is not refected')
else:
    print(f'{z_p = }, {p = } H_0 is refected')

z_p = 0.005413460638565616, p = 0.05 H_0 is refected


# Two samples tests

# Two samples mean

Let's behave here as we did in the previous section- skip the known variance case. I'll only note that the same statistics as in the case of not equal unknown variances (see below) results in Z-statistics, instead of t-distribution.

There are two main parts here: $independent$ and $related$ samples cases. Let's start with related one.

For related samples statistics $T = \dfrac{\mu_1 - \mu_2}{S(\text{diff})\big/\sqrt n} \sim t_{n-1}$. Here $\mu_i$ - the mean of $i^{th}$ sample, $S(\text{diff})$ - the standart derivation of difference of samples. Samples are assumed to be of equal size.

In [7]:
# Generating related samples
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = (stats.norm.rvs(loc=5,scale=10,size=500) + stats.norm.rvs(scale=0.2,size=500))
rvs3 = (stats.norm.rvs(loc=8,scale=10,size=500) + stats.norm.rvs(scale=0.2,size=500))

In [8]:
print(stats.ttest_rel(rvs1, rvs2))
print(stats.ttest_rel(rvs1, rvs3))

Ttest_relResult(statistic=-0.0765391447613296, pvalue=0.9390208496632412)
Ttest_relResult(statistic=-6.337317859691792, pvalue=5.23189470185983e-10)


It seems that there're statifically significant reasons to think that rvs1 and rvs2 have equal mean, while for rvs1 and rvs2 we can't state the same.

For independent samples with equal unknown variance we use statistics we should elavuate statistics $T = \dfrac{\mu_1 - \mu_2}{\dfrac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_1 + n_2 - 2}\sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}}}\sim t_{n_1+n_2-2}$

Hard one, isn't is?

In [21]:
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs3 = stats.norm.rvs(loc=8, scale=10, size=100)

In [22]:
print(stats.ttest_ind(rvs1,rvs2, equal_var = True))
print(stats.ttest_ind(rvs1,rvs3, equal_var = True))

Ttest_indResult(statistic=-1.2113723100522373, pvalue=0.22603949901711712)
Ttest_indResult(statistic=-3.453964630016503, pvalue=0.0005914880434097882)


There need no comments, I think...

If samples have unknown variance, that are not assumed to be equal, we use different statistics $T = \dfrac{\mu_1 - \mu_2}{\sqrt{\dfrac{S_1^2}{n_1} + \dfrac{S_1^2}{n_1}}}$

In [23]:
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=5,scale=15,size=500)
rvs3 = stats.norm.rvs(loc=8, scale=20, size=100)

In [25]:
print(stats.ttest_ind(rvs1, rvs2, equal_var=False))
print(stats.ttest_ind(rvs2, rvs3, equal_var=False))

Ttest_indResult(statistic=-1.2291931298976304, pvalue=0.21933042140248887)
Ttest_indResult(statistic=-4.94156655583275, pvalue=2.437465652574751e-06)


# Proportion test

Similar to one sample case, we use statistics $T = \dfrac{p_1 - p_2}{\sqrt{\dfrac{p_1q_1}{n_1}+\dfrac{p_2q_2}{n_2}}}\sim N(0, 1)$

In [48]:
n1 = 1000
n2 = 500
p1_test = 0.52
p2_test = 0.55
p = 0.05

In [49]:
_, p_value = sm.stats.proportions_ztest(count = [n1*p1_test, n2*p2_test], nobs = [n1, n2])
print(f'{p_value = }')

p_value = 0.2724568487259569


In [50]:
p2_test = 0.6
_, p_value = sm.stats.proportions_ztest(count = [n1*p1_test, n2*p2_test], nobs = [n1, n2])
print(f'{p_value = }')

p_value = 0.0033463056952775846


That is all I wanted to consider in this notebook. More complicated and rarely used tests, like ANOVA, Levene’s or Bartlett’s tests, should be considered in the individual notebook, but not here.

THE END.