In [1]:
import pandas as pd
import numpy as np

from statsmodels.stats.weightstats import ttest_ind as ttest_ind_sm
from statsmodels.stats.weightstats import DescrStatsW, CompareMeans
from statsmodels.stats.power import TTestIndPower
from scipy.stats import ttest_ind, t

In [2]:
# We'll first set the seed for the numpy random number generator
np.random.seed(1869)

## Simple t-test

We will use the t-test implementation available in scipy.stats. First we'll read in the data in the file hypothesis_test_example.csv in the Data directory of the repository. 

In [3]:
# Read in t-test example dataset
df_simple_example = pd.read_csv("../Data/hypothesis_test_example.csv")

Let's look at what the data consists of.

In [4]:
df_simple_example.head(n=10)

Unnamed: 0,x_A,x_B
0,2.035733,-0.171019
1,1.105017,0.65591
2,-0.745565,0.994498
3,1.8823,1.146082
4,-0.369233,0.317476
5,0.806429,1.124682
6,0.320897,-0.413721
7,0.910782,0.326461
8,1.141452,0.30949
9,1.803123,0.538578


Okay, we've got two simple columns of data. One column from the A group and one column of data from B group.

Let's take quick look at the summary statistics of the data. We'll just use the describe function from pandas to do this.

In [5]:
# Extract summary statistics of the data
df_simple_example.describe()

Unnamed: 0,x_A,x_B
count,100.0,100.0
mean,0.529,0.176
std,0.93847,0.983872
min,-1.036634,-1.95743
25%,-0.241617,-0.44883
50%,0.544784,0.250698
75%,1.108813,0.676324
max,3.126451,2.612088


From the summary statistics we can see there is difference in the sample means, with the B group sample data having a mean of 0.176, whilst the A group sample data has a mean of 0.529. But does this provide evidence for the underlying population means of group A and group B being different? Let's run the t-test to test this.

In [6]:
# The scipy t-test function is really easy to use. We just pass in the two columns of data. We are assuming that 
# the underlying population variances are the same in each group. 
ttest_ind(df_simple_example['x_A'], df_simple_example['x_B'])

TtestResult(statistic=2.5961983095998966, pvalue=0.010132851609223453, df=198.0)

From this t-test we get a test statistic (t-value) of 2.596 and a p-value of 0.0101. If we use a $\alpha$ threshold of $\alpha=0.05$, then we would reject the null hypothesis and conclude that there is evidence (not proof) that the underlying population means of the two groups are different.

For completeness, let's see how we can also run that t-test using the statsmodels package. We'll use the statsmodels.stats.weightstats.ttest_ind function. To avoid the clash in function names with the scipy.stats.ttest_ind function we have imported this as ttest_ind_sm. Again the function is very easy to use. We just pass the two columns of data to the function. The default settings will assume that the underlying population variances are the same for each group and construct an estimate of that common variance from all the data pooled together. The function will also assume we are doing a two-tailed hypothesis test.

In [7]:
# Run the statsmodels version of the two-sample two-tailed t-test
ttest_ind_sm(df_simple_example['x_A'], df_simple_example['x_B'])

(2.596198309599896, 0.01013285160922347, 198.0)

The calculated t-value and p-value are the same as for the scipy version of the t-test. The 198 refers to the number of degrees of freedom used in the calculation of p-value. In this case the number of degrees of freedom is 200 - 2 = 198 because we have 200 observations in total but we have estimated two sample variances from that data when calculating the t-value.

We can check that the p-value calculated by statsmodels and scipy is the same as from the t-distribution formula given in the main text with $\nu=198$ by using the t-distribution implementation in scipy.

In [8]:
# Calculate the p-value for the observed t-value of 
# t=2.5961983095998953 and and df=198 by using the cumulative 
# distribution function (CDF) of Student's t-distribution.
# scipy gives us an implementation of the t-distribution. 
# We want to calculate the area under the PDF that is to the
# right of 2.5961983095998953 and to the left of -2.5961983095998953.
# Since the t-distribution is symmetric about zero this is calculated as
# 2*(1 - CDF(2.5961983095998953)) or 2*CDF(-2.5961983095998953).

p_value1 = 2.0*(1.0 - t.cdf(2.5961983095998953, 198))
p_value2 = 2.0*t.cdf(-2.5961983095998953, 198)

print("p-value estimate from right-hand tail of PDF = ", p_value1)
print("p-value estimate from left-hand tail of PDF = ", p_value2)

p-value estimate from right-hand tail of PDF =  0.010132851609223614
p-value estimate from left-hand tail of PDF =  0.01013285160922349


Up to machine precision the two different ways of calculating the p-value from the CDF of the t-distribution implementation in scipy come out to be the same, and are the same as we get from the scipy and statsmodel t-test functions.

## Permutation based test of difference in means

In [9]:
# Set a large, but reasonable number of permutations to run. 
# In this case I've chosen to generate 100000 permuted datasets.
# This may take a couple of minutes to run.
n_permutations = 100000

# First I'll combine the original data into a single array. This makes performing the permutation easier.
x_All = np.concatenate((df_simple_example['x_A'].to_numpy(), df_simple_example['x_B'].to_numpy()))

## Next I'll calculate the observed test-statistic value and store it 
## in a variable called t_observed

# Create arrays to hold the indices of the datapoints 
# belonging to the A group and the B group. To start, the
# A group datapoints are at indices 0:99. The B group datapoints
# are at indices 100:199
nA = df_simple_example.shape[0]
nB = nA

A_indices = np.arange(0, nA)
B_indices = np.arange(nA, (nA+nB))

# Calculate the mean of each sample group
m_A = np.mean(x_All[A_indices])
m_B = np.mean(x_All[B_indices])
    
# Calculate the sample variances of each sample group
# The ddof=1 means we are using unbiased estimators for 
# the sample variance calculations
s2_A = np.var(x_All[A_indices], ddof=1)
s2_B = np.var(x_All[B_indices], ddof=1)
    
# Calculate the t-value test-statistic for the original data
sigma2_observed = (((nA-1)*s2_A) + ((nB-1)*s2_B))/(nA+nB-2)
t_observed = (m_A - m_B)/ (np.sqrt(sigma2_observed) * np.sqrt(2.0/nA))

print("Observed t-value is = ", t_observed)

## Now perform the permutations

# Set our p-value estimate count to zero
p_count = 0.0

# Loop over the permutations
for i in range(n_permutations):
    #Generate the permutation
    permuted_indices = np.random.permutation(nA+nB)
    A_indices = permuted_indices[0:nA]
    B_indices = permuted_indices[nA:(nA+nB)]
    
    # Calculate the mean of each sample group 
    # for the permuted dataset
    m_A = np.mean(x_All[A_indices])
    m_B = np.mean(x_All[B_indices])
    
    # Calculate the sample variances of each sample group
    # for the permuted dataset
    s2_A = np.var(x_All[A_indices], ddof=1)
    s2_B = np.var(x_All[B_indices], ddof=1)
    
    # Calculate the t-value for the permuted dataset
    sigma2_permuted = (((nA-1)*s2_A) + ((nB-1)*s2_B))/(nA+nB-2)
    t_permuted = (m_A - m_B)/ (np.sqrt(sigma2_permuted) * np.sqrt(2.0/nA))
    
    # Update our count if the t-value for the permuted dataset 
    # exceeds (in magnitude) that for the real dataset
    if np.abs(t_permuted) >= np.abs(t_observed):
        p_count += 1.0
        
# Now estimate the p-value
p_value_permutation = (1.0+p_count)/(1.0+n_permutations)
print("Permutation estimated p-value = ", p_value_permutation)

Observed t-value is =  2.5961983095998966
Permutation estimated p-value =  0.01032989670103299


The scipy.stats t-test implementation also allows us to estimate the p-value using a permutation-based calculation. So let's run the scipy.stats version and see how it compares. To do so we just specify the permutation argument to the scipy.stats.ttest_ind function. We'll use the same number of permutations as before. 

In [10]:
# Run the scipy t-test with permutation-based p-value estimation
ttest_ind(df_simple_example['x_A'].to_numpy(), df_simple_example['x_B'].to_numpy(), permutations=n_permutations)

TtestResult(statistic=2.5961983095998966, pvalue=0.00988990110098899, df=nan)

The permutation-based p-value estimates from our own code and from the scipy.stats.ttest_ind function are very similar. Obviously, since both are based on random generation of permutations we would expect to see differences, as the two different calculations will be generating different sets of permutations. As we increase the number of permutations used, we would expect the differences in the p-values estimates between the two methods to decrease. Try increasing n_permutations to 1000000 and re-running - but remember increasing n_permutations means the code takes 10 times longer to run.

## Confidence Interval calculation

We'll use the simple example data again to demonstrate how to use the statsmodels.stats.weightstats.CompareMeans class to calculate a confidence interval for the difference between two population means given two i.i.d. samples of data from those populations. We'll have to wrap the pandas series holding the samples as statsmodels.stats.weightstats.DescrStatsW objects.

In [11]:
# First we'll instantiate a CompareMeans object to run
# the confidence interval calculation. We just pass in
# our two samples of data. We must wrap the samples as
# DescrStatsW objects. Since we are not applying any 
# non-uniform weights to the observations we can just 
# pass the pandas series for each sample into the 
# constructor for the DescrStatsW class
mean_comparison = CompareMeans(DescrStatsW(df_simple_example['x_A']), DescrStatsW(df_simple_example['x_B']))

# Now compute the 95% confidence level for the 
# difference in means using the tconfint_diff method 
# of the CompareMeans class. The 95% confidence level 
# is the default
mean_difference_95CI = mean_comparison.tconfint_diff()

mean_difference_95CI

(0.08486864535681571, 0.6211313546431838)

We can see that the 95% confidence interval does not cross zero, so we would conclude that a hypothesis test would reject a null hypothesis of their being no difference in population means, when tested at the $\alpha=0.05$ level.

Let's run another confidence interval calculation, but this time with a higher level of confidence, say 99%.

In [12]:
mean_difference_99CI = mean_comparison.tconfint_diff(alpha=0.01)

mean_difference_99CI

(-0.0006375498458349727, 0.7066375498458345)

Clearly, the 99% confidence interval is wider than the 95% confidence interval. The 99% confidence interval straddles zero, so we would reject the null hypothesis at $\alpha=0.01$ in a two-tailed test of the hypothesis $\mu_{A} = \mu_{B}$. The illustrates the subjective nature of hypothesis testing.

## Power calculation for t-test

We will use the statsmodels two-sample t-test power calculation function solve_power from the statsmodels.stats.power.TTestIndPower class. Unfortunately, this function may throw a warning due to warnings from the underlying Boost library that scipy makes use of. See the statsmodels issue https://github.com/statsmodels/statsmodels/issues/8624 for more details. For convenience of output we have suppressed the warning.

The solve_power function takes several arguments, i) effect_size, ii) nobs1, iii) alpha, iv) power, v) ratio, vi) alternative. We specify all but one of effect_size, nobs1, alpha, power, ratio and the function will determine the value of the unspecified argument that is necessary for conistency with the other arguments. In the example below, we have left nobs1 unspecified, so the function will determine the number of observations needed to give us a power of 0.8, when the effect size $|\mu_{A} - \mu_{B}|/\sigma = 0.5$, and the type-I error rate $\alpha=0.05$. The argument ratio (set to its default value of 1.0 here) is the ratio of the sample sizes drawn from the A and B populations, so a ratio of 1.0 means we are calculating the value of $N$ required if we are going to have an equal number of observations from A and B. The alternative argument is the type of alternative hypothesis we want to test. 'two-sided' means "two-tailed".

In [13]:
# Import the warnings module so we can ignore the warning thrown by scipy
import warnings

# Wrap the call to tt_ind_solve_power
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    print("Sample size required = ", TTestIndPower().solve_power(effect_size=0.5, nobs1=None, alpha=0.05, power=0.8, ratio=1.0, alternative='two-sided'))

Sample size required =  63.765611775409525


To double check that this does give the required power, we can use the power function from the TTestIndPower class to calculate the power for the sample size we have just estimated. If the estimated sample size is correct, we should get back a power of 0.8. All the other arguments to the power function are the same as before, i.e. as we passed to the solve_power function.

In [14]:
print("Power = ", TTestIndPower().power(effect_size=0.5, nobs1=63.765610587854034, alpha=0.05, ratio=1.0, alternative='two-sided'))

Power =  0.7999999950676624


Note that if we calculate the power for a larger sample size, we should get a higher value. Let's try with a sample size of 85.

In [15]:
print("Power = ", TTestIndPower().power(effect_size=0.5, nobs1=85, alpha=0.05, ratio=1.0, alternative='two-sided'))

Power =  0.8998940700985045


We can see that the power has increased to nearly 90%, i.e. there is a 90% probability of rejecting the null hypothesis in a two-tailed t-test when $|\mu_{A} - \mu_{B}|/\sigma = 0.5$, $\alpha=0.05$ and $N_{A} = N_{B} = 85$.