## How to perform a 2 sample t-test?

Lets us say we have to test whether the height of men in the population is different from height of women in general. So we take a sample from the population and use the t-test to see if the result is significant.

Steps:-

1.Determine a null and alternate hypothesis.
In general, the null hypothesis will state that the two populations being tested have no statistically significant difference. The alternate hypothesis will state that there is one present. In this example we can say that:

2. Collect sample data
Next step is to collect data for each population group. In our example we will collect 2 sets of data, one with the height of women and one with the height of men. The sample size should ideally be the same but it can be different. Lets say that the sample sizes are nx and ny.

3. Determine a confidence interval and degrees of freedom
This is what we call alpha (α). The typical value of α is 0.05. This means that there is 95% confidence that the conclusion of this test will be valid. The degree of freedom can be calculated by the the following formula:


4. Calculate the t-statistic
t-statistic can be calculated using the below formula:


where, Mx and My are the mean values of the two samples of male and female.
Nx and Ny are the sample space of the two samples
S is the standard deviation

5. Calculate the critical t-value from the t distribution
To calculate the critical t-value, we need 2 things, the chosen value of alpha and the degrees of freedom. The formula of critical t-value is complex but it is fixed for a fixed pair of degree of freedom and value of alpha. We therefore use a table to calculate the critical t-value:


In python, rather than looking up in the table we will use a function from the sciPy package. (I promise u, its the only time we will use it!)

6. Compare the critical t-values with the calculated t statistic
If the calculated t-statistic is greater than the critical t-value, the test concludes that there is a statistically significant difference between the two populations. Therefore, you reject the null hypothesis that there is no statistically significant difference between the two populations.

In any other case, there is no statistically significant difference between the two populations. The test fails to reject the null hypothesis and we accept the alternate hypothesis which says that the height of men and women are statistically different.



In [None]:
## Import the packages
import numpy as np
from scipy import stats

In [None]:
# ## Define 2 random distributions
#Sample Size
N = 10
#Gaussian distributed data with mean = 2 and var = 1
a = np.random.randn(N) + 2
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)

In [3]:

## Calculate the Standard Deviation
#Calculate the variance to get the standard deviation

#For unbiased max likelihood estimate we have to divide the var by N-1, and therefore the parameter ddof = 1
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)

In [4]:

#std deviation
s = np.sqrt((var_a + var_b)/2)
s


0.72345968699968499

In [7]:


## Calculate the t-statistics
t = (a.mean() - b.mean())/(s*np.sqrt(2/N))
t

5.0954478472219273

In [8]:

## Compare with the critical t-value
#Degrees of freedom
df = 2*N - 2
df

18

In [9]:

#p-value after comparison with the t 
p = 1 - stats.t.cdf(t,df=df)
p

3.7774106135035623e-05

In [10]:
## Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(2*p2))

t = 5.09544784722
p = 0.00015109642454
