## Two sample t test

**Two sample t test is used to determine if two population means are equal.**

* 1) Data is *paired* - For example, a group of students are given coaching classes and effect of coaching on the  marks scored is determined.
* 2) Data is *not paired* - For example, find out  whether the miles per gallon of  cars of Japanese make is superior to cars of Indian make.

In [4]:
import numpy                     as     np
import pandas                    as     pd
from   scipy.stats               import ttest_1samp, ttest_ind # independent
import matplotlib.pyplot         as     plt
import matplotlib
import seaborn as sns
import scipy.stats as stats
import statsmodels.stats.api as sm

## Example 1 - Independent Two Sample T-Test

A hotel manager looks to enhance the initial impressions that hotel guests have when they check in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20 deliveries on a particular day were selected in Wing A of the hotel, and a random sample of 20 deliveries were selected in Wing B. The results are stored in Luggage . Analyze the data and determine whether there is a difference between the mean delivery times in the two wings of the hotel. <br>

In [5]:
mydata = pd.read_csv('Luggage.csv')
mydata.head()

Unnamed: 0,WingA,WingB
0,10.7,7.2
1,9.89,6.68
2,11.83,9.29
3,9.04,8.95
4,9.37,6.61


### Step 1: Define null and alternative hypotheses

The null hypothesis states that the mean time to deliver the luggages are the same, $\mu{A}$ equals $\mu{B}$. The alternative hypothesis states that the mean time to deliver the luggages are different, $\mu{A}$ is not equal to $\mu{B}$.

* $H_0$: $\mu{A}$ - $\mu{B}$ =      0 i.e        $\mu{A}$ = $\mu{B}$
* $H_A$: $\mu{A}$ - $\mu{B}$ $\neq$  0 i.e      $\mu{A}$ $\neq$ $\mu{B}$

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05 and the population standard deviation is not known.

### Step 3: Identify the test statistic

* We have two samples and we do not know the population standard deviation.
* Sample sizes for both samples are  same.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_{STAT}$ test statistic for two sample unpaired test.

### Step 4: Calculate the p - value and test statistic

** This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances.**

For this exercise, we are going to first assume that the variance is equal and then compute the necessary statistical values.

In [6]:
t_statistic, p_value  = ttest_ind(mydata['WingA'],mydata['WingB']) #ttest_ind function is used to do t test on two samples
print('tstat',t_statistic)    
print('P Value',p_value)    

tstat 5.16151166403543
P Value 8.007988032535588e-06


In [9]:
t1=stats.t.ppf(q=0.025,df=39)  # Degrees of freedom
print(t1)
t2=stats.t.ppf(q=0.975,df=39)  # Degrees of freedom
print(t2)

-2.0226909117347285
2.022690911734728


### Step 5:  Decide to reject or accept null hypothesis

In [4]:
# p_value < 0.05 => alternative hypothesis:
print ("two-sample t-test p-value=", p_value)

alpha_level = 0.05

# We conclude that the mean time to deliver luggages in of both the wings of the hotel are not same.'

two-sample t-test p-value= 8.007988032535588e-06


## Example 2 - Paired T-Test

The file Concrete contains the compressive strength, in thousands of pounds per square inch (psi), of 40 samples of concrete taken two and seven days after pouring. (Data extracted from O. Carrillo-Gamboa and R. F. Gunst, “Measurement-Error-Model Collinearities,” Technometrics, 34 (1992): 454–464.)

At the 0.01 level of significance, is there evidence that the mean strength is lower at two days than at seven days?


In [15]:
mydata = pd.read_csv('Concrete.csv')
mydata

Unnamed: 0,Sample,Two Days,Seven Days
0,1,2.83,3.505
1,2,3.295,3.43
2,3,2.71,3.67
3,4,2.855,3.355
4,5,2.98,3.985
5,6,3.065,3.63
6,7,3.765,4.57
7,8,3.265,3.7
8,9,3.17,3.66
9,10,2.895,3.25


## Step 1: Define null and alternative hypotheses

* the null hypothesis states that the compressive strength of the cement is not lower at 2 days than at 7 days, $\mu_{2}$ $\geq$ $\mu_{7}$. 
* The alternative hypthesis states that the compressive strength of the cement is lower at 2 days than at 7 days, $\mu_{2}$ < $\mu_{7}$

* $H_0$: $\mu_{2}$ - $\mu_{7}$ $\geq$  0
* $H_A$: $\mu_{2}$ - $\mu_{7}$ <  0

Here, $\mu_2$ denotes the mean compressive strenght of the cement after two days and $\mu_7$ denotes the mean compressive strength of the cement after seven days.

## Step 2: Decide the significance level

Here we select $\alpha$ = 0.01 as given in the question.

## Step 3: Identify the test statistic

* Sample sizes for both samples are  same.
* We have two paired samples and we do not know the population standard deviation.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_{STAT}$ test statistic for two sample paired test.

## Step 4: Calculate the p - value and test statistic

**We use the scipy.stats.ttest_rel to calculate the T-test on TWO RELATED samples of scores. This is a two-sided test for the null hypothesis that 2 related or repeated samples have identical average (expected) values. Here we give the two sample observations as input. This function returns t statistic and two-tailed p value.**

In [6]:
# paired t-test
t_statistic, p_value  =  stats.ttest_rel(mydata['Two Days'],mydata['Seven Days'])
print('tstat  %1.3f' % t_statistic)
print("p-value for one-tail:", p_value/2)

tstat  -9.372
p-value for one-tail: 7.768158524368873e-12


In [16]:
t1=stats.t.ppf(q=0.01,df=38)  # Degrees of freedom
print(t1)
t2=stats.t.ppf(q=1-0.01,df=38)  # Degrees of freedom
print(t2)

-2.428567630859086
2.428567630859085


## Step 5:  Decide to reject or accept null hypothesis

In [17]:
# p_value < 0.05 => alternative hypothesis:
# they don't have the same mean at the 5% significance level
print ("Paired two-sample t-test p-value=", p_value/2)

alpha_level = 0.01

if (p_value/2) < alpha_level:
    print('We have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
    
else:
    print('We do not have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
    

Paired two-sample t-test p-value= 4.003994016267794e-06
We have enough evidence to reject the null hypothesis in favour of alternative hypothesis
