## Two sample test

**Two sample t test (Snedecor and Cochran 1989) is used to determine if two population means are equal.
A common application is to test if a new treatment or approach or process is yielding better results than the current treatment or approach or process.**

* 1) Data is *paired* - For example, a group of students are given coaching classes and effect of coaching on the  marks scored is determined.
* 2) Data is *not paired* - For example, find out  whether the miles per gallon of  cars of Japanese make is superior to cars of Indian make.

## Two sample t test for unpaired data is defined as 
* $H_0$: $\mu1$        = $\mu2$ 
* $H_a$: $\mu1$ $\neq$ = $\mu2$ 

### Test statistic T = $\frac{\overline{X_1} - \overline{X_2}}{\sqrt{\frac{{s_1}^2} {n1}+ \frac{{s_2}^2}{n2}}}$

* where n1 and n2 are the sample sizes and X1 and X2 are the sample means 
* ${S_1}^2$ and ${S_2}^2$ are sample variances

### EXERCISE

Compare two unrelated samples. Data was collected on the weight loss of 16 women and 20 men enrolled in a weight reduction program.
At $\alpha$ = 0.05, test whether the weight loss of these two samples is different.

In [1]:
Weight_loss_Male   = [ 3.69, 4.12, 4.65, 3.19,  4.34, 3.68, 4.12, 4.50, 3.70, 3.09,3.65, 4.73, 3.93, 3.46, 3.28, 4.43, 4.13, 3.62, 3.71, 2.92]
Weight_loss_Female = [2.99, 1.80, 3.79, 4.12, 1.76, 3.50, 3.61, 2.32, 3.67, 4.26, 4.57, 3.01, 3.82, 4.33, 3.40, 3.86]

In [9]:
from    scipy.stats             import  ttest_1samp,ttest_ind, wilcoxon, ttest_ind_from_stats
import  scipy.stats             as      stats  
from    statsmodels.stats.power import  ttest_power
import  matplotlib.pyplot       as      plt

### Step 1: Define null and alternative hypotheses

### Step 2: Decide the significance level

### Step 3: Identify the test statistic

### Step 4: Calculate the p - value and test statistic

**We use the scipy.stats.ttest_ind to calculate the t-test for the means of TWO INDEPENDENT samples of scores given the two sample observations. This function returns t statistic and two-tailed p value.**

**This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances.**

### Step 5:  Decide to reject or accept null hypothesis

In [3]:
# Null hypothesis : Mean of both the samples are equal
# Alternate hypothesis : Not equal
# 5% significance level
# Two sample t-test of unpaired data
# Calculating p-value
t_statistic, p_value = ttest_ind(Weight_loss_Male, Weight_loss_Female)
print(t_statistic, p_value)

1.827188295981286 0.0764604205335295


In [None]:
# Deciding hypothesis to accept or reject
if p_value < 0.05:
    print("Reject the null hypothesis")
else:
    print("Accept null hypothesis")

## Two sample t test for paired data

### EXERCISE

Compare two related samples. Data was collected on the marks scored by 25 students in their final practice exam and the marks scored by the students after attending special coaching classes conducted by their college.
At 5% level of significance, is there any evidence that the coaching classes has any effect on the marks scored.

In [12]:
Marks_before = [ 52, 56, 61, 47, 58, 52, 56, 60, 52, 46, 51, 62, 54, 50, 48, 59, 56, 51, 52, 44, 52, 45, 57, 60, 45]

Marks_after  = [62, 64, 40, 65, 76, 82, 53, 68, 77, 60, 69, 34, 69, 73, 67, 82, 62, 49, 44, 43, 77, 61, 67, 67, 54]

### Step 1: Define null and alternative hypotheses

### Step 2: Decide the significance level

### Step 3: Identify the test statistic

### Step 4: Calculate the p - value and test statistic

**We use the scipy.stats.ttest_rel to calculate the T-test on TWO RELATED samples of scores.
This is a two-sided test for the null hypothesis that 2 related or repeated samples have identical average (expected) values. Here we give the two sample observations as input. This function returns t statistic and two-tailed p value.**

### Step 5:  Decide to reject or accept null hypothesis

In [14]:
# Null hypothesis : Mean of marks scored before and after the exam are equal, hence no effect of coaching classes.
# Alternate hypothesis : Not equal, hence there is an effect of coaching classes on the marks.
# 5% significance level
# Two sample t-test of paired data
# Calculating p-value
difference = [Marks_after[i]-Marks_before[i] for i in range(len(Marks_before))]
t_statistic, p_value = ttest_1samp(difference, 0)
print(t_statistic, p_value)

3.404831324883169 0.0023297583680290364


In [15]:
# Deciding hypothesis to accept or reject
if p_value < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")

Reject null hypothesis


### EXERCISE
**Alchohol consumption before and after love failure is given in the following table. Conduct a paired t test to check whether the alcholhol consumption is more after the love failure at 5% level of significance.**

### Step 1: Define null and alternative hypotheses

### Step 2: Decide the significance level

### Step 3: Identify the test statistic

### Step 4: Calculate the p - value and test statistic

**We use the scipy.ttest_1samp to calculate the T-test on the difference between sample scores.**

In [18]:
import numpy as np

Alchohol_Consumption_before = np.array([470, 354, 496, 351, 349, 449, 378, 359, 469, 329, 389, 497, 493, 268, 445, 287, 338, 271, 412, 335])
Alchohol_Consumption_after  = np.array([408, 439, 321, 437, 335, 344, 318, 492, 531, 417, 358, 391, 398, 394, 508, 399, 345, 341, 326, 467])

D  = Alchohol_Consumption_after -Alchohol_Consumption_before
print(D)
print('Mean is %3.2f and standard deviation is %3.2f' %(D.mean(),np.std(D,ddof = 1)))

[ -62   85 -175   86  -14 -105  -60  133   62   88  -31 -106  -95  126
   63  112    7   70  -86  132]
Mean is 11.50 and standard deviation is 95.68


In [19]:
# Null hypothesis : Mean alcohol consumption is more after the love failure.
# Alternate hypothesis : Mean alcohol consumption is either the same or less after the love failure. 
# 5% significance level
# Two sample t-test of paired data
# Calculating p-value
t_statistic, p_value = ttest_1samp(D, 0)
print(t_statistic, p_value)

0.5375404241815105 0.5971346738292477


### Step 5:  Decide to reject or accept null hypothesis

In [20]:
# Deciding hypothesis to accept or reject
if p_value < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")

Accept null hypothesis


### EXERCISE

**Sugar consumption in grams of 20 patients (both diabetic and non-diabetic) are given below:**

**At 5% level of significance, is there evidence that the mean sugar consumption is different for diabetic and non-diabetic?** 

*In the following table, 0 means diabetic and 1 means non-diabetic.*
    

In [21]:
import numpy       as np
import scipy.stats as stats
weight               = np.array([[9.31, 0],[7.76, 0],[6.98, 1],[7.88, 1],[8.49, 1],[10.05, 1],[8.80, 1],[10.88, 1],[6.13, 1],[7.90, 1], \
                            [11.51, 0],[12.59, 0],[7.05, 1],[11.85, 0],[9.99, 0],[7.48, 0],[8.79, 0],[8.69, 1],[9.68, 0],[8.58, 1],\
                           [9.19, 0],[8.11, 1]])

sugar_diabetic       = weight[:,1] == 0
sugar_diabetic       = weight[sugar_diabetic][:,0]
sugar_nondiabetic    = weight[:,1] == 1
sugar_nondiabetic    = weight[sugar_nondiabetic][:,0] 

#### Hint: 

Use the numpy array, sugar_diabetic and numpy array, sugar_nondiabetic for your analysis.

In [22]:
# Null hypothesis : Mean of both the samples are equal
# Alternate hypothesis : Not equal
# 5% significance level
# Two sample t-test of unpaired data
# Calculating p-value
t_statistic, p_value = ttest_ind(sugar_diabetic, sugar_nondiabetic)
print(t_statistic, p_value)

2.3730593333971224 0.02777741611352253


In [23]:
# Deciding hypothesis to accept or reject
if p_value < 0.05:
    print("Reject the null hypothesis")
else:
    print("Accept null hypothesis")

Reject the null hypothesis


### EXERCISE

**The delivery time of Pizza from an online food deliery service firm and the home delivery from a local restaurant are given below: At 5% level of significance, is the mean delivery time for online delivery food service firm is less than the mean delivery time for the home delivery from a local restaurant.**

In [24]:
Pizza_delivery_online = np.array([16.8, 11.7, 15.6, 16.7, 17.5, 18.1, 14.1, 21.8, 13.9, 20.8])
Pizza_delivery_local  = np.array([22.0, 15.2, 18.7, 15.6, 20.8, 19.5, 17.0, 19.5, 16.5, 24.0])

#### Hint: Use paired t test

In [25]:
# Null hypothesis : Mean delivery time  for online food delivery service firm is less than the mean delivery time for home delivery from ocal restaurant.
# Alternate hypothesis : Mean delivery time is either equal or more for both the online food delivery services firm.
# 5% significance level
# Two sample t-test of paired data
# Calculating p-value
t_statistic, p_value = ttest_1samp(Pizza_delivery_local - Pizza_delivery_online, 0)
print(t_statistic, p_value)

3.0447930464454114 0.013909593560837055


In [26]:
# Deciding hypothesis to accept or reject
if p_value < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")

Reject null hypothesis
