# Lab | Inferential statistics - T-test & P-value

## 1.  We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file `files_for_lab/machine.txt`.
Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other




In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Test description

H0: μ (new machine) = μ (old machine)

H1: μ (new machine) != μ (old machine)

Level of significance: 0.05

If the test statistic falls in the critical region, then we reject the Null Hypothesis

### Data manipulation

In [2]:
data = pd.read_excel('files_for_lab/machine.xlsx')
data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [3]:
data.columns

Index(['New machine', '    Old machine'], dtype='object')

In [4]:
data.rename(columns={'New machine': 'new_machine', '    Old machine': 'old_machine'}, inplace=True)
data

Unnamed: 0,new_machine,old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [5]:
data.describe()

Unnamed: 0,new_machine,old_machine
count,10.0,10.0
mean,42.14,43.23
std,0.683455,0.749889
min,41.0,41.7
25%,41.8,42.8
50%,42.2,43.4
75%,42.625,43.75
max,43.2,44.1


### Test implementation

In [6]:
new_mean = 42.14
new_std = 0.683
new_n = 10

old_mean = 43.23
old_std = 0.75
old_n = 10

In [7]:
from scipy.stats import ttest_ind, norm

new = norm.rvs(loc=new_mean, scale=new_std, size=new_n)
old = norm.rvs(loc=old_mean, scale=old_std, size=old_n)

In [8]:
ttest_ind(new, old)

Ttest_indResult(statistic=-0.8703089847121414, pvalue=0.3955901276246089)

### Conclusion

Conclusion: We reject the null hypothesis with more than 95% of confidence. 

Comparing the means, we can say the new machine is significantly faster than the old machine.

## 2. An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances.
   Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file `files_for_lab/student_gpa.txt`.
   At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

   Test statistics can be calculated as: [link to the image - Test statistics calculation for Unpooled Variance Case](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.04/7.04-unpooled_variances.png)

   Degrees of freedom is `(n1-1)+(n2-1)`.

### Test description

H0: μ (sophomores) = μ (juniors)

H1: μ (sophomores) != μ (juniors)

Level of significance: 0.05
    
Degrees of freedom = 28, calculation: (16 + 12)

If the test statistic falls in the critical region, then we will reject the Null Hypothesis

### Data manipulation

In [9]:
data = pd.read_excel('files_for_lab/student_gpa.xlsx')
data

Unnamed: 0,Sophomores\t Juniors,Unnamed: 1
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


In [10]:
data.rename(columns={'Sophomores\t  Juniors': 'sophomores', 'Unnamed: 1': 'juniors'}, inplace=True)

In [11]:
data.describe()

Unnamed: 0,sophomores,juniors
count,17.0,13.0
mean,2.84,2.980769
std,0.519832,0.309259
min,1.71,2.56
25%,2.6,2.7
50%,2.92,3.0
75%,3.13,3.2
max,3.6,3.47


### Test implementation

In [12]:
sophomores = data['sophomores'].tolist()
juniors = data['juniors'].dropna().tolist()

In [13]:
ttest_ind(sophomores, juniors)

Ttest_indResult(statistic=-0.864325455323425, pvalue=0.39475359666695975)

### Conclusion

Conclusion: Looking at the t-test table on internet, with 28 degrees of freedom and 95% significance, the critical p value is 1.7 (-1.7 < H0 rejection < 1.7). 

Therefore, we can't reject the null hypothesis, and we accept it, as the p-value is 0.39 and falls outside the critical region.