# Lab | Inferential statistics - T-test & P-value

## 1.  We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file `files_for_lab/machine.txt`.
Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other




In [50]:
import pandas as pd

data=pd.read_excel("machine.xlsx")
data[["New machine","    Old machine"]]


Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [49]:
new_mean = data["New machine"].mean(axis=0)
old_mean = data["    Old machine"].mean(axis=0)

print(" new machine mean: ",new_mean,"\n","old machine mean: ",old_mean)

 new machine mean:  42.14 
 old machine mean:  43.230000000000004


## Hypothesis test
- H0: µ = 43.23
- H1: µ < 43.23
- This is a one tailed t-test for two samples

In [33]:
# we can import the ttest_ind from scipy so we don't have to work it out by hand

from scipy.stats import ttest_ind


T_test = ttest_ind(list(data["New machine"]), list(data["    Old machine"]))

# the t-test is two tailed by default so we need to divide by 2 for a one tailed p-value
print(T_test, "\np-value: ", T_test.pvalue / 2)


Ttest_indResult(statistic=-3.3972307061176026, pvalue=0.0032111425007745158) 
p-value:  0.0016055712503872579


## Result

- The p-value for our t-stat is 0.0016. This is less than our alpha level of 0.05 meaning we can reject the null hypothesis
- Therefore we can accept the alternate hypothesis that the new machine is faster.

## 2. An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances.
   Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file `files_for_lab/student_gpa.txt`.
   At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

   Test statistics can be calculated as: [link to the image - Test statistics calculation for Unpooled Variance Case](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.04/7.04-unpooled_variances.png)

   Degrees of freedom is `(n1-1)+(n2-1)`.

In [35]:
data=pd.read_excel("student_gpa.xlsx")
data

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


## Setting up test

- This is a two-tailed test because the mean could be higher or lower than the H0
- Degrees of freedom = 28
- Alpha level 0.05 

<img src=https://cdn1.byjus.com/wp-content/uploads/2020/04/T-table.png height ="400">  

- If our t-test returns a critical value of above 2.048 then we can reject the H0

In [44]:
# here we can use the same function as above from scipy but set equal_var to false as the population variances are not equal

T_test = ttest_ind(list(data["Sophomores"]), list(data["  Juniors"][0:13]),equal_var=False)
T_test


Ttest_indResult(statistic=-0.9231495630900278, pvalue=0.3642180675348571)

## Result

- The t-statistic is 0.92 which is not above 2.048 meaning we don't reject the H0.
- The p_value is also not less than 0.05 so based on that we should also not reject the H0.
- To conclude we do not have proof that there is a significant difference in average gpas between sophomores and juniors.