# Lab | Inferential statistics - T-test & P-value


### Instructions

# 1.

We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file `files_for_lab/machine.txt`.
Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [1]:
import numpy as np
import pandas as pd
import statistics
import math
from scipy.stats import t

In [2]:
data = pd.read_csv('machine.txt', sep = "\t")

In [3]:
data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [4]:
#new_machine = [42.1, 41, 41.3, 41.8, 42.4, 42.8, 43.2, 42.3, 41.8, 42.7]
#old_machine = [42.7, 43.6, 43.8, 43.3, 42.5, 43.5, 43.1, 41.7, 44, 44.1]

In [7]:
sample_mean1 = round((sum(data['New machine']) / len(data['New machine'])),2)
sample_mean1

42.14

In [8]:
sample_std1 = round(statistics.stdev(data['New machine']),2)
sample_std1

0.68

In [9]:
sample_mean2 = round((sum(data['Old machine']) / len(data['Old machine'])),2)
sample_mean2

43.23

In [10]:
sample_std2 = round(statistics.stdev(data['Old machine']),2)
sample_std2

0.75

#### The null hypothesis is that there is no difference in the two population means, i.e.
H0: sample_mean1 - sample_mean2 = 0
    
#### The alternative is that the new machine is faster, i.e.
Ha: sample_mean1 - sample_mean2 < 0

In [11]:
n = 10

pooled_sample_std = math.sqrt(((n-1)*sample_std1**2 + (n-1)*sample_std2**2)/(n+n-2))
statistic = (sample_mean1-sample_mean2)/(pooled_sample_std*math.sqrt((1/n)+(1/n)))
print("T Statistic is: ", statistic)

T Statistic is:  -3.4047540987884606


In [27]:
# Using python to find the p value and critical value
print("P value is: ", 1- t.cdf(statistic,(n*2)-2))
print("Critical Value of z is: ", t.ppf(0.025, n+n-2)) #alpha is 0.05

P value is:  0.79916887943892
Critical Value of z is:  -2.10092204024096


In [25]:
from scipy.stats import ttest_ind

display(ttest_ind(data['New machine'], data['Old machine'], axis=0, equal_var=True, nan_policy='propagate'))

Ttest_indResult(statistic=-3.3972307061176026, pvalue=0.0032111425007745158)

#### Looking at the t table, we find that the critical value is -1.7341. Our test statistic, -3.4047, is in our rejection region (t < -1.7341), therefore, we reject the null hypothesis. With a significance level of 5%, we reject the null hypothesis and conclude there is enough evidence to suggest that the new machine is faster than the old machine.

# 2.

An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances.
   Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file `student_gpa.txt`.
   At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

   Test statistics can be calculated as: [link to the image - Test statistics calculation for Unpooled Variance Case](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.04/7.04-unpooled_variances.png)

   Degrees of freedom is `(n1-1)+(n2-1)`.

In [13]:
sophomores = [3.04, 1.71, 3.3, 2.88, 2.11, 2.6, 2.92, 3.6, 2.28, 2.82, 3.03, 3.13, 2.86, 3.49, 3.11, 2.13, 3.27]
juniors = [2.56, 2.77, 2.7, 3, 2.98, 3.47, 3.26, 3.2, 3.19, 2.65, 3, 3.39, 2.58]

In [18]:
len(sophomores)

17

In [19]:
len(juniors)

13

In [14]:
mean_s = round((sum(sophomores) / len(sophomores)),2)
mean_s

2.84

In [15]:
std_s = round(statistics.stdev(sophomores),2)
std_s

0.52

In [16]:
mean_j = round((sum(juniors) / len(juniors)),2)
mean_j

2.98

In [17]:
std_j = round(statistics.stdev(juniors),2)
std_j

0.31

In [26]:
display(ttest_ind(sophomores, juniors, axis=0, equal_var=True, nan_policy='propagate'))

Ttest_indResult(statistic=-0.864325455323425, pvalue=0.39475359666695975)

In [22]:
n_s = 17
n_j = 13

pooled_sample_std = math.sqrt(((n_s-1)*std_s**2 + (n_j-1)*std_j**2)/(n_s+n_j-2))
statistic = (mean_s-mean_j)/(pooled_sample_std*math.sqrt((1/n_s)+(1/n_j)))
print("T Statistic is: ", statistic)

T Statistic is:  -0.8589504911088421


In [28]:
# Using python to find the p value and critical value
#print("P value is: ", 1- t.cdf(statistic,n_s+n_j-2))
print("Critical Value of z is: ", t.ppf(0.05, n_s+n_j-2)) #alpha is 0.05

Critical Value of z is:  -1.7011309342659315


Since the p-value of 0.39 is larger than 
α
=
0.05
, we fail to reject the null hypothesis.

At 5% level of significance, the data does not provide sufficient evidence that the mean GPAs of sophomores and juniors at the university are different.