### Instructions

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [1]:
import pandas as pd
import numpy as np
import math as m
import scipy.stats as stats

In [3]:
data = pd.read_csv("files_for_lab/machine.txt", encoding='utf-16', sep='\t')

In [4]:
data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


##### Set up the hypothesis

mu = mean velocity
H0: The machines run same velocity. muNew = muOld
Ha: New machine is faster that old. muOld > muNew


##### Assumptions

Following the description we assume the following:
* Both set of data is independent
* Data is normally distributed
* The two samples have similar variance (normal distribution)

##### Choose the appropiate test and run

Order to choose the test
* Independent attributes
* Normally distributed
We should choose a t-test
* as we want to see if one set is greater that the other, alternative = "greater"

In [7]:
data.columns

Index(['New machine', '    Old machine'], dtype='object')

In [8]:
# We create the control and test group

control = np.array(data['    Old machine'])
treatment = np.array(data['New machine'])

control, treatment

(array([42.7, 43.6, 43.8, 43.3, 42.5, 43.5, 43.1, 41.7, 44. , 44.1]),
 array([42.1, 41. , 41.3, 41.8, 42.4, 42.8, 43.2, 42.3, 41.8, 42.7]))

In [13]:
# setting up the t test
ttest, p_value = stats.ttest_ind(control, treatment, equal_var=True, alternative="greater")
print("pvalue: ", round(p_value, 5))

print("Since our hypothesis is one sided >> pvalue one sided", p_value/2)
if p_value/2 < 0.05:
    print("Reject H0: mean sample of the old machine is larger than the mean sample of the new machine")
else:
    print("No evidence to reject the null hypothesis")

pvalue:  0.00161
Since our hypothesis is one sided >> pvalue one sided 0.0008027856251936289
Reject H0: mean sample of the old machine is larger than the mean sample of the new machine


In [12]:
control.mean(), treatment.mean()

(43.230000000000004, 42.14)

Conclusion
* Pvalue < 0,05/2 --> We reject the null hypothesis and accept the Ha

# Exercise 2

An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

Test statistics can be calculated as: link to the image - Test statistics calculation for Unpooled Variance Case

Degrees of freedom is (n1-1)+(n2-1).

In [16]:
data2 = pd.read_csv("files_for_lab/student_gpa.txt", sep='\t')

In [18]:
data2

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


#### Define the hypothesis

* H0: mean of Sophomores and Juniors is the same
* Ha: mean is different

##### Assumptions

* two samples are independent.
* samples have different variance
* assume data is normally distributed because they recomend ttest

##### Choosing the test

* taking into account the assumptions we must go for a ttest
* we choose equal_var = False --> because the variances are different

In [19]:
data2.columns

Index(['Sophomores', '  Juniors'], dtype='object')

In [20]:
# creating control and treatment arrays for the t test
S = np.array(data2['Sophomores'])
J = np.array(data2['  Juniors'])

S = S[~np.isnan(S)]
J = J[~np.isnan(J)] #removing nans from array so the test can run

S, J #sanity check

(array([3.04, 1.71, 3.3 , 2.88, 2.11, 2.6 , 2.92, 3.6 , 2.28, 2.82, 3.03,
        3.13, 2.86, 3.49, 3.11, 2.13, 3.27]),
 array([2.56, 2.77, 2.7 , 3.  , 2.98, 3.47, 3.26, 3.2 , 3.19, 2.65, 3.  ,
        3.39, 2.58]))

In [21]:
#PRECHECK
S.mean(), J.mean()

(2.84, 2.980769230769231)

In [22]:
# setting up the t test
ttest, p_value = stats.ttest_ind(S, J, equal_var=False, alternative="two-sided")
print("pvalue: ", round(p_value, 5))

print("Since our hypothesis is two-sided, pvalue two-sided", p_value)
if p_value < 0.05:
    print("Reject H0: mean sample of Somophores' GPA is different than the mean sample of Juniors' GPA")
else:
    print("No evidence to reject the null hypothesis")

pvalue:  0.36422
Since our hypothesis is two-sided, pvalue two-sided 0.3642180675348571
No evidence to reject the null hypothesis
