In [15]:
import scikit_posthocs as sp
import numpy as np
from scipy import stats
import pandas as pd
import os
pd.options.display.float_format = '{:,.4f}'.format

#### 1. We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

$H_{0}$: The performaces of the machines are equal.  
$H_{1}$: The performance of the new machine is better than the old one.   


In [28]:
os.chdir(r'C:\Users\TrendingPC\Desktop\IronHAck\LABS\LABS-unit-7\4.T-test_&_P-values\files_for_lab')
data = pd.read_csv('machine.txt', encoding="utf-16", sep = "\t")

data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [35]:
data.columns

Index(['New machine', '    Old machine'], dtype='object')

In [33]:
new = np.array(data['New machine'])
old = np.array(data['    Old machine'])

In [36]:
new

array([42.1, 41. , 41.3, 41.8, 42.4, 42.8, 43.2, 42.3, 41.8, 42.7])

In [37]:
def check_normality(data):
    test_stat_normality, p_value_normality=stats.shapiro(data)
    print("p value:%.4f" % p_value_normality)
    if p_value_normality <0.05:
        print("Reject null hypothesis >> The data is not normally distributed")
    else:
        print("Fail to reject null hypothesis >> The data is normally distributed")       

In [38]:
def check_variance_homogeneity(group1, group2):
    test_stat_var, p_value_var= stats.levene(group1,group2)
    print("p value:%.4f" % p_value_var)
    if p_value_var <0.05:
        print("Reject null hypothesis >> The variances of the samples are different.")
    else:
        print("Fail to reject null hypothesis >> The variances of the samples are same.")

In [39]:
check_normality(old)
check_normality(new)

p value:0.5010
Fail to reject null hypothesis >> The data is normally distributed
p value:0.9676
Fail to reject null hypothesis >> The data is normally distributed


In [42]:
check_variance_homogeneity(old, new)

p value:0.8795
Fail to reject null hypothesis >> The variances of the samples are same.


In [44]:
ttest,p_value = stats.ttest_ind(old,new)
print("p value:%.4f" % p_value)
print("since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:%.4f" %(p_value/2))
if p_value/2 <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis") 

p value:0.0032
since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:0.0016
Reject null hypothesis


#### There is enough evidence to conclude that the new machine performs better than the old one

#### 2. An additional problem (not mandatory): 

In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

* Degrees of freedom is (n1-1)+(n2-1).

$H_{0}$: Mean GPAs of sophomores and juniors at the university doesn't differ  
$H_{1}$: Mean GPAs of sophomores and juniors at the university differ

In [46]:
data2 = pd.read_csv('student_gpa.txt', sep = "\t")

data2

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


In [47]:
data2.columns

Index(['Sophomores', '  Juniors'], dtype='object')

In [60]:
Sophomores = np.array(data2['Sophomores'])
Juniors = np.array(data2['  Juniors'])

#Dropping Nan values from 'Juniors'
Juniors = Juniors[~np.isnan(Juniors)]

In [62]:
check_normality(Sophomores)
check_normality(Juniors)

p value:0.3154
Fail to reject null hypothesis >> The data is normally distributed
p value:0.4130
Fail to reject null hypothesis >> The data is normally distributed


#### Data is normally distributed

In [63]:
check_variance_homogeneity(Sophomores, Juniors)

p value:0.2016
Fail to reject null hypothesis >> The variances of the samples are same.


#### The variances of the samples are same

In [68]:
ttest,p_value = stats.mannwhitneyu(Sophomores, Juniors)
print("p value:%.4f" % p_value)
print("since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:%.4f" %(p_value/2))
if p_value/2 <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis") 

p value:0.7377
since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:0.3689
Fail to reject null hypothesis


#### There is enough evidence to conclude that the mean GPAs of sophomores and juniors at the university doesn't differ