# 0.Instructions

We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test.

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file *files_for_lab/machine.txt*. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [1]:
import pandas as pd 
import numpy as np

In [2]:
from scipy.stats import norm 
import statistics as stats
from scipy.stats import ttest_ind
from scipy.stats import t
from scipy.stats import norm
from scipy.stats import ttest_1samp
from scipy import stats
import math

In [3]:
df = pd.read_excel("machine.xlsx")

# 1.EDA

Explore data, standardize header names and declare variables (mean, std, etc)

In [4]:
df 

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [5]:
# standardize the column names because there are unnecessary spaces
df.columns.to_list()

['New machine', '    Old machine']

In [6]:
def standardize_col(col):
    return col.lower().replace(" ","")

In [7]:
new_cols=[]
for col in df.columns:
    new_cols.append(standardize_col(col))

In [8]:
new_cols

['newmachine', 'oldmachine']

In [9]:
df.columns=new_cols
df.head()

Unnamed: 0,newmachine,oldmachine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5


## Statistics (mean, stdev)

In [10]:
# describe all the data 
df.describe()

Unnamed: 0,newmachine,oldmachine
count,10.0,10.0
mean,42.14,43.23
std,0.683455,0.749889
min,41.0,41.7
25%,41.8,42.8
50%,42.2,43.4
75%,42.625,43.75
max,43.2,44.1


In [11]:
# define length
new_n = len(df)
old_n = len(df)
totalsample_n = (old_n + new_n)
n = len(df)
print('Old machine sample size is:', old_n, 'New machine sample size is:', new_n, 'Total sample size is:', totalsample_n, 'Sample size is:', n, sep='\n')

Old machine sample size is:
10
New machine sample size is:
10
Total sample size is:
20
Sample size is:
10


In [12]:
new_mean = df.mean()['newmachine']
print('New machine mean is:', new_mean)

New machine mean is: 42.14


In [13]:
old_mean = df.mean()['oldmachine']
new_mean = df.mean()['newmachine']
print('Old machine mean is:', round(old_mean,2)) 

Old machine mean is: 43.23


In [14]:
new_std = df.std()['newmachine']
print('New machine standard deviation is:', round(new_std,2))

New machine standard deviation is: 0.68


In [15]:
old_std = df.std()['oldmachine']
print('Old machine standard deviation is:', round(old_std,2))

Old machine standard deviation is: 0.75


# 2.Degree of freedom (DOF)

The number of samples added -2

DOF = n1 + n2 - number of samples (we have two: old and new)

In [16]:
DOF = ((new_n + old_n) - 2)
DOF

18

# 3.Define the hypothesis

**H0 Null hypothesis:** packing speed is the same for both machines, there is no improvement
- old_speed = new_speed


**Ha Alternative hypothesis:** there is a change, the new machine means an improvement in packing time, being the new one faster than the old one
- old_speed < new_speed

# 4.Calculate the t-stat

In [17]:
t_statistic= (new_mean-old_mean)/np.sqrt(((new_std**2)/n)+((old_std**2)/n))
print('The T-statistic is: ', round(t_statistic,4))

The T-statistic is:  -3.3972


# 5.Calculate the critical value

In [18]:
# what's the critical value - z distribution
# 0.05 is convention 
critical_value= norm.ppf(0.05)
print('The critical value is: ', round(critical_value,4))

The critical value is:  -1.6449


# 6.Calculate the p value

In [19]:
# it is less than 0.05?
# p-value of 0.05 (5%) is accepted to mean the data is valid.
pvalue = norm.cdf(t_statistic)
print('The P-value is: ', round(pvalue,4))

The P-value is:  0.0003


# 7.Confidence intervals

In [20]:
# confidence interval for the new machine
absoluteZ = abs(norm.ppf(0.05))
upperCI = new_mean + absoluteZ*new_std/np.sqrt(n-1)
print('The Upper CI is:', round(upperCI,4))

The Upper CI is: 42.5147


In [21]:
lowerCI = new_mean - absoluteZ*new_std/math.sqrt(n-1)
print('The Lower CI is:', round(lowerCI,4))

The Lower CI is: 41.7653


In [22]:
# confidence interval for the old machine
upperCI = old_mean + absoluteZ*old_std/np.sqrt(n-1)
print('The Upper CI is:', round(upperCI,4))

The Upper CI is: 43.6412


In [23]:
lowerCI = old_mean - absoluteZ*old_std/math.sqrt(n-1)
print('The Lower CI is:', round(lowerCI,4))

The Lower CI is: 42.8188


# 8.Conclusions

**T-statistic (-3.39) < T-critical (-1.73)**

Both machines work on different times

The calculated t-statistic is not greater than the critical t-value

Theres is evidence that the new machine is faster than the old one. 


# Apendix: other ways for calculating T-test & P-value

In [24]:
from scipy.stats import ttest_1samp

In [25]:
from statistics import pvariance 
from statistics import variance 

In [26]:
pooled_standard_deviation = 0.717441

In [27]:
test_t = ttest_1samp(a=df, popmean=0.71)
test_t

Ttest_1sampResult(statistic=array([191.69237331, 179.30662739]), pvalue=array([1.45543303e-17, 2.65464518e-17]))

In [28]:
#calculate pop sample variance
pop_variance_om = pvariance(df['oldmachine'])
print('Population variance of Old Machine is:', round(pop_variance_om,2))

Population variance of Old Machine is: 0.51


In [29]:
#calculate pop sample variance
pop_variance_nm = pvariance(df['newmachine'])
print('Population variance of New Machine is:', round(pop_variance_nm,2))

Population variance of New Machine is: 0.42


In [30]:
sample1 = df.std()['newmachine']/np.sqrt(new_n)
sample1

0.21612753436596469

In [31]:
sample2 = df.std()['oldmachine']/np.sqrt(old_n)
sample2

0.2371356854910985

In [32]:
sed = np.sqrt((sample1**2) + (sample2**2))
sed

0.3208495666888837

In [33]:
# finale t stat
t_stat =(df.mean()['newmachine'] - df.mean()['oldmachine'])/sed
print('T-statistic is:', round(t_stat,4))

T-statistic is: -3.3972


In [34]:
# calculate p value 
t_dist = t(DOF)
2 * t_dist.cdf(t_stat)

0.0032111425007745158

In [35]:
# short way to calculate the t-stat and p-value with given formula
t_stat, p_val = stats.ttest_ind(df['newmachine'], df['oldmachine'], equal_var=True)
print('T-statistic is:', round(t_stat,4),'\n','P-value is:', round(p_val,4))

T-statistic is: -3.3972 
 P-value is: 0.0032
