# Hypothesis Testing

## A F&B manager wants to determine whether there is any significant difference in the diameter of the cutlet between two units. A randomly selected sample of cutlets was collected from both units and measured? Analyze the data and draw inferences at 5% significance level. Please state the assumptions and tests that you carried out to check validity of the assumptions.
Minitab File : Cutlets.mtw


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Step 1: Define Null(H0) & alternative (H1) hypothesis
# H0: No difference in the diameter of the cutlet between A & B units [A = B]
# H1: A & B units has the difference in the diameter of the cutlet [A != B]
# Step 2: As we dont know standard deviation of Population we will use t statistics
# Step 3: significance value = 0.05

In [3]:
df = pd.read_csv("Cutlets.csv")
df

Unnamed: 0,Unit A,Unit B
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522
5,7.3871,6.811
6,6.8755,7.2212
7,7.0621,6.6606
8,6.684,7.2402
9,6.8236,7.0503


In [4]:
Mean_A = df["Unit A"].mean() # Unit A Mean
Mean_A

7.01909142857143

In [5]:
Mean_B = df["Unit B"].mean() # Unit B Mean
Mean_B

6.964297142857142

In [6]:
Std_A = df["Unit A"].std() # Unit A standard Deviation
Std_A

0.2884084841815496

In [7]:
Std_B = df["Unit B"].std() # Unit B standard Deviation
Std_B

0.3434006470631082

In [8]:
#Step 4 calculate t statstical
t_statstical = (Mean_A-Mean_B)/(Std_A/np.sqrt(35)) # t statstical value
print("t statistical = ",t_statstical)

t statistical =  1.1239869273042056


In [9]:
#Step 5 calculate t critical
t_critical = stats.t.ppf(0.975,df=34)  # t critical value
print("t critical = ",t_critical)

NameError: name 'stats' is not defined

In [None]:
if t_statstical<t_critical:             #Select Acceptance Criteria of Hypothesis as per t statsical & t critical value
    print("No difference in the diameter of the cutlet between A & B units")
else:
    print("A & B units has the difference in the diameter of the cutlet")

## A hospital wants to determine whether there is any difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list. They collected a random sample and recorded TAT for reports of 4 laboratories. TAT is defined as sample collected to report dispatch. Analyze the data and determine whether there is any difference in average TAT among the different laboratories at 5% significance level.
 
    Minitab File: LabTAT.mtw

In [23]:
# As we need to compare averages of more than 2 parameter. Thus, we are using Anova Statistic Technique
# Define Null(H0) & alternative (H1) hypothesis
# H0: u1 = u2 = u3 = u4
# H1: Atleast one mean is different

In [24]:
reports_data = pd.read_csv("LabTAT.csv")
reports_data

Unnamed: 0,Laboratory 1,Laboratory 2,Laboratory 3,Laboratory 4
0,185.35,165.53,176.70,166.13
1,170.49,185.91,198.45,160.79
2,192.77,194.92,201.23,185.18
3,177.33,183.00,199.61,176.42
4,193.41,169.57,204.63,152.60
...,...,...,...,...
115,178.49,170.66,193.80,172.68
116,176.08,183.98,215.25,177.64
117,202.48,174.54,203.99,170.27
118,182.40,197.18,194.52,150.87


In [29]:
from scipy import stats
stats.f_oneway(reports_data.iloc[:,0],reports_data.iloc[:,1],reports_data.iloc[:,2],reports_data.iloc[:,3])

F_onewayResult(statistic=118.70421654401437, pvalue=2.1156708949992414e-57)

## Sales of products in four different regions is tabulated for males and females. Find if male-female buyer rations are similar across regions.

In [59]:
buy= pd.read_csv("BuyerRatio.csv")
buy

Unnamed: 0,Observed Values,East,West,North,South
0,Males,50,142,131,70
1,Females,435,1523,1356,750


In [60]:
Male = buy.iloc[0,1:]
Female = buy.iloc[1,1:]
stats.ttest_ind(Male,Female) 

Ttest_indResult(statistic=-3.5844638979607404, pvalue=0.011580726207478622)

## TeleCall uses 4 centers around the globe to process customer order forms. They audit a certain %  of the customer order forms. Any error in order form renders it defective and has to be reworked before processing.  The manager wants to check whether the defective %  varies by centre. Please analyze the data at 5% significance level and help the manager draw appropriate inferences

In [63]:
Customer = pd.read_csv("Costomer+OrderForm.csv")
Customer

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,Error Free,Error Free,Defective,Error Free
1,Error Free,Error Free,Error Free,Defective
2,Error Free,Defective,Defective,Error Free
3,Error Free,Error Free,Error Free,Error Free
4,Error Free,Error Free,Defective,Error Free
...,...,...,...,...
295,Error Free,Error Free,Error Free,Error Free
296,Error Free,Error Free,Error Free,Error Free
297,Error Free,Error Free,Defective,Error Free
298,Error Free,Error Free,Error Free,Error Free


In [64]:
Phillippines_value=Customer['Phillippines'].value_counts()
Indonesia_value=Customer['Indonesia'].value_counts()
Malta_value=Customer['Malta'].value_counts()
India_value=Customer['India'].value_counts()

In [72]:
print(Phillippines_value,Indonesia_value,Malta_value,India_value)

Error Free    271
Defective      29
Name: Phillippines, dtype: int64 Error Free    267
Defective      33
Name: Indonesia, dtype: int64 Error Free    269
Defective      31
Name: Malta, dtype: int64 Error Free    280
Defective      20
Name: India, dtype: int64


In [70]:
chiStats = stats.chi2_contingency([[271,267,269,280],[29,33,31,20]])
chiStats

(3.858960685820355,
 0.2771020991233135,
 3,
 array([[271.75, 271.75, 271.75, 271.75],
        [ 28.25,  28.25,  28.25,  28.25]]))