In [1]:
import statsmodels.api as sm
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

**Question 1**

A F&B manager wants to determine whether there is any significant difference in the diameter of the cutlet between two units. A randomly selected sample of cutlets was collected from both units and measured? Analyze the data and draw inferences at 5% significance level. Please state the assumptions and tests that you carried out to check validity of the assumptions.


**Parameter:** Cutlets from Unit A & Unit B

**Parameter of Interest:** $\mu_1 - \mu_2$, diameter of the cutlet.

**Null Hypothesis:** $\mu_1 = \mu_2$

**Alternative Hypthosis:** $\mu_1 \neq \mu_2$

**Checking for Inequality** 

**2 sample 2-tail test**

In [2]:
cutlets = pd.read_csv('cutlets.csv')
cutlets.head()

Unnamed: 0,Unit A,Unit B
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522


In [3]:
cutlets.mean()

Unit A    7.019091
Unit B    6.964297
dtype: float64

In [4]:
p=sm.stats.ztest(cutlets["Unit A"].dropna(), cutlets["Unit B"].dropna(), alternative='two-sided')
print('t value = {} \np-value = {}'.format(p[0] , p[1]))

t value = 0.7228688704678063 
p-value = 0.46976045023906055


#### or

In [5]:
p=stats.ttest_ind( cutlets["Unit A"].dropna(), cutlets["Unit B"])
print('t value = {} \np-value = {}'.format(p[0] , p[1]))

t value = 0.7228688704678063 
p-value = 0.4722394724599501


**Conclusion of the hypothesis test**

Since the p-value is greater than 0.05, we failed to reject the Null hypothesis that there is no significant difference in the diameter of the cutlet between two units.


**Question 2**

A hospital wants to determine whether there is any difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list. They collected a random sample and recorded TAT for reports of 4 laboratories. TAT is defined as sample collected to report dispatch. Analyze the data and determine whether there is any difference in average TAT among the different laboratories at 5% significance level.

**Parameter:** Reports of the laboratories

**Parameter of Interest:** Time required for the reports

**Test  Annova**

In [6]:
LabTAT=pd.read_csv('LabTAT.csv')
LabTAT.head()

Unnamed: 0,Laboratory 1,Laboratory 2,Laboratory 3,Laboratory 4
0,185.35,165.53,176.7,166.13
1,170.49,185.91,198.45,160.79
2,192.77,194.92,201.23,185.18
3,177.33,183.0,199.61,176.42
4,193.41,169.57,204.63,152.6


In [7]:
stats.f_oneway(LabTAT.iloc[:,0], LabTAT.iloc[:,1],LabTAT.iloc[:,2],LabTAT.iloc[:,3])

F_onewayResult(statistic=118.70421654401437, pvalue=2.1156708949992414e-57)

**Conclusion**

Since the p-value is less than 0.05, we reject the null hypothesis. This means we can say that there is a difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list.

**Question 3**

Sales of products in four different regions is tabulated for males and females. Find if male-female buyer rations are similar across regions.

**H0 = All proportions are equal!**

**Ha = Not all proportions are equal!**

**Chi-Squared - Test of Independence**



In [8]:
product_sale = pd.read_csv('BuyerRatio.csv')
product_sale

Unnamed: 0,Observed Values,East,West,North,South
0,Males,50,142,131,70
1,Females,435,1523,1356,750


In [9]:
from scipy.stats import chi2_contingency

In [10]:
# contingency table
table = [[50, 142, 131, 70],
[550, 351, 480, 350]]
print(table)

[[50, 142, 131, 70], [550, 351, 480, 350]]


In [11]:
stat, p, dof, expected = chi2_contingency(table)
print(f'Statistic = {stat} \np value = {p} \nDegree of freedom = {dof} \nExpected values = {expected}')

Statistic = 80.27295426602495 
p value = 2.682172557281901e-17 
Degree of freedom = 3 
Expected values = [[111.01694915  91.21892655 113.05225989  77.71186441]
 [488.98305085 401.78107345 497.94774011 342.28813559]]


**Conclusion**

All proportions are equal.

**Question 4**

TeleCall uses 4 centers around the globe to process customer order forms. They audit a certain % of the customer order forms. Any error in order form renders it defective and has to be reworked before processing. The manager wants to check whether the defective % varies by centre. Please analyze the data at 5% significance level and help the manager draw appropriate inferences

**H0 : defective % does not vary by centre**

**Ha : defective % varies by centre</h2>**

**Using Anova**

In [12]:
COF=pd.read_csv('CostomerOrderForm.csv')
COF.head()

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,Error Free,Error Free,Defective,Error Free
1,Error Free,Error Free,Error Free,Defective
2,Error Free,Defective,Defective,Error Free
3,Error Free,Error Free,Error Free,Error Free
4,Error Free,Error Free,Defective,Error Free


In [13]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
COF.iloc[:,0] = labelencoder.fit_transform(COF.iloc[:,0])
COF.iloc[:,1] = labelencoder.fit_transform(COF.iloc[:,1])
COF.iloc[:,2] = labelencoder.fit_transform(COF.iloc[:,2])
COF.iloc[:,3] = labelencoder.fit_transform(COF.iloc[:,3])

In [14]:
stats.f_oneway(COF.iloc[:,0], COF.iloc[:,1],COF.iloc[:,2],COF.iloc[:,3])

F_onewayResult(statistic=1.286168556089167, pvalue=0.2776780955705948)

**Conclusion**

Since the p-value is greater than 0.05, we failed to reject the null hypothesis. 

**Using chi-square test**

In [15]:
COF['Phillippines'].value_counts()

1    271
0     29
Name: Phillippines, dtype: int64

In [16]:
COF['Indonesia'].value_counts()

1    267
0     33
Name: Indonesia, dtype: int64

In [17]:
COF['Malta'].value_counts()

1    269
0     31
Name: Malta, dtype: int64

In [18]:
COF['India'].value_counts()

1    280
0     20
Name: India, dtype: int64

In [19]:
# Creating contingency table
cont_table = np.array([[271,267,269,280],[29,33,31,20]])
cont_table

array([[271, 267, 269, 280],
       [ 29,  33,  31,  20]])

In [20]:
stat, p, dof, expected = chi2_contingency(cont_table)
print(f'Statistic = {stat} \np value = {p} \nDegree of freedom = {dof} \nExpected values = {expected}')

Statistic = 3.858960685820355 
p value = 0.2771020991233135 
Degree of freedom = 3 
Expected values = [[271.75 271.75 271.75 271.75]
 [ 28.25  28.25  28.25  28.25]]


**Conclusion**

Since the p-value is greater than 0.05, we failed to reject the null hypothesis.