# Problem Statement

1- A F&B manager wants to determine whether there is any significant
difference in the diameter of the cutlet between two units. A randomly
selected sample of cutlets was collected from both units and measured?
Analyze the data and draw inferences at 5% significance level. Please state
the assumptions and tests that you carried out to check validity of the
assumptions.
File: Cutlets.csv

--- ### --- We will load our dataset

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy import stats

In [10]:
data = pd.read_csv('Cutlets.csv')
data

Unnamed: 0,Unit A,Unit B
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522
5,7.3871,6.811
6,6.8755,7.2212
7,7.0621,6.6606
8,6.684,7.2402
9,6.8236,7.0503


Seems like we have blank values from row number 35, we will drop them from our database

In [13]:
data.drop(data.index[35:51],axis=0,inplace=True)
data

Unnamed: 0,Unit A,Unit B
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522
5,7.3871,6.811
6,6.8755,7.2212
7,7.0621,6.6606
8,6.684,7.2402
9,6.8236,7.0503


Since both the variables are continous, we would need to compare it against each other <br>
1- We will check wither Unit A and Unit B are equal
2 - Then we will check for External condition is similar or not then based on the result we will conduct tests

In [15]:
# First we will check for normality

# H0 - Unit A and Unit B are normal
# H1 - Either of the unit is not normal

# To do this test we will use Shapiro test


print(stats.shapiro(data['Unit A']))
print(stats.shapiro(data['Unit B']))

ShapiroResult(statistic=0.9649458527565002, pvalue=0.3199819028377533)
ShapiroResult(statistic=0.9727300405502319, pvalue=0.5224985480308533)


Since p value is > 0.05, we fail to reject Null Hypothesis.

We will check for variance between variables as we know the external conditions are not same, as the data belongs to two different unit

In [16]:
# Variance Test

# Ho - Variance is equal between units
# H1 - Variance is different between units

scipy.stats.levene(data['Unit A'],data['Unit B'])

LeveneResult(statistic=0.665089763863238, pvalue=0.4176162212502553)

Since p value is greater then 0.05, we fail to reject null hypothesis

We will now conduct 2 sample T test for equal variance

In [17]:
# Ho- The diameter of the cutlets from Unit A and Unit B are equal
# H1 - Atleast one units diameter is not equal

scipy.stats.ttest_ind(data['Unit A'],data['Unit B'])

Ttest_indResult(statistic=0.7228688704678061, pvalue=0.4722394724599501)

Since P value is greater then 0.05, we fail to reject Null Hypothesis.

Inferences can be made that the diameter of the cutlets from both the units do not have significant difference between them

# Problem Statement

A hospital wants to determine whether there is any difference in the
average Turn Around Time (TAT) of reports of the laboratories on their
preferred list. They collected a random sample and recorded TAT for
reports of 4 laboratories. TAT is defined as sample collected to report
dispatch.
Analyze the data and determine whether there is any difference in
average TAT among the different laboratories at 5% significance level.
File: LabTAT.csv

For the above statement, we have more then 2 variables which are continous, we will start with normality test

In [18]:
data = pd.read_csv('lab_tat_updated.csv')
data

Unnamed: 0,Laboratory_1,Laboratory_2,Laboratory_3,Laboratory_4
0,185.35,165.53,176.70,166.13
1,170.49,185.91,198.45,160.79
2,192.77,194.92,201.23,185.18
3,177.33,183.00,199.61,176.42
4,193.41,169.57,204.63,152.60
...,...,...,...,...
115,160.25,170.66,193.80,172.68
116,176.08,183.98,215.25,177.64
117,202.48,174.54,211.22,170.27
118,182.40,197.18,194.52,150.87


In [19]:
# First we will check for normality

# Ho - All variables are normal
# H1 - Atleast one variable is not normal

print(stats.shapiro(data['Laboratory_1']))
print(stats.shapiro(data['Laboratory_2']))
print(stats.shapiro(data['Laboratory_3']))
print(stats.shapiro(data['Laboratory_4']))

ShapiroResult(statistic=0.9886691570281982, pvalue=0.42317795753479004)
ShapiroResult(statistic=0.9936322569847107, pvalue=0.8637524843215942)
ShapiroResult(statistic=0.9796067476272583, pvalue=0.06547004729509354)
ShapiroResult(statistic=0.9913753271102905, pvalue=0.6618951559066772)


So p value of all variables is > 0.05, so we fail to reject Null Hypothessis

Since all variables are normal, now we will test for variance

In [20]:
# Ho - Variance of all variables are equal
# h1 - There is difference in the variance

scipy.stats.levene(data['Laboratory_1'],data['Laboratory_2'],data['Laboratory_3'],data['Laboratory_4'])

LeveneResult(statistic=1.025294593220823, pvalue=0.38107781677304564)

Since p value is greater then 0.05, we fail to reject Null Hyphothesis and will now conduct One Way Annova

In [21]:
# Ho - Mean of all labs are equal
# H1 - Mean of atleast one lab is not equal

F, p = stats.f_oneway(data['Laboratory_1'],data['Laboratory_2'],data['Laboratory_3'],data['Laboratory_4'])
print(F)
print(p)

121.39264646442368
2.143740909435053e-58


The P value  is 2e-58 which is < 0.05, we will reject our Null Hypothesis. <br>

Inferences: The mean of TAT amongst Labs are significantly different from each other

# Problem Statement

Sales of products in four different regions is tabulated for males and
females. Find if male-female buyer rations are similar across regions.
East West North South
Males
50
142
131
70
Females
550
351
480
350

Lets load our data

In [23]:
data = pd.read_csv('BuyerRatio.csv')
data

Unnamed: 0,Observed Values,East,West,North,South
0,Males,50,142,131,70
1,Females,435,1523,1356,750


In [24]:
from statsmodels.stats.proportion import proportions_ztest

In [29]:
alpha = 0.5
Male = [50,142,131,70]
Female = [435,1523,1356,750]
Sales = [Male,Female]
print(Sales)

[[50, 142, 131, 70], [435, 1523, 1356, 750]]


In [31]:
# Ho - Male Female ratio is similar across regions
# H1 - Ratio is not similar in atleast one region

chiStats = stats.chi2_contingency(Sales)
print('Test t=%f p-value=%f' % (chiStats[0], chiStats[1]))
print('Interpret by p-Value')
if chiStats[1] < 0.05:
    print('we reject null hypothesis')
else:
    print('we fail to reject null hypothesis')

Test t=1.595946 p-value=0.660309
Interpret by p-Value
we fail to reject null hypothesis


In [33]:
#critical value = 0.1
alpha = 0.05
critical_value = stats.chi2.ppf(q = 1 - alpha,df=chiStats[2])# Find the critical value for 95% confidence*
                      #degree of freedom

observed_chi_val = chiStats[0]
#if observed chi-square < critical chi-square, then variables are not related
#if observed chi-square > critical chi-square, then variables are not independent (and hence may be related).
print('Interpret by critical value')
if observed_chi_val <= critical_value:
    # observed value is not in critical area therefore we accept null hypothesis
    print ('Null hypothesis cannot be rejected (variables are not related)')
else:
    # observed value is in critical area therefore we reject null hypothesis
    print ('Null hypothesis cannot be excepted (variables are not independent)')

Interpret by critical value
Null hypothesis cannot be rejected (variables are not related)


Inferences, propotions of males and females across region is same

# Problem Statement

Telecall uses 4 centers around the globe to process customer order forms.
They audit a certain % of the customer order forms. Any error in order form
renders it defective and must be reworked before processing. The manager
wants to check whether the defective % varies by center. Please analyze
the data at 5% significance level and help the manager draw appropriate
inferences
File: Customer OrderForm.csv

In [34]:
data = pd.read_csv('CustomerOrderform.csv')

In [35]:
data.head()

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,Error Free,Error Free,Defective,Error Free
1,Error Free,Error Free,Error Free,Defective
2,Error Free,Defective,Defective,Error Free
3,Error Free,Error Free,Error Free,Error Free
4,Error Free,Error Free,Defective,Error Free


In [36]:
phillippines_value = data['Phillippines'].value_counts()
indonesia_value = data['Indonesia'].value_counts()
malta_value = data['Malta'].value_counts()
india_value = data['India'].value_counts()
print(phillippines_value)
print(indonesia_value)
print(malta_value)
print(india_value)

Error Free    271
Defective      29
Name: Phillippines, dtype: int64
Error Free    267
Defective      33
Name: Indonesia, dtype: int64
Error Free    269
Defective      31
Name: Malta, dtype: int64
Error Free    280
Defective      20
Name: India, dtype: int64


In [39]:
chiStats = stats.chi2_contingency([[271,267,269,280],[29,33,31,20]])
print('Test t=%f p-value=%f' % (chiStats[0], chiStats[1]))
print('Interpret by p-Value')
if chiStats[1] < 0.05:
    print('we reject null hypothesis')
else:
    print('we fail to reject null hypothesis')

Test t=3.858961 p-value=0.277102
Interpret by p-Value
we fail to reject null hypothesis


In [40]:
#critical value = 0.1
alpha = 0.05
critical_value = stats.chi2.ppf(q = 1 - alpha,df=chiStats[2])
observed_chi_val = chiStats[0]
print('Interpret by critical value')
if observed_chi_val <= critical_value:
       print ('Null hypothesis cannot be rejected (variables are not related)')
else:
       print ('Null hypothesis cannot be excepted (variables are not independent)')

Interpret by critical value
Null hypothesis cannot be rejected (variables are not related)


Inference: Defective % does not varies across the regions

# Problem Statement

5. Fantaloons Sales managers commented that % of males versus females walking in to the store differ based on day of the week. Analyze the data and determine whether there is evidence at 5 % significance level to support this hypothesis.

In [42]:
Fantaloons=pd.read_csv('Fantaloons.csv')
Fantaloons.head()

Unnamed: 0,Weekdays,Weekend
0,Male,Female
1,Female,Male
2,Female,Male
3,Male,Female
4,Female,Female


In [43]:
Weekdays_value=Fantaloons['Weekdays'].value_counts()
Weekend_value=Fantaloons['Weekend'].value_counts()
print(Weekdays_value,Weekend_value)

Female    287
Male      113
Name: Weekdays, dtype: int64 Female    233
Male      167
Name: Weekend, dtype: int64


In [44]:
#we do the cross table 
tab = Fantaloons.groupby(['Weekdays', 'Weekend']).size()
count = np.array([280, 520]) #How many Male and Female
nobs = np.array([400, 400]) #Total number of Male and Female are there 

stat, pval = proportions_ztest(count, nobs,alternative='two-sided') 
#Alternative The alternative hypothesis can be either two-sided or one of the one- sided tests
#smaller means that the alternative hypothesis is prop < value
#larger means prop > value.
print('{0:0.3f}'.format(pval))
# two. sided -> means checking for equal proportions of Male and Female 
# p-value < 0.05 accept alternate hypothesis i.e.
# Unequal proportions 

stat, pval = proportions_ztest(count, nobs,alternative='larger')
print('{0:0.3f}'.format(pval))
# Ha -> Proportions of Male > Proportions of Female
# Ho -> Proportions of Female > Proportions of Male
# p-value >0.05 accept null hypothesis 
# so proportion of Female > proportion of Male

0.000
1.000


  zstat = value / std_diff


Inferences: P-value <0.05 and hence we reject null. We reject null Hypothesis. Hence proportion of Female is greater than Male