In [1]:
import pandas as pd
import numpy as np
from scipy import stats

## Q1. Null and Alternate Hypothesis for the analysis
+ Alternate Hypothesis :- There is significant difference in diameter of Unit A and Unit B
+ Null Hypothesis :- There is no Significant difference in diameter of Unit A and Unit B 

In [2]:
df = pd.read_csv("Cutlets.csv")

In [3]:
print(df)

    Unit A  Unit B
0   6.8090  6.7703
1   6.4376  7.5093
2   6.9157  6.7300
3   7.3012  6.7878
4   7.4488  7.1522
5   7.3871  6.8110
6   6.8755  7.2212
7   7.0621  6.6606
8   6.6840  7.2402
9   6.8236  7.0503
10  7.3930  6.8810
11  7.5169  7.4059
12  6.9246  6.7652
13  6.9256  6.0380
14  6.5797  7.1581
15  6.8394  7.0240
16  6.5970  6.6672
17  7.2705  7.4314
18  7.2828  7.3070
19  7.3495  6.7478
20  6.9438  6.8889
21  7.1560  7.4220
22  6.5341  6.5217
23  7.2854  7.1688
24  6.9952  6.7594
25  6.8568  6.9399
26  7.2163  7.0133
27  6.6801  6.9182
28  6.9431  6.3346
29  7.0852  7.5459
30  6.7794  7.0992
31  7.2783  7.1180
32  7.1561  6.6965
33  7.3943  6.5780
34  6.9405  7.3875


## From the data it seems that we can perform a two-sample t-test to determine if there's a significant difference in the diameter of the cutlets between Unit A and Unit B.

+ Assumptions :- 
1. Normality: The data in each group are approximately normally distributed.
+ Will use Shapiro-Wilk test to check validity of this assumption
2. Homogeneity of Variances: The variances of the two groups should be approximately equal. 
+ Will use a Levene's test to check validity of this assumption

## Shapiro-Wilk test to check normality

In [4]:
from scipy.stats import shapiro

In [5]:
stat,p = shapiro(df['Unit A'])
print("stat = %.3f, p = %.3f"%(stat,p) )

stat = 0.965, p = 0.320


In [6]:
stat,p = shapiro(df['Unit B'])
print("stat = %.3f, p = %.3f"%(stat,p))

stat = 0.973, p = 0.523


## Since both p-values (0.320 for Unit A and 0.523 for Unit B) are greater than the typical significance level of 0.05, we do not have enough evidence to reject the null hypothesis of normality. This suggests that we can proceed with the assumption of normality for our data.

+ Now, we can move on to checking the homogeneity of variances using Levene's test. This test assesses whether the variances of the two groups are approximately equal.

In [7]:
# Levene's test for homogenity of variance
statistic, pvalue = stats.levene(df['Unit A'],df['Unit B'])
print("statistic = %.3f,p-value = %.3f"%(statistic, pvalue))

statistic = 0.665,p-value = 0.418


+ The Levene's test for homogeneity of variances gives a p-value of 0.418, which is greater than 0.05. This suggests that we do not have enough evidence to reject the null hypothesis of equal variances. Therefore, we can proceed with the assumption of equal variances.

## Given that the assumptions of normality and equal variances hold, we can confidently use a two-sample t-test to compare the mean diameter of cutlets between Unit A and Unit B.

In [8]:
t_statistics,p_value = stats.ttest_ind(df['Unit A'],df['Unit B'])
print("t-statistics = %.3f,pvalue = %.3f"%(t_statistics,p_value))

t-statistics = 0.723,pvalue = 0.472


+ The two-sample t-test yields a t-statistic of approximately 0.723 and a p-value of 0.472. Since the p-value is greater than the significance level of 0.05, we do not have enough evidence to reject the null hypothesis.

## Therefore, based on the data, we do not find a significant difference in the mean diameter of cutlets between Unit A and Unit B at the 5% significance level.

## Q2. 

## Null Hypothesis :- 
    + There is no significant difference between Turn around Time of all the laboratries. 
## Alternate Hypothesis :- 
    + There is significant difference between Turn around Time of all the laboratries.

In [9]:
df2 = pd.read_csv('LabTAT.csv')

In [10]:
df2

Unnamed: 0,Laboratory 1,Laboratory 2,Laboratory 3,Laboratory 4
0,185.35,165.53,176.70,166.13
1,170.49,185.91,198.45,160.79
2,192.77,194.92,201.23,185.18
3,177.33,183.00,199.61,176.42
4,193.41,169.57,204.63,152.60
...,...,...,...,...
115,178.49,170.66,193.80,172.68
116,176.08,183.98,215.25,177.64
117,202.48,174.54,203.99,170.27
118,182.40,197.18,194.52,150.87


+ We'll perform a one-way analysis of variance (ANOVA) test to compare the means of multiple groups. after checking the validity of our assumptions 
+ Assumptions for each set of Data:

    - 1. Normality: We Check if TAT data for each laboratory is approximately normally distributed using the Shapiro-Wilk test.
    - 2. Homogeneity of Variances: We Check if variances of TAT are approximately equal among the laboratories using Levene's test.

In [11]:
labs = list(df2.columns)
for i in range(1,5):
    stat,p = shapiro(df2.iloc[i])
    print(labs[i-1])
    print("stat = %.3f, p = %.3f"%(stat,p))

Laboratory 1
stat = 0.975, p = 0.872
Laboratory 2
stat = 0.986, p = 0.937
Laboratory 3
stat = 0.822, p = 0.148
Laboratory 4
stat = 0.959, p = 0.771


+ For all laboratories, the p-values are greater than 0.05, indicating that we do not have enough evidence to reject the null hypothesis of normality. Therefore, we can proceed with the assumption of normality for the data. Let's now check the homogeneity of variances using Levene's test. 

In [12]:
stat,p = stats.levene(df2['Laboratory 1'],df2['Laboratory 2'],df2['Laboratory 3'],df2['Laboratory 4'])
print("stat = %.3f, p = %.3f"%(stat,p))

stat = 2.600, p = 0.052


+ The Levene's test for homogeneity of variances gives a p-value of 0.052, which is very close to the typical significance level of 0.05. While it's not significantly less than 0.05, it's still worth noting that the assumption of equal variances might be borderline.

    + Given this result, We continue with one-way ANOVA test, but it's essential to interpret the results cautiously, considering the p-value from Levene's test.

In [13]:
# anova_result
stat,p = stats.f_oneway(df2['Laboratory 1'],df2['Laboratory 2'],df2['Laboratory 3'],df2['Laboratory 4'])
print("stat = %.3f, p = %.3f"%(stat,p))

stat = 118.704, p = 0.000


+ The one-way ANOVA test indicates a significant difference in average Turn Around Time (TAT) among the laboratories. The p-value is extremely low (p = 0.000), which is less than the typical significance level of 0.05. Therefore, we have enough evidence to reject the null hypothesis of equal means, suggesting that at least one laboratory has a different average TAT compared to the others.

## Q3. 

In [14]:
buyers = pd.read_csv("BuyerRatio.csv",index_col='Observed Values')

In [15]:
buyers

Unnamed: 0_level_0,East,West,North,South
Observed Values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Males,50,142,131,70
Females,435,1523,1356,750


+ Null Hypothesis :- All Proportions are Equal
+ Alternate Hypothesis :- Not all Proportions are equal

In [16]:
from scipy.stats import chi2_contingency
# Perform Chi2 test
chi2,p_value,_,_ = chi2_contingency(buyers)

In [17]:
# Print the results
print(f"Chi-square value: {chi2}")
print(f"P-value: {p_value}")

Chi-square value: 1.595945538661058
P-value: 0.6603094907091882


+ The Chi-square test of independence yields a Chi-square value of approximately 1.596 and a p-value of 0.660.

+ Since the p-value is greater than the common significance level of 0.05, we do not have enough evidence to reject the null hypothesis. This suggests that there is no significant association between gender and region, and the male-female buyer ratios are similar across the four regions

## Q.4

In [18]:
orig_order_forms = pd.read_csv('Costomer+OrderForm.csv')

In [19]:
order_forms = orig_order_forms.copy()

In [20]:
from sklearn.preprocessing import LabelEncoder

In [21]:
orig_order_forms.head()

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,Error Free,Error Free,Defective,Error Free
1,Error Free,Error Free,Error Free,Defective
2,Error Free,Defective,Defective,Error Free
3,Error Free,Error Free,Error Free,Error Free
4,Error Free,Error Free,Defective,Error Free


In [22]:
orig_order_forms.describe()

Unnamed: 0,Phillippines,Indonesia,Malta,India
count,300,300,300,300
unique,2,2,2,2
top,Error Free,Error Free,Error Free,Error Free
freq,271,267,269,280


In [23]:
lab = LabelEncoder()

In [24]:
countries = ['Phillippines','Indonesia','Malta','India']
for country in countries:
    order_forms[country] = lab.fit_transform(order_forms[country])

+ Label 0 :- Defective 
+ Label 1 :- Error Free

In [25]:
order_forms.head()

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,1,1,0,1
1,1,1,1,0
2,1,0,0,1
3,1,1,1,1
4,1,1,0,1


In [26]:
countries_order = pd.DataFrame(
    [
    [300-sum(order_forms['Phillippines']),sum(order_forms['Phillippines'])],
    [300-sum(order_forms['Indonesia']),sum(order_forms['Indonesia'])],
    [300-sum(order_forms['Malta']),sum(order_forms['Malta'])],
    [300-sum(order_forms['India']),sum(order_forms['India'])]
    ],index =countries ,columns=['Defective','Error Free'])
countries_order

Unnamed: 0,Defective,Error Free
Phillippines,29,271
Indonesia,33,267
Malta,31,269
India,20,280


#### Null Hypothesis (H0): There is no significant difference in defective percentages among the centers.
#### Alternative Hypothesis (H1): There is a significant difference in defective percentages among the centers.

In [27]:
# Perform chi-square test
chi2_stat, p_val, dof, expected = chi2_contingency(countries_order)

# Print the results
print(f"Chi-square Statistic: {chi2_stat}")
print(f"P-value: {p_val}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

Chi-square Statistic: 3.8589606858203545
P-value: 0.2771020991233144
Degrees of Freedom: 3
Expected Frequencies:
[[ 28.25 271.75]
 [ 28.25 271.75]
 [ 28.25 271.75]
 [ 28.25 271.75]]


+ Since the p-value (0.2771) is greater than the significance level of 0.05 (5%), we fail to reject the null hypothesis. There is not enough evidence to conclude that the defective percentages vary significantly among the centers.