# Case Study on ANOVA

XYZ Company has offices in four different zones. The company wishes to
investigate the following :
    
● The mean sales generated by each zone.

● Total sales generated by all the zones for each month.

● Check whether all the zones generate the same amount of sales.


In [1]:
#Importing the necessary libraries 
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:
#Importing the required dataset
data = pd.read_csv("F:\\pythonprogramming\\Sales_data_zone_wise.csv")
data.head()

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062


In [3]:
data.info()#prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


In [4]:
#Printing the dataset
data

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062
5,Month - 6,1534390,1573128,1751825,2292044
6,Month - 7,1820196,1992031,1786826,1688055
7,Month - 8,1625696,1665534,2161754,2363315
8,Month - 9,1652644,1873402,1755290,1422059
9,Month - 10,1852450,1913059,1754314,1608387


In [5]:
# To verify null data items
data.isnull().sum()  

Month       0
Zone - A    0
Zone - B    0
Zone - C    0
Zone - D    0
dtype: int64

In [6]:
data.describe()

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.0,1755560.0,1772871.0,1842927.0
std,261940.1,168389.9,333193.7,375016.5
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


# 1.The mean sales generated by each zone.

In [7]:
data.loc['Group Means'] = data.mean()
data

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525.0,1748451.0,1523308.0,2267260.0
1,Month - 2,1238428.0,1707421.0,2212113.0,1994341.0
2,Month - 3,1860771.0,2091194.0,1282374.0,1241600.0
3,Month - 4,1871571.0,1759617.0,2290580.0,2252681.0
4,Month - 5,1244922.0,1606010.0,1818334.0,1326062.0
5,Month - 6,1534390.0,1573128.0,1751825.0,2292044.0
6,Month - 7,1820196.0,1992031.0,1786826.0,1688055.0
7,Month - 8,1625696.0,1665534.0,2161754.0,2363315.0
8,Month - 9,1652644.0,1873402.0,1755290.0,1422059.0
9,Month - 10,1852450.0,1913059.0,1754314.0,1608387.0


ie. The mean sale generated by 
                               
                               Zone - A : 1.540493e+06
    
                               Zone - B : 1.755560e+06
            
                               Zone - C : 1.772871e+06
                    
                               Zone - D : 1.842927e+06

# 2. Total sales generated by all the zones for each month.

In [8]:
np.sum(data,axis = 1)

0              7.022544e+06
1              7.152303e+06
2              6.475939e+06
3              8.174449e+06
4              5.995328e+06
5              7.151387e+06
6              7.287108e+06
7              7.816299e+06
8              6.703395e+06
9              7.128210e+06
10             7.032783e+06
11             6.111084e+06
12             5.925424e+06
13             7.155515e+06
14             5.934156e+06
15             6.506659e+06
16             7.149383e+06
17             7.083490e+06
18             6.971953e+06
19             7.124599e+06
20             7.389597e+06
21             7.560001e+06
22             6.687919e+06
23             7.784747e+06
24             6.095918e+06
25             6.512360e+06
26             6.267918e+06
27             7.470920e+06
28             6.772277e+06
Group Means    6.911851e+06
dtype: float64

# 3. Check whether all the zones generate the same amount of sales.


Using One-way ANOVA :
    
Level of significance, alpha = 0.05

Null Hypothesis, H0 : Same amount of sales are generated in 4 Zones.
    
Alternate Hypothesis, Ha : Different amount of sales are generated in 4 Zones.

In [12]:
from scipy.stats import f_oneway

In [13]:
alpha_value = 0.05
fvalue, pvalue = stats.f_oneway(data['Zone - A'], data['Zone - B'], data['Zone - C'] , data['Zone - D'])
print("p-value-:",pvalue )
print("Significance level-:",alpha_value)


p-value-: 0.0007050314102121292
Significance level-: 0.05


In [11]:
if pvalue  <= alpha_value:
    print("Reject H0, Different amount of sales are generated in 4 Zones")
else:
     print("Retain H0, Same amount of sales are generated in 4 Zones")

Reject H0, Different amount of sales are generated in 4 Zones



Conclusion:

Here, p-value(0.0007050314102121292) is lesser than the alpha value (0.05). Hence we reject the null hypotheis and accept the alternate hypothesis ,so it can be concluded that  different amount of sales are generated in 4 Zones..