Case Study on ANOVA

XYZ Company has offices in four different zones. The company wishes to
investigate the following :

● The mean sales generated by each zone.

● Total sales generated by all the zones for each month.

● Check whether all the zones generate the same amount of sales.

Help the company to carry out their study with the help of data provided.

In [62]:

#importing necessary modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from google.colab import drive

Reading and preparing data

In [63]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [64]:
df= pd.read_csv('/content/drive/My Drive/Sales_data_zone_wise.csv')
df.head()     #reading dataset in py environment

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062


In [65]:
df.columns #displaying features

Index(['Month', 'Zone - A', 'Zone - B', 'Zone - C', 'Zone - D'], dtype='object')

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


In [67]:
df.isna().sum()  #checking null values

Month       0
Zone - A    0
Zone - B    0
Zone - C    0
Zone - D    0
dtype: int64

In [68]:
df.describe()  # extracting statistical data

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.0,1755560.0,1772871.0,1842927.0
std,261940.1,168389.9,333193.7,375016.5
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


Findings:
• The given dataset is the zone wise monthly sales of a company. 

• There are 4 zones, A, B, C & D. and data of 29 months are given. 

•There are no null values in the data.
 
•There are 29 observations and a total of five columns with one object datatype and 4 int 64 datatype.

1) The Mean sales generated by each zone.

In [69]:
Mean_sales=df.mean(axis = 0) #storing in variable  Mean_sales
Mean_sales.round()

Zone - A    1540493.0
Zone - B    1755560.0
Zone - C    1772871.0
Zone - D    1842927.0
dtype: float64

Findings:

Mean/average sales generated by each zone is found out. 

•  Zone-D has more average sales compared to other zones. 

• Zone - B and Zone - C have almost same average sales. 

• Zone-A has least mean sales.

2) Total sales generated by all the zones for each month.

In [70]:
df['Total Sales'] =df.sum(axis = 1)
df

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D,Total Sales
0,Month - 1,1483525,1748451,1523308,2267260,7022544
1,Month - 2,1238428,1707421,2212113,1994341,7152303
2,Month - 3,1860771,2091194,1282374,1241600,6475939
3,Month - 4,1871571,1759617,2290580,2252681,8174449
4,Month - 5,1244922,1606010,1818334,1326062,5995328
5,Month - 6,1534390,1573128,1751825,2292044,7151387
6,Month - 7,1820196,1992031,1786826,1688055,7287108
7,Month - 8,1625696,1665534,2161754,2363315,7816299
8,Month - 9,1652644,1873402,1755290,1422059,6703395
9,Month - 10,1852450,1913059,1754314,1608387,7128210


In [73]:

#displaying minimum value of sales. 
df['Total Sales'].min()

5925424

In [74]:
df['Total Sales'].max()

8174449

Findings:

•Maximum sales is generated on 4th month(8174449) 

Minimum sales generated on 13th month(5925424).

3) Check whether all the zones generate the same amount of sales.


Here we have to check one-way Anova test. 

Stating hypothesis;

Null hypothesis H0: Amount of sales from all zones are equal. 

Alternate hypothesis Ha:Amount of sales from all zones are not equal. 

Let Significance level: 0.05

In [75]:
import scipy
from scipy import stats
from scipy.stats import f_oneway

In [76]:
F,p=scipy.stats.f_oneway(df['Zone - A'],df['Zone - B'], df['Zone - C'], df['Zone - D'])
print(['F-statistic:',F]) 
print(['p-value:',p])
if p > 0.05:
    print('Accept Null Hypothesis')
else :
    print ('Reject Null Hypothesis')

['F-statistic:', 5.672056106843581]
['p-value:', 0.0011827601694503335]
Reject Null Hypothesis


Findings:

• F- statistic = 5.6720 and  
p-value =0.00118<0.05

Hence we can reject null hypothesis and accept alternate hypothesis which states that all zones have different sales amount.