# Case Study on ANOVA

XYZ Company has offices in four different zones. The company wishes to investigate the following :

● The mean sales generated by each zone.

● Total sales generated by all the zones for each month.

● Check whether all the zones generate the same amount of sales.

Help the company to carry out their study with the help of data provided.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [2]:
df=pd.read_csv('Sales_data_zone_wise.csv')

In [3]:
df.head()

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


Insight:

1. There are 29 samples and 5 features.

2. There is no dataset with null value.

3. Out of 5 features, 4 are of integer type and 1 is of object type.

#  The mean sales generated by each zone.

In [5]:
df.mean()

Zone - A    1.540493e+06
Zone - B    1.755560e+06
Zone - C    1.772871e+06
Zone - D    1.842927e+06
dtype: float64

The above format is in exponential form. We will have to convert it into integer

In [6]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [7]:
df.mean()

Zone - A   1540493.13793
Zone - B   1755559.58621
Zone - C   1772871.03448
Zone - D   1842926.75862
dtype: float64

Insight:
    From the above value the mean for each zone is high. From this we can say that the data are more spread out

In [8]:
df.describe()

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.13793,1755559.58621,1772871.03448,1842926.75862
std,261940.06187,168389.8859,333193.72453,375016.47949
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


# Total sales generated by all the zones for each month.

In [9]:
df['Sales Total']= df['Zone - A'] + df['Zone - B'] + df['Zone - C'] + df['Zone - D']
df[['Month', 'Sales Total']]

Unnamed: 0,Month,Sales Total
0,Month - 1,7022544
1,Month - 2,7152303
2,Month - 3,6475939
3,Month - 4,8174449
4,Month - 5,5995328
5,Month - 6,7151387
6,Month - 7,7287108
7,Month - 8,7816299
8,Month - 9,6703395
9,Month - 10,7128210


Insight:
        we have 29 months and above table gives sum of each months for all zones.

# Check whether all the zones generate the same amount of sales.

Here there is only 1 factor or independent variable. hence we are using One-Way ANOVA

Null Hypothesis: All zones generated same sales

Alternate Hypothesis: The sales generted is different for different zones

In [10]:
fvalue, pvalue = stats.f_oneway(df['Zone - A'], df['Zone - B'], df['Zone - C'], df['Zone - D'])
print('The F-value is: ',fvalue,'\n' 'The P-value is: ',pvalue,'\n')
if pvalue > 0.05:
    print('Accept null hypothesis that The average sales before and after digital marketing is same ')
else:
    print('Reject the null hypothesis and accept that The sales generted is different for different zones')

The F-value is:  5.672056106843581 
The P-value is:  0.0011827601694503335 

Reject the null hypothesis and accept that The sales generted is different for different zones


insights:

The p value obtained from ANOVA analysis is significant (p < 0.05), therefore we reject the null hypothesis.Which means Sales generated are different for different zone


Note on F value: F value is inversely related to p value and higher F value (greater than F critical value) indicates a significant p value.