# Case Study on Testing of Hypothesis

Questions:
    1. The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.
    2. The company needs to check whether there is any dependency between the features “Region” and “Manager”.

In [1]:
import scipy.stats as st
import math
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('Sales_add.csv')

In [3]:
data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [4]:
data.columns

Index(['Month', 'Region', 'Manager', 'Sales_before_digital_add(in $)',
       'Sales_After_digital_add(in $)'],
      dtype='object')

In [5]:
data.shape

(22, 5)

In [6]:
data.dtypes

Month                             object
Region                            object
Manager                           object
Sales_before_digital_add(in $)     int64
Sales_After_digital_add(in $)      int64
dtype: object

In [7]:
#summary statistics
data.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


In [8]:
data.isna().any()

Month                             False
Region                            False
Manager                           False
Sales_before_digital_add(in $)    False
Sales_After_digital_add(in $)     False
dtype: bool

1. To test whether there is increase in mean of sales after digital martketing

Null Hypothesis H0: There is no difference in means of sales before and after digital marketing
    
Alternative Hypothesis H1: The means of sales after digital marketing is greater

We fix significance value at alpha =  0.05.

The above test can be done using scipy.stats library. scipy_stats_ttest_rel() calculates the t-test on two related samples of scores.

In [9]:
pre = data['Sales_before_digital_add(in $)'] #list of sales before digital marketing
pos = data['Sales_After_digital_add(in $)']  #list of sales after digital marketing

In [10]:
st.ttest_rel(pre,pos, alternative='less')

Ttest_relResult(statistic=-12.09070525287017, pvalue=3.168333502287889e-11)

p-value is very small compared to significance level 0.05. Hence, we reject null hypothesis.

We can conclude that there is increase in sales after stepping into digital marketing.

2. Dependency between 'Region' and 'Manager'

Null Hypothesis H0: Region and Manager are independent attributes
    
Alternative Hypothesis H1: Region and Manager are dependent on each other

We fix significance value at alpha = 0.05

In [11]:
data['Region'].value_counts()

Region - A    10
Region - B     7
Region - C     5
Name: Region, dtype: int64

In [12]:
data['Manager'].value_counts()

Manager - A    9
Manager - B    7
Manager - C    6
Name: Manager, dtype: int64

In [13]:
#contingency table for Region and Manager
ct = pd.crosstab(data.Region, data.Manager, margins=True) 
ct

Manager,Manager - A,Manager - B,Manager - C,All
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Region - A,4,3,3,10
Region - B,4,1,2,7
Region - C,1,3,1,5
All,9,7,6,22


The above can be done SciPy library. scipy.stats.chi2_contingency() gives chi-square test of independence of variables in a contingency table.

In [14]:
obs_arr = np.array([ct.iloc[0][0:3].values, ct.iloc[1][0:3].values, ct.iloc[2][0:3].values])
st.chi2_contingency(obs_arr)

(3.050566893424036,
 0.5493991051158094,
 4,
 array([[4.09090909, 3.18181818, 2.72727273],
        [2.86363636, 2.22727273, 1.90909091],
        [2.04545455, 1.59090909, 1.36363636]]))

In [15]:
#p - value greater than 0.05, hence we fail to reject null hypothesis. Hence, Region and Manager are independent attributes.