# Case Study on Testing of Hypothesis

A company started to invest in digital marketing as a new way of their product
promotions.For that they collected data and decided to carry out a study on it.

● The company wishes to clarify whether there is any increase in sales after
stepping into digital marketing.

● The company needs to check whether there is any dependency between the
features “Region” and “Manager”.

In [1]:
#Importing the necessary libraries 
import pandas as pd
import numpy as np
from scipy.stats import ttest_rel
import scipy.stats as stats
from statsmodels.stats.weightstats import ztest

In [2]:
#Importing the required dataset
data = pd.read_csv("F:\\pythonprogramming\\Sales_add.csv")
data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


In [4]:
#Printing the dataset
data

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402
5,Month-6,Region - A,Manager - B,137163,256948
6,Month-7,Region - C,Manager - C,130625,222106
7,Month-8,Region - A,Manager - A,131140,230637
8,Month-9,Region - B,Manager - C,171259,226261
9,Month-10,Region - C,Manager - B,141956,193735


In [5]:
# To verify null data items
data.isnull().sum()  

Month                             0
Region                            0
Manager                           0
Sales_before_digital_add(in $)    0
Sales_After_digital_add(in $)     0
dtype: int64

In [6]:
data.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


# 1. The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

In [7]:
bef_digi = data["Sales_before_digital_add(in $)"]
aft_digi = data["Sales_After_digital_add(in $)"]

In [8]:
bef_digi.describe()

count        22.000000
mean     149239.954545
std       14844.042921
min      130263.000000
25%      138087.750000
50%      147444.000000
75%      157627.500000
max      178939.000000
Name: Sales_before_digital_add(in $), dtype: float64

In [9]:
aft_digi.describe()

count        22.000000
mean     231123.727273
std       25556.777061
min      187305.000000
25%      214960.750000
50%      229986.500000
75%      250909.000000
max      276279.000000
Name: Sales_After_digital_add(in $), dtype: float64

In [10]:
print(np.var(bef_digi), np.var(aft_digi))

210329900.67975205 623460269.4710745


Setting Hypothesis

Null Hypothesis: H0 :There is no increase in sales after stepping into digital marketing

Alternative Hypothesis: Ha : There is an increase in sales after stepping into digital marketing

In [33]:
#Paired t-test
# to compare the means of the same group or item under two separate scenarios
ttest_pair = ttest_rel(a=bef_digi, b=aft_digi)
ttest_pair

Ttest_relResult(statistic=-12.09070525287017, pvalue=6.336667004575778e-11)

Because the p-value of our test (6.336667004575778e-11) is less than alpha = 0.05, we can strongly reject the null hypothesis of the test. So we can say that there is  an increase in sales after stepping into digital marketing

# 2. The company needs to check whether there is any dependency between the features “Region” and “Manager”.


Setting Hypothesis

Null Hypothesis: H0: There is no dependency between Region and Manager

Alternative Hypothesis: Ha : There is dependency between Region and Manager

In [12]:
#chisquare test
#To determine whether there is a significant association between 2 categorical variables

In [13]:
# To build our data in matrix format
data_table=pd.crosstab(data['Region'],data['Manager'])
print(data_table)

Manager     Manager - A  Manager - B  Manager - C
Region                                           
Region - A            4            3            3
Region - B            4            1            2
Region - C            1            3            1


In [14]:
#Observed Values
Obs_values = data_table.values
print("Observed Values :-\n",Obs_values)


Observed Values :-
 [[4 3 3]
 [4 1 2]
 [1 3 1]]


In [15]:
#Expected values
val = stats.chi2_contingency(data_table)
val

(3.050566893424036,
 0.5493991051158094,
 4,
 array([[4.09090909, 3.18181818, 2.72727273],
        [2.86363636, 2.22727273, 1.90909091],
        [2.04545455, 1.59090909, 1.36363636]]))

In [16]:
Expect_values = val[3]

In [19]:
#calculating the degree of freedom
column_nos = len(data_table.iloc[0,0:3])
row_nos = len(data_table.iloc[0:3,0])
degree_of_free =(row_nos-1) * (column_nos-1) 
print("Degree of freedom:-",degree_of_free)
alpha_value = 0.05

Degree of freedom:- 4


In [20]:
#calculating chi-square statistics
from scipy.stats import chi2
chi_square = sum([(o-e)**2./e for o,e in zip (Obs_values ,Expect_values)])
chi_square_statistics = chi_square[0]+ chi_square[1]

In [32]:
print("chi-square statistics-: ", chi_square_statistics )

chi-square statistics-:  2.921995464852608


In [28]:
#p-value
p_value = 1-chi2.cdf(x=chi_square_statistics,df=degree_of_free)
print("p-value-:",p_value )
print("Significance level-:",alpha_value)
print("Degree of freedom-:",degree_of_free)

p-value-: 0.5709629929220089
Significance level-: 0.05
Degree of freedom-: 4


In [31]:
if p_value  <= alpha_value:
    print("Reject H0, There is a dependency between Region and Manager")
else:
     print("Retain H0, There is no dependency between Region and Manager")

Retain H0, There is no dependency between Region and Manager
