#Case Study on Testing of Hypothesis


Importing neccesary libraries

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency

**Case Study on Testing of Hypothesis**                                       
A company started to invest in digital marketing as a new way of their product
promotions. For that they collected data and decided to carry out a study on it.                                                                       
● The company wishes to clarify whether there is any increase in sales after
stepping into digital marketing.                                                          
● The company needs to check whether there is any dependency between
thefeatures “Region” and “Manager”.
Help the company to carry out their study with the help of data provided.

Read the data

In [2]:
sales_data=pd.read_csv('/content/Sales_add (1).csv')

In [3]:
sales_data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [4]:
sales_data.tail()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
17,Month-18,Region - C,Manager - B,167996,191517
18,Month-19,Region - B,Manager - A,132135,227040
19,Month-20,Region - A,Manager - B,152493,212579
20,Month-21,Region - B,Manager - A,147425,263388
21,Month-22,Region - A,Manager - C,130263,243020


In [6]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


In [7]:
sales_data.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


#Clarify whether there is any increase in sales after stepping into digital marketing                                                                    
State the null and alternative hypotheses                                     
**Null hypothesis, H0**: There is no increase in sales after stepping into digital marketing or digital marketing has no impact in sales                 

**Alternative hypothesis, HA**:There is an increase in sales after stepping into digital marketing or digital marketing has an impact in sales           

**Identify the test statistic**
 T test is performed in samples with relatively small sample size. So here we use T test

In [5]:
t_stat,p_value=ttest_ind(sales_data['Sales_before_digital_add(in $)'],sales_data['Sales_After_digital_add(in $)'])
p_value

2.614368006904645e-16

set the level of significance is 5%. i.e. alpha = 0.05

In [8]:
alpha=0.05

In [9]:
if p_value<alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

Reject null hypothesis


Reject the null hypothesis , There is an increase in sales after stepping into digital marketing or digital marketing has an impact in sales

# ●  checking if there is any dependency between the features “Region” and “Manager”.

**Null hypothesis, H0**: There is no dependency between the features 'Region' and 'Manager' or features 'Region' and 'Manager' are independent

**Alternative hypothesis, HA**: There is some kind of dependency between the features 'Region' and 'Manager' or features 'Region' and 'Manager' are not independent

Identify the test statistic
If both features are qualitative, we use chi-sqaure test of independence

# 1. Using count                                                                
We create a cross-tabulated data using 'Region' and 'Manager' counts and then perform chi-square test on that data.

In [10]:

pd.crosstab(sales_data.Region,sales_data.Manager)

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,4,3,3
Region - B,4,1,2
Region - C,1,3,1


In [11]:
chi_square_args = pd.crosstab(sales_data.Region,sales_data.Manager).values
chi_stat,p_value,dof,expected= chi2_contingency(chi_square_args)
p_value

0.5493991051158094

set the level of significance is 5%. i.e. alpha = 0.05

In [12]:
if p_value<alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

Accept null hypothesis


Test failed to reject null hypothesis. So we can say that, when we consider count, the features 'Region' and 'Manager' are statistically independent with 95% confidence

#2. Using sum of sales                                                      
We create a cross-tabulated data using sum of 'Region' and 'Manager' sales values and then perform chi-square test on that data. Since the data is not in the required format we need to reshape the data using melt function

In [17]:
sales_data_new=pd.melt(sales_data,id_vars=['Manager','Region'],value_vars=['Sales_before_digital_add(in $)','Sales_After_digital_add(in $)'],value_name='Sales')


In [18]:
sales_data_new.head()

Unnamed: 0,Manager,Region,variable,Sales
0,Manager - A,Region - A,Sales_before_digital_add(in $),132921
1,Manager - C,Region - A,Sales_before_digital_add(in $),149559
2,Manager - A,Region - B,Sales_before_digital_add(in $),146278
3,Manager - B,Region - B,Sales_before_digital_add(in $),152167
4,Manager - B,Region - C,Sales_before_digital_add(in $),159525


In [19]:
sales_data_new.tail()

Unnamed: 0,Manager,Region,variable,Sales
39,Manager - B,Region - C,Sales_After_digital_add(in $),191517
40,Manager - A,Region - B,Sales_After_digital_add(in $),227040
41,Manager - B,Region - A,Sales_After_digital_add(in $),212579
42,Manager - A,Region - B,Sales_After_digital_add(in $),263388
43,Manager - C,Region - A,Sales_After_digital_add(in $),243020


In [20]:
df=pd.crosstab(sales_data_new['Region'],sales_data_new['Manager'],values=sales_data_new['Sales'],aggfunc='sum')
df

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,1624951,1123683,1121946
Region - B,1510751,383975,760034
Region - C,376799,1113131,352731


In [21]:
chi_stat,p_value,dof,expected=chi2_contingency(df)
p_value

0.0

In [22]:
if p_value<alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

Reject null hypothesis


So we can say that, when we consider the total sales, the features 'Region' and 'Manager' are statistically dependent with 95% confidence

# Conclusion

There is an increase in sales after stepping into digital marketing or digital marketing has an impact in sales.

There is dependency between the features 'Region' and 'Manager'