# Case study on testing of hypothesis
A company started to invest in digital marketing as a new way of their product
promotions.For that they collected data and decided to carry out a study on it.
* The company wishes to clarify whether there is any increase in sales after
stepping into digital marketing.
* The company needs to check whether there is any dependency between the
features “Region” and “Manager”.
Help the company to carry out their study with the help of data provided

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data=pd.read_csv('Sales_add.csv')

In [3]:
data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [4]:
df=pd.DataFrame(data)

In [5]:
df

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402
5,Month-6,Region - A,Manager - B,137163,256948
6,Month-7,Region - C,Manager - C,130625,222106
7,Month-8,Region - A,Manager - A,131140,230637
8,Month-9,Region - B,Manager - C,171259,226261
9,Month-10,Region - C,Manager - B,141956,193735


In [6]:
df.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


# 1. The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

# Hypothesis:
* H_0 : The sales before and after digital marketing is equal
* H_a : The sales before and after digital marketing is higher
* Significance level  : 0.05
* Performing one-tailed t-test

In [10]:
from scipy import stats

In [11]:
from statsmodels.stats.weightstats import ttest_ind

In [26]:
salesbefore=int(np.mean(df['Sales_before_digital_add(in $)']))
salesafter=int(np.mean(df['Sales_After_digital_add(in $)']))
print('Average sales before digital marketing is',salesbefore)
print('Average sales after digital marketing is',salesafter)

Average sales before digital marketing is 149239
Average sales after digital marketing is 231123


In [30]:
tscore,pval=stats.ttest_ind(df['Sales_before_digital_add(in $)'],df['Sales_After_digital_add(in $)'])
print("The t score is: ", round(tscore, 5))
print("The p value is: ", round(pval, 20))

The t score is:  -12.99508
The p value is:  2.6144e-16


As p value is much lower than 0.05, we reject the null hypothesis that the sales before digital marketing and after digital marketing are equal and conclude that there has been a significant increase in sales after the introduction of digital marketing.

# --------

# 2. The company needs to check whether there is any dependency between the features “Region” and “Manager”. 

# Hypothesis:
* H_0 : Region and Manager are independent
* H_a : Region and Manager are not independent
* Significance level : 0.05
* Performing chi squared test

In [31]:
data['Region'].value_counts()

Region - A    10
Region - B     7
Region - C     5
Name: Region, dtype: int64

In [32]:
data['Manager'].value_counts()

Manager - A    9
Manager - B    7
Manager - C    6
Name: Manager, dtype: int64

In [36]:
pd.crosstab(data.Region,data.Manager,margins=True)

Manager,Manager - A,Manager - B,Manager - C,All
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Region - A,4,3,3,10
Region - B,4,1,2,7
Region - C,1,3,1,5
All,9,7,6,22


In [35]:
from scipy.stats import chi2_contingency
from scipy.stats import chi2

In [37]:
table=[[4,3,3],[4,1,2],[1,3,1]]
stat,p,dof,expected=chi2_contingency(table)

In [39]:
alpha=0.05

In [41]:
print('Significance level=%.3f,p=%.3f'%(alpha,p))

Significance level=0.050,p=0.549


Since the p value is greater than the significance level of 0.05 we fail to reject the null hypothesis and conclude that the variables are independent.That is the features 'Manager' and 'Region' have no significant relation.

# Insights:
    
1. In the first question,after performing one tailed t test, it was concluded that there was a significant increase in sales after the introduction of digital marketing
2. In the second question,after performing the chi squared test, it was concluded that the features 'Region' and 'Manager' are independent.