# Case Study 4: Testing of Hypothesis

### A company started to invest in digital marketing as a new way of their product promotions.For that they collected data and decided to carry out a study on it.
### 1. The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.
### 2. The company needs to check whether there is any dependency between the features “Region” and “Manager”.
### Help the company to carry out their study with the help of data provided.

Importing required libraries and packages

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency

Descriptive analytics on the dataset.

In [2]:
data=pd.read_csv("Sales_add.csv")

In [3]:
data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


In [5]:
data.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


In [6]:
data.dtypes

Month                             object
Region                            object
Manager                           object
Sales_before_digital_add(in $)     int64
Sales_After_digital_add(in $)      int64
dtype: object

### 1. The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

In [7]:
before=data['Sales_before_digital_add(in $)']
after=data['Sales_After_digital_add(in $)']

In [8]:
before.head()

0    132921
1    149559
2    146278
3    152167
4    159525
Name: Sales_before_digital_add(in $), dtype: int64

In [9]:
after.head()

0    270390
1    223334
2    244243
3    231808
4    258402
Name: Sales_After_digital_add(in $), dtype: int64

#### The aim of the test is to conclude whether there is any increase in sales after stepping into digital marketing.

##### Null hypothesis: There is no increase in sales after stepping into digital marketing, i.e, mean of sales after <=  mean of sales before
##### Alternative hypothesis: Sales show an increase with digital marketing i.e, mean of sales after > mean of sales before


Level of significance: 
    Refers to the degree of significance in which we accept or reject the null-hypothesis.
    This is normally denoted with alpha and generally it is 0.05 or 5% , which means your output should be 95%  confident to give similar kind of result in each sample.

alpha = 0.05

Here sample size<30 hence we resort to t test. We are testing 2 samples of data (sales before and after digital marketing) and also alternative hypothesis is one directional which implies one tailed t test

We use the ttest_ind method to carry out the independent samples t-test. Here, we used the Pandas series’ we previously created (subsets), and set the equal_var parameter to True. 

When you call scipy.stats.ttest_ind(before, after), this makes a Hypothesis Test on the value of before.mean()-after.mean()

In [10]:
res = stats.ttest_ind(before,after,equal_var=True)
res

Ttest_indResult(statistic=-12.995084451110877, pvalue=2.614368006904645e-16)

This is the output for a two-tailed t-test we must divide the p by 2 for our one-tailed test. So depending on the Significance Level alpha, you need

p/2 < alpha

in order to reject the Null Hypothesis H0. For alpha=0.05 this is clearly not the case so you cannot reject H0.

#### Conclusion: 

##### >> p/2 value(1.3) is more than significance level(0.05) which implies null hypothesis is true. Ie there is no increase in sales after stepping into digital marketing

# ----------------------------------------------------------------------------------------------------------

### 2. The company needs to check whether there is any dependency between the features “Region” and “Manager”.

The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical variables

A Contingency table (also called crosstab) is used in statistics to summarise the relationship between several categorical variables. 

Contingency Table showing correlation between Region and Manager.

In [11]:
data_crosstab = pd.crosstab(data['Manager'], data['Region'], margins = False)
                         
print(data_crosstab)

Region       Region - A  Region - B  Region - C
Manager                                        
Manager - A           4           4           1
Manager - B           3           1           3
Manager - C           3           2           1


#### The aim of the test is to conclude whether the two variables( Manager and Region) are related to each other.

#####  Null hypothesis (H0) :There is no relation between the variables. 
##### Alternate hypothesis (Ha): There is a significant relation between the two.

We can verify the hypothesis by using p-value:

We define a significance factor to determine whether the relation between the variables is of considerable significance. Generally a significance factor or alpha value of 0.05 is chosen. This alpha value denotes the probability of erroneously rejecting H0 when it is true.If the p-value for the test comes out to be strictly greater than the alpha value, then H0 holds true.



The chi2_contingency() function of scipy.stats module takes as input, the contingency table in 2d array format. It returns a tuple containing test statistics, the p-value, degrees of freedom and expected table(the one we created from the calculated values) in that order.

In [12]:
data1 = [[4, 4, 1], [3, 1, 3],[3,2,1]]
stat, p, dof, expected = chi2_contingency(data1)

Hence, we need to compare the obtained p-value with alpha value of 0.05.

In [13]:
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')

p value is 0.5493991051158094
Independent (H0 holds true)


####  Conclusion

#### p-value > alpha , we accept H0, that is, the variables do not have a significant relation.