## Case Study on Testing of Hypothesis
A company started to invest in digital marketing as a new way of their product
promotions.For that they collected data and decided to carry out a study on it.
- The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.
- The company needs to check whether there is any dependency between the features **“Region”** and **“Manager”**.

Help the company to carry out their study with the help of data provided.

In [1]:
# Import the requried libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import scipy.stats as stats

In [2]:
# Loading the csv file into a pandas dataframe
df = pd.read_csv('Sales_add.csv')

# display the contents in the data set
df

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402
5,Month-6,Region - A,Manager - B,137163,256948
6,Month-7,Region - C,Manager - C,130625,222106
7,Month-8,Region - A,Manager - A,131140,230637
8,Month-9,Region - B,Manager - C,171259,226261
9,Month-10,Region - C,Manager - B,141956,193735


In [3]:
# summary of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


This dataset contains the following data

- there is 22 rows and 5 features. this dataset has Month, Region, Manager, Sales_before_digital_add and Sales_After_digital_add features.

In [4]:
# Calculating the null values present in each columns in the dataset
df.isna().sum() # or you can use df.isnull().sum() as well

Month                             0
Region                            0
Manager                           0
Sales_before_digital_add(in $)    0
Sales_After_digital_add(in $)     0
dtype: int64

As you can see that, there is no missing values present in the dataset. so we can move to the data visualization

**descriptive analytics on the dataset**

In [5]:
# the total number of elements in each dimension
df.shape

(22, 5)

Here we can use t-test its because t-test is used for hypothesis testing when sample size is small, usually n < 30 where n is used to quantify the sample size

In [6]:
# calculating the region feature with values inside in the dataset
df.groupby('Region').count() 

Unnamed: 0_level_0,Month,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Region - A,10,10,10,10
Region - B,7,7,7,7
Region - C,5,5,5,5


In [7]:
# calculating the average sales before and after on region wise
df.groupby(['Region'])['Sales_before_digital_add(in $)','Sales_After_digital_add(in $)'].mean().round(2)

Unnamed: 0_level_0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Region - A,148204.9,238853.1
Region - B,150523.57,228727.86
Region - C,149513.0,219019.2


As per the region wise average sales before and after, we can see that after sales, there is a huge sales stepping into digital marketing.

In [8]:
# calculating the manager feature with values inside in the dataset
df.groupby('Manager').count() 

Unnamed: 0_level_0,Month,Region,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Manager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Manager - A,9,9,9,9
Manager - B,7,7,7,7
Manager - C,6,6,6,6


In [9]:
# calculating the average sales before and after on manager wise
df.groupby(['Manager'])['Sales_before_digital_add(in $)','Sales_After_digital_add(in $)'].mean().round(2)

Unnamed: 0_level_0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Manager,Unnamed: 1_level_1,Unnamed: 2_level_1
Manager - A,145875.22,244402.67
Manager - B,155499.29,218899.14
Manager - C,146984.5,225467.33


As per the manager wise average sales before and after, we can see that after sales, there is a huge sales stepping into digital marketing in a magagerial level.

In [10]:
# calculating the region and manager features with values inside in the dataset
df.groupby(['Region','Manager']).count() 

Unnamed: 0_level_0,Unnamed: 1_level_0,Month,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Region,Manager,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Region - A,Manager - A,4,4,4
Region - A,Manager - B,3,3,3
Region - A,Manager - C,3,3,3
Region - B,Manager - A,4,4,4
Region - B,Manager - B,1,1,1
Region - B,Manager - C,2,2,2
Region - C,Manager - A,1,1,1
Region - C,Manager - B,3,3,3
Region - C,Manager - C,1,1,1


In [11]:
# calculating the average sales before and after on both region and manager 
df.groupby(['Region','Manager'])['Sales_before_digital_add(in $)','Sales_After_digital_add(in $)'].mean().round(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Region,Manager,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,Manager - A,148628.5,257609.25
Region - A,Manager - B,155617.0,218944.0
Region - A,Manager - C,140228.0,233754.0
Region - B,Manager - A,142725.0,234962.75
Region - B,Manager - B,152167.0,231808.0
Region - B,Manager - C,165299.0,214718.0
Region - C,Manager - A,147463.0,229336.0
Region - C,Manager - B,156492.33,214551.33
Region - C,Manager - C,130625.0,222106.0


In [12]:
# statistical summary of the data
df[['Sales_before_digital_add(in $)','Sales_After_digital_add(in $)']].describe().T # diagonal by writing rows as columns 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Sales_before_digital_add(in $),22.0,149239.954545,14844.042921,130263.0,138087.75,147444.0,157627.5,178939.0
Sales_After_digital_add(in $),22.0,231123.727273,25556.777061,187305.0,214960.75,229986.5,250909.0,276279.0


This is the complete statisticsl summary of the data. where maximum and minimum sales of before is 178939.0 and 130263.0 while after sales are 276279.0 and 187305.0 respectively. top 25% sales of before and after is 157627.5 and more and 250909.0 and more. Also least sales of before and after is 138087.75 and less and 214960.75 and less . Average sample sales of before and after is 149239.95 and 231123.73 alsothe  standard deviation is 14844.04 and 25556.78.

### 1. The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

**Hypothesis**

**Step-1: Set up Hypothesis (NULL and Alternate):**
- (Null Hypothesis) H0: the average sales after digital add is less than or equal to the average sales before digital add
- (Alternate Hypothesis) H1: the average sales after digital add is greater than the average sales before digital add

**Step-2: Set the Criteria for  decision :**  To set the criteria for a decision, we state the level of significance for a test. It could 5%, 1% or 0.5%
-  It is not given in the problem so let’s assume it as 5% (0.05).

**Step-3: Compute the random chance of probability: Random chance probability/ Test statistic helps to determine the likelihood. Higher probability has higher likelihood and enough evidence to accept the Null hypothesis.**

- Compute the random chance probability using ttest related

**Step-4: Make Decision:** 
Here, we compare p value with predefined significance level and if it is less than significance level, we reject Null hypothesis else we accept it. While making a decision to retain or reject the null hypothesis, we might go wrong because we are observing a sample and not an entire population. There are four decision alternatives regarding the truth and falsity of the decision we make about a null hypothesis:
1. The decision to retain the null hypothesis could be correct.
2. The decision to retain the null hypothesis could be incorrect, it is know as Type II error.
3. The decision to reject the null hypothesis could be correct.
4. The decision to reject the null hypothesis could be incorrect, it is known as Type I error.**

From the hypothesis we can clearly say that there is one tailed t test because they are related to each other and we need to check the sales(increase and decrease) also we can say that 95% confidence interval(default - if not mentioned). And probability is 0.05 or 5%.

In [13]:
# assign sales before and after values to the variables
sales_before_add = df['Sales_before_digital_add(in $)']
sales_after_add = df['Sales_After_digital_add(in $)']

In [14]:
# ttest related values of tvalue and pvalue
tvalue, pvalue = stats.ttest_rel(sales_before_add,sales_after_add)
stats.ttest_rel(sales_before_add,sales_after_add)

Ttest_relResult(statistic=-12.09070525287017, pvalue=6.336667004575778e-11)

In [15]:
# checking Hypothesis analysis
if pvalue <0.05:
    print('(Null Hypothesis) Reject H0')
else:
    print('(Alternate Hypothesis) Reject H1')

(Null Hypothesis) Reject H0


So we can clearly say that, the pvalue is less than for a significance level of 0.05, hence we reject the null hypothesis and there is a chance that increase in sales after stepping into digital marketing.

### 2. The company needs to check whether there is any dependency between the features “Region” and “Manager”.

**Hypothesis**

- (Null Hypothesis) H0: Region and Manager features are independent
- (Alternate Hypothesis) H1: Region and Manager features are dependent

In [16]:
# calculating the manager feature with values inside in the dataset
df.groupby(['Region','Manager']).count() 

Unnamed: 0_level_0,Unnamed: 1_level_0,Month,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Region,Manager,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Region - A,Manager - A,4,4,4
Region - A,Manager - B,3,3,3
Region - A,Manager - C,3,3,3
Region - B,Manager - A,4,4,4
Region - B,Manager - B,1,1,1
Region - B,Manager - C,2,2,2
Region - C,Manager - A,1,1,1
Region - C,Manager - B,3,3,3
Region - C,Manager - C,1,1,1


In [17]:
# calculating the average sales before and after on both region and manager 
df.groupby(['Region','Manager'])['Sales_before_digital_add(in $)','Sales_After_digital_add(in $)'].mean().round(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
Region,Manager,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,Manager - A,148628.5,257609.25
Region - A,Manager - B,155617.0,218944.0
Region - A,Manager - C,140228.0,233754.0
Region - B,Manager - A,142725.0,234962.75
Region - B,Manager - B,152167.0,231808.0
Region - B,Manager - C,165299.0,214718.0
Region - C,Manager - A,147463.0,229336.0
Region - C,Manager - B,156492.33,214551.33
Region - C,Manager - C,130625.0,222106.0


In [18]:
# unstack - Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels
df1=df.groupby('Region')['Manager'].value_counts().unstack()
df1

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,4,3,3
Region - B,4,1,2
Region - C,1,3,1


In [19]:
# perform the Chi-Square Test for Independence
chivalue, pvalue, dof, expvalues = stats.chi2_contingency(df1)

In [20]:
# Diplay the Chi-Square Test of Independence values
print('Chi sqaure Test Statistic:',chivalue)
print('Pvalue of test data;',pvalue)
print('Degrees of Freedom:',dof)
print('Expected observations of data:',expvalues)

Chi sqaure Test Statistic: 3.050566893424036
Pvalue of test data; 0.5493991051158094
Degrees of Freedom: 4
Expected observations of data: [[4.09090909 3.18181818 2.72727273]
 [2.86363636 2.22727273 1.90909091]
 [2.04545455 1.59090909 1.36363636]]


In [21]:
# chi square criteria value for significance of 0.05 and degrees of freedom (3-1)*(3-1) =4 
chicriteria=stats.chi2.ppf(1-0.05,4)
stats.chi2.ppf(1-0.05,df=4)

9.487729036781154

In [22]:
# checking Hypothesis condition

if(chivalue>chicriteria and pvalue<0.05):
    print('(Null Hypothesis) Reject H0: Region and Manager features are independent')
else:
    print('(Alternate Hypothesis) Reject H1: Region and Manager features are dependent')

(Alternate Hypothesis) Reject H1: Region and Manager features are dependent


As you can see we failed to reject the null hypothesis. Hence, Chi square test for independence the Features Region and Manager are independent