# Case Study on Testing of Hypothesis

A company started to invest in digital marketing as a new way of their product promotions. For that they collected data and decided to carry out a study on it:
* The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.
* The company needs to check whether there is any dependency between the features “Region” and “Manager”.

Help the company to carry out their study with the help of data provided.

In [2]:
import pandas as pd

In [3]:
sd=pd.read_csv('Sales_add.csv')
sd.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


**Data Analyzing:**

In [5]:
sd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


In [9]:
sd.isna().any()

Month                             False
Region                            False
Manager                           False
Sales_before_digital_add(in $)    False
Sales_After_digital_add(in $)     False
dtype: bool

In [7]:
sd.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


In [3]:
sd.describe(include=['object'])

Unnamed: 0,Month,Region,Manager
count,22,22,22
unique,22,3,3
top,Month-22,Region - A,Manager - A
freq,1,10,9


**Observations:**
1. This is data set showing the sales of a company reported by different managers of different regions
2. Data set have 5 features: 3 are of the object datatype and the rest are integer with 64 bit.
3. There is no null values.
4. There are 22 entries

### a) The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

**Null Hypothesis, H0**: There is **no increase** in sales after stepping into digital marketing.

**Alternative Hypothesis, H1**: There is **an increase** in sales after stepping into digital marketing.

* This is data set having less than 30 entries, so that we can perform a t-test.
* The alternative hypothesis claims that there is an increase in sales, so we do a (right tail)one tailed t-test.
* Since the test is conducted in the same set of population, we can perform paired sample t-test.

If the p-value is less than what is tested at, most commonly 0.05, one can reject the null hypothesis.

In [5]:
from scipy import stats

In [6]:
a,b=stats.ttest_rel(sd['Sales_After_digital_add(in $)'],sd['Sales_before_digital_add(in $)'])
print('Statistic vlaue is: %0.3f'%a)
print('P-value is:%0.12f'%b)

Statistic vlaue is: 12.091
P-value is:0.000000000063


**Inferences:**
* Sine the p-value < 0.05, which shows that we can reject the null hypothesis and supports the alternative hypothesis.
* So we can clearly say that the there is an increase in sales for the company after stepping into digital marketing.

### b) The company needs to check whether there is any dependency between the features “Region” and “Manager”.

* Since the region and manager features having catogorical values, we can perform chi-square test.
* The company needs to know the dependancy between the two feature, so chi-squared Test for independence.

**Null Hypothesis, H0**: There is **no dependency** between the features "Region" & "Manager"

**Alternative Hypothesis, H1**: There **is dependency** between the features "Region" & "Manager"

In [27]:
'''pandas.crosstab: Compute a simple cross tabulation of two (or more) factors. 
By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.'''
ctab=pd.crosstab(sd["Region"], sd["Manager"])
ctab

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,4,3,3
Region - B,4,1,2
Region - C,1,3,1


In [28]:
s,p,d,e=stats.chi2_contingency(ctab)

In [30]:
print('Statistic vlaue is: %0.3f'%s)
print('P-value is:%0.12f'%p)
print('Degree of freedom vlaue is: %0.3f'%d)
print(' is:',e)

Statistic vlaue is: 3.051
P-value is:0.549399105116
Degree of freedom vlaue is: 4.000
 is: [[4.09090909 3.18181818 2.72727273]
 [2.86363636 2.22727273 1.90909091]
 [2.04545455 1.59090909 1.36363636]]


**Inferences:**
* Since the p-value is > 0.05, we can accept the null hypothesis.
* So we can say that the features "Region" and "Manager" are independent to each other.