# Assessing Campaign Performance

<b>About the dataset</b></br>
The dataset consists of feature vectors belonging to 12,330 sessions. 
The dataset was formed so that each session 
would belong to a different user in a 1-year period to avoid 
any tendency to a specific campaign, special day, user 
profile, or period.

The dataset consists of 10 numerical and 8 categorical attributes. 
The 'Revenue' attribute can be used as the class label. 

"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

<b>Aim of the Project:</b></br>
  
To perform A/B testing using Chi-Square, First  divided the dataset into two groups and then compare the proportion of shoppers who made a purchase in each group. then used a Chi-Square test to determine if the difference in proportions is statistically significant or due to chance.

In [1]:
#import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy

In [2]:
#read the dataset
data=pd.read_csv("online_shoppers_intention.csv")

In [3]:
# shape of the data
data.shape

(12330, 18)

it has 12330 rows and 18 Columns

In [4]:
# first 3 rows of the data
data.head(3)

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False


In [5]:
# data type of features
data.dtypes

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                       bool
Revenue                       bool
dtype: object

we have 2 object type features and rest are of numerical type

In [6]:
# info of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

there is no null values in the dataset

In [7]:
# check for duplicate rows
data[data.duplicated()]

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
158,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,3,Returning_Visitor,False,False
159,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,3,2,3,3,Returning_Visitor,False,False
178,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,3,2,3,3,Returning_Visitor,False,False
418,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Mar,1,1,1,1,Returning_Visitor,True,False
456,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Mar,2,2,4,1,Returning_Visitor,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11934,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Dec,1,1,1,2,New_Visitor,False,False
11938,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Dec,1,1,4,1,Returning_Visitor,True,False
12159,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Dec,1,1,1,3,Returning_Visitor,False,False
12180,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Dec,1,13,9,20,Returning_Visitor,False,False


In [8]:
#check for the null values
data.isna().sum()

Administrative             0
Administrative_Duration    0
Informational              0
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    0
Revenue                    0
dtype: int64

# Our Aim is to see whether the marketing campeign have any good impact on the revenue

In [9]:
# level of Significance : 0.05
# Confidence Interval: 95%

In [10]:
#Define hypothesis
#H0: the difference in conversion rate is happened due to chance
#Ha: there is a statistically significant difference in conversion rate

In [11]:
# test Metric
# in this case Test Metric: Conversion Rate

In [12]:
# test Parameters
#significance level: 0.05- means we are willing to take 0.5% of risk/ margin of error
#level of Confidence: 95%
# power:90%

In [13]:
# sample the dataset into 2 groups, assuming one samples is before and the other one is after the campaign
# here frac specify the size of sampling 50%
data_withoutM=data.sample(frac=0.5,random_state=1)
data_withM=data.drop(data_withoutM.index)

In practical case, Sampling is the most crucial part. 
For sampling in real case scenarios, we need to do the following
1. need to define Target population
2. need to define sampling frame
3. decide sampling method-like simple random sampling, stratified sampling etc.
4. Need to calculate sample size- based on factors like precision, resources, time availability
4. implement sampling procedure

In [14]:
# check the size of data_withoutM
data_withoutM.shape[0]

6165

In [15]:
# checking the no.of rows in data_withM
data_withM.shape[0]

6165

In [16]:
# calculate total number of visitors and Conversion Rate in each group
# Conversion rate is the ratio of total Revenue per size

len_a=data_withoutM.shape[0]
len_b=data_withM.shape[0]

Con_Rate_a=data_withoutM['Revenue'].mean()
Con_Rate_b=data_withM['Revenue'].mean()

purchase_a=data_withoutM['Revenue'].sum()
purchase_b=data_withM['Revenue'].sum()

In [17]:
# conversion rate of both samples
Con_Rate_a,Con_Rate_b

(0.15182481751824817, 0.15766423357664233)

In [18]:
# size of the samples
len_a,len_b

(6165, 6165)

In [19]:
# now need to check whether this change in conversion is due to by chance or not
# we are using chi square test to prove this 
#with a significance level of 0.5 and 95% confidence

In [20]:
# Conduct a Chi-Square test
observed = [[purchase_a, purchase_b], [len_a - purchase_a, len_b - purchase_b]]
chi2, p_value, dof, expected = scipy.stats.chi2_contingency(observed)
print('Chi-Square test result:')
print('Chi-Square value:', chi2)
print('P-value:', p_value)
print('Degrees of freedom:', dof)

Chi-Square test result:
Chi-Square value: 0.7595733625892976
P-value: 0.3834620707096291
Degrees of freedom: 1


In [21]:
#Next, we will calculate the critical value which will help to accept or reject the null hypothesis.

critical_value= scipy.stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 1)
critical_value

3.841458820694124

In [22]:
# Coclusion:
# our test score is less than critical value
# Hence we fail to reject null Hypothesis
# and accept the fact that change in conversion is happened by chance & there is no significant difference between these two

In [23]:
# Calculate the effect size
# cohen_d indicate how much the two groups differ in terms of the variable being measured
# Cohen's d is calculated by taking the difference between the means of two groups and dividing it by the pooled standard deviation of the two groups

total_visitors = len_a + len_b
total_conv_rate = (purchase_a + purchase_b) / total_visitors
pooled_std= (total_conv_rate * (1 - total_conv_rate) * (1/len_a + 1/len_b)) ** 0.5
cohen_d = (Con_Rate_b - Con_Rate_a) / pooled_std
print('Effect size (Cohen\'s d):', cohen_d)

Effect size (Cohen's d): 0.8964360627647052


In [24]:
# Cohen's d value of 0.896 indicating large effect size
# that means even if the difference in 2 groups are not statistically significant, it is practically significant and should not be ignored

so we can conclude that large effect size indicating a practical difference between 2 groups, even though we fail to reject null Hypothesis saying 'no significant difference at the chosen level of significance'. 