<a href="https://colab.research.google.com/github/Emenike-Amara/Projects/blob/main/A_B_Testing_with_Marketing_Data_Using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction** ✨

Somara is -- a fictitious organization-- dedicated to making a positive impact in the world. The marketing team at Somara is passionate about maximizing the effectiveness of their campaigns and ensuring that their efforts generates a significant return on investment. They are faced with a critical question: should they rely on Public Service Announcements (PSAs) or explore the potential of traditional advertisements to reach their target audience and drive the desired action?

To tackle this challenge, the team turned to the Data team to offer a method to scientifically evaluate and compare the impact of different approaches, providing them with valuable insights to inform their marketing strategy.




---



*Join me as we explore the power of A/B testing and witness how it can unlock the potential for success in marketing campaigns.*

# **Project Objective ✨**
This project is geared towards A/B testing for a captivating marketing campaign, it will explore the impact of two different approaches: advertisements and public service announcements (PSAs). 
The idea of the dataset is to analyze the groups, find if the ads were successful, how much the company can make from the ads, and if the difference between the groups is statistically significant.



---
You can view the dataset [here](https://www.kaggle.com/datasets/faviovaz/marketing-ab-testing?resource=download). 


***Let's delve in***

# **Step 1** ▶ : Connecting to the dataset and Importing the necessary Librabries

In [None]:
#Install this package so I can view the chart
!pip install altair_viewer
!pip install vega_datasets
!pip install vega

In [None]:
#Loading data and declaring path reference
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import altair as alt
alt.data_transformers.disable_max_rows()
alt.renderers.enable('altair_viewer')
from google.colab import drive
drive.mount('/content/drive')
path1='/content/drive/My Drive/marketing_AB.csv'



Mounted at /content/drive


In [None]:
# Data
data = pd.read_csv(path1)
data.head()

Unnamed: 0.1,Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
0,0,1069124,ad,False,130,Monday,20
1,1,1119715,ad,False,93,Tuesday,22
2,2,1144181,ad,False,21,Tuesday,18
3,3,1435133,ad,False,355,Tuesday,10
4,4,1015700,ad,False,276,Friday,14


# **Step 2** ▶: Exploratory Data Analysis 

In [None]:
data['test group'].unique()

array(['ad', 'psa'], dtype=object)

In [None]:
#clean the column names

data.columns = np.array(pd.Series(data.columns).apply(lambda x: x.replace(' ', '_')))

In [None]:
data.head()

Unnamed: 0,Unnamed:_0,user_id,test_group,converted,total_ads,most_ads_day,most_ads_hour
0,0,1069124,ad,False,130,Monday,20
1,1,1119715,ad,False,93,Tuesday,22
2,2,1144181,ad,False,21,Tuesday,18
3,3,1435133,ad,False,355,Tuesday,10
4,4,1015700,ad,False,276,Friday,14


In [None]:
#To determine if User activity level can be a measured metric for the experimentation

data['user_id'].value_counts().sort_values()

#unique entry per user

1069124    1
1081965    1
1637531    1
1257223    1
1492276    1
          ..
1313930    1
1561741    1
1383070    1
1188359    1
1237779    1
Name: user_id, Length: 588101, dtype: int64

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 588101 entries, 0 to 588100
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Unnamed:_0     588101 non-null  int64 
 1   user_id        588101 non-null  int64 
 2   test_group     588101 non-null  object
 3   converted      588101 non-null  bool  
 4   total_ads      588101 non-null  int64 
 5   most_ads_day   588101 non-null  object
 6   most_ads_hour  588101 non-null  int64 
dtypes: bool(1), int64(4), object(2)
memory usage: 27.5+ MB


In [None]:
#Sample size for the A/B testing

len(data)

588101

In [None]:
data.describe().head()

Unnamed: 0,Unnamed:_0,user_id,total_ads,most_ads_hour
count,588101.0,588101.0,588101.0,588101.0
mean,294050.0,1310692.0,24.820876,14.469061
std,169770.279668,202226.0,43.715181,4.834634
min,0.0,900000.0,1.0,0.0
25%,147025.0,1143190.0,4.0,11.0


In [None]:
group_distribution = data['test_group'].value_counts()

# Print the distribution by test group
print(group_distribution)

ad     564577
psa     23524
Name: test_group, dtype: int64


❗ The observed distribution of the groups in the dataset reveals a significant skew towards the "ads" group, as indicated in the datacard. It is worth noting that maintaining random selection and an equal distribution between the groups is a fundamental principle in A/B testing, typically aiming for a 50/50 split. However, in the context of this specific report, the distribution is imbalanced with a split of 96/14, deviating from the recommended guideline.

In the context of marketing campaigns, it is not uncommon to have such an imbalanced distribution. This arises from the necessity of exposing a larger proportion of users to the experiment, which is usually the advertisement being tested. This approach is grounded in statistical principles, aiming to ensure sufficient statistical power, enhance the ability to detect small effects, achieve generalizability, and optimize cost efficiency.

While the imbalanced distribution may deviate from the ideal 50/50 split, it is a deliberate design choice made to gather more meaningful insights from the experiment. By exposing a larger number of users to the campaign, the test becomes more sensitive to detecting subtle changes in user behavior or preferences. Additionally, it enables the evaluation of the campaign's performance within the target audience, facilitating informed decision-making based on reliable data.

Therefore, while the skewed distribution may differ from the recommended guideline, it is a strategic approach in the context of marketing campaigns, driven by statistical considerations and the objective of obtaining valuable insights from the experiment.

#**Step 3** ▶: Defining and Designing the A/B Test
We would userid for the customer and the column test group

Test group value 'PSA' = 0 (Control group)

Test group value 'ads' = 1 (Experimental group)

**Metrics for success:** 
> Analyze by users: conversion rate per channel

> How many times a user sees  an ad (to understand how the users are affected by the change)




In [None]:
# Group the DataFrame by the column of interest and calculate the sum of the 'converted' column
conversion_counts = data.groupby('test_group')['converted'].sum()

# Step 2: Calculate the total number of conversions
total_conversions = data['converted'].sum()

# Step 3: Compute the percentage contribution of each class
percentage_contribution = (conversion_counts / total_conversions) * 100

# Print the results
print("Conversion count by class:")
print(conversion_counts)
print("\nPercentage contribution of each class:")
print(percentage_contribution)

Conversion count by class:
test_group
ad     14423
psa      420
Name: converted, dtype: int64

Percentage contribution of each class:
test_group
ad     97.170383
psa     2.829617
Name: converted, dtype: float64


In [None]:
#Determine the conversion rate of the different campaigns 

conversion_rate = (conversion_counts/group_distribution)*100
print ("\nPct_Conversion_rate of each campaign: ")
print (conversion_rate )


Pct_Conversion_rate of each campaign: 
test_group
ad     2.554656
psa    1.785411
dtype: float64


From this result, we can see that ads has better conversion than the psa on a **97.17: 2.83** respectively. We can assume this is the case as a result of the skewed dataset. Let us determine the statistical significance.

In [None]:
# Create the Altair chart using the dataframe as the data source

chart = alt.Chart(data).mark_line(size=1).encode(
    alt.X('total_ads', axis=alt.Axis(title='most_ads_day')),
    alt.Y('user_id:Q', axis=alt.Axis(title='number_of_users')),
    tooltip=['most_ads_day:O'], 
    color='most_ads_day:O'
).properties(
    width=600,
    height=400
)

chart.show()



#**Step 4** ▶ : Hypothesis Definition

H1 = With the introduction of ads,users are more likely to click on the ads and conversion rate will improve

H0 = The ads campaign will have no effect on the user engagement and consequently will not affect conversion rate.

 *❗Caveat: The MDE was not calculated here You can now calculate your Sample size using this [Calculator](https://www.optimizely.com/sample-size-calculator/#/?conversion=2&effect=30&significance=95)*

# Step 5 ▶ : Analyze the Results

In [None]:
# Separate data for each test group
ad_data = data.loc[data['test_group'] == 'ad', 'converted']   #Experiment group
psa_data = data.loc[data['test_group'] == 'psa', 'converted']  # Control group

# Perform statistical analysis on the separate arrays

t_statistic, p_value = stats.ttest_ind(psa_data, ad_data)

# Print the results
print("Group A data:", psa_data)
print("Group B data:", ad_data)
print("T-statistic:", t_statistic)
print("P-value:", p_value)

Group A data: 18        False
38        False
68        False
140       False
157       False
          ...  
588052    False
588063    False
588066    False
588069    False
588081    False
Name: converted, Length: 23524, dtype: bool
Group B data: 0         False
1         False
2         False
3         False
4         False
          ...  
588096    False
588097    False
588098    False
588099    False
588100    False
Name: converted, Length: 564577, dtype: bool
T-statistic: -7.37040597428566
P-value: 1.7033052627831264e-13


In [None]:
# Check the p-value to determine statistical significance

alpha = 0.05  # Significance level

if p_value < alpha:
    print("Statistically significant results. Reject the null hypothesis.")
else:
    print("Not statistically significant results. Fail to reject the null hypothesis.")

Statistically significant results. Reject the null hypothesis.


# **Step 6** ▶ : Share Findings



> *Based on the statistical analysis, the test results reveal a compelling and statistically significant disparity between the two groups. This significant divergence signifies that the advertisement exerted a tangible impact on user engagement, leading to a noteworthy surge in the conversion rate. This outcome unequivocally implies an upswing in purchases and highlights the effectiveness of the advertisement in driving customer actions.*


> To determine how much money the company will make, the defined actionable metrics are conversion rate, average order value, customer lifetime value, and revenue generated per customer. However for this project we would need to calculate the **%improvement(Lift)** which can be applied to the Lift to the baseline values of the financial metrics .

In [None]:
mean_group_ad = data[data['test_group'] == 'ad']['converted'].mean()
mean_group_psa = data[data['test_group'] == 'psa']['converted'].mean()

print(mean_group_ad)
print(mean_group_psa)

0.025546559636683747
0.01785410644448223


In [None]:
Lift = ((mean_group_ad - mean_group_psa) / mean_group_psa) * 100

print (Lift)

43.085064022225836


**Conclusion**: Based on the A/B test results, it is evident that using ads leads to a significant increase in the conversion rate compared to the control group (PSA). The data reveals a substantial **43.1% improvement** in the conversion rate when ads were utilized. This finding highlights the effectiveness of ads in driving user engagement and encouraging desired actions, showcasing the potential for increased business success and revenue generation through targeted marketing campaigns.