# A/B Testing Preparation and t-Test Python

## About the Dataset

This dataset contains performance data from an A/B test comparing two marketing campaigns: a control campaign and a test campaign. It includes key engagement and conversion metrics recorded over a 30-day period for each campaign.

### Tasks

* Clean the data by handling missing values and inconsistent formatting.
* Perform a two-sample t-test to determine whether the difference in KPI - number of purchases (Purchases) between the Control and Test campaigns is statistically significant.

### Data Dictionary

* <b>Campaign Name:</b> The name of the campaign - control or test.
* <b>Date:</b> Date of the record.
* <b>Spend:</b> Amount spent on the campaign in dollars.
* <b>of Impressions:</b> Number of impressions the ad crossed through the campaign.
* <b>Reach:</b> The number of unique impressions received in the ad.
* <b>of Website Clicks:</b> Number of website clicks received through the ads.
* <b>of Searches:</b> Number of users who performed searches on the website.
* <b>of View Content:</b> Number of users who viewed content and products on the website.
* <b>of Add to Cart:</b> Number of users who added products to the cart.
* <b>of Purchase:</b> Number of purchases.

## Preparation

### Loading the Libraries

Loading Python relevant libraries

In [143]:
import numpy as np
import pandas as pd
import re
from scipy import stats

In [144]:
pd.__version__

'2.2.1'

### Loading the Data

Loading the A/B testing results dataset into a DataFrame

In [145]:
urlControl = 'https://raw.githubusercontent.com/Adi-Shalit/AB-Testing-SQL-Python-Tableau-Project/main/control_group.csv'
df_control = pd.read_csv(urlControl, sep = ";")
urlTest = 'https://raw.githubusercontent.com/Adi-Shalit/AB-Testing-SQL-Python-Tableau-Project/main/test_group.csv'
df_test = pd.read_csv(urlTest, sep = ";")

In [146]:
df_control.head()

Unnamed: 0,Campaign Name,Date,Spend [USD],# of Impressions,Reach,# of Website Clicks,# of Searches,# of View Content,# of Add to Cart,# of Purchase
0,Control Campaign,1.08.2019,2280,82702.0,56930.0,7016.0,2290.0,2159.0,1819.0,618.0
1,Control Campaign,2.08.2019,1757,121040.0,102513.0,8110.0,2033.0,1841.0,1219.0,511.0
2,Control Campaign,3.08.2019,2343,131711.0,110862.0,6508.0,1737.0,1549.0,1134.0,372.0
3,Control Campaign,4.08.2019,1940,72878.0,61235.0,3065.0,1042.0,982.0,1183.0,340.0
4,Control Campaign,5.08.2019,1835,,,,,,,


In [147]:
df_test.head()

Unnamed: 0,Campaign Name,Date,Spend [USD],# of Impressions,Reach,# of Website Clicks,# of Searches,# of View Content,# of Add to Cart,# of Purchase
0,Test Campaign,1.08.2019,3008,39550,35820,3038,1946,1069,894,255
1,Test Campaign,2.08.2019,2542,100719,91236,4657,2359,1548,879,677
2,Test Campaign,3.08.2019,2365,70263,45198,7885,2572,2367,1268,578
3,Test Campaign,4.08.2019,2710,78451,25937,4216,2216,1437,566,340
4,Test Campaign,5.08.2019,2297,114295,95138,5863,2106,858,956,768


It seems like both of the dfs configured with the same columns. They both have a column for campaign name so lets unite them

In [148]:
df = pd.concat([df_control, df_test], ignore_index=True)

Let's sample some rows to see if everything seems normal

#### Sampling rows

In [149]:
df.sample(10)

Unnamed: 0,Campaign Name,Date,Spend [USD],# of Impressions,Reach,# of Website Clicks,# of Searches,# of View Content,# of Add to Cart,# of Purchase
26,Control Campaign,27.08.2019,2061,104678.0,91579.0,4941.0,3549.0,3249.0,980.0,605.0
20,Control Campaign,21.08.2019,1803,74654.0,59873.0,5691.0,2711.0,2496.0,1460.0,800.0
6,Control Campaign,7.08.2019,2544,142123.0,127852.0,2640.0,1388.0,1106.0,1166.0,499.0
33,Test Campaign,4.08.2019,2710,78451.0,25937.0,4216.0,2216.0,1437.0,566.0,340.0
32,Test Campaign,3.08.2019,2365,70263.0,45198.0,7885.0,2572.0,2367.0,1268.0,578.0
36,Test Campaign,7.08.2019,2838,53986.0,42148.0,4221.0,2733.0,2182.0,1301.0,890.0
19,Control Campaign,20.08.2019,2675,113430.0,78625.0,2578.0,1001.0,848.0,1709.0,299.0
55,Test Campaign,26.08.2019,2311,80841.0,61589.0,3820.0,2037.0,1046.0,346.0,284.0
34,Test Campaign,5.08.2019,2297,114295.0,95138.0,5863.0,2106.0,858.0,956.0,768.0
31,Test Campaign,2.08.2019,2542,100719.0,91236.0,4657.0,2359.0,1548.0,879.0,677.0


looks OK to me :)

### Understanding the Data

Basic details about the dataset

#### Data Shape

In [150]:
print(f'This Dataframe has {df.shape[0]} rows over {df.shape[1]} columns')

This Dataframe has 60 rows over 10 columns


### Cleanup

In the following section we'll validate date types, deal with empty and duplicated rows

#### Validating Datatypes

In [151]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Campaign Name        60 non-null     object 
 1   Date                 60 non-null     object 
 2   Spend [USD]          60 non-null     int64  
 3   # of Impressions     59 non-null     float64
 4   Reach                59 non-null     float64
 5   # of Website Clicks  59 non-null     float64
 6   # of Searches        59 non-null     float64
 7   # of View Content    59 non-null     float64
 8   # of Add to Cart     59 non-null     float64
 9   # of Purchase        59 non-null     float64
dtypes: float64(7), int64(1), object(2)
memory usage: 4.8+ KB


<b>Converting 'Date' column into a valid datetime64 type</b>

In [152]:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

In [153]:
df[['Date']].sample(5)

Unnamed: 0,Date
13,2019-08-14
6,2019-08-07
49,2019-08-20
24,2019-08-25
21,2019-08-22


<b>Observation:</b> Apparently each campaign is measured over 30 different days.

<b>Converting column names:</b> replacing space with underline for SQL analysis later on

In [154]:
df.columns = df.columns.str.replace(" ", "_")

In [155]:
df.head()

Unnamed: 0,Campaign_Name,Date,Spend_[USD],#_of_Impressions,Reach,#_of_Website_Clicks,#_of_Searches,#_of_View_Content,#_of_Add_to_Cart,#_of_Purchase
0,Control Campaign,2019-08-01,2280,82702.0,56930.0,7016.0,2290.0,2159.0,1819.0,618.0
1,Control Campaign,2019-08-02,1757,121040.0,102513.0,8110.0,2033.0,1841.0,1219.0,511.0
2,Control Campaign,2019-08-03,2343,131711.0,110862.0,6508.0,1737.0,1549.0,1134.0,372.0
3,Control Campaign,2019-08-04,1940,72878.0,61235.0,3065.0,1042.0,982.0,1183.0,340.0
4,Control Campaign,2019-08-05,1835,,,,,,,


Renaming the columns to more readable names

In [156]:
df = df.rename(columns={
    'Campaign_Name': 'Campaign',
    'Spend_[USD]': 'Spend_USD',
    '#_of_Impressions': 'Impressions',
    '#_of_Website_Clicks': 'Website_Clicks',
    '#_of_Searches': 'Searches',
    '#_of_View_Content': 'View_Content',
    '#_of_Add_to_Cart': 'Add_to_Cart',
    '#_of_Purchase': 'Purchases'
})

In [157]:
df.head()

Unnamed: 0,Campaign,Date,Spend_USD,Impressions,Reach,Website_Clicks,Searches,View_Content,Add_to_Cart,Purchases
0,Control Campaign,2019-08-01,2280,82702.0,56930.0,7016.0,2290.0,2159.0,1819.0,618.0
1,Control Campaign,2019-08-02,1757,121040.0,102513.0,8110.0,2033.0,1841.0,1219.0,511.0
2,Control Campaign,2019-08-03,2343,131711.0,110862.0,6508.0,1737.0,1549.0,1134.0,372.0
3,Control Campaign,2019-08-04,1940,72878.0,61235.0,3065.0,1042.0,982.0,1183.0,340.0
4,Control Campaign,2019-08-05,1835,,,,,,,


#### Null Values

In [158]:
df.isnull().sum()

Campaign          0
Date              0
Spend_USD         0
Impressions       1
Reach             1
Website_Clicks    1
Searches          1
View_Content      1
Add_to_Cart       1
Purchases         1
dtype: int64

We can see that the majority of the columns contain 1 null value. Let's check if the null values refer to a single row

In [159]:
df_invalid_rows = df[df['Reach'].isnull()]

In [160]:
df_invalid_rows.head()

Unnamed: 0,Campaign,Date,Spend_USD,Impressions,Reach,Website_Clicks,Searches,View_Content,Add_to_Cart,Purchases
4,Control Campaign,2019-08-05,1835,,,,,,,


As expected, the row contains all the null value that we have seen before.

This is a small fraction of the data (1/60 = 1.7%), so it has almost no impact on the statistical analysis.

Missing data is critical for KPI and A/B Testing, and any attempt to fill it in synthetically (by mean or median) will introduce bias into the analysis.

<b>Deleting the row with the null values</b>

In [161]:
df = df.dropna()

Checking again if null values exist

In [162]:
df.isnull().sum()

Campaign          0
Date              0
Spend_USD         0
Impressions       0
Reach             0
Website_Clicks    0
Searches          0
View_Content      0
Add_to_Cart       0
Purchases         0
dtype: int64

We're good to go

#### Duplicate Rows

In [163]:
df.nunique()

Campaign           2
Date              30
Spend_USD         59
Impressions       59
Reach             59
Website_Clicks    59
Searches          58
View_Content      56
Add_to_Cart       59
Purchases         55
dtype: int64

In [164]:
df.duplicated().sum()

0

No exact duplicate rows were found in the dataset 

## Statistical Analysis: Independent Samples t-Test

In this part of the project, we aim to statistically evaluate whether there is a significant difference between the control group and the test group in terms of a selected performance indicator (KPI).

While the dataset does not specify a single target metric or KPI, I have chosen to focus on the <b>number of purchases (Purchases)</b> as the main outcome variable of interest. This KPI represents direct conversions and is commonly used to assess the effectiveness of marketing campaigns.

To perform this analysis, we will use the independent samples t-test (also known as a two-sample t-test). This test is appropriate when comparing the means of two independent groups (control vs. test) on a continuous numerical variable. I selected a significance level <b>(α) of 0.05</b> to limit the probability of a Type I error to 5%.

<b>Null Hypothesis (H₀):</b> The mean number of purchases in the Control Campaign equals the mean number of purchases in the Test Campaign

<b>Alternative Hypothesis (H₁):</b> The mean number of purchases in the Control Campaign differs from the mean number of purchases in the Test Campaign.

Checking the number of observations from each campaign

In [165]:
df['Campaign'].value_counts()

Campaign
Test Campaign       30
Control Campaign    29
Name: count, dtype: int64

The dataset has almost 30 samples for each campaign

Seperating the dataset to 2 groups, a group for every campaign

In [166]:
df_control = df[df['Campaign'] == 'Control Campaign']['Purchases']
df_test = df[df['Campaign'] == 'Test Campaign']['Purchases']

In [167]:
df_control.head()

0    618.0
1    511.0
2    372.0
3    340.0
5    764.0
Name: Purchases, dtype: float64

Performing the T test

In [168]:
t_stat, p_value = stats.ttest_ind(df_control, df_test, equal_var=False)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: 0.030212884995111548
P-value: 0.9760037958073526


Since the p‑value (0.9760) is much greater than our α = 0.05, <b>we fail to reject the null hypothesis</b>. There is no statistically significant difference in mean Purchases between the Control Campaign and the Test Campaign at the 5% significance level.

Converting the df to csv file for further analysis using SQL

In [171]:
df = pd.read_csv('AB_data_for_Analysis.csv')