### Data dictionary:

Index: Row index 

user id: User ID (unique)

test group: If "ad" the person saw the advertisement, if "psa" they only saw the public service announcement

converted: If a person bought the product then True, else is False

total ads: Amount of ads seen by person

most ads day: Day that the person saw the biggest amount of ads

most ads hour: Hour of day that the person saw the biggest amount of ads

In [5]:
import pandas as pd 

df = (pd.read_csv('marketing_AB.csv')).drop('Unnamed: 0', axis=1)
df.converted = df.converted.astype(int)
df

Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
0,1069124,ad,0,130,Monday,20
1,1119715,ad,0,93,Tuesday,22
2,1144181,ad,0,21,Tuesday,18
3,1435133,ad,0,355,Tuesday,10
4,1015700,ad,0,276,Friday,14
...,...,...,...,...,...,...
588096,1278437,ad,0,1,Tuesday,23
588097,1327975,ad,0,1,Tuesday,23
588098,1038442,ad,0,3,Tuesday,23
588099,1496395,ad,0,1,Tuesday,23


In [6]:
from summarytools import dfSummary
dfSummary(df)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,user id [int64],Mean (sd) : 1310692.2 (202226.0) min < med < max: 900000.0 < 1313725.0 < 1654483.0 IQR (CV) : 340898.0 (6.5),"588,101 distinct values",,0 (0.0%)
2,test group [object],1. ad 2. psa,"564,577 (96.0%) 23,524 (4.0%)",,0 (0.0%)
3,converted [int64],1. 0 2. 1,"573,258 (97.5%) 14,843 (2.5%)",,0 (0.0%)
4,total ads [int64],Mean (sd) : 24.8 (43.7) min < med < max: 1.0 < 13.0 < 2065.0 IQR (CV) : 23.0 (0.6),807 distinct values,,0 (0.0%)
5,most ads day [object],1. Friday 2. Monday 3. Sunday 4. Thursday 5. Saturday 6. Wednesday 7. Tuesday,"92,608 (15.7%) 87,073 (14.8%) 85,391 (14.5%) 82,982 (14.1%) 81,660 (13.9%) 80,908 (13.8%) 77,479 (13.2%)",,0 (0.0%)
6,most ads hour [int64],Mean (sd) : 14.5 (4.8) min < med < max: 0.0 < 14.0 < 23.0 IQR (CV) : 7.0 (3.0),24 distinct values,,0 (0.0%)


In [8]:
treatment_count = df[df['test group'] == 'ad'].shape[0]
control_count = df[df['test group'] == 'psa'].shape[0]

print(f"Number of treatment IDs: {treatment_count}")
print(f"Number of control IDs: {control_count}")

Number of treatment IDs: 564577
Number of control IDs: 23524


Making balanced groups and grouping the data by test group

In [26]:
treatment_group = df[df['test group'] == 'ad']
control_group = df[df['test group'] == 'psa']

treatment_group = treatment_group.sample(23524)

print(treatment_group.info())
print('')
print(control_group.info())

balanced_df = pd.concat([treatment_group, control_group], axis=0)
#balanced_df = balanced_df.groupby('test group')['converted'].mean()
balanced_df

<class 'pandas.core.frame.DataFrame'>
Index: 23524 entries, 186287 to 201987
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   user id        23524 non-null  int64 
 1   test group     23524 non-null  object
 2   converted      23524 non-null  int64 
 3   total ads      23524 non-null  int64 
 4   most ads day   23524 non-null  object
 5   most ads hour  23524 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 1.3+ MB
None

<class 'pandas.core.frame.DataFrame'>
Index: 23524 entries, 18 to 588081
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   user id        23524 non-null  int64 
 1   test group     23524 non-null  object
 2   converted      23524 non-null  int64 
 3   total ads      23524 non-null  int64 
 4   most ads day   23524 non-null  object
 5   most ads hour  23524 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 

Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
186287,1380967,ad,0,33,Monday,0
274517,1631043,ad,0,14,Wednesday,11
71801,1558982,ad,0,23,Saturday,14
42948,1465873,ad,0,94,Sunday,16
435376,1188726,ad,0,1,Saturday,13
...,...,...,...,...,...,...
588052,900959,psa,0,16,Tuesday,22
588063,902828,psa,0,3,Tuesday,22
588066,914578,psa,0,1,Tuesday,22
588069,909042,psa,0,6,Tuesday,22


In [27]:
from scipy.stats import chi2_contingency

# Create a contingency table
contingency_table = pd.crosstab(balanced_df['test group'], balanced_df['converted'])

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square: {chi2}, P-value: {p}")

Chi-square: 37.880253824674476, P-value: 7.522276275604638e-10


In [28]:
balanced_df

Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
186287,1380967,ad,0,33,Monday,0
274517,1631043,ad,0,14,Wednesday,11
71801,1558982,ad,0,23,Saturday,14
42948,1465873,ad,0,94,Sunday,16
435376,1188726,ad,0,1,Saturday,13
...,...,...,...,...,...,...
588052,900959,psa,0,16,Tuesday,22
588063,902828,psa,0,3,Tuesday,22
588066,914578,psa,0,1,Tuesday,22
588069,909042,psa,0,6,Tuesday,22
