# A/B Test Challenge



---

#### What is an A/B Test? 

It is a decision making support & research methodology that allow you to measure an impact of a change in a product (e.g.: a digital product). For this challenge you will analyse the data resulting of an A/B test performed on a digital product where a new set of sponsored ads are included.


#### Measure of success

Metrics are needed to measure the success of your product. They are typically split in the following categories: 

- __Enganged based metrics:__ number of users, number of downloads, number of active users, user retention, etc.

- __Revenue and monetization metrics:__ ads and affiliate links, subscription-based, in-app purchases, etc.

- __Technical metrics:__ service level indicators (uptime of the app, downtime of the app, latency).



In [4]:
import numpy as np
import pandas as pd

from statsmodels.stats.weightstats import ztest
from scipy import stats

import seaborn as sns
import matplotlib.pylab as plt


Bad key text.latex.preview in file /Users/sil/miniconda3/envs/m2_env/lib/python3.8/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 123 ('text.latex.preview : False')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.7.2/lib/matplotlib/mpl-data/matplotlibrc
or from the matplotlib source distribution

Bad key mathtext.fallback_to_cm in file /Users/sil/miniconda3/envs/m2_env/lib/python3.8/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 155 ('mathtext.fallback_to_cm : True  # When True, use symbols from the Computer Modern')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.7.2/lib/matplotlib/mpl-data/matplotlibrc
or from the matplotlib source distribution

Bad key savefig.jpeg_quality in file /Users/sil/miniconda3/envs/m2_env/lib/python3.8/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 418 ('savefig.jp

---

## Metrics understanding

In this part you must analyse the metrics involved in the test. We will focus in the following metrics:

- Activity level + Daily active users (DAU).

- Click-through rate (CTR)

### Activity level

In the following part you must perform every calculation you consider necessary in order to answer the following questions:

- How many activity levels you can find in the dataset (Activity level of zero means no activity).

- What is the amount of users for each activity level.

- How many activity levels do you have per day and how many records per each activity level.

At the end of this section you must provide your conclusions about the _activity level_ of the users.

__Dataset:__ `activity_pretest.csv`

In [2]:
# your-code 1ºunit 2ºvalue-count o gruopby 3º
#primerto leemos el df para ver que es lo que tiene
df_acti_pretest = pd.read_csv('./abtest/activity_pretest.csv')
df_acti_pretest



Unnamed: 0,userid,dt,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0
1,d2646662-269f-49de-aab1-8776afced9a3,2021-10-01,0
2,c4d1cfa8-283d-49ad-a894-90aedc39c798,2021-10-01,0
3,6889f87f-5356-4904-a35a-6ea5020011db,2021-10-01,0
4,dbee604c-474a-4c9d-b013-508e5a0e3059,2021-10-01,0
...,...,...,...
1859995,200d65e6-b1ce-4a47-8c2b-946db5c5a3a0,2021-10-31,20
1859996,535dafe4-de7c-4b56-acf6-aa94f21653bc,2021-10-31,20
1859997,0428ca3c-e666-4ef4-8588-3a2af904a123,2021-10-31,20
1859998,a8cd1579-44d4-48b3-b3d6-47ae5197dbc6,2021-10-31,20


In [3]:
#Cuántos niveles de actividad puede encontrar en el conjunto de datos (el nivel de actividad cero significa que no hay actividad).
df_acti_pretest['activity_level'].unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20])

In [4]:
df_acti_pretest['userid'].nunique()


60000

In [5]:
#¿Cuál es la cantidad de usuarios para cada nivel de actividad?
df_acti_pretest.groupby('activity_level')['userid'].count()

activity_level
0     909125
1      48732
2      49074
3      48659
4      48556
5      49227
6      48901
7      48339
8      48396
9      48820
10     48943
11     48832
12     48911
13     48534
14     48620
15     48599
16     48934
17     48395
18     48982
19     48901
20     24520
Name: userid, dtype: int64

In [6]:
#¿Cuántos niveles de actividad tienes por día y cuántos registros por cada nivel de actividad?
df_acti_pretest.groupby(['dt'])[['activity_level']].nunique().reset_index()

Unnamed: 0,dt,activity_level
0,2021-10-01,21
1,2021-10-02,21
2,2021-10-03,21
3,2021-10-04,21
4,2021-10-05,21
5,2021-10-06,21
6,2021-10-07,21
7,2021-10-08,21
8,2021-10-09,21
9,2021-10-10,21


### Daily active users (DAU)

![ab_test](./img/user_activity_ab_testing.JPG)


The daily active users (DAU) refers to the amount of users that are active per day (activity level of zero means no activity). You must perform the calculation of this metric and provide your insights about it.

__Dataset:__ `activity_pretest.csv`

In [7]:
df_acti_pretest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1860000 entries, 0 to 1859999
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   userid          object
 1   dt              object
 2   activity_level  int64 
dtypes: int64(1), object(2)
memory usage: 42.6+ MB


In [13]:
# your-code
#activity_pretest = df_acti_pretest[df_acti_pretest['activity_level']!=0]
dau_pretest = df_acti_pretest.loc[df_acti_pretest['activity_level'] != 0,:].groupby('dt').count()['userid']
dau_pretest



dt
2021-10-01    30634
2021-10-02    30775
2021-10-03    30785
2021-10-04    30599
2021-10-05    30588
2021-10-06    30639
2021-10-07    30637
2021-10-08    30600
2021-10-09    30902
2021-10-10    30581
2021-10-11    30489
2021-10-12    30715
2021-10-13    30761
2021-10-14    30716
2021-10-15    30637
2021-10-16    30708
2021-10-17    30741
2021-10-18    30694
2021-10-19    30587
2021-10-20    30795
2021-10-21    30705
2021-10-22    30573
2021-10-23    30645
2021-10-24    30815
2021-10-25    30616
2021-10-26    30673
2021-10-27    30661
2021-10-28    30734
2021-10-29    30723
2021-10-30    30628
2021-10-31    30519
Name: userid, dtype: int64

### Click-through rate (CTR)

![ab_test](./img/ad_click_through_rate_ab_testing.JPG)

Click-through rate (CTR) refers to the percentage of clicks that the user perform from the total amount ads showed to that user during a certain day. You must perform the analysis of this metric (e.g.: average CTR per day) and provide your insights about it.

__Dataset:__ `ctr_pretest.csv`

In [15]:
# your-code
ctr_pretest = pd.read_csv('./abtest/ctr_pretest.csv')
ctr_pretest


Unnamed: 0,userid,dt,ctr
0,4b328144-df4b-47b1-a804-09834942dce0,2021-10-01,34.28
1,34ace777-5e9d-40b3-a859-4145d0c35c8d,2021-10-01,34.67
2,8028cccf-19c3-4c0e-b5b2-e707e15d2d83,2021-10-01,34.77
3,652b3c9c-5e29-4bf0-9373-924687b1567e,2021-10-01,35.42
4,45b57434-4666-4b57-9798-35489dc1092a,2021-10-01,35.04
...,...,...,...
950870,a09a3687-b71a-4a67-b1ef-9b05c9770c4c,2021-10-31,32.33
950871,c843a595-b94c-42e1-b2fe-ec096070681e,2021-10-31,30.09
950872,edcdf0c1-3d8f-47e8-b7dd-05505749eb69,2021-10-31,35.71
950873,76b7a9ae-98fa-4c77-869d-594a4ef7282d,2021-10-31,34.76


In [16]:
ctr_promedio = ctr_pretest[['dt','ctr']].groupby('dt').mean()
ctr_promedio

Unnamed: 0_level_0,ctr
dt,Unnamed: 1_level_1
2021-10-01,32.993446
2021-10-02,32.991664
2021-10-03,32.995086
2021-10-04,32.992995
2021-10-05,33.004375
2021-10-06,33.018564
2021-10-07,32.9885
2021-10-08,32.998654
2021-10-09,33.005082
2021-10-10,33.007134


---

## Pretest metrics 

In this section you will perform the analysis of the metrics using the dataset that includes the result for the test and control groups, but only for the pretest data (i.e.: prior to November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups prior to the start of the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

In [5]:
# your-code
activity_all = pd.read_csv('./abtest/activity_all.csv')
activity_all
ctr_all = pd.read_csv('./abtest/ctr_all.csv')
ctr_all


Unnamed: 0,userid,dt,groupid,ctr
0,60389fa7-2d71-4cdf-831c-c2bb277ffa1e,2021-11-13,0,31.81
1,b59cb225-d160-4851-92d2-7cc8120a2f63,2021-11-13,0,30.46
2,aa336050-934e-453f-a5b0-dd881fcd114e,2021-11-13,0,34.25
3,8df767f4-a10f-4322-a722-676b7e02b372,2021-11-13,0,34.92
4,a74762ed-4da0-42ab-91d2-40d7e808dfe9,2021-11-13,0,34.95
...,...,...,...,...
2303403,932e0348-ea2d-4b98-8782-aa84420f0796,2021-11-12,1,37.27
2303404,6775a825-6d3d-4dc3-9335-cad061736752,2021-11-12,1,39.14
2303405,a7b55365-21f1-4123-b2b5-485a8c7b98da,2021-11-12,1,40.05
2303406,a6fa937c-6f40-4f04-b15b-f1de09e179db,2021-11-12,1,38.14


In [20]:
acti_pre = activity_all.loc[activity_all['dt'] <= '2021-10-30',:]
acti_pre1 = acti_pre.loc[acti_pre['groupid'] == 1,:]
acti_pre2 = acti_pre.loc[acti_pre['groupid'] == 0,:]
print(acti_pre1.info())
print(acti_pre2.info())

<class 'pandas.core.frame.DataFrame'>
Index: 901470 entries, 2 to 3624660
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   userid          901470 non-null  object
 1   dt              901470 non-null  object
 2   groupid         901470 non-null  int64 
 3   activity_level  901470 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 34.4+ MB
None
<class 'pandas.core.frame.DataFrame'>
Index: 898530 entries, 0 to 3624659
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   userid          898530 non-null  object
 1   dt              898530 non-null  object
 2   groupid         898530 non-null  int64 
 3   activity_level  898530 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 34.3+ MB
None


In [25]:
dau_pre1 = acti_pre1.loc[acti_pre1['activity_level'] != 0,:].groupby('dt').count()['userid']
dau_pre2 = acti_pre2.loc[acti_pre2['activity_level'] != 0,:].groupby('dt').count()['userid']
dau_pre1_mean = dau_pre1.mean()
dau_pre1_mean
dau_pre2_mean = dau_pre2.mean()
dau_pre2_mean

15324.633333333333

In [26]:
z_stat, p_value = ztest(dau_pre1, value=dau_pre1_mean)

# Imprimir resultados
print(f"Estadístico Z: {z_stat}")
print(f"Valor p: {p_value}")

# Tomar una decisión basada en el valor p
alpha = 0.05
if p_value < alpha:
    print("Rechazamos la hipótesis nula")
else:
    print("No podemos rechazar la hipótesis nula")

Estadístico Z: 0.0
Valor p: 1.0
No podemos rechazar la hipótesis nula


In [28]:
z_stat, p_value = ztest(dau_pre2, value=dau_pre2_mean)

# Imprimir resultados
print(f"Estadístico Z: {z_stat}")
print(f"Valor p: {p_value}")

# Tomar una decisión basada en el valor p
alpha = 0.05
if p_value < alpha:
    print("Rechazamos la hipótesis nula")
else:
    print("No podemos rechazar la hipótesis nula")

Estadístico Z: 0.0
Valor p: 1.0
No podemos rechazar la hipótesis nula


---

## Experiment metrics 

In this section you must perform the same analysis as in the previous section, but using the data generated during the experiment (i.e.: after November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups during the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

In [6]:
# your-code
df_experimento = activity_all[activity_all['dt'] > '2021-11-01']
df_experimento
ctr_all

Unnamed: 0,userid,dt,groupid,ctr
0,60389fa7-2d71-4cdf-831c-c2bb277ffa1e,2021-11-13,0,31.81
1,b59cb225-d160-4851-92d2-7cc8120a2f63,2021-11-13,0,30.46
2,aa336050-934e-453f-a5b0-dd881fcd114e,2021-11-13,0,34.25
3,8df767f4-a10f-4322-a722-676b7e02b372,2021-11-13,0,34.92
4,a74762ed-4da0-42ab-91d2-40d7e808dfe9,2021-11-13,0,34.95
...,...,...,...,...
2303403,932e0348-ea2d-4b98-8782-aa84420f0796,2021-11-12,1,37.27
2303404,6775a825-6d3d-4dc3-9335-cad061736752,2021-11-12,1,39.14
2303405,a7b55365-21f1-4123-b2b5-485a8c7b98da,2021-11-12,1,40.05
2303406,a6fa937c-6f40-4f04-b15b-f1de09e179db,2021-11-12,1,38.14


In [7]:
exp_group = df_experimento[df_experimento['groupid'] == 1]  
exp_control = df_experimento[df_experimento['groupid'] == 0]   

In [8]:
exp_activity_test = exp_group['activity_level']
exp_activity_control = exp_control['activity_level']


In [9]:
#dau
exp_dau_test = exp_group.groupby('dt')['userid'].nunique()
exp_dau_control = ctr_all.groupby('dt')['userid'].nunique()
mean = exp_dau_test.mean()
mean

30049.0

In [None]:
z_stat, p_value = ztest(df_experimento, value=mean)

# Imprimir resultados
print(f"Estadístico Z: {z_stat}")
print(f"Valor p: {p_value}")

# Tomar una decisión basada en el valor p
alpha = 0.05
if p_value < alpha:
    print("Rechazamos la hipótesis nula")
else:
    print("No podemos rechazar la hipótesis nula")


---

## Conclusions

Please provide your conclusions after the analyses and your recommendation whether we may or may not implement the changes in the digital product.

In [8]:
# your-conclusions
'''
1º caso: no podemos rechazar la H1
2º caso: no podemos rechazar la H1
3º caso: podemos rechazar la H1
4º caso: podemos rechazar la H1
'''

---