# A/B Test Challenge



---

#### What is an A/B Test? 

It is a decision making support & research methodology that allow you to measure an impact of a change in a product (e.g.: a digital product). For this challenge you will analyse the data resulting of an A/B test performed on a digital product where a new set of sponsored ads are included.


#### Measure of success

Metrics are need it to measure the success of your product. They are typically split in the following categories: 

- __Enganged based metrics:__ number of users, number of downloads, number of active users, user retention, etc.

- __Revenue and monetization metrics:__ ads and affiliate links, subscription-based, in-app purchases, etc.

- __Technical metrics:__ service level indicators (uptime of the app, downtime of the app, latency).



---

## Metrics understanding

In this part you must analyse the metrics involved in the test. We will focus in the following metrics:

- Activity level + Daily active users (DAU).

- Click-through rate (CTR)

### Activity level

In the following part you must perform every calculation you consider necessary in order to answer the following questions:

- How many activity levels you can find in the dataset (Activity level of zero means no activity).

- What is the amount of users for each activity level.

- How many activity levels do you have per day and how many records per each activity level.

At the end of this section you must provide your conclusions about the _activity level_ of the users.

__Dataset:__ `activity_pretest.csv`

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

In [2]:
#Read file
activity_pretest = pd.read_csv('./data/activity_pretest.csv')
activity_pretest

Unnamed: 0,userid,dt,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0
1,d2646662-269f-49de-aab1-8776afced9a3,2021-10-01,0
2,c4d1cfa8-283d-49ad-a894-90aedc39c798,2021-10-01,0
3,6889f87f-5356-4904-a35a-6ea5020011db,2021-10-01,0
4,dbee604c-474a-4c9d-b013-508e5a0e3059,2021-10-01,0
...,...,...,...
1859995,200d65e6-b1ce-4a47-8c2b-946db5c5a3a0,2021-10-31,20
1859996,535dafe4-de7c-4b56-acf6-aa94f21653bc,2021-10-31,20
1859997,0428ca3c-e666-4ef4-8588-3a2af904a123,2021-10-31,20
1859998,a8cd1579-44d4-48b3-b3d6-47ae5197dbc6,2021-10-31,20


In [3]:
#info
activity_pretest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1860000 entries, 0 to 1859999
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   userid          object
 1   dt              object
 2   activity_level  int64 
dtypes: int64(1), object(2)
memory usage: 42.6+ MB


In [4]:
#drop duplicates
activity_pretest.drop_duplicates(inplace=True)

In [5]:
#check if there are null values
activity_pretest.isnull().sum()

userid            0
dt                0
activity_level    0
dtype: int64

In [6]:
#df droping level 0 which means no activity
activity_pretest = activity_pretest[activity_pretest.activity_level != 0]
activity_pretest

Unnamed: 0,userid,dt,activity_level
909125,428070b0-083e-4c0e-8444-47bf91e99fff,2021-10-01,1
909126,93370f9c-56ef-437f-99ff-cb7c092d08a7,2021-10-01,1
909127,0fb7120a-53cf-4a51-8b52-bf07b8659bd6,2021-10-01,1
909128,ce64a9d8-07d9-4dca-908d-5e1e4568003d,2021-10-01,1
909129,e08332f0-3a5c-4ed2-b957-87e464e89b97,2021-10-01,1
...,...,...,...
1859995,200d65e6-b1ce-4a47-8c2b-946db5c5a3a0,2021-10-31,20
1859996,535dafe4-de7c-4b56-acf6-aa94f21653bc,2021-10-31,20
1859997,0428ca3c-e666-4ef4-8588-3a2af904a123,2021-10-31,20
1859998,a8cd1579-44d4-48b3-b3d6-47ae5197dbc6,2021-10-31,20


In [7]:
#descripts statistics
activity_pretest.describe()

Unnamed: 0,activity_level
count,950875.0
mean,10.256362
std,5.635938
min,1.0
25%,5.0
50%,10.0
75%,15.0
max,20.0


In [8]:
"""
#convert dt into a date column
activity_pretest['dt'] = pd.to_datetime(activity_pretest['dt'])
activity_pretest.info()
"""

"\n#convert dt into a date column\nactivity_pretest['dt'] = pd.to_datetime(activity_pretest['dt'])\nactivity_pretest.info()\n"

In [8]:
#activity levels
activity_pretest["activity_level"].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20], dtype=int64)

In [9]:
#users per date per activity level
activity_pretest.groupby(["dt", "activity_level"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,userid
dt,activity_level,Unnamed: 2_level_1
2021-10-01,1,1602
2021-10-01,2,1507
2021-10-01,3,1587
2021-10-01,4,1551
2021-10-01,5,1586
...,...,...
2021-10-31,16,1499
2021-10-31,17,1534
2021-10-31,18,1531
2021-10-31,19,1616


In [10]:
#amount of users for each activity level
activity_pretest.groupby(["activity_level"]).agg({"userid": "count"})

Unnamed: 0_level_0,userid
activity_level,Unnamed: 1_level_1
1,48732
2,49074
3,48659
4,48556
5,49227
6,48901
7,48339
8,48396
9,48820
10,48943


In [11]:
activity_pretest.groupby(["activity_level"]).agg({"userid": "count"}).mean()

userid    47543.75
dtype: float64

In [12]:
#activity levels per day and records per each activity level
activity_pretest.groupby(["dt", "activity_level"]).agg({"userid": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,userid
dt,activity_level,Unnamed: 2_level_1
2021-10-01,1,1602
2021-10-01,2,1507
2021-10-01,3,1587
2021-10-01,4,1551
2021-10-01,5,1586
...,...,...
2021-10-31,16,1499
2021-10-31,17,1534
2021-10-31,18,1531
2021-10-31,19,1616


In [13]:
activity_pretest.groupby(["dt", "activity_level"]).agg({"userid": "count"}).mean()

userid    1533.669355
dtype: float64

# conclusions
- There are twenty activity levels in the dataset, and level zero, which means no activity

- There are 60000 unique users. There is a mean of 47544 users for each activity level.

- There are 20 activity levels per day and a mean of 1534 records per each activity level.

### Daily active users (DAU)

![ab_test](./img/user_activity_ab_testing.JPG)


The daily active users (DAU) refers to the amount of users that are active per day (activity level of zero means no activity). You must perform the calculation of this metric and provide your insights about it.

__Dataset:__ `activity_pretest.csv`

In [14]:
activity_pretest.groupby(["dt"]).agg({"userid": "count"})

Unnamed: 0_level_0,userid
dt,Unnamed: 1_level_1
2021-10-01,30634
2021-10-02,30775
2021-10-03,30785
2021-10-04,30599
2021-10-05,30588
2021-10-06,30639
2021-10-07,30637
2021-10-08,30600
2021-10-09,30902
2021-10-10,30581


In [15]:
activity_pretest.groupby(["dt"]).agg({"userid": "count"}).mean()

userid    30673.387097
dtype: float64

# conclusions
- There are 30673 users active per day

### Click-through rate (CTR)

![ab_test](./img/ad_click_through_rate_ab_testing.JPG)

Click-through rate (CTR) refers to the percentage of clicks that the user perform from the total amount ads showed to that user during a certain day. You must perform the analysis of this metric (e.g.: average CTR per day) and provide your insights about it.

__Dataset:__ `ctr_pretest.csv`

In [16]:
#Read file
ctr_pretest = pd.read_csv('./data/ctr_pretest.csv')
ctr_pretest

Unnamed: 0,userid,dt,ctr
0,4b328144-df4b-47b1-a804-09834942dce0,2021-10-01,34.28
1,34ace777-5e9d-40b3-a859-4145d0c35c8d,2021-10-01,34.67
2,8028cccf-19c3-4c0e-b5b2-e707e15d2d83,2021-10-01,34.77
3,652b3c9c-5e29-4bf0-9373-924687b1567e,2021-10-01,35.42
4,45b57434-4666-4b57-9798-35489dc1092a,2021-10-01,35.04
...,...,...,...
950870,a09a3687-b71a-4a67-b1ef-9b05c9770c4c,2021-10-31,32.33
950871,c843a595-b94c-42e1-b2fe-ec096070681e,2021-10-31,30.09
950872,edcdf0c1-3d8f-47e8-b7dd-05505749eb69,2021-10-31,35.71
950873,76b7a9ae-98fa-4c77-869d-594a4ef7282d,2021-10-31,34.76


In [17]:
#descripts statistics
ctr_pretest.describe()

Unnamed: 0,ctr
count,950875.0
mean,33.000242
std,1.731677
min,30.0
25%,31.5
50%,33.0
75%,34.5
max,36.0


In [18]:
#ctr per user per day
ctr_pretest.groupby(["dt", "userid"]).agg({"ctr": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,ctr
dt,userid,Unnamed: 2_level_1
2021-10-01,00037d4d-ebfa-4a99-9d3e-adbefd6dae3a,1
2021-10-01,0004c8bb-df77-43b2-a93c-7398e9bc5175,1
2021-10-01,000a9421-3a7c-4910-bb77-508f415faaf1,1
2021-10-01,000c70d2-75cd-4814-ba43-a30736b395fa,1
2021-10-01,001291fe-996d-47b5-a7ee-c73f4b77814b,1
...,...,...
2021-10-31,ffee543d-1e45-489c-b75f-fac28882cdf3,1
2021-10-31,ffeeb5d7-bafb-4aa5-909f-0f7fb96de8e4,1
2021-10-31,fff370e6-ceed-4282-865f-7bb0863f7ec9,1
2021-10-31,fffdf2f8-7f61-4fb3-b5fc-6323a72290a7,1


In [19]:
ctr_pretest.groupby(["dt", "userid"]).agg({"ctr": "count"}).mean()

ctr    1.0
dtype: float64

In [20]:
#ctr per user per day
ctr_pretest.groupby(["dt"]).agg({"ctr": "count"})

Unnamed: 0_level_0,ctr
dt,Unnamed: 1_level_1
2021-10-01,30634
2021-10-02,30775
2021-10-03,30785
2021-10-04,30599
2021-10-05,30588
2021-10-06,30639
2021-10-07,30637
2021-10-08,30600
2021-10-09,30902
2021-10-10,30581


In [21]:
ctr_pretest.groupby(["dt"]).agg({"ctr": "count"}).mean()

ctr    30673.387097
dtype: float64

# conclusions
- There's an average of 1 ctr per user per day
- There's an average of 30673 ctr per day

---

## Pretest metrics 

In this section you will perform the analysis of the metrics using the dataset that includes the result for the test and control groups, but only for the pretest data (i.e.: prior to November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups prior to the start of the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

In [24]:
#import data
activity_all = pd.read_csv('./data/activity_all.csv')
activity_all.head()

Unnamed: 0,userid,dt,groupid,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0,0
1,d2646662-269f-49de-aab1-8776afced9a3,2021-10-01,0,0
2,c4d1cfa8-283d-49ad-a894-90aedc39c798,2021-10-01,1,0
3,6889f87f-5356-4904-a35a-6ea5020011db,2021-10-01,0,0
4,dbee604c-474a-4c9d-b013-508e5a0e3059,2021-10-01,1,0


In [28]:
#data info
activity_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3660000 entries, 0 to 3659999
Data columns (total 4 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   userid          object
 1   dt              object
 2   groupid         int64 
 3   activity_level  int64 
dtypes: int64(2), object(2)
memory usage: 111.7+ MB


In [30]:
#drop duplicates
activity_all.drop_duplicates(inplace=True)

In [33]:
#check if there are null values
activity_all.isnull().sum()

userid            0
dt                0
groupid           0
activity_level    0
dtype: int64

In [34]:
#pretest data
activity_pre = activity_all[activity_all['dt'] < '2021-11-01']
activity_pre.head()

Unnamed: 0,userid,dt,groupid,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0,0
1,d2646662-269f-49de-aab1-8776afced9a3,2021-10-01,0,0
2,c4d1cfa8-283d-49ad-a894-90aedc39c798,2021-10-01,1,0
3,6889f87f-5356-4904-a35a-6ea5020011db,2021-10-01,0,0
4,dbee604c-474a-4c9d-b013-508e5a0e3059,2021-10-01,1,0


In [36]:
#drop activity level 0
activity_pre = activity_pre[activity_pre.activity_level != 0]
activity_pre.head()

Unnamed: 0,userid,dt,groupid,activity_level
1356592,428070b0-083e-4c0e-8444-47bf91e99fff,2021-10-01,1,1
1356593,93370f9c-56ef-437f-99ff-cb7c092d08a7,2021-10-01,1,1
1356594,0fb7120a-53cf-4a51-8b52-bf07b8659bd6,2021-10-01,1,1
1356595,ce64a9d8-07d9-4dca-908d-5e1e4568003d,2021-10-01,0,1
1356596,e08332f0-3a5c-4ed2-b957-87e464e89b97,2021-10-01,1,1


In [None]:
#Activity level


In [23]:
#import data
ctr_all = pd.read_csv('./data/ctr_all.csv')
ctr_all.head()

Unnamed: 0,userid,dt,groupid,ctr
0,60389fa7-2d71-4cdf-831c-c2bb277ffa1e,2021-11-13,0,31.81
1,b59cb225-d160-4851-92d2-7cc8120a2f63,2021-11-13,0,30.46
2,aa336050-934e-453f-a5b0-dd881fcd114e,2021-11-13,0,34.25
3,8df767f4-a10f-4322-a722-676b7e02b372,2021-11-13,0,34.92
4,a74762ed-4da0-42ab-91d2-40d7e808dfe9,2021-11-13,0,34.95


In [29]:
#data info
ctr_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2303408 entries, 0 to 2303407
Data columns (total 4 columns):
 #   Column   Dtype  
---  ------   -----  
 0   userid   object 
 1   dt       object 
 2   groupid  int64  
 3   ctr      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 70.3+ MB


In [31]:
#drop duplicates
ctr_all.drop_duplicates(inplace=True)

In [32]:
#check if there are null values
ctr_all.isnull().sum()

userid     0
dt         0
groupid    0
ctr        0
dtype: int64

In [35]:
#pretest data
ctr_pre = ctr_all[ctr_all['dt'] < '2021-11-01']
ctr_pre.head()

Unnamed: 0,userid,dt,groupid,ctr
808703,4b328144-df4b-47b1-a804-09834942dce0,2021-10-01,0,34.28
808704,34ace777-5e9d-40b3-a859-4145d0c35c8d,2021-10-01,0,34.67
808705,8028cccf-19c3-4c0e-b5b2-e707e15d2d83,2021-10-01,0,34.77
808706,652b3c9c-5e29-4bf0-9373-924687b1567e,2021-10-01,0,35.42
808707,45b57434-4666-4b57-9798-35489dc1092a,2021-10-01,0,35.04


---

## Experiment metrics 

In this section you must perform the same analysis as in the previous section, but using the data generated during the experiment (i.e.: after November 1st, 2021). You must provide insights about the metrics (__Activity level__, __DAU__ and __CTR__) and also perform an hyphotesis test in order to determine whether there is any statistical significant difference between the groups during the experiment. You must try different approaches (i.e.: __z-test__ and __t-test__) and compare the results.


__Datasets:__ `activity_all.csv`, `ctr_all.csv`

In [17]:
# your-code




---

## Conclusions

Please provide your conclusions after the analyses and your recommendation whether we may or may not implement the changes in the digital product.

In [18]:
# your-conclusions




---