### Understand the usecase 

Let's see how to calculate metrics from raw datasets :)

In [2]:
!pip install altair

Collecting altair
  Downloading altair-5.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting toolz (from altair)
  Downloading toolz-0.12.1-py3-none-any.whl.metadata (5.1 kB)
Downloading altair-5.2.0-py3-none-any.whl (996 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m996.9/996.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading toolz-0.12.1-py3-none-any.whl (56 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.1/56.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: toolz, altair
Successfully installed altair-5.2.0 toolz-0.12.1


In [3]:
import pandas as pd 
import numpy as np 
import altair as alt 
from datetime import datetime 

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Metrics Calculations

### 1. User Activity (it's a metric )

In [8]:
data = pd.read_csv("/workspaces/Learn_AB_Testing/ab-testing-in-python/Course notes/Activity_pretest.csv")
data.head()

Unnamed: 0,userid,dt,activity_level
0,a5b70ae7-f07c-4773-9df4-ce112bc9dc48,2021-10-01,0
1,d2646662-269f-49de-aab1-8776afced9a3,2021-10-01,0
2,c4d1cfa8-283d-49ad-a894-90aedc39c798,2021-10-01,0
3,6889f87f-5356-4904-a35a-6ea5020011db,2021-10-01,0
4,dbee604c-474a-4c9d-b013-508e5a0e3059,2021-10-01,0


**Activity level** --> Activity level means how many times a user has been active within the app. Activity defind as opening the app. 

In [13]:
data.activity_level.value_counts().sort_values()

## 20 activity levels (it means 20 times 24520 people are opening the app)

activity_level
20     24520
7      48339
17     48395
8      48396
13     48534
4      48556
15     48599
14     48620
3      48659
1      48732
9      48820
11     48832
19     48901
6      48901
12     48911
16     48934
10     48943
18     48982
2      49074
5      49227
0     909125
Name: count, dtype: int64

In [14]:
## we can analyze the same thing in groupby 
data.groupby("activity_level").describe().head()

Unnamed: 0_level_0,userid,userid,userid,userid,dt,dt,dt,dt
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
activity_level,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,909125,60000,6b953416-72e5-4b6e-b634-41c8d3bf98a4,27,909125,31,2021-10-11,29511
1,48732,33688,3c5297b6-602e-4479-9a97-e2b4cb444f0a,6,48732,31,2021-10-19,1620
2,49074,33761,3d5b7e5d-d7b8-459b-a4f0-33231fc930fd,6,49074,31,2021-10-14,1665
3,48659,33634,fd9d8064-2f3f-47ba-9deb-0a38bc0b1a3d,6,48659,31,2021-10-28,1663
4,48556,33502,dc396a83-174c-4244-8a33-71eae2283eeb,8,48556,31,2021-10-29,1632


In [22]:
## now we know the activity level of the people, now we only look the people who are all active, not exists (0). 


activity = data.query("activity_level > 0").groupby(['dt', 'activity_level']).count().reset_index()

In [23]:
activity.count()

dt                620
activity_level    620
userid            620
dtype: int64

In [24]:
## let's visualize this :) 

alt.Chart(activity).mark_line(size=1).encode(
    alt.X('dt:T', axis=alt.Axis(title = 'date')),
    alt.Y('userid:Q', axis=alt.Axis(title = 'number of users')),
    tooltip=['activity_level'], 
    color='activity_level:N'
).properties(
    width=600,
    height=400, 
    title="Activity level"
)

### 2. Daily active users ( DAU ) 

In this dataset, a userid will count towards DAU if their activity_level for that day is not zero.

In [20]:
activity = data.query('activity_level > 0').groupby(['dt']).count().reset_index()

## let's visualize this :) 
alt.Chart(activity).mark_line(size=4).encode(
    alt.X('dt:T', axis=alt.Axis(title = 'date')),
    alt.Y('userid:Q', axis=alt.Axis(title = 'number of users'))
).properties(
    width=600,
    height=400, 
    title='Daily Active Users'
)

### 3. Click Through Rate ( CTR )

In [38]:
data = pd.read_csv("/workspaces/Learn_AB_Testing/ab-testing-in-python/Course notes/Ctr_pretest.csv")
data.head()

Unnamed: 0,userid,dt,ctr
0,4b328144-df4b-47b1-a804-09834942dce0,2021-10-01,34.28
1,34ace777-5e9d-40b3-a859-4145d0c35c8d,2021-10-01,34.67
2,8028cccf-19c3-4c0e-b5b2-e707e15d2d83,2021-10-01,34.77
3,652b3c9c-5e29-4bf0-9373-924687b1567e,2021-10-01,35.42
4,45b57434-4666-4b57-9798-35489dc1092a,2021-10-01,35.04


**CTR** -> It means that the user has seen a certain number of ads a day and clicked on some percentage. Example on 01-10-2021, 34% of the people clicked the ads :) 

In [39]:
data.describe()

Unnamed: 0,ctr
count,950875.0
mean,33.000242
std,1.731677
min,30.0
25%,31.5
50%,33.0
75%,34.5
max,36.0


In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950875 entries, 0 to 950874
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   userid  950875 non-null  object 
 1   dt      950875 non-null  object 
 2   ctr     950875 non-null  float64
dtypes: float64(1), object(2)
memory usage: 21.8+ MB


In [45]:
ctr = data.groupby(['dt']).mean(numeric_only = True).reset_index()

alt.Chart(ctr).mark_line(size=4).encode(
    alt.X('dt:T', axis=alt.Axis(title = 'date')),
    alt.Y('ctr:Q', axis=alt.Axis(title = 'ctr'), scale=alt.Scale(domain=[32, 34])),
    tooltip=['ctr'], 
).properties(
    width=600,
    height=400, 
    title='Average Daily CTR'
)