# Metrics for site assignment
Create a set of functions that will calculate metrics for a news site (CTR, CR, average page time, share of readings)

CTR (Click-through rate) - ad conversion, i.e. the ratio of the number of clicks to the number of ad displays.

CR (Conversion rate) - conversion to a particular section of the site, I.e. the ration of the number of users on previous page and number of users on current page.

Average page time - the ratio all time to the number of users. In this case you shouldn’t count time less than 5 seconds on page.

Share of readings - how many times users share this page, i.e. the ratio of the number of share and the number of users.

News site has main page and categories: politics, sport, science, technologies. Every category has about 10 new posts every day. You can use this information for solving task.

# First, let's create fake datastes

## For CTR

For the CTR, we will have a table with the following headers:

| date | ads_displayed | clicks_on_ads |
|------|---------------|---------------|

The table is indexed by the date.
* ads_displayed is the number of ads displayed in the specified date.
* clicks_on_ads is the number of clicks on the ads displayed in the specified date.

## For CR

For CR, we will have a table with the following headers:

| date | politics | sport | science | tech | politics_to_sport | politics_to_science | politics_to_tech | sport_to_politics | sport_to_science | sport_to_tech | science_to_sport | science_to_politics | science_to_tech | tech_to_politics | tech_to_sport | tech_to_science |
|------|----------|-------|---------|------|-------------------|---------------------|------------------|-------------------|------------------|---------------|------------------|---------------------|-----------------|------------------|---------------|-----------------|

The table is indexed by the date.

* politics, sport, science, tech refers to the number of users that read articles from these categories.
* politics_to_sport, politics_to_science, politics_to_tech refers to the number of users that went from reading politics articles to sport, science and tech articles respectively.
* sport_to_politics, sport_to_science, sport_to_tech refers to the number of users that went from reading sport articles to politics, science and tech articles respectively.
* science_to_politics, science_to_sport, science_to_tech refers to the number of users that went from reading science articles to politics, sport and tech articles respectively.
* tech_to_politics, tech_to_sport, tech_to_science refers to the number of users that went from reading tech articles to politics, sport and science articles respectively.

## Average page time
For the average page time, we will have the following multiindex table:

| date | user_uid | time |
|------|----------|------|

The first index of the table will be the date. The second index will be the user_uid, which will identify a user uniquely. Lastly, we will have the time in seconds, so we can filter people who stayed less than five seconds in the news site.

## Share of readings

For the share of readings, we wil have the following multiindex table:

| date | post_uid | shares |
|------|----------|--------|

The first index of the table will be the date. The second index will be the post_uid, which will identify a post uniquely. Lastly, we will have the number of shares.

## Creating the first table and calculating CTR

For all the tables, we will be creating data for 30 days from now.

In [1]:
from datetime import datetime, timedelta
from random import randint
import pandas as pd
number_of_days = 30
start_date = datetime.today()
dates = [start_date + timedelta(days=day) for day in range(number_of_days)]
ads_displayed = [randint(1000, 4000) for i in range(number_of_days)]
clicks_on_ads = [randint(0, ads) for ads in ads_displayed]
ctr_data = {'date': dates, 'ads_displayed': ads_displayed, 'clicks_on_ads': clicks_on_ads}
ctr_dataframe = pd.DataFrame(data=ctr_data)

In [2]:
ctr_dataframe.head()

Unnamed: 0,date,ads_displayed,clicks_on_ads
0,2020-03-21 15:14:54.032866,1192,1055
1,2020-03-22 15:14:54.032866,3930,25
2,2020-03-23 15:14:54.032866,3795,1909
3,2020-03-24 15:14:54.032866,1267,433
4,2020-03-25 15:14:54.032866,2784,2033


Now, we will create a new function that will calculate a new with the CTR:

## Function that calculates CTR

In [3]:
def ctr(new_column, first_column, second_column):
    ctr_dataframe[new_column] = ctr_dataframe[first_column]/ctr_dataframe[second_column]

In [4]:
ctr('ctr', 'clicks_on_ads', 'ads_displayed')

In [5]:
ctr_dataframe.head()

Unnamed: 0,date,ads_displayed,clicks_on_ads,ctr
0,2020-03-21 15:14:54.032866,1192,1055,0.885067
1,2020-03-22 15:14:54.032866,3930,25,0.006361
2,2020-03-23 15:14:54.032866,3795,1909,0.50303
3,2020-03-24 15:14:54.032866,1267,433,0.341752
4,2020-03-25 15:14:54.032866,2784,2033,0.730244


We can also sum all of the values of `clicks_on_ads` and divide by the sum of all the values of `ads_displayed`.

In [6]:
total_clicks_on_ads = ctr_dataframe['clicks_on_ads'].sum()
total_ads_displayed = ctr_dataframe['ads_displayed'].sum()
ctr = total_clicks_on_ads/total_ads_displayed
print(f'CTR: {ctr:.2f}')

CTR: 0.49


It would be interesting if we compared this value to the mean of the values of the `ctr` column:

In [7]:
mean_ctr = ctr_dataframe['ctr'].mean()
print(f'Mean CTR: {mean_ctr:.2f}')

Mean CTR: 0.51


They are similar but not the same. We can compare them and see that they are not equal:

In [8]:
ctr == mean_ctr

False

For the following metrics, we will be calculating them for each date, as this makes more sense from the business perspective. If we would like to have a metric from, say, March 1 to March 21, then we would simply calculate the mean of the metrics calculated from March 1 to March 21.

## Creating the second table and calculating the CR

In [9]:
columns_cr_dataframe = ['politics','sport','science','tech','politics_to_sport','politics_to_science',
           'politics_to_tech','sport_to_politics','sport_to_science','sport_to_tech','science_to_sport',
           'science_to_politics','science_to_tech','tech_to_politics','tech_to_sport','tech_to_science']
cr_data = {'date': dates}
cr_data.update({column: [] for column in columns_cr_dataframe})
for i in range(number_of_days):
    for column in columns_cr_dataframe:
        cr_data[column] += [randint(2000, 10000)]
cr_dataframe = pd.DataFrame(data=cr_data)

In [10]:
cr_dataframe.head()

Unnamed: 0,date,politics,sport,science,tech,politics_to_sport,politics_to_science,politics_to_tech,sport_to_politics,sport_to_science,sport_to_tech,science_to_sport,science_to_politics,science_to_tech,tech_to_politics,tech_to_sport,tech_to_science
0,2020-03-21 15:14:54.032866,2676,8062,6820,3066,6980,4444,6455,6858,2955,7841,8672,5742,5326,5567,8763,7281
1,2020-03-22 15:14:54.032866,7661,9975,5689,2712,4112,7283,8230,6061,9994,5193,6279,7630,9000,7906,7985,7319
2,2020-03-23 15:14:54.032866,4691,2848,3994,2953,9172,4607,7592,9677,6849,6070,6844,2309,3621,7607,4317,7596
3,2020-03-24 15:14:54.032866,7778,8435,5831,9925,2769,5519,2009,3645,2774,7911,8583,8703,3569,6932,2375,8138
4,2020-03-25 15:14:54.032866,5165,2955,6738,7134,8331,7564,8001,4853,2285,3100,4854,9659,7785,9179,6468,4761


Now, we calculate 4 new columns: `cr_politics`, `cr_sport`, `cr_science` and `cr_tech` with the following calculations:

## Function that calculates CR

In [11]:
def cr(new_column, first_column, second_column, third_column, fourth_column):
    cr_dataframe[new_column] = cr_dataframe[first_column].sum()/\
    (cr_dataframe[second_column].sum()
    +cr_dataframe[third_column].sum()
    +cr_dataframe[fourth_column].sum())

In [12]:
cr('cr_politics', 'politics', 'politics_to_sport', 'politics_to_science', 'politics_to_tech')
cr('cr_sport', 'sport', 'sport_to_politics', 'sport_to_science', 'sport_to_tech')
cr('cr_science', 'science', 'science_to_politics', 'science_to_sport', 'science_to_tech')
cr('cr_tech', 'tech', 'tech_to_politics', 'tech_to_sport', 'tech_to_science')

Now, we can show the conversion rate for each category of the news site and for each date.

In [13]:
cr_dataframe[['date', 'cr_politics', 'cr_sport', 'cr_science', 'cr_tech']].head()

Unnamed: 0,date,cr_politics,cr_sport,cr_science,cr_tech
0,2020-03-21 15:14:54.032866,0.330019,0.347513,0.344555,0.297053
1,2020-03-22 15:14:54.032866,0.330019,0.347513,0.344555,0.297053
2,2020-03-23 15:14:54.032866,0.330019,0.347513,0.344555,0.297053
3,2020-03-24 15:14:54.032866,0.330019,0.347513,0.344555,0.297053
4,2020-03-25 15:14:54.032866,0.330019,0.347513,0.344555,0.297053


As we said in the last metric, we can calculate the mean metric for each category:

In [14]:
def mean_cr_per_category(category):
    return cr_dataframe[category].mean()

In [15]:
politics_cr = mean_cr_per_category('cr_politics')
sport_cr = mean_cr_per_category('cr_sport')
science_cr = mean_cr_per_category('cr_science')
tech_cr = mean_cr_per_category('cr_tech')

In [16]:
print(f'Mean CR for politics: {politics_cr:.2f}')
print(f'Mean CR for sport: {sport_cr:.2f}')
print(f'Mean CR for science: {science_cr:.2f}')
print(f'Mean CR for tech: {tech_cr:.2f}')

Mean CR for politics: 0.33
Mean CR for sport: 0.35
Mean CR for science: 0.34
Mean CR for tech: 0.30


## Creating the third table and calculating average page time

For each date, let's see if we can create 5 users per day. Then we can adjust this parameter so the data looks more "real". **We are not going to make this parameter variable for each date to not complicate things more**.

In [17]:
users_per_day = 5
multiindex_dates = []
for date in dates:
    for i in range(users_per_day):
        multiindex_dates += [date]
uids = []
uid = 1
for date in dates:
    for i in range(users_per_day):
        uids += [uid]
        uid += 1
time = []
for date in dates:
    for i in range(users_per_day):
        random_time = randint(0, 15)
        time += [random_time]
index = [multiindex_dates, uids]
index = list(zip(*index))
index = pd.MultiIndex.from_tuples(index, names=['date', 'user_uid'])

In [18]:
avg_page_time_dataframe = pd.DataFrame(time, index=index, columns=['time'])

In [19]:
avg_page_time_dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,time
date,user_uid,Unnamed: 2_level_1
2020-03-21 15:14:54.032866,1,7
2020-03-21 15:14:54.032866,2,5
2020-03-21 15:14:54.032866,3,13
2020-03-21 15:14:54.032866,4,14
2020-03-21 15:14:54.032866,5,2
...,...,...
2020-04-19 15:14:54.032866,146,2
2020-04-19 15:14:54.032866,147,13
2020-04-19 15:14:54.032866,148,1
2020-04-19 15:14:54.032866,149,1


Great! Now, we need to create a function that will filter time that is less than 5 seconds and then return a dataframe with the mean time for each date:

In [20]:
def avg_page_time_per_day(min_time=5):
    filtered_avg_page_time_dataframe = avg_page_time_dataframe[avg_page_time_dataframe['time'] > min_time]
    return filtered_avg_page_time_dataframe.groupby(by=['date']).mean()

In [21]:
avg_time_dataframe = avg_page_time_per_day()
avg_time_dataframe.head()

Unnamed: 0_level_0,time
date,Unnamed: 1_level_1
2020-03-21 15:14:54.032866,11.333333
2020-03-22 15:14:54.032866,10.5
2020-03-23 15:14:54.032866,8.0
2020-03-24 15:14:54.032866,13.5
2020-03-25 15:14:54.032866,12.0


Again, we can take the average time for all the dates in this new dataframe:

In [22]:
average_time = avg_time_dataframe['time'].mean()
print(f'Average time for the 30 days: {average_time:.2f} seconds')

Average time for the 30 days: 10.58 seconds


## Creating the fourth table and calculating share of readings

The approach is almost the same as the one we took in the last metric, so we will be repeating some code. However, we need an additional piece of information from the last metric, namely the `users_per_day` variable. Also, we need the average number of posts per day, which is `10` according to the instructions. Again, **we are not going to complicate things more by making this parameter variable**.

In [23]:
posts_per_day = 10
multiindex_dates = []
for date in dates:
    for i in range(posts_per_day):
        multiindex_dates += [date]
uids = []
uid = 1
for date in dates:
    for i in range(posts_per_day):
        uids += [uid]
        uid += 1
shares = []
for date in dates:
    for i in range(posts_per_day):
        random_shares = randint(0, users_per_day)
        shares += [random_shares]
index = [multiindex_dates, uids]
index = list(zip(*index))
index = pd.MultiIndex.from_tuples(index, names=['date', 'post_id'])

In [24]:
shares_dataframe = pd.DataFrame(shares, index=index, columns=['shares'])

In [25]:
shares_dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,shares
date,post_id,Unnamed: 2_level_1
2020-03-21 15:14:54.032866,1,4
2020-03-21 15:14:54.032866,2,5
2020-03-21 15:14:54.032866,3,3
2020-03-21 15:14:54.032866,4,5
2020-03-21 15:14:54.032866,5,5
...,...,...
2020-04-19 15:14:54.032866,296,0
2020-04-19 15:14:54.032866,297,1
2020-04-19 15:14:54.032866,298,1
2020-04-19 15:14:54.032866,299,0


Great! Now, we need to divide the number of shares of each post by the total of users per day.

## Function that calculates shares per users

In [26]:
def shares_per_users():
    shares_dataframe['shares_per_total_users'] = shares_dataframe['shares']/users_per_day

In [27]:
shares_per_users()
shares_dataframe.head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,shares,shares_per_total_users
date,post_id,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-03-21 15:14:54.032866,1,4,0.8
2020-03-21 15:14:54.032866,2,5,1.0
2020-03-21 15:14:54.032866,3,3,0.6
2020-03-21 15:14:54.032866,4,5,1.0
2020-03-21 15:14:54.032866,5,5,1.0
2020-03-21 15:14:54.032866,6,4,0.8
2020-03-21 15:14:54.032866,7,1,0.2
2020-03-21 15:14:54.032866,8,4,0.8
2020-03-21 15:14:54.032866,9,0,0.0
2020-03-21 15:14:54.032866,10,5,1.0


We can calculate the average value of `shares_per_total_users` for each date:

In [28]:
shares_per_total_users_dataframe = shares_dataframe.groupby('date')['shares_per_total_users'].mean()
shares_per_total_users_dataframe.head(15)

date
2020-03-21 15:14:54.032866    0.72
2020-03-22 15:14:54.032866    0.58
2020-03-23 15:14:54.032866    0.50
2020-03-24 15:14:54.032866    0.52
2020-03-25 15:14:54.032866    0.42
2020-03-26 15:14:54.032866    0.40
2020-03-27 15:14:54.032866    0.60
2020-03-28 15:14:54.032866    0.28
2020-03-29 15:14:54.032866    0.48
2020-03-30 15:14:54.032866    0.50
2020-03-31 15:14:54.032866    0.64
2020-04-01 15:14:54.032866    0.42
2020-04-02 15:14:54.032866    0.54
2020-04-03 15:14:54.032866    0.68
2020-04-04 15:14:54.032866    0.54
Name: shares_per_total_users, dtype: float64

Lastly, we are going to calculate the mean shares per total users for all the dates:

In [29]:
mean_shares_per_total_users = shares_per_total_users_dataframe.mean()
print(f'Mean shares per total users for the 30 days: {mean_shares_per_total_users:.2f}')

Mean shares per total users for the 30 days: 0.50
