# Calculating 'Amplification Ratio:' Reproducing Twitter Research with DP Guarantees

References:
* [Twitter Blog Summary](https://blog.twitter.com/en_us/topics/company/2021/rml-politicalcontent)
* [Twitter Preprint](https://cdn.cms-twdigitalassets.com/content/dam/blog-twitter/official/en_us/company/2021/rml/Algorithmic-Amplification-of-Politics-on-Twitter.pdf) (includes Supplementary Information section)
* [PNAS Paper](https://doi.org/10.1073/pnas.2025334119)
* [PNAS Supporting Information](https://www.pnas.org/highwire/filestream/1021951/field_highwire_adjunct_files/0/pnas.2025334119.sapp.pdf)

### Imports and data

In [1]:
# standard
import pandas as pd
from datetime import datetime, timedelta

In [2]:
# from google drive, twitter datasets -> demo -> to benchmark -> demo 2 datasets
df=pd.read_parquet("../../../../Data/OpenMined/1M_rows_dataset_sample.parquet")

# old dataset: df=pd.read_parquet("https://github.com/madhavajay/datasets/blob/main/spicy_bird/1M_rows_dataset_sample.parquet?raw=true")

In [3]:
# add a date column to simplify some calculations, check sample to verify
df['tweet_date'] = df.tweet_date_time.dt.date
df.sample(5)

Unnamed: 0,tweet_id,impressions,impressions_ct,bucket,tweet_origin,tweet_date_time,date,time,user_id,user_country,url,publication_title,ad_fontes_bias,ad_fontes_reliability,domain,tweet_date
547788,68,12069,2551,0,Canada,2021-05-27 13:47:09,2021-05-27,0 days 13:47:09,6401,U.S.A.,https://www.alternet.org/2019/04/trump-lashes-...,AlterNet,-13.33333,24.66667,.alternet.org,2021-05-27
142355,81,19904,4298,0,Canada,2021-01-08 07:59:51,2021-01-08,0 days 07:59:51,476714,France,https://www.alternet.org/2021/02/marjorie-tayl...,AlterNet,-21.75,27.0,.alternet.org,2021-01-08
511322,56,15806,12927,1,U.S.A.,2020-05-25 10:37:36,2020-05-25,0 days 10:37:36,11326,Japan,https://www.alternet.org/2019/02/top-gop-leade...,AlterNet,-14.0,26.33333,.alternet.org,2020-05-25
502836,56,15806,2879,0,Japan,2020-05-25 10:37:36,2020-05-25,0 days 10:37:36,222070,U.S.A.,https://www.alternet.org/2019/02/top-gop-leade...,AlterNet,-14.0,26.33333,.alternet.org,2020-05-25
281786,78,8925,7108,1,U.S.A.,2021-06-28 23:20:30,2021-06-28,0 days 23:20:30,408560,Germany,https://www.alternet.org/2020/09/trump-white-h...,AlterNet,-19.66667,40.0,.alternet.org,2021-06-28


In [41]:
# utility function - pretty-print a decimal as a percentage
def pretty_percentage(decimal):
    return f"{decimal:+.2%}"

# NumPy
This is essentially a prototype that can be followed when creating the PySyft version

### Calculating Amplification Ratio with NumPy
**Note 1**: eventually this will require a little more calculation due to imbalanced sample sizes in the holdback experiment (~4:1 algorithm:chronological, i.e. treatment:control)

**Note 2**: currently, each row in this dataset represents an impression. If and when data is aggregated, amplification ratio should be calculated by summing the impressions column rather than counting rows

To calculate amplification ratio for a given set of tweets:

In [27]:
def set_amplification_ratio(df, tweet_id_set):
    chron_impressions = len(df[df.tweet_id.isin(tweet_id_set) & df.bucket==0])
    algo_impressions = len(df[df.tweet_id.isin(tweet_id_set) & df.bucket==1])
    amp_ratio = algo_impressions / chron_impressions
    return amp_ratio

In [40]:
# amplificaiton ratio for a specific set of tweets
sample_tweet_set = [8,9,10]
print(pretty_percentage(
    set_amplification_ratio(df=df, tweet_id_set = sample_tweet_set)))

+1.31%


To calculate amplification ratio for a given user (author):

In [42]:
def user_amplification_ratio(df, user_id):
    chron_impressions = len(df[(df.author_id==user_id) & df.bucket==0])
    algo_impressions = len(df[(df.author_id==user_id) & df.bucket==1])
    amp_ratio = algo_impressions / chron_impressions
    return amp_ratio

In [47]:
# currently this does not work because "user_id" represents the reader/viewer, not the author/creator
# sample_user = 68355
# print(pretty_percentage(
#    user_amplification_ratio(df=df, user_id=sample_user)))

To calculate amplification ratio for a given publication:

In [49]:
def pub_amplification_ratio(df, pub):
    chron_impressions = len(df[(df.publication_title==pub) & df.bucket==0])
    algo_impressions = len(df[(df.publication_title==pub) & df.bucket==1])
    amp_ratio = algo_impressions / chron_impressions
    return amp_ratio

In [51]:
sample_pub='Al Jazeera'
print(pretty_percentage(
    pub_amplification_ratio(df=df, pub=sample_pub)))

+20.81%


To calculate amplification ratio month-to-month for a given publication:

In [52]:
df.groupby([df.tweet_date_time.dt.to_period("M").rename("year-month")]).apply(
                lambda x: pretty_percentage(pub_amplification_ratio(df=x,pub=sample_pub)))

year-month
2020-01      +0.00%
2020-02    +100.00%
2020-03     +35.54%
2020-04      +0.00%
2020-05     +55.64%
2020-06      +0.00%
2020-08     +27.47%
2020-09      +0.00%
2020-10     +11.59%
2020-11      +0.00%
2020-12     +46.12%
2021-01      +0.00%
2021-02      +9.74%
2021-03     +39.26%
2021-04     +72.24%
2021-05      +0.00%
2021-06     +17.49%
2021-07      +0.00%
2021-08     +17.95%
Freq: M, dtype: object

### Testing Equality with NumPy

* scipy bootstrap function [implementation](https://github.com/scipy/scipy/blob/v1.8.0/scipy/stats/_bootstrap.py#L215-L488)
* the [bootstrapped python library](https://github.com/facebookarchive/bootstrapped) from Facebook open source

In [60]:
import bootstrapped.bootstrap as bs
import bootstrapped.compare_functions as bs_compare
import bootstrapped.stats_functions as bs_stats

In [94]:
# will compare two publications
tweet_set_a = df[(df.publication_title=='Al Jazeera')]
tweet_set_a = df[(df.publication_title=='AlterNet')]

In [86]:
# get impressions
tweet_set_a = tweet_set_a.drop_duplicates(subset=['tweet_id','bucket','impressions'])
tweet_set_b = tweet_set_b.drop_duplicates(subset=['tweet_id','bucket','impressions'])

In [101]:
n_total_runs = 50
sample_size = 10

n_runs_a_greater = 0

for _ in range(n_total_runs):
    sample_a = tweet_set_a.sample(sample_size).tweet_id.tolist()
    ratio_a = set_amplification_ratio(df=df, tweet_id_set=sample_a)
    sample_b = tweet_set_b.sample(sample_size).tweet_id.tolist()
    ratio_b = set_amplification_ratio(df=df, tweet_id_set=sample_b)
    if ratio_a > ratio_b:
        n_runs_a_greater += 1

print(f"probability a > b: {n_runs_a_greater/n_total_runs}")

probability a > b: 0.56


In [9]:
# array(['AlterNet', 'Al Jazeera'], dtype=object)
# to come: confidence intervals

### Calculating Amplification Ratio with PySyft

In [10]:
# to come
# connect to domain_node dataset
# data = domain_node.datasets[-1]["dataset name"]
# calculate with sum_result = data.sum()
# retrieve with published_result = sum_result.publish(sigma=1e6)