# Calculating 'Amplification Ratio:' Reproducing Twitter Research with DP Guarantees

References:
* [Twitter Blog Summary](https://blog.twitter.com/en_us/topics/company/2021/rml-politicalcontent)
* [Twitter Preprint](https://cdn.cms-twdigitalassets.com/content/dam/blog-twitter/official/en_us/company/2021/rml/Algorithmic-Amplification-of-Politics-on-Twitter.pdf) (includes Supplementary Information section)
* [PNAS Paper](https://doi.org/10.1073/pnas.2025334119)
* [PNAS Supporting Information](https://www.pnas.org/highwire/filestream/1021951/field_highwire_adjunct_files/0/pnas.2025334119.sapp.pdf)

### Imports and data

In [1]:
# standard
import pandas as pd
from datetime import datetime, timedelta

In [2]:
# from google drive, twitter datasets -> demo -> to benchmark -> demo 2 datasets
df=pd.read_parquet("1M_rows_dataset_sample.parquet")

# old dataset: df=pd.read_parquet("https://github.com/madhavajay/datasets/blob/main/spicy_bird/1M_rows_dataset_sample.parquet?raw=true")

In [3]:
# add a date column to simplify some calculations, check sample to verify
df['tweet_date'] = df.tweet_date_time.dt.date
df.sample(5)

Unnamed: 0,tweet_id,impressions,impressions_ct,bucket,tweet_origin,tweet_date_time,date,time,user_id,user_country,url,publication_title,ad_fontes_bias,ad_fontes_reliability,domain,tweet_date
493117,56,15806,2879,0,Japan,2020-05-25 10:37:36,2020-05-25,0 days 10:37:36,241143,Germany,https://www.alternet.org/2019/02/top-gop-leade...,AlterNet,-14.0,26.33333,.alternet.org,2020-05-25
749328,12,7820,6364,1,Canada,2020-12-27 10:23:27,2020-12-27,0 days 10:23:27,198392,Japan,https://www.aljazeera.com/news/2020/06/police-...,Al Jazeera,-6.25,46.25,.aljazeera.com,2020-12-27
610076,76,11502,9369,1,Japan,2020-01-02 06:05:10,2020-01-02,0 days 06:05:10,429932,France,https://www.alternet.org/2020/08/republicans-a...,AlterNet,-10.33333,33.0,.alternet.org,2020-01-02
24009,63,11477,2462,0,Germany,2020-10-14 10:27:49,2020-10-14,0 days 10:27:49,21339,U.K.,https://www.alternet.org/2019/03/this-needs-to...,AlterNet,-27.0,24.25,.alternet.org,2020-10-14
224954,57,9619,1774,0,France,2021-07-26 22:10:18,2021-07-26,0 days 22:10:18,106198,U.K.,https://www.alternet.org/2019/02/conservative-...,AlterNet,-20.0,15.25,.alternet.org,2021-07-26


### Calculating Amplification Ratio with NumPy
**Note**: eventually this will require a little more calculation due to imbalanced sample sizes in the holdback experiment (~4:1 algorithm:chronological, i.e. treatment:control)



To calculate amplification ratio for a given set of tweets:

In [4]:
def set_amplification_ratio(df, tweet_id_set):
    chron_impressions = df[df.tweet_id.isin(tweet_id_set) & df.bucket==0].impressions.sum()
    algo_impressions = df[df.tweet_id.isin(tweet_id_set) & df.bucket==1].impressions.sum()
    amp_ratio = algo_impressions / chron_impressions
    return amp_ratio

In [5]:
# amplificaiton ratio for a specific set of tweets
sample_tweet_set = [8,9,10]
print(set_amplification_ratio(df=df, tweet_id_set = sample_tweet_set))

0.014264363543452287


To calculate amplification ratio for a given publication:

In [6]:
def pub_amplification_ratio(df, pub):
    chron_impressions = df[(df.publication_title==pub) & df.bucket==0].impressions.sum()
    algo_impressions = df[(df.publication_title==pub) & df.bucket==1].impressions.sum()
    amp_ratio = algo_impressions / chron_impressions
    return amp_ratio

In [7]:
sample_pub='Al Jazeera'
print(pub_amplification_ratio(df=df, pub=sample_pub))

0.17641674594218157


To calculate amplification ratio month-to-month for a given publication:

In [8]:
df.groupby([df.tweet_date_time.dt.to_period("M").rename("year-month")]).apply(
                lambda x: pub_amplification_ratio(df=x,pub=sample_pub))

year-month
2020-01    0.000000
2020-02    1.000000
2020-03    0.252640
2020-04    0.000000
2020-05    0.482793
2020-06    0.000000
2020-08    0.271657
2020-09    0.000000
2020-10    0.113357
2020-11    0.000000
2020-12    0.423205
2021-01    0.000000
2021-02    0.053475
2021-03    0.455225
2021-04    0.766344
2021-05    0.000000
2021-06    0.147059
2021-07    0.000000
2021-08    0.087397
Freq: M, dtype: float64

In [9]:
# to come: confidence intervals, statistical comparisons

### Calculating Amplification Ratio with PySyft

In [10]:
# to come
# connect to domain_node dataset
# data = domain_node.datasets[-1]["dataset name"]
# calculate with sum_result = data.sum()
# retrieve with published_result = sum_result.publish(sigma=1e6)