# Calculating 'Amplification Ratio:' Reproducing Twitter Research with DP Guarantees

References:
* [Twitter Blog Summary](https://blog.twitter.com/en_us/topics/company/2021/rml-politicalcontent)
* [Twitter Preprint](https://cdn.cms-twdigitalassets.com/content/dam/blog-twitter/official/en_us/company/2021/rml/Algorithmic-Amplification-of-Politics-on-Twitter.pdf) (includes Supplementary Information section)
* [PNAS Paper](https://doi.org/10.1073/pnas.2025334119)
* [PNAS Supporting Information](https://www.pnas.org/highwire/filestream/1021951/field_highwire_adjunct_files/0/pnas.2025334119.sapp.pdf)

### Imports and data

In [40]:
# standard
import pandas as pd
from datetime import datetime, timedelta

In [24]:
df=pd.read_parquet("https://github.com/madhavajay/datasets/blob/main/spicy_bird/1M_rows_dataset_sample.parquet?raw=true")

In [65]:
# add a column to simplify, check to verify
df['tweet_date'] = df.tweet_date_time.dt.date
df.sample(5)

Unnamed: 0,tweet_id,impressions,tweet_date_time,date,time,user_id,url,publication_title,ad_fontes_bias,ad_fontes_reliability,domain,tweet_date
538808,62,17590,2020-10-04 15:02:56,2020-10-04,15:02:56,106881,https://www.alternet.org/2019/03/trump-has-sol...,AlterNet,-26.25,18.0,.alternet.org,2020-10-04
831701,90,9765,2020-03-23 04:12:22,2020-03-23,04:12:22,465948,https://amgreatness.com/2020/08/05/the-nhl-get...,American Greatness,22.0,28.66667,.amgreatness.com,2020-03-23
728679,81,19904,2021-01-08 07:59:51,2021-01-08,07:59:51,314727,https://www.alternet.org/2021/02/marjorie-tayl...,AlterNet,-21.75,27.0,.alternet.org,2021-01-08
939271,102,7877,2020-02-24 12:43:30,2020-02-24,12:43:30,570878,https://amgreatness.com/2021/01/05/morning-gre...,American Greatness,30.66667,11.33333,.amgreatness.com,2020-02-24
610629,69,21661,2021-07-28 02:44:17,2021-07-28,02:44:17,199347,https://www.alternet.org/2020/07/even-republic...,AlterNet,-19.0,28.33333,.alternet.org,2021-07-28


### Calculating with NumPy

In [73]:
# choose a random pair of tweets to calculate an amplification ratio
# one tweet will be "control" and one will be "treatment"
# (since this dataset doesn't currently have treatment/control labels)

pub='Al Jazeera'
rand_pair = df[df.publication_title==pub].sample(2)
control_impressions = rand_pair['impressions'].iloc[0]
treatment_impressions = rand_pair['impressions'].iloc[1]

print(f"Impressions on Treatment: {treatment_impressions}\n\nImpressions on Control: {control_impressions}")

Impressions on Treatment: 4972

Impressions on Control: 12442


In [83]:
# currently just the raw ratio
# for the Twitter project it will require a little more calculation due to imbalanced sample sizes

ratio = treatment_impressions / control_impressions
print(f"Amplification Ratio: {ratio*100:.1f}%")

Amplification Ratio: 40.0%


In [None]:
# to come: sum based on treatment/control labels

In [None]:
# to come: statistical comparison

### Calculating with PySyft

In [None]:
# to come