Last updated: July 5, 2023

Last run: July 5, 2023

**Additional filtering based on data (as described in the published study)**

## Unreviewed science in the news: The evolution of preprint media coverage from 2014-2021


Juan Pablo Alperin

**Related Publication:**
Fleerackers, A., Shores, K., Chtena, N. & Alperin, J.P. (2023). Unreviewed science in the news: The evolution of preprint media coverage from 2014-2021. *bioarxiv*. 

**Related Dataset:**
Alperin, Juan Pablo; Fleerackers; Shores, 2023, "Data for: Unreviewed science in the news", https://doi.org/10.7910/DVN/ZHQUFD, *Harvard Dataverse*

In [1]:
import pandas as pd

## Preprint filtering

In [2]:
df = pd.read_csv('data/top outlets/preprints_mention_details_with_dates.csv')

df['posted_on'] = pd.to_datetime(df.posted_on)

df['best_published_pub_date'] = pd.to_datetime(df['best_published_pub_date'])
df['best_preprint_pub_date'] = pd.to_datetime(df['best_preprint_pub_date'])

preprints_original = df.copy()

In [3]:
N = df.shape[0]
print("{:,}".format(N))

40,039


In [4]:
# Remove everything before 2013 that appeared from problematic dates
df = df[df.best_preprint_pub_date >= '2013']

In [5]:
print("{:,}".format(N-df.shape[0]))
N=df.shape[0]

3,619


In [6]:
# filter those that appear to actually be postprints, not preprints
df = df[(df.best_published_pub_date.isnull()) | (df.best_published_pub_date > df.best_preprint_pub_date)]

print("{:,}".format(N - df.shape[0]))
N=df.shape[0]

165


In [7]:
# remaining
print("{:,}".format(df.shape[0]))

36,255


In [8]:
# date data too uncertain to decide if preprint or postprint (dates too closer together)
# 7 days 

df = df[(df.best_published_pub_date.isnull()) | (abs((df.best_published_pub_date - df.best_preprint_pub_date).dt.days) > 7)]

In [9]:
print("{:,}".format(N - df.shape[0]))
N=df.shape[0]

332


In [10]:
# The news media mentioned happened more than 5 days before the preprint was published

df = df[df.days_since_preprint >= -5]

In [11]:
print("{:,}".format(N - df.shape[0]))
N=df.shape[0]

327


In [12]:
# mention appears to be to the postprint, not preprint

df = df[(df.days_since_publication.isnull()) | (df.days_since_publication < -5)]

In [13]:
print("{:,}".format(N - df.shape[0]))
N=df.shape[0]

3,547


In [14]:
# We have the same news story for the same DOI/arxiv_id more than once
df = df.drop_duplicates(subset=['doi', 'arxiv_id', 'outlet', 'moreover_url'])

print("{:,}".format(N - df.shape[0]))
N=df.shape[0]

1,021


In [15]:
df.to_csv('data/top outlets/preprints_mentions_final.csv', index=False)

In [16]:
preprints = df.copy()
del df

## Summary of preprint filtering

In [17]:
N_mentions = preprints.shape[0]
N_mentions_original = preprints_original.shape[0]

N_preprints = preprints.altmetric_id.unique().shape[0]
N_preprints_original = preprints_original.altmetric_id.unique().shape[0]

N_stories =  preprints[['outlet', 'moreover_url']].drop_duplicates().shape[0]
N_stories_original =  preprints_original[['outlet', 'moreover_url']].drop_duplicates().shape[0]

N_outlets = preprints.outlet.unique().shape[0]

print("{:,} mentions ({:.1f}%)".format(N_mentions, 100*N_mentions/N_mentions_original))
print("{:,} preprints ({:.1f}%)".format(N_preprints, 100*N_preprints/N_preprints_original))
print("{:,} stories ({:.1f}%)".format(N_stories, 100*N_stories/N_stories_original))
print()
print("In total, filtering led to the exclusion of {:,} mentions ({:.1f}% of original dataset)".format(
    N_mentions_original-N_mentions, 100*(N_mentions_original-N_mentions)/N_mentions_original))
print()
print("Our final preprint sample comprised {:,} mentions of {:,} preprints in {:,} stories published by the 99 outlets in our sample (a preprint could be covered in several stories and a single story could mention multiple preprints)".format( 
     N_mentions, N_preprints, N_stories))



31,028 mentions (77.5%)
11,538 preprints (76.7%)
25,249 stories (80.8%)

In total, filtering led to the exclusion of 9,011 mentions (22.5% of original dataset)

Our final preprint sample comprised 31,028 mentions of 11,538 preprints in 25,249 stories published by the 99 outlets in our sample (a preprint could be covered in several stories and a single story could mention multiple preprints)


## WoS filtering

In [18]:
input_file = 'data/top outlets/wos_with_doi_mention_details.csv'
# columns = ['Altmetric_ID', 'Order', 'DOI', 'Arxiv_ID', 'First_Seen_On', 'PubDate', 'wos_year', 'wos_title', 'Author_Name', 'Author_Url', 'Posted_On', 'Title', 'moreover_url']

In [19]:
df = pd.read_csv(input_file)
df.columns = [x.lower() for x in df.columns]
df.rename(columns = {'author_name': 'outlet', 'author_url': 'outlet_url'}, inplace=True)
df['posted_on'] = pd.to_datetime(df.posted_on)
df['posted_on_year'] = df.posted_on.map(lambda x: x.year)

print("{:,}".format(df.shape[0]))

articles_original = df.copy()

1,657,202


In [20]:
# Remove everything before 2013, to match what we did with preprints because of problematic dates
N = df.shape[0]
df = df[df.wos_year >= 2013]

In [21]:
print("{:,}".format(N - df.shape[0]))
N=df.shape[0]

156,187


In [22]:
# Merging WoS with Preprint data to find duplicates, etc.
merged = df.set_index(['altmetric_id', 'order']).merge(preprints[['altmetric_id', 'order', 'posted_on']].set_index(['altmetric_id', 'order']), 
         left_index=True, right_index=True, 
         how='outer', 
        suffixes=['', '_p'],
         indicator='source'
        )

In [23]:
print("Removing {:,} WoS mentions that are in the preprint mentions folder".format(merged[merged.source == 'both'].shape[0]))

df = merged[merged.source == 'left_only']
del df['source']
del df['posted_on_p']

Removing 579 WoS mentions that are in the preprint mentions folder


In [24]:
# We have the same news story for the same DOI more than once
df = df.drop_duplicates(subset=['doi', 'outlet', 'moreover_url'])

print("{:,}".format(N - df.shape[0]))
N=df.shape[0]

14,482


In [25]:
print("{:,}".format(df.shape[0]))
df.to_csv('data/top outlets/wos_mentions_final.csv')

1,486,533


In [26]:
articles = df.copy()
del df

## Summary of WoS filtering

In [27]:
N_mentions = articles.shape[0]
N_mentions_original = articles_original.shape[0]

N_articles = articles.doi.unique().shape[0]
N_articles_original = articles_original.doi.unique().shape[0]

N_stories =  articles[['outlet', 'moreover_url']].drop_duplicates().shape[0]
N_stories_original =  articles_original[['outlet', 'moreover_url']].drop_duplicates().shape[0]

N_outlets = articles.outlet.unique().shape[0]

print("{:,} mentions ({:.1f}%)".format(N_mentions, 100*N_mentions/N_mentions_original))
print("{:,} preprints ({:.1f}%)".format(N_preprints, 100*N_preprints/N_preprints_original))
print("{:,} stories ({:.1f}%)".format(N_stories, 100*N_stories/N_stories_original))
print()
print("In total, filtering led to the exclusion of {:,} mentions ({:.1f}% of original dataset)".format(
    N_mentions_original-N_mentions, 100*(N_mentions_original-N_mentions)/N_mentions_original))
print()
print("The final published research sample comprised {:,} mentions of {:,} distinct research outputs across {:,} stories published by the {:,} outlets in our sample (a research output could be covered in several stories and a single story could mention multiple outputs)".format( 
     N_mentions, N_articles, N_stories, N_outlets))



1,486,533 mentions (89.7%)
11,538 preprints (76.7%)
1,084,048 stories (97.4%)

In total, filtering led to the exclusion of 170,669 mentions (10.3% of original dataset)

The final published research sample comprised 1,486,533 mentions of 397,446 distinct research outputs across 1,084,048 stories published by the 94 outlets in our sample (a research output could be covered in several stories and a single story could mention multiple outputs)


In [28]:
"A total of {:,} mentions of {:,} publications across {:,} news stories.".format(articles.shape[0], articles.doi.unique().shape[0], articles[['outlet', 'moreover_url']].drop_duplicates().shape[0])

'A total of 1,486,533 mentions of 397,446 publications across 1,084,048 news stories.'