### Define Search Parameters

There are two important objects to understand and define for the search:


- 'keywords' are used to run individual queries and retrieve papers
- 'relevance_terms' are used to quickly filter the retrieved papers by abstract and title

The relevance terms are represented by groups of terms. From each group one term must be a substring of the papers abstract or title for it to be called relevant. You can leave the terms blank if you want to forego this simple filtering.

In [1]:
from search import *
import csv
import pandas as pd

In [None]:
# Max number of results to return per search
max_results = 10

# Papers must be published after
min_year = 0

keywords = ["Conspiracy Narratives",]

# OPTIONAL
# Basic stems of our search terms, these will be combined
# to filter papers found with the main keywords
topic_terms = [
    "narrat", "isinformat", "conspira", "propagand", "fake news", "fact-check"
]
task_terms = [
    "detect", "track", "model", "predict", "classif",
    "extract", "identif", "recognition", "analys",
]
method_terms = [
    "nlp", "natural language processing", "ai",
    "dataset", "algorithm", "graph", "network", "comput",
    "llm", "large language model",
]
relevance_terms = [topic_terms, task_terms, method_terms] 

# Setup API key for Scopus and ScienceDirect searches
setup_elsevier_api("api_keys/elsevier_api_key.txt")

# Needed exclusivly for semantic scholar
semanticscholar_api_key_path = "api_keys/semantic_scholar_api_key.txt"

# OPTIONAL
# Proper etiquette for open alex search is to identify yourself
email = "example@example.org"

# OPTIONAL
# Can be used to check if the search contains any preselected papers
gold_titles = init_gold_titles("gold_papers.txt")

# Needed to process results
all_results = []
seen_keys = set()

### Search Individual APIs

In [3]:
search_acl_anthology(
    keywords=keywords,
    seen_keys=seen_keys,
    all_results=all_results,
    min_year=min_year,
    relevance_terms=relevance_terms,
    gold_titles=gold_titles
)

Searching ACL Anthology...: 21606it [00:05, 3436.11it/s]Unknown TeX-math command: \choose
Searching ACL Anthology...: 51835it [00:14, 3376.58it/s]Unknown TeX-math command: \mathtt{MuMOInstruct}
Unknown TeX-math command: \mathtt{MuMOInstruct}
Unknown TeX-math command: \mathtt{GeLLM^3O}
Unknown TeX-math command: \mathtt{GeLLM^3O}
Unknown TeX-math command: \mathtt{GeLLM^3O}
Unknown TeX-math command: \mathtt{GeLLM^3O}
Unknown TeX-math command: \mathtt{MuMOInstruct}
Searching ACL Anthology...: 52183it [00:14, 3310.31it/s]Unknown TeX-math command: \textcolor{red}{100~trillion}
Searching ACL Anthology...: 52553it [00:14, 3417.12it/s]Unknown TeX-math command: \dots
Searching ACL Anthology...: 54933it [00:15, 2730.04it/s]Unknown TeX-math command: \mathbf
Unknown TeX-math command: \mathbf
Unknown TeX-math command: \mathbf
Unknown TeX-math command: \mathbf
Unknown TeX-math command: \mathbf
Unknown TeX-math command: \mathbf
Unknown TeX-math command: \mathbf
Unknown TeX-math command: \left\lfloor
U

Found 4 candidate papers

Added: Which side are you on? Insider-Outsider classification in conspiracy-theoretic social media
Added: Conspiracy Narratives in the Protest Movement Against COVID-19 Restrictions in Germany. A Long-term Content Analysis of Telegram Chat Groups.
Added: Identifying Conspiracy Theories News based on Event Relation Graph
Adding more...
Gold papers found in this search: 2
Added 4 new papers






In [4]:
search_arxiv(
    keywords=keywords,
    seen_keys=seen_keys,
    all_results=all_results,
    min_year=min_year,
    max_results=max_results,
    relevance_terms=relevance_terms,
    gold_titles=gold_titles
)

Searching arXiv...: 100%|██████████| 1/1 [00:00<00:00,  5.80it/s]

Found 8 candidate papers

Added: Recontextualized Knowledge and Narrative Coalitions on Telegram
Added: An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web
Added: Visual Framing of Science Conspiracy Videos: Integrating Machine Learning with Communication Theories to Study the Use of Color and Brightness
Adding more...
Gold papers found in this search: 2
Added 7 new papers






In [5]:
search_crossref(
    keywords=keywords,
    seen_keys=seen_keys,
    all_results=all_results,
    min_year=min_year,
    max_results=max_results,
    relevance_terms=relevance_terms,
    gold_titles=gold_titles
)

Searching Crossref...: 100%|██████████| 1/1 [00:14<00:00, 14.27s/it]

Found 0 candidate papers

Gold papers found in this search: 0
Added 0 new papers






In [6]:
# scholar is currently ooO
# search_scholar(keywords, seen_keys, all_results, relevance_terms=relevance_terms, min_year=min_year, max_results=max_results, gold_titles=gold_titles)

In [7]:
search_openalex(
    keywords=keywords,
    seen_keys=seen_keys,
    all_results=all_results,
    min_year=min_year,
    max_results=max_results,
    relevance_terms=relevance_terms,
    gold_titles=gold_titles,
    email=email # unique for open alex
)

Searching OpenAlex...: 100%|██████████| 1/1 [00:00<00:00,  1.96it/s]

Found 2 candidate papers

Added: The Making of the English Working Class.
Added: The spreading of misinformation online
Gold papers found in this search: 0
Added 2 new papers






In [8]:
search_sciencedirect(
    keywords=keywords,
    seen_keys=seen_keys,
    all_results=all_results,
    min_year=min_year,
    max_results=max_results,
    relevance_terms=relevance_terms,
    gold_titles=gold_titles
)

Searching ScienceDirect...: 100%|██████████| 1/1 [00:00<00:00, 491.60it/s]

Found 2 candidate papers

Added: Association of the belief in conspiracy narratives with vaccination status and recommendation behaviours of German physicians
Added: Disinformed social movements: A large-scale mapping of conspiracy narratives as online harms during the COVID-19 pandemic
Gold papers found in this search: 0
Added 2 new papers






In [9]:
search_scopus(
    keywords=keywords,
    seen_keys=seen_keys,
    all_results=all_results,
    min_year=min_year,
    max_results=max_results,
    relevance_terms=relevance_terms,
    gold_titles=gold_titles
)

Searching Scopus...: 100%|██████████| 1/1 [00:00<00:00, 73.58it/s]

Found 10 candidate papers

Added: Exploring expert figures in alien-related UFO conspiracy theories
Added: The public debate of covid-19 vaccines on social media: a systematic review
Added: Faith in Fear: Conspiracy Theories as an Explanatory Tradition Within the American Christian Right
Adding more...
Gold papers found in this search: 0
Added 10 new papers






In [10]:
search_semanticscholar(
    keywords=keywords,
    seen_keys=seen_keys,
    all_results=all_results,
    semanticscholar_api_key_path=semanticscholar_api_key_path, # unique for semschol
    min_year=min_year,
    max_results=max_results,
    relevance_terms=relevance_terms,
    gold_titles=gold_titles
)

Searching Semantic Scholar...: 100%|██████████| 1/1 [00:02<00:00,  2.98s/it]

Found 4 candidate papers

Added: Conspiracy narratives and vaccine hesitancy: a scoping review of prevalence, impact, and interventions
Added: Credibility cues of conspiracy narratives: exploring the belief-driven credibility evaluation of a YouTube conspiracy video
Added: Conviction in the absence of proof: Conspiracy mentality mediates religiosity’s relationship with support for COVID-19 conspiracy narratives
Adding more...
Gold papers found in this search: 0
Added 4 new papers






### Checking and Saving Results

If you provided a gold paper file we can figure out how many of the titles were found. However, this simply matches the titles in lower case so titles won't be recognized with minor changes. And searches can become quite large. I've had many gold papers be flagged that actually were somewhere in the search, because their title deviated slightly from the expectation.

Be sure to check the save file to not lose any data!

In [11]:
nr_gold_papers_found(all_results, gold_titles, True)

print(f"Found {len(all_results)} papers in total\n")

for i, paper in enumerate(all_results[:3]):
    print(f"\nPaper {i+1}")
    print(f"Title: {paper.get("title")}")
    print(f"Authors: {paper.get("authors", [])}")
    print(f"DOI: {paper.get("doi")}")
    print(f"Abstract: {paper.get("abstract")}...")
    print(f"Year: {paper.get("year")}")
    print(f"Source: {paper.get("source")}")

Not found: analyzing mis/disinformation: understandingswiss covid-19 narratives through nlp analysis
Not found: automated detection of tropes in short texts.
Not found: emotional triggers and cognitive manipulation in romanian social media: a nlp analysis
Not found: models for predicting changes in public opinion during the implementation of the narrative in social media
Not found: monitoring narratives about the energy transition in germany
Not found: multiclaimnet: a massively multilingual dataset of fact-checked claim clusters
Not found: propainsight: toward deeper understanding of propaganda in terms of techniques, appeals, and intent
Not found: tracking and identifying international propaganda and influence networks online
Not found: tracking the takes and trajectories of english-language news narratives across trustworthy and worrisome websites
Not found: trust in disinformation narratives: a trust in the news experiment
Not found: ukelectionnarratives: a dataset of misleading na

In [12]:
with open("results/candidate_papers.csv", mode="w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=all_results[0].keys())
    writer.writeheader()
    writer.writerows([r for r in all_results])

In [13]:
df = pd.read_csv("results/candidate_papers.csv")
df["source"].value_counts()

source
scopus             10
arxiv               7
acl_anthology       4
semanticscholar     4
openalex            2
sciencedirect       2
Name: count, dtype: int64

### Filter Papers using Abstract and Title

You can find the prompts I used for LLM-based processing in annotate/prompting.py. They can be used as a reference to create your own annotation prompts according to what you're interested in.

I tried abstracting the rest of the prompting logic as far as I could to mitigate API rate limits and JSON errors crashing an entire annotation run.

Initially it's probably also a good idea to deduplicate the search.

In [14]:
import pandas as pd
pd.options.mode.copy_on_write = True
from academiccloud_api import OpenAIClient
from annotate import (
    annotate_df,
    get_screening_prompt
    )

In [15]:
if "df" not in globals():  # Check if df is not defined
    df = pd.read_csv("results/candidate_papers.csv")

print(f"Processing {len(df)} papers")

# Abstracts are needed to filter papers
# Thus, papers without abstracts are removed
before = len(df)
df = df[df["abstract"].notna() & df["abstract"].str.strip().ne("")]
print(f"Removed {before - len(df)} papers with missing or empty abstracts")

# Deduplicate by abstract and title
for col in ["abstract", "title"]:
    if col in df.columns:
        before = len(df)
        df = df.drop_duplicates(subset=col, keep="first")
        after = len(df)
        print(f"Removed {before - after} duplicates by '{col}'")

# Deduplicate by non-empty DOIs
if "doi" in df.columns:
    has_doi = df["doi"].notna() & df["doi"].str.strip().ne("")
    df_with_doi = df[has_doi]
    df_without_doi = df[~has_doi]
    before = len(df_with_doi)
    df_with_doi = df_with_doi.drop_duplicates(subset="doi", keep="first")
    after = len(df_with_doi)
    df = pd.concat([df_with_doi, df_without_doi], ignore_index=True)
    print(f"Removed {before - after} duplicates by 'doi'")

df.head()

Processing 29 papers
Removed 0 papers with missing or empty abstracts
Removed 0 duplicates by 'abstract'
Removed 0 duplicates by 'title'
Removed 0 duplicates by 'doi'


Unnamed: 0,title,authors,doi,abstract,url,year,source
0,Which side are you on? Insider-Outsider classi...,"[Name(first='Pavan', last='Holur'), Name(first...",10.18653/v1/2022.acl-long.341,Social media is a breeding ground for threat n...,https://aclanthology.org/2022.acl-long.341.pdf,2022,acl_anthology
1,Conspiracy Narratives in the Protest Movement ...,"[Name(first='Manuel', last='Weigand'), Name(fi...",10.18653/v1/2022.nlpcss-1.8,From the start of the COVID-19 pandemic in Ger...,https://aclanthology.org/2022.nlpcss-1.8.pdf,2022,acl_anthology
2,Identifying Conspiracy Theories News based on ...,"[Name(first='Yuanyuan', last='Lei'), Name(firs...",10.18653/v1/2023.findings-emnlp.656,"Conspiracy theories, as a type of misinformati...",https://aclanthology.org/2023.findings-emnlp.6...,2023,acl_anthology
3,An automated pipeline for the discovery of con...,"['Timothy R. Tangherlini', 'Shadi Shahsavari',...",10.1371/journal.pone.0233879,Although a great deal of attention has been pa...,http://arxiv.org/pdf/2008.09961v1,2020,arxiv
4,What distinguishes conspiracy from critical na...,"['Damir Korenčić', 'Berta Chulvi', 'Xavier Bon...",10.1111/exsy.13671,The current prevalence of conspiracy theories ...,http://arxiv.org/pdf/2407.10745v1,2024,arxiv


In [16]:
ac = OpenAIClient("api_keys/api_key.txt")
model = "qwen3-32b"

def screening_prompt_args(row):
    title = row['title']
    abstract = row['abstract']
    return [title, abstract]

# Fills the selected prompt with values returned by the
# function above and annotates the papers
df = annotate_df(
    df,
    client=ac,
    model=model,
    prompt_fn=get_screening_prompt,
    get_prompt_args=screening_prompt_args,
)

if len(df[df["requires reannotation"]]) > 0:
    print(f"{len(df[df["requires reannotation"]])} papers require reannotation!")
    print(
        "Simply run this cell again if the cause is not an API rate limit "
        "(in which case you have to wait for them to be reset).\n"
        "It should automatically pickup from rows which are not yet marked as completed.\n"
    )

df.to_csv("results/candidate_papers.csv", index=False)
df.head()

Annotating papers...: 100%|██████████| 3/3 [00:41<00:00, 13.67s/it]


Unnamed: 0,title,authors,doi,abstract,url,year,source,shared task,survey,disinformation focused,...,indicative quote,tasks present,tasks,methods present,methods,datasets present,domains,additional concepts present,additional concepts,requires reannotation
0,Which side are you on? Insider-Outsider classi...,"[Name(first='Pavan', last='Holur'), Name(first...",10.18653/v1/2022.acl-long.341,Social media is a breeding ground for threat n...,https://aclanthology.org/2022.acl-long.341.pdf,2022,acl_anthology,No,No,Yes,...,Social media is a breeding ground for threat n...,Yes,Insider-Outsider classification,Yes,pretrained language modeling,Yes,conspiracy-theoretic social media,No,,False
1,Conspiracy Narratives in the Protest Movement ...,"[Name(first='Manuel', last='Weigand'), Name(fi...",10.18653/v1/2022.nlpcss-1.8,From the start of the COVID-19 pandemic in Ger...,https://aclanthology.org/2022.nlpcss-1.8.pdf,2022,acl_anthology,No,No,Yes,...,We investigate this claim by measuring the fre...,Yes,"Narrative Classification,\nTopic Modeling",Yes,"Fine-tuning a Distilbert model,\nTopic Modelling",Yes,COVID-19 protests in Germany Telegram messages,No,,False
2,Identifying Conspiracy Theories News based on ...,"[Name(first='Yuanyuan', last='Lei'), Name(firs...",10.18653/v1/2023.findings-emnlp.656,"Conspiracy theories, as a type of misinformati...",https://aclanthology.org/2023.findings-emnlp.6...,2023,acl_anthology,No,No,Yes,...,"Conspiracy theories, as a type of misinformati...",Yes,Conspiracy Theory Identification,Yes,"Graph-based Analysis,\nLanguage Models,\nHeter...",Yes,News Articles,No,,False


### Annotate Full Text Papers

Perform paper selection based on the annotations then download the selection (automatically whereever possible) and annotate the full PDFs.

You can for example implement the master boolean query for your search parameters here. Remember that Dataframes require &, | and ~ operators for boolean logic (don't ask me why, I think its because they're bitwise?).

I would also recommend renaming the csv file here to not lose any annotations from the previous broad search.

In [17]:
import pandas as pd
pd.options.mode.copy_on_write = True
from academiccloud_api import OpenAIClient
from annotate import (
    annotate_df,
    get_review_prompt,
    scrape_paper
    )
from tqdm import tqdm

In [18]:
if "df" not in globals(): # Continue with an exisiting file
    df = pd.read_csv("results/candidate_papers.csv")

selection = df[
        (df["shared task"] != "Yes"
        ) & (df["survey"] != "Yes"
        # I manually added some papers that I wanted annotated
        # but they weren't in any of the api searches
        # No cheating allowed so entries without a source get kicked
        ) & (~df.isna()["source"] 
        ) & (df["disinformation focused"] == "Yes"
        ) & (df["narrative focused"] == "Yes"
        ) & (df["tasks present"] == "Yes")
    ]
selection.to_csv(f"results/selection.csv", index=False)
print(f"Total: {len(selection)}")
selection["source"].value_counts()

Total: 3


source
acl_anthology    3
Name: count, dtype: int64

This is also where we scrape as many of the PDFs as we can find. Its very possible though that you will have to manually edit the resulting csv here. I had to add links to downloaded PDFs for example as many DOIs would point to very different landing pages which did not always feature the PDF directly in their payload. 

scrape_paper is meant to handle both websites and local file links. I decided to just download and add links to local file://... since I was already going through the landing pages of the DOIs one by one anyway.

In [19]:
tqdm.pandas()

selection["paper markdown"] = selection.progress_apply(scrape_paper, axis=1)
selection = selection[selection["paper markdown"].apply(lambda x: isinstance(x, str) and x.strip() != "")]

selection.to_csv(f"results/selection.csv", index=False)
selection.head()

100%|██████████| 3/3 [00:07<00:00,  2.59s/it]


Unnamed: 0,title,authors,doi,abstract,url,year,source,shared task,survey,disinformation focused,...,tasks present,tasks,methods present,methods,datasets present,domains,additional concepts present,additional concepts,requires reannotation,paper markdown
0,Which side are you on? Insider-Outsider classi...,"[Name(first='Pavan', last='Holur'), Name(first...",10.18653/v1/2022.acl-long.341,Social media is a breeding ground for threat n...,https://aclanthology.org/2022.acl-long.341.pdf,2022,acl_anthology,No,No,Yes,...,Yes,Insider-Outsider classification,Yes,pretrained language modeling,Yes,conspiracy-theoretic social media,No,,False,## **Which side are you on? Insider-Outsider c...
1,Conspiracy Narratives in the Protest Movement ...,"[Name(first='Manuel', last='Weigand'), Name(fi...",10.18653/v1/2022.nlpcss-1.8,From the start of the COVID-19 pandemic in Ger...,https://aclanthology.org/2022.nlpcss-1.8.pdf,2022,acl_anthology,No,No,Yes,...,Yes,"Narrative Classification,\nTopic Modeling",Yes,"Fine-tuning a Distilbert model,\nTopic Modelling",Yes,COVID-19 protests in Germany Telegram messages,No,,False,# **Conspiracy Narratives in the Protest Movem...
2,Identifying Conspiracy Theories News based on ...,"[Name(first='Yuanyuan', last='Lei'), Name(firs...",10.18653/v1/2023.findings-emnlp.656,"Conspiracy theories, as a type of misinformati...",https://aclanthology.org/2023.findings-emnlp.6...,2023,acl_anthology,No,No,Yes,...,Yes,Conspiracy Theory Identification,Yes,"Graph-based Analysis,\nLanguage Models,\nHeter...",Yes,News Articles,No,,False,# **Identifying Conspiracy Theories News based...


Once all the PDFs are accounted for they can be reviewed.

In [25]:
if "selection" not in globals(): # Continue with an exisiting file
    selection = pd.read_csv("results/selection.csv")

ac = OpenAIClient("api_keys/api_key.txt")
model="qwen3-32b"

# IMPORTANT: reset the reannotation marker before we can add more annotations
selection = selection.drop(columns=["requires reannotation"])

def review_prompt_args(row):
    full_paper = row['paper markdown']
    full_paper = full_paper[:full_paper.find("**References**")]
    disinfo_topics = row["disinformation topics"].split(",") if isinstance(row["disinformation topics"], str) else "Disinformation"
    return [full_paper, disinfo_topics]

selection = annotate_df(
    selection,
    client=ac,
    model=model,
    prompt_fn=get_review_prompt,
    get_prompt_args=review_prompt_args,
)

if len(selection[selection["requires reannotation"]]) > 0:
    print(f"{len(selection[selection["requires reannotation"]])} papers require reannotation!")
    print(
        "Simply run this cell again if the cause is not an API rate limit "
        "(in which case you have to wait for them to be reset).\n"
        "It should automatically pickup from rows which are not yet marked as completed.\n"
    )

selection.to_csv("results/selection.csv", index=False)
selection.head()

Annotating papers...: 100%|██████████| 3/3 [01:16<00:00, 25.63s/it]


Unnamed: 0,title,authors,doi,abstract,url,year,source,shared task,survey,disinformation focused,...,models,languages specified,languages,target group specified,target groups,data perspective specified,perspectives,modalities specified,modalities,requires reannotation
0,Which side are you on? Insider-Outsider classi...,"[Name(first='Pavan', last='Holur'), Name(first...",10.18653/v1/2022.acl-long.341,Social media is a breeding ground for threat n...,https://aclanthology.org/2022.acl-long.341.pdf,2022,acl_anthology,No,No,Yes,...,"BERT,\nRoBERTa,\nDistilBERT,\nXLM",No,,Yes,"Vaccine-hesitant communities,\nGroups associat...",Yes,Social media users generating conspiracy-theor...,Yes,Text,False
1,Conspiracy Narratives in the Protest Movement ...,"[Name(first='Manuel', last='Weigand'), Name(fi...",10.18653/v1/2022.nlpcss-1.8,From the start of the COVID-19 pandemic in Ger...,https://aclanthology.org/2022.nlpcss-1.8.pdf,2022,acl_anthology,No,No,Yes,...,Distilbert,Yes,German,No,,Yes,Telegram users in the Querdenken movement orga...,Yes,Text,False
2,Identifying Conspiracy Theories News based on ...,"[Name(first='Yuanyuan', last='Lei'), Name(firs...",10.18653/v1/2023.findings-emnlp.656,"Conspiracy theories, as a type of misinformati...",https://aclanthology.org/2023.findings-emnlp.6...,2023,acl_anthology,No,No,Yes,...,"Longformer,\nBi-LSTM,\nHeterogeneous Graph Att...",No,,No,,Yes,The data reflects news articles from both cons...,No,,False
