# Searching for Research Papers in arXiv

This notebook demonstrates how to interact with the `arXiv` API using `Floki`, specifically through the `ArxivFetcher` class. We will explore:

* How to search for papers using advanced query strings.
* How to filter results by date (e.g., last 24 hours).
* How to retrieve metadata for papers.
* How to download the top 5 papers for further exploration.
* How to extract and process text from the downloaded PDFs, with each page stored as a separate document.

In [None]:
### Install Required Libraries
!pip install floki-ai
!pip install arxiv

## Initialize Logging

In [1]:
import logging
logging.basicConfig(level=logging.INFO)

## Importing Necessary Modules

Import the required module and set up the `ArxivFetcher` to start searching for papers.

In [2]:
from floki.document import ArxivFetcher

# Initialize the fetcher
fetcher = ArxivFetcher()

## Basic Search by Query String

In this example, we search for papers related to "machine learning". The results are returned as `Document` objects with `text` as the summary and `metadata` containing details.

In [3]:
# Search for papers related to "machine learning"
results = fetcher.search(query="machine learning", max_results=5)

# Display the metadata and summaries of the retrieved documents
for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: machine learning
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=machine+learning&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 100 of 377877 total results
INFO:floki.document.fetcher.arxiv:Found 5 results for query: machine learning


Title: Learning segmentation from point trajectories
Authors: Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Summary: We consider the problem of segmenting objects in videos based on their motion
and no other forms of supervision. Prior work has often approached this problem
by using the principle of common fate, namely the fact that the motion of
points that belong to the same object is strongly correlated. However, most
authors have only considered instantaneous motion from optical flow. In this
work, we present a way to train a segmentation network using long-term point
trajectories as a supervisory signal to complement optical flow. The key
difficulty is that long-term motion, unlike instantaneous motion, is difficult
to model -- any parametric approximation is unlikely to capture complex motion
patterns over long periods of time. We instead draw inspiration from subspace
clustering approaches, proposing a loss function that seeks to group the
trajectories into l

## Advanced Query Strings

Here we demonstrate using advanced query strings with logical operators like `AND`, `OR`, and `NOT`.

Search for papers where "agents" and "cybersecurity" both appear:

In [4]:
results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 95 of 95 total results
INFO:floki.document.fetcher.arxiv:Found 10 results for query: all:(agents AND cybersecurity)


Title: CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education
Authors: Tianyu Wang, Nianjun Zhou, Zhixiong Chen
Summary: Many non-traditional students in cybersecurity programs often lack access to
advice from peers, family members and professors, which can hinder their
educational experiences. Additionally, these students may not fully benefit
from various LLM-powered AI assistants due to issues like content relevance,
locality of advice, minimum expertise, and timing. This paper addresses these
challenges by introducing an application designed to provide comprehensive
support by answering questions related to knowledge, skills, and career
preparation advice tailored to the needs of these students. We developed a
learning tool platform, CyberMentor, to address the diverse needs and pain
points of students majoring in cybersecurity. Powered by agentic workflow and
Generative Large Language Models (LLMs), the platform leverages
Retriev

Search for papers where "quantum" appears but not "computing":

In [5]:
results = fetcher.search(query="all:(quantum NOT computing)", max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(quantum NOT computing)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28quantum+NOT+computing%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 100 of 356845 total results
INFO:floki.document.fetcher.arxiv:Found 10 results for query: all:(quantum NOT computing)


Title: Mukkamala-Pereñiguez master function for even-parity perturbations of the Schwarzschild spacetime
Authors: Eric Poisson
Summary: Mukkamala and Pere\~niguez recently discovered a new master function for
even-parity metric perturbations of the Schwarzschild spacetime. Remarkably,
this function satisfies the Regge-Wheeler equation (instead of the Zerilli
equation), which was previously understood to govern the odd-parity sector of
the perturbation only. In this paper I follow up on their work. First, I
identify a source term for their Regge-Wheeler equation, constructed from the
perturbing energy-momentum tensor. Second, I relate the new master function to
the radiation fields at future null infinity and the event horizon. Third, I
reconstruct the metric perturbation from the new master function, in the
Regge-Wheeler gauge. The main conclusion of this work is that the greater
simplicity of the Regge-Wheeler equation (relative to the Zerilli equation) is
offset by a greater complexi

Search for papers authored by a specific person

In [6]:
results = fetcher.search(query='au:"John Doe"', max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: au:"John Doe"
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=au%3A%22John+Doe%22&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 1 of 1 total results
INFO:floki.document.fetcher.arxiv:Found 1 results for query: au:"John Doe"


Title: Double Deep Q-Learning in Opponent Modeling
Authors: Yangtianze Tao, John Doe
Summary: Multi-agent systems in which secondary agents with conflicting agendas also
alter their methods need opponent modeling. In this study, we simulate the main
agent's and secondary agents' tactics using Double Deep Q-Networks (DDQN) with
a prioritized experience replay mechanism. Then, under the opponent modeling
setup, a Mixture-of-Experts architecture is used to identify various opponent
strategy patterns. Finally, we analyze our models in two environments with
several agents. The findings indicate that the Mixture-of-Experts model, which
is based on opponent modeling, performs better than DDQN.



## Filter Papers by Date (e.g., Last 15 Days)

In [7]:
from datetime import datetime, timedelta

# Calculate date 48 hours ago
last_24_hours = (datetime.now() - timedelta(days=15)).strftime("%Y%m%d")

# Search for recent papers
recent_results = fetcher.search(
    query="all:(agents AND cybersecurity)",
    from_date=last_24_hours,
    to_date=datetime.now().strftime("%Y%m%d"),
    max_results=5
)

# Display recent papers
for doc in recent_results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Published: {doc.metadata['published']}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29+AND+submittedDate%3A%5B20250107+TO+20250122%5D&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 1 of 1 total results
INFO:floki.document.fetcher.arxiv:Found 1 results for query: all:(agents AND cybersecurity) AND submittedDate:[20250107 TO 20250122]


Title: CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education
Authors: Tianyu Wang, Nianjun Zhou, Zhixiong Chen
Published: 2025-01-16
Summary: Many non-traditional students in cybersecurity programs often lack access to
advice from peers, family members and professors, which can hinder their
educational experiences. Additionally, these students may not fully benefit
from various LLM-powered AI assistants due to issues like content relevance,
locality of advice, minimum expertise, and timing. This paper addresses these
challenges by introducing an application designed to provide comprehensive
support by answering questions related to knowledge, skills, and career
preparation advice tailored to the needs of these students. We developed a
learning tool platform, CyberMentor, to address the diverse needs and pain
points of students majoring in cybersecurity. Powered by agentic workflow and
Generative Large Language Models (LLMs), the plat

## Download Top 5 Papers as PDF Files

In [8]:
import os
from pathlib import Path

# Create a directory for downloaded papers
os.makedirs("arxiv_papers", exist_ok=True)

# Search and download PDFs
download_results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=5, download=True, dirpath=Path("arxiv_papers"))

for paper in download_results:
    print(f"Downloaded Paper: {paper['title']}")
    print(f"File Path: {paper['file_path']}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 95 of 95 total results
INFO:floki.document.fetcher.arxiv:Found 5 results for query: all:(agents AND cybersecurity)
INFO:floki.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper to arxiv_papers/2412.20787v3.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf
INFO:floki.document.fetcher.

Downloaded Paper: CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education
File Path: arxiv_papers/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf

Downloaded Paper: What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics
File Path: arxiv_papers/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf

Downloaded Paper: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
File Path: arxiv_papers/2412.20787v3.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf

Downloaded Paper: BotSim: LLM-Powered Malicious Social Botnet Simulation
File Path: arxiv_papers/2412.13420v1.BotSim__LLM_Powered_Malicious_Social_Botnet_Simulation.pdf

Downloaded Paper: algoTRIC: Symmetric and asymmetric encryption algorithms for Cryptography 

In [9]:
download_results[0]

{'entry_id': 'http://arxiv.org/abs/2501.09709v1',
 'title': 'CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education',
 'authors': ['Tianyu Wang', 'Nianjun Zhou', 'Zhixiong Chen'],
 'published': '2025-01-16',
 'updated': '2025-01-16',
 'primary_category': 'cs.CY',
 'categories': ['cs.CY', 'cs.AI', 'K.3.2; I.2.1'],
 'pdf_url': 'http://arxiv.org/pdf/2501.09709v1',
 'file_path': 'arxiv_papers/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf'}

## Download Top 5 Papers as PDF Files (Include Summary)

In [10]:
import os
from pathlib import Path

# Create a directory for downloaded papers
os.makedirs("arxiv_papers", exist_ok=True)

# Search and download PDFs
download_results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=5, download=True, dirpath=Path("more_arxiv"), include_summary=True)

INFO:floki.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 94 of 94 total results
INFO:floki.document.fetcher.arxiv:Found 5 results for query: all:(agents AND cybersecurity)
INFO:floki.document.fetcher.arxiv:Downloading paper to more_arxiv/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper to more_arxiv/2412.20787v3.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper to more_arxiv/2412.13420v1.BotSim__LLM_Powered_Malicious_Social_Botnet_Simulation.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper to more_arxiv/2412.15237v1.algoTRI

In [11]:
download_results[0]

{'entry_id': 'http://arxiv.org/abs/2501.00855v1',
 'title': 'What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics',
 'authors': ['Lynnette Hui Xian Ng', 'Kathleen M. Carley'],
 'published': '2025-01-01',
 'updated': '2025-01-01',
 'primary_category': 'cs.CY',
 'categories': ['cs.CY', 'cs.AI', 'cs.SI'],
 'pdf_url': 'http://arxiv.org/pdf/2501.00855v1',
 'file_path': 'more_arxiv/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf',
 'summary': 'Chatter on social media is 20% bots and 80% humans. Chatter by bots and\nhumans is consistently different: bots tend to use linguistic cues that can be\neasily automated while humans use cues that require dialogue understanding.\nBots use words that match the identities they choose to present, while humans\nmay send messages that are not related to the identities they present. Bots and\nhumans differ in their communication structure: sampled bots have a star\ninteraction stru

In [13]:
print(download_results[0]["summary"])

Chatter on social media is 20% bots and 80% humans. Chatter by bots and
humans is consistently different: bots tend to use linguistic cues that can be
easily automated while humans use cues that require dialogue understanding.
Bots use words that match the identities they choose to present, while humans
may send messages that are not related to the identities they present. Bots and
humans differ in their communication structure: sampled bots have a star
interaction structure, while sampled humans have a hierarchical structure.
These conclusions are based on a large-scale analysis of social media tweets
across ~200mil users across 7 events. Social media bots took the world by storm
when social-cybersecurity researchers realized that social media users not only
consisted of humans but also of artificial agents called bots. These bots wreck
havoc online by spreading disinformation and manipulating narratives. Most
research on bots are based on special-purposed definitions, mostly predicat

## Reading Downloaded PDFs

To read the downloaded PDF files, we'll use the `PyPDFReader` class from floki.document. This allows us to extract the content of each page while retaining the associated metadata for further processing.

In [None]:
# Ensure you have the required library for reading PDFs installed. If not, you can install it using the following command:
!pip install pypdf

The following code reads each downloaded PDF file and extracts its pages. Each page is stored as a separate Document object, containing both the page's text and the metadata from the original PDF.

In [16]:
from pathlib import Path
from floki.document import PyPDFReader

# Initialize the PDF reader
docs_read = []
reader = PyPDFReader()

# Remove 'summary' from metadata in download_results
for paper in download_results:
    paper.pop("summary", None)  # Remove the 'summary' key if it exists

# Process each downloaded PDF
for paper in download_results:
    local_pdf_path = Path(paper["file_path"])  # Ensure the key matches the output
    documents = reader.load(local_pdf_path, additional_metadata=paper)  # Load the PDF with metadata
    
    # Append each page's document to the main list
    docs_read.extend(documents)  # Flatten into one list of all documents

# Verify the results
print(f"Extracted {len(docs_read)} documents from the PDFs.")



Extracted 107 documents from the PDFs.


In [17]:
docs_read[0:15]

[Document(metadata={'file_path': 'more_arxiv/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf', 'page_number': 1, 'total_pages': 33, 'entry_id': 'http://arxiv.org/abs/2501.00855v1', 'title': 'What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics', 'authors': ['Lynnette Hui Xian Ng', 'Kathleen M. Carley'], 'published': '2025-01-01', 'updated': '2025-01-01', 'primary_category': 'cs.CY', 'categories': ['cs.CY', 'cs.AI', 'cs.SI'], 'pdf_url': 'http://arxiv.org/pdf/2501.00855v1'}, text='What is a Social Media Bot? A Global Comparison of\nBot and Human Characteristics\nLynnette Hui Xian Ng1,* and Kathleen M. Carley1\n1Center for Informed Democracy & Social - cybersecurity (IDeaS), Societal and Software Systems Carnegie Mellon\nUniversity, Pittsburgh, PA 15213\n*lynnetteng@cmu.edu\nABSTRACT\nChatter on social media about global events comes from 20% bots and 80% humans. The chatter by bots and humans is\nconsistently d