# Searching for Research Papers in arXiv

This notebook demonstrates how to interact with the `arXiv` API using `Floki`, specifically through the `ArxivFetcher` class. We will explore:

* How to search for papers using advanced query strings.
* How to filter results by date (e.g., last 24 hours).
* How to retrieve metadata for papers.
* How to download the top 5 papers for further exploration.
* How to extract and process text from the downloaded PDFs, with each page stored as a separate document.

In [None]:
### Install Required Libraries
!pip install floki-ai
!pip install arxiv

## Initialize Logging

In [1]:
import logging
logging.basicConfig(level=logging.INFO)

## Importing Necessary Modules

Import the required module and set up the `ArxivFetcher` to start searching for papers.

In [2]:
from floki.document import ArxivFetcher

# Initialize the fetcher
fetcher = ArxivFetcher()

  warn(


## Basic Search by Query String

In this example, we search for papers related to "machine learning". The results are returned as `Document` objects with `text` as the summary and `metadata` containing details.

In [3]:
# Search for papers related to "machine learning"
results = fetcher.search(query="machine learning", max_results=5)

# Display the metadata and summaries of the retrieved documents
for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: machine learning
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=machine+learning&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 100 of 374510 total results
INFO:floki.document.fetcher.arxiv:Found 5 results for query: machine learning


Title: PERSE: Personalized 3D Generative Avatars from A Single Portrait
Authors: Hyunsoo Cha, Inhee Lee, Hanbyul Joo
Summary: We present PERSE, a method for building an animatable personalized generative
avatar from a reference portrait. Our avatar model enables facial attribute
editing in a continuous and disentangled latent space to control each facial
attribute, while preserving the individual's identity. To achieve this, our
method begins by synthesizing large-scale synthetic 2D video datasets, where
each video contains consistent changes in the facial expression and viewpoint,
combined with a variation in a specific facial attribute from the original
input. We propose a novel pipeline to produce high-quality, photorealistic 2D
videos with facial attribute editing. Leveraging this synthetic attribute
dataset, we present a personalized avatar creation method based on the 3D
Gaussian Splatting, learning a continuous and disentangled latent space for
intuitive facial attribute manipul

## Advanced Query Strings

Here we demonstrate using advanced query strings with logical operators like `AND`, `OR`, and `NOT`.

Search for papers where "agents" and "cybersecurity" both appear:

In [4]:
results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 93 of 93 total results
INFO:floki.document.fetcher.arxiv:Found 10 results for query: all:(agents AND cybersecurity)


Title: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
Authors: Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, Xiapu Luo
Summary: Evaluating Large Language Models (LLMs) is crucial for understanding their
capabilities and limitations across various applications, including natural
language processing and code generation. Existing benchmarks like MMLU, C-Eval,
and HumanEval assess general LLM performance but lack focus on specific expert
domains such as cybersecurity. Previous attempts to create cybersecurity
datasets have faced limitations, including insufficient data volume and a
reliance on multiple-choice questions (MCQs). To address these gaps, we propose
SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in
the cybersecurity domain. SecBench includes questions in various formats (MCQs
and short-answer questions (SAQs)), at different capability levels (Knowledge
Retention and Logi

Search for papers where "quantum" appears but not "computing":

In [5]:
results = fetcher.search(query="all:(quantum NOT computing)", max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(quantum NOT computing)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28quantum+NOT+computing%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 100 of 355744 total results
INFO:floki.document.fetcher.arxiv:Found 10 results for query: all:(quantum NOT computing)


Title: Holographic observers for time-band algebras
Authors: Kristan Jensen, Suvrat Raju, Antony J. Speranza
Summary: We study the algebra of observables in a time band on the boundary of anti-de
Sitter space in a theory of quantum gravity. Strictly speaking this algebra
does not have a commutant because products of operators within the time band
give rise to operators outside the time band. However, we show that in a state
where the bulk contains a macroscopic observer, it is possible to define a
coarse-grained version of this algebra with a non-trivial commutant, and a
resolution limited by the observer's characteristics. This algebra acts on a
little Hilbert space that describes excitations about the observer's state and
time-translated versions of this state. Our construction requires a choice of
dressing that determines how elements of the algebra transform under the
Hamiltonian. At leading order in gravitational perturbation theory, and with a
specific choice of dressing, our con

Search for papers authored by a specific person

In [6]:
results = fetcher.search(query='au:"John Doe"', max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: au:"John Doe"
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=au%3A%22John+Doe%22&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 1 of 1 total results
INFO:floki.document.fetcher.arxiv:Found 1 results for query: au:"John Doe"


Title: Double Deep Q-Learning in Opponent Modeling
Authors: Yangtianze Tao, John Doe
Summary: Multi-agent systems in which secondary agents with conflicting agendas also
alter their methods need opponent modeling. In this study, we simulate the main
agent's and secondary agents' tactics using Double Deep Q-Networks (DDQN) with
a prioritized experience replay mechanism. Then, under the opponent modeling
setup, a Mixture-of-Experts architecture is used to identify various opponent
strategy patterns. Finally, we analyze our models in two environments with
several agents. The findings indicate that the Mixture-of-Experts model, which
is based on opponent modeling, performs better than DDQN.



## Filter Papers by Date (e.g., Last 24 Hours)

In [7]:
from datetime import datetime, timedelta

# Calculate date 48 hours ago
last_24_hours = (datetime.now() - timedelta(days=1)).strftime("%Y%m%d")

# Search for recent papers
recent_results = fetcher.search(
    query="all:(agents AND cybersecurity)",
    from_date=last_24_hours,
    to_date=datetime.now().strftime("%Y%m%d"),
    max_results=5
)

# Display recent papers
for doc in recent_results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Published: {doc.metadata['published']}")
    print(f"Summary: {doc.text}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29+AND+submittedDate%3A%5B20241230+TO+20241231%5D&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 1 of 1 total results
INFO:floki.document.fetcher.arxiv:Found 1 results for query: all:(agents AND cybersecurity) AND submittedDate:[20241230 TO 20241231]


Title: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
Authors: Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, Xiapu Luo
Published: 2024-12-30
Summary: Evaluating Large Language Models (LLMs) is crucial for understanding their
capabilities and limitations across various applications, including natural
language processing and code generation. Existing benchmarks like MMLU, C-Eval,
and HumanEval assess general LLM performance but lack focus on specific expert
domains such as cybersecurity. Previous attempts to create cybersecurity
datasets have faced limitations, including insufficient data volume and a
reliance on multiple-choice questions (MCQs). To address these gaps, we propose
SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in
the cybersecurity domain. SecBench includes questions in various formats (MCQs
and short-answer questions (SAQs)), at different capability levels (Knowle

## Download Top 5 Papers as PDF Files

In [8]:
import os
from pathlib import Path

# Create a directory for downloaded papers
os.makedirs("arxiv_papers", exist_ok=True)

# Search and download PDFs
download_results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=5, download=True, dirpath=Path("arxiv_papers"))

for paper in download_results:
    print(f"Downloaded Paper: {paper['title']}")
    print(f"File Path: {paper['file_path']}\n")

INFO:floki.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 93 of 93 total results
INFO:floki.document.fetcher.arxiv:Found 5 results for query: all:(agents AND cybersecurity)
INFO:floki.document.fetcher.arxiv:Downloading paper to arxiv_papers/2412.20787v1.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper to arxiv_papers/2412.13420v1.BotSim__LLM_Powered_Malicious_Social_Botnet_Simulation.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper to arxiv_papers/2412.15237v1.algoTRIC__Symmetric_and_asymmetric_encryption_algorithms_for_Cryptography____A_comparative_analysis_in_AI_era.pdf
INFO:floki.document.fetcher.arxiv:Downloading paper

Downloaded Paper: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
File Path: arxiv_papers/2412.20787v1.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf

Downloaded Paper: BotSim: LLM-Powered Malicious Social Botnet Simulation
File Path: arxiv_papers/2412.13420v1.BotSim__LLM_Powered_Malicious_Social_Botnet_Simulation.pdf

Downloaded Paper: algoTRIC: Symmetric and asymmetric encryption algorithms for Cryptography -- A comparative analysis in AI era
File Path: arxiv_papers/2412.15237v1.algoTRIC__Symmetric_and_asymmetric_encryption_algorithms_for_Cryptography____A_comparative_analysis_in_AI_era.pdf

Downloaded Paper: The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap
File Path: arxiv_papers/2412.06512v1.The_Fusion_of_Large_Language_Models_and_Formal_Methods_for_Trustworthy_AI_Agents__A_Roadmap.pdf

Downloaded Paper: Out-of-Distribution Detection for Neurosymboli

## Reading Downloaded PDFs

To read the downloaded PDF files, we'll use the `PyPDFReader` class from floki.document. This allows us to extract the content of each page while retaining the associated metadata for further processing.

In [None]:
# Ensure you have the required library for reading PDFs installed. If not, you can install it using the following command:
!pip install pypdf

The following code reads each downloaded PDF file and extracts its pages. Each page is stored as a separate Document object, containing both the page's text and the metadata from the original PDF.

In [9]:
from pathlib import Path
from floki.document import PyPDFReader

# Initialize the PDF reader
docs_read = []
reader = PyPDFReader()

# Process each downloaded PDF
for paper in download_results:
    local_pdf_path = Path(paper["file_path"])  # Ensure the key matches the output
    documents = reader.load(local_pdf_path, additional_metadata=paper)  # Load the PDF with metadata
    
    # Append each page's document to the main list
    docs_read.extend(documents)  # Flatten into one list of all documents

# Verify the results
print(f"Extracted {len(docs_read)} documents from the PDFs.")



Extracted 83 documents from the PDFs.


In [12]:
docs_read[0:15]

[Document(metadata={'file_path': 'arxiv_papers/2412.20787v1.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf', 'page_number': 1, 'total_pages': 11, 'entry_id': 'http://arxiv.org/abs/2412.20787v1', 'title': 'SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity', 'authors': ['Pengfei Jing', 'Mengyun Tang', 'Xiaorong Shi', 'Xing Zheng', 'Sen Nie', 'Shi Wu', 'Yong Yang', 'Xiapu Luo'], 'published': '2024-12-30', 'updated': '2024-12-30', 'primary_category': 'cs.CR', 'categories': ['cs.CR', 'cs.AI'], 'pdf_url': 'http://arxiv.org/pdf/2412.20787v1'}, text='SecBench: A Comprehensive Multi-Dimensional\nBenchmarking Dataset for LLMs in Cybersecurity\nPENGFEI JING, The Hong Kong Polytechnic University, Tencent Security Keen Lab, China\nMENGYUN TANG, Tencent Zhuque Lab, China\nXIAORONG SHI, Tencent Zhuque Lab, China\nXING ZHENG, Tencent Zhuque Lab, China\nSEN NIE, Tencent Security Keen Lab, China\nSHI WU, Tencent Sec