## üìö **Notebook 01: ArXiv API Exploration**

### Purpose
Explore the ArXiv API to understand how to search for and retrieve AI research papers programmatically. This notebook establishes the foundation for our paper discovery pipeline.

### What We'll Do

| Step | Task | Output |
|------|------|--------|
| 1 | **Install & Import** | Set up arxiv library and dependencies |
| 2 | **Basic Search** | Test simple keyword searches | List of recent papers |
| 3 | **Explore Metadata** | Examine paper structure (title, abstract, authors, etc.) | Understanding of data fields |
| 4 | **Advanced Queries** | Filter by category, date, sort options | Targeted search results |
| 5 | **Download PDFs** | Test PDF retrieval functionality | Sample PDF files |
| 6 | **Build Search Function** | Create reusable search utility | Production-ready code |

### Key Questions to Answer
- What metadata does ArXiv provide?
- How do we filter for AI/ML papers specifically?
- Can we reliably download PDFs?
- What are the rate limits and best practices?

### Expected Outcomes
- Working knowledge of ArXiv API
- Sample dataset of 10-20 recent AI papers
- Reusable search function for future notebooks
- Understanding of data structure for agent design

---

**Last Updated:** January 2026  


In [4]:
# Cell 2: Imports and Setup

"""
I'll use the official arxiv Python library for API access.
"""

# Core libraries
import arxiv  # ArXiv API wrapper
import pandas as pd  # Data manipulation
from datetime import datetime, timedelta  # Date handling
import time  # For rate limiting




In [5]:
# Cell 3: Initialize ArXiv Client

"""
Create a configured ArXiv client with sensible defaults.
The client handles pagination, rate limiting, and retries automatically.
"""

# Initialize client with configuration
client = arxiv.Client(
    page_size=100,        # Number of results per page (max 100)
    delay_seconds=3,      # Polite rate limiting (3 seconds between requests)
    num_retries=3         # Retry failed requests up to 3 times
)



In [6]:
# Cell 4: Basic Search Test

"""
Test a simple search query to understand the API response structure.
Search for recent papers on "large language models" (LLM).
"""

# Define a basic search
search = arxiv.Search(
    query="large language models",  # Search term
    max_results=5,                   # Limit to 5 papers for testing
    sort_by=arxiv.SortCriterion.SubmittedDate,  # Most recent first
    sort_order=arxiv.SortOrder.Descending
)

# Execute search and collect results
print("Searching for: 'large language models'\n")


results = list(client.results(search))

# Display basic info for each paper
for i, paper in enumerate(results, 1):
    print(f"\n{i}. {paper.title}")
    print(f"   Authors: {', '.join([author.name for author in paper.authors[:3]])}...")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   ArXiv ID: {paper.entry_id.split('/')[-1]}")

print(f"Retrieved {len(results)} papers successfully")

Searching for: 'large language models'


1. Unveiling the 3D structure of the central molecular zone from stellar kinematics and photometry: The 50 and 20 km/s clouds
   Authors: Francisco Nogueras-Lara, Ashley T. Barnes, Jonathan D. Henshaw...
   Published: 2026-01-08
   ArXiv ID: 2601.05252v1

2. Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
   Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina...
   Published: 2026-01-08
   ArXiv ID: 2601.05251v1

3. QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer
   Authors: Daniele Lizzio Bosco, Shuteng Wang, Giuseppe Serra...
   Published: 2026-01-08
   ArXiv ID: 2601.05250v1

4. LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model
   Authors: Zhuoyang Liu, Jiaming Liu, Hao Chen...
   Published: 2026-01-08
   ArXiv ID: 2601.05248v1

5. Random Models and Guarded Logic
   Authors: Oskar Fiuk...
   Published: 2026-01-08
   ArXiv ID: 2601.05247v1
Retrieved 5 papers succe

In [8]:
# Cell 5: Why Did We Get Wrong Results?

"""
The search returned irrelevant papers because:
1. ArXiv searches across ALL categories (physics, math, CS, etc.)
2. It matches ANY words, not necessarily the phrase
3. We need to filter by category and use better query syntax
"""

# Let's examine what categories these papers are in
print("üîç Analyzing the categories of our 'wrong' results:\n")

for i, paper in enumerate(results, 1):
    # paper.categories is already a list of strings
    categories = paper.categories
    print(f"{i}. {paper.title[:60]}...")
    print(f"   Categories: {', '.join(categories)}")
    print()

print("üí° Notice: None of these are in cs.AI or cs.LG (machine learning)!")
print("   We need to filter by category!")

üîç Analyzing the categories of our 'wrong' results:

1. Unveiling the 3D structure of the central molecular zone fro...
   Categories: astro-ph.GA

2. Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular V...
   Categories: cs.CV

3. QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quan...
   Categories: cs.CV

4. LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robo...
   Categories: cs.RO

5. Random Models and Guarded Logic...
   Categories: cs.LO

üí° Notice: None of these are in cs.AI or cs.LG (machine learning)!
   We need to filter by category!


In [10]:
# Cell 6: Search with Category Filtering

"""
ArXiv categories for AI/ML:
- cs.AI  = Artificial Intelligence
- cs.LG  = Machine Learning
- cs.CL  = Computation and Language (NLP)
- cs.CV  = Computer Vision
"""

# Better search with category filtering
search_ai = arxiv.Search(
    query="cat:cs.AI OR cat:cs.LG OR cat:cs.CL",  # Filter by AI/ML categories
    max_results=10,
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending
)

print("üîç Searching AI/ML papers from cs.AI, cs.LG, cs.CL categories\n")
print("-" * 80)

ai_papers = list(client.results(search_ai))

for i, paper in enumerate(ai_papers, 1):
    # paper.categories is already a list of strings
    categories = paper.categories
    print(f"\n{i}. {paper.title}")
    print(f"   Authors: {', '.join([author.name for author in paper.authors[:2]])}...")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   Categories: {', '.join(categories[:3])}")

print("\n" + "-" * 80)
print(f"‚úÖ Retrieved {len(ai_papers)} AI/ML papers!")

üîç Searching AI/ML papers from cs.AI, cs.LG, cs.CL categories

--------------------------------------------------------------------------------

1. Optimal Lower Bounds for Online Multicalibration
   Authors: Natalie Collina, Jiuyao Lu...
   Published: 2026-01-08
   Categories: cs.LG, math.ST, stat.ML

2. GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
   Authors: Shih-Yang Liu, Xin Dong...
   Published: 2026-01-08
   Categories: cs.CL, cs.AI, cs.LG

3. RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
   Authors: Boyang Wang, Haoran Zhang...
   Published: 2026-01-08
   Categories: cs.CV, cs.AI, cs.RO

4. Robust Reasoning as a Symmetry-Protected Topological Phase
   Authors: Ilmo Sung...
   Published: 2026-01-08
   Categories: cs.LG, cond-mat.dis-nn, cs.AI

5. Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
   Authors: P. Gilda, P. Dungarwal...
   Published