## üìö **Notebook 01: ArXiv API Exploration**

### Purpose
Explore the ArXiv API to understand how to search for and retrieve AI research papers programmatically. This notebook establishes the foundation for our paper discovery pipeline.

### What We'll Do

| Step | Task | Output |
|------|------|--------|
| 1 | **Install & Import** | Set up arxiv library and dependencies |
| 2 | **Basic Search** | Test simple keyword searches | List of recent papers |
| 3 | **Explore Metadata** | Examine paper structure (title, abstract, authors, etc.) | Understanding of data fields |
| 4 | **Advanced Queries** | Filter by category, date, sort options | Targeted search results |
| 5 | **Download PDFs** | Test PDF retrieval functionality | Sample PDF files |
| 6 | **Build Search Function** | Create reusable search utility | Production-ready code |

### Key Questions to Answer
- What metadata does ArXiv provide?
- How do we filter for AI/ML papers specifically?
- Can we reliably download PDFs?
- What are the rate limits and best practices?

### Expected Outcomes
- Working knowledge of ArXiv API
- Sample dataset of 10-20 recent AI papers
- Reusable search function for future notebooks
- Understanding of data structure for agent design

---

**Last Updated:** January 2026  


In [4]:
# Cell 2: Imports and Setup

"""
I'll use the official arxiv Python library for API access.
"""

# Core libraries
import arxiv  # ArXiv API wrapper
import pandas as pd  # Data manipulation
from datetime import datetime, timedelta  # Date handling
import time  # For rate limiting




In [5]:
# Cell 3: Initialize ArXiv Client

"""
Create a configured ArXiv client with sensible defaults.
The client handles pagination, rate limiting, and retries automatically.
"""

# Initialize client with configuration
client = arxiv.Client(
    page_size=100,        # Number of results per page (max 100)
    delay_seconds=3,      # Polite rate limiting (3 seconds between requests)
    num_retries=3         # Retry failed requests up to 3 times
)



In [6]:
# Cell 4: Basic Search Test

"""
Test a simple search query to understand the API response structure.
Search for recent papers on "large language models" (LLM).
"""

# Define a basic search
search = arxiv.Search(
    query="large language models",  # Search term
    max_results=5,                   # Limit to 5 papers for testing
    sort_by=arxiv.SortCriterion.SubmittedDate,  # Most recent first
    sort_order=arxiv.SortOrder.Descending
)

# Execute search and collect results
print("Searching for: 'large language models'\n")


results = list(client.results(search))

# Display basic info for each paper
for i, paper in enumerate(results, 1):
    print(f"\n{i}. {paper.title}")
    print(f"   Authors: {', '.join([author.name for author in paper.authors[:3]])}...")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   ArXiv ID: {paper.entry_id.split('/')[-1]}")

print(f"Retrieved {len(results)} papers successfully")

Searching for: 'large language models'


1. Unveiling the 3D structure of the central molecular zone from stellar kinematics and photometry: The 50 and 20 km/s clouds
   Authors: Francisco Nogueras-Lara, Ashley T. Barnes, Jonathan D. Henshaw...
   Published: 2026-01-08
   ArXiv ID: 2601.05252v1

2. Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
   Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina...
   Published: 2026-01-08
   ArXiv ID: 2601.05251v1

3. QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer
   Authors: Daniele Lizzio Bosco, Shuteng Wang, Giuseppe Serra...
   Published: 2026-01-08
   ArXiv ID: 2601.05250v1

4. LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model
   Authors: Zhuoyang Liu, Jiaming Liu, Hao Chen...
   Published: 2026-01-08
   ArXiv ID: 2601.05248v1

5. Random Models and Guarded Logic
   Authors: Oskar Fiuk...
   Published: 2026-01-08
   ArXiv ID: 2601.05247v1
Retrieved 5 papers succe

In [8]:
# Cell 5: Why Did We Get Wrong Results?

"""
The search returned irrelevant papers because:
1. ArXiv searches across ALL categories (physics, math, CS, etc.)
2. It matches ANY words, not necessarily the phrase
3. We need to filter by category and use better query syntax
"""

# Let's examine what categories these papers are in
print("üîç Analyzing the categories of our 'wrong' results:\n")

for i, paper in enumerate(results, 1):
    # paper.categories is already a list of strings
    categories = paper.categories
    print(f"{i}. {paper.title[:60]}...")
    print(f"   Categories: {', '.join(categories)}")
    print()

print("üí° Notice: None of these are in cs.AI or cs.LG (machine learning)!")
print("   We need to filter by category!")

üîç Analyzing the categories of our 'wrong' results:

1. Unveiling the 3D structure of the central molecular zone fro...
   Categories: astro-ph.GA

2. Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular V...
   Categories: cs.CV

3. QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quan...
   Categories: cs.CV

4. LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robo...
   Categories: cs.RO

5. Random Models and Guarded Logic...
   Categories: cs.LO

üí° Notice: None of these are in cs.AI or cs.LG (machine learning)!
   We need to filter by category!


In [10]:
# Cell 6: Search with Category Filtering

"""
ArXiv categories for AI/ML:
- cs.AI  = Artificial Intelligence
- cs.LG  = Machine Learning
- cs.CL  = Computation and Language (NLP)
- cs.CV  = Computer Vision
"""

# Better search with category filtering
search_ai = arxiv.Search(
    query="cat:cs.AI OR cat:cs.LG OR cat:cs.CL",  # Filter by AI/ML categories
    max_results=10,
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending
)

print("üîç Searching AI/ML papers from cs.AI, cs.LG, cs.CL categories\n")
print("-" * 80)

ai_papers = list(client.results(search_ai))

for i, paper in enumerate(ai_papers, 1):
    # paper.categories is already a list of strings
    categories = paper.categories
    print(f"\n{i}. {paper.title}")
    print(f"   Authors: {', '.join([author.name for author in paper.authors[:2]])}...")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   Categories: {', '.join(categories[:3])}")

print("\n" + "-" * 80)
print(f"‚úÖ Retrieved {len(ai_papers)} AI/ML papers!")

üîç Searching AI/ML papers from cs.AI, cs.LG, cs.CL categories

--------------------------------------------------------------------------------

1. Optimal Lower Bounds for Online Multicalibration
   Authors: Natalie Collina, Jiuyao Lu...
   Published: 2026-01-08
   Categories: cs.LG, math.ST, stat.ML

2. GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
   Authors: Shih-Yang Liu, Xin Dong...
   Published: 2026-01-08
   Categories: cs.CL, cs.AI, cs.LG

3. RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
   Authors: Boyang Wang, Haoran Zhang...
   Published: 2026-01-08
   Categories: cs.CV, cs.AI, cs.RO

4. Robust Reasoning as a Symmetry-Protected Topological Phase
   Authors: Ilmo Sung...
   Published: 2026-01-08
   Categories: cs.LG, cond-mat.dis-nn, cs.AI

5. Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
   Authors: P. Gilda, P. Dungarwal...
   Published

### **üî¨ Exploring Paper Metadata**

Now I'm going to dig deeper into what information ArXiv actually gives us for each paper. This is crucial because I need to understand what data I'll have available when building the agent pipeline.

**What I'm doing here:**
- Examining the full structure of a paper object to see all available fields
- I will check if abstracts are complete enough for analysis
- Testing whether I can reliably access PDF links for download



In [11]:
# Cell 7: Explore Full Paper Metadata

"""
Let's examine one paper in detail to understand all available metadata.
This will inform how we structure our data pipeline later.
"""

# Pick the first paper from our AI/ML results
sample_paper = ai_papers[0]

print("üìÑ DETAILED PAPER STRUCTURE")
print("=" * 80)
print(f"\nüîπ Title: {sample_paper.title}")
print(f"\nüîπ ArXiv ID: {sample_paper.entry_id.split('/')[-1]}")
print(f"\nüîπ Published Date: {sample_paper.published.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nüîπ Updated Date: {sample_paper.updated.strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\nüîπ Authors ({len(sample_paper.authors)}):")
for author in sample_paper.authors[:5]:  # Show first 5
    print(f"   - {author.name}")
if len(sample_paper.authors) > 5:
    print(f"   ... and {len(sample_paper.authors) - 5} more")

print(f"\nüîπ Categories: {', '.join(sample_paper.categories)}")

print(f"\nüîπ Abstract ({len(sample_paper.summary)} characters):")
print(f"   {sample_paper.summary[:300]}...")  # First 300 chars

print(f"\nüîπ PDF URL: {sample_paper.pdf_url}")

print(f"\nüîπ ArXiv Page URL: {sample_paper.entry_id}")

print(f"\nüîπ Primary Category: {sample_paper.primary_category}")

print(f"\nüîπ Comment: {sample_paper.comment if sample_paper.comment else 'None'}")

print("\n" + "=" * 80)
print("‚úÖ All key metadata fields are accessible and complete!")

üìÑ DETAILED PAPER STRUCTURE

üîπ Title: Optimal Lower Bounds for Online Multicalibration

üîπ ArXiv ID: 2601.05245v1

üîπ Published Date: 2026-01-08 18:59:32

üîπ Updated Date: 2026-01-08 18:59:32

üîπ Authors (4):
   - Natalie Collina
   - Jiuyao Lu
   - Georgy Noarov
   - Aaron Roth

üîπ Categories: cs.LG, math.ST, stat.ML

üîπ Abstract (937 characters):
   We prove tight lower bounds for online multicalibration, establishing an information-theoretic separation from marginal calibration.
  In the general setting where group functions can depend on both context and the learner's predictions, we prove an $Œ©(T^{2/3})$ lower bound on expected multicalibrat...

üîπ PDF URL: https://arxiv.org/pdf/2601.05245v1

üîπ ArXiv Page URL: http://arxiv.org/abs/2601.05245v1

üîπ Primary Category: cs.LG

üîπ Comment: None

‚úÖ All key metadata fields are accessible and complete!


## üì• Testing PDF Download

Now I need to verify that I can actually download PDFs programmatically. This is critical because my agent will need to extract full paper content, not just abstracts.

**What I'm testing:**
- Whether the arxiv library can download PDFs automatically
- I will check file sizes to confirm complete downloads
- Verifying that files are saved correctly to our data folder

**Why this matters:** The entire "Paper Analyzer" agent depends on being able to read full papers. If PDF downloads are unreliable, I'll need a backup strategy.

In [12]:
# Cell 8: Test PDF Download Functionality

"""
Test downloading a PDF to ensure we can access full paper content.
We'll download to our data/raw folder.
"""

import os

# Create data/raw directory if it doesn't exist
os.makedirs('../data/raw', exist_ok=True)

# Download the first paper's PDF
sample_paper = ai_papers[0]
paper_id = sample_paper.entry_id.split('/')[-1].replace('.', '_')

print(f"üì• Downloading: {sample_paper.title[:60]}...")
print(f"   ArXiv ID: {paper_id}")
print(f"   PDF URL: {sample_paper.pdf_url}\n")

# Download PDF
pdf_path = f"../data/raw/{paper_id}.pdf"
sample_paper.download_pdf(filename=pdf_path)

# Check if download succeeded
if os.path.exists(pdf_path):
    file_size = os.path.getsize(pdf_path) / 1024  # Size in KB
    print(f"‚úÖ Download successful!")
    print(f"   Saved to: {pdf_path}")
    print(f"   File size: {file_size:.1f} KB")
else:
    print("‚ùå Download failed!")

üì• Downloading: Optimal Lower Bounds for Online Multicalibration...
   ArXiv ID: 2601_05245v1
   PDF URL: https://arxiv.org/pdf/2601.05245v1

‚úÖ Download successful!
   Saved to: ../data/raw/2601_05245v1.pdf
   File size: 684.4 KB


## üîß Building a Reusable Search Function

I'm now going to create a clean, production-ready function that I can reuse across all notebooks and eventually in my agent pipeline. This will be the foundation of the "Research Finder" agent.

**What I'm building:**
- A flexible search function that handles different query types and categories
- I will make it return structured data (not just raw objects)
- Error handling so the agent doesn't crash on bad queries



In [14]:
# Cell 9: Build Reusable ArXiv Search Function

"""
Create a clean, reusable function for searching ArXiv papers.
This will be the core of our Research Finder agent.
"""

def search_arxiv_papers(
    query=None,
    categories=["cs.AI", "cs.LG", "cs.CL"],
    max_results=10,
    days_back=7,
    sort_by="submitted"
):
    """
    Search ArXiv for AI/ML papers with flexible parameters.
    
    Args:
        query (str): Keyword search (e.g., "transformer models")
        categories (list): ArXiv categories to filter by
        max_results (int): Maximum number of papers to return
        days_back (int): Only get papers from last N days (None for all time)
        sort_by (str): "submitted" or "relevance"
    
    Returns:
        list: List of dictionaries containing paper metadata
    """
    
    # Build query string
    if query and categories:
        # Combine keyword search with category filter
        category_query = " OR ".join([f"cat:{cat}" for cat in categories])
        full_query = f"{query} AND ({category_query})"
    elif categories:
        # Category filter only
        full_query = " OR ".join([f"cat:{cat}" for cat in categories])
    elif query:
        # Keyword only (not recommended - gets all categories)
        full_query = query
    else:
        raise ValueError("Must provide either query or categories")
    
    # Set sort criterion
    if sort_by == "submitted":
        sort_criterion = arxiv.SortCriterion.SubmittedDate
    else:
        sort_criterion = arxiv.SortCriterion.Relevance
    
    # Create search
    search = arxiv.Search(
        query=full_query,
        max_results=max_results,
        sort_by=sort_criterion,
        sort_order=arxiv.SortOrder.Descending
    )
    
    # Execute search
    try:
        results = client.results(search)
        papers = []
        
        for paper in results:
            # Filter by date if specified
            if days_back:
                cutoff_date = datetime.now() - timedelta(days=days_back)
                if paper.published.replace(tzinfo=None) < cutoff_date:
                    continue
            
            # Structure the data
            paper_data = {
                'arxiv_id': paper.entry_id.split('/')[-1],
                'title': paper.title,
                'authors': [author.name for author in paper.authors],
                'published': paper.published.strftime('%Y-%m-%d'),
                'categories': paper.categories,
                'primary_category': paper.primary_category,
                'abstract': paper.summary,
                'pdf_url': paper.pdf_url,
                'arxiv_url': paper.entry_id
            }
            papers.append(paper_data)
        
        return papers
    
    except Exception as e:
        print(f"‚ùå Search failed: {e}")
        return []


print("Function Defined: Ready to use in production pipeline")

Function Defined: Ready to use in production pipeline
