## üìö **Notebook 01: ArXiv API Exploration**

### Purpose
Explore the ArXiv API to understand how to search for and retrieve AI research papers programmatically. This notebook establishes the foundation for our paper discovery pipeline.

### What We'll Do

| Step | Task | Output |
|------|------|--------|
| 1 | **Install & Import** | Set up arxiv library and dependencies |
| 2 | **Basic Search** | Test simple keyword searches | List of recent papers |
| 3 | **Explore Metadata** | Examine paper structure (title, abstract, authors, etc.) | Understanding of data fields |
| 4 | **Advanced Queries** | Filter by category, date, sort options | Targeted search results |
| 5 | **Download PDFs** | Test PDF retrieval functionality | Sample PDF files |
| 6 | **Build Search Function** | Create reusable search utility | Production-ready code |

### Key Questions to Answer
- What metadata does ArXiv provide?
- How do we filter for AI/ML papers specifically?
- Can we reliably download PDFs?
- What are the rate limits and best practices?

### Expected Outcomes
- Working knowledge of ArXiv API
- Sample dataset of 10-20 recent AI papers
- Reusable search function for future notebooks
- Understanding of data structure for agent design

---

**Last Updated:** January 2026  


In [2]:
# Cell 2: Imports and Setup

"""
I'll use the official arxiv Python library for API access.
"""

# Core libraries
import arxiv  # ArXiv API wrapper
import pandas as pd  # Data manipulation
from datetime import datetime, timedelta  # Date handling
import time  # For rate limiting




In [3]:
# Cell 3: Initialize ArXiv Client

"""
Create a configured ArXiv client with sensible defaults.
The client handles pagination, rate limiting, and retries automatically.
"""

# Initialize client with configuration
client = arxiv.Client(
    page_size=100,        # Number of results per page (max 100)
    delay_seconds=3,      # Polite rate limiting (3 seconds between requests)
    num_retries=3         # Retry failed requests up to 3 times
)



In [4]:
# Cell 4: Basic Search Test

"""
Test a simple search query to understand the API response structure.
Search for recent papers on "large language models" (LLM).
"""

# Define a basic search
search = arxiv.Search(
    query="large language models",  # Search term
    max_results=5,                   # Limit to 5 papers for testing
    sort_by=arxiv.SortCriterion.SubmittedDate,  # Most recent first
    sort_order=arxiv.SortOrder.Descending
)

# Execute search and collect results
print("Searching for: 'large language models'\n")


results = list(client.results(search))

# Display basic info for each paper
for i, paper in enumerate(results, 1):
    print(f"\n{i}. {paper.title}")
    print(f"   Authors: {', '.join([author.name for author in paper.authors[:3]])}...")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   ArXiv ID: {paper.entry_id.split('/')[-1]}")

print(f"Retrieved {len(results)} papers successfully")

Searching for: 'large language models'


1. AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs
   Authors: Chengming Cui, Tianxin Wei, Ziyi Chen...
   Published: 2026-01-09
   ArXiv ID: 2601.06022v1

2. Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
   Authors: Jiajie Zhang, Xin Lv, Ling Feng...
   Published: 2026-01-09
   ArXiv ID: 2601.06021v1

3. Mobility Trajectories from Network-Driven Markov Dynamics
   Authors: David A. Meyer, Asif Shakeel...
   Published: 2026-01-09
   ArXiv ID: 2601.06020v1

4. Probing Cosmic Expansion and Early Universe with Einstein Telescope
   Authors: Angelo Ricciardone, Mairi Sakellariadou, Archisman Ghosh...
   Published: 2026-01-09
   ArXiv ID: 2601.06017v1

5. LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection
   Authors: √û√≥r Sverrisson, Steinn Gu√∞mundsson...
   Published: 2026-01-09
   ArXiv ID: 2601.06016v1
Retriev

In [5]:
# Cell 5: Why Did We Get Wrong Results?

"""
The search returned irrelevant papers because:
1. ArXiv searches across ALL categories (physics, math, CS, etc.)
2. It matches ANY words, not necessarily the phrase
3. We need to filter by category and use better query syntax
"""

# Let's examine what categories these papers are in
print("üîç Analyzing the categories of our 'wrong' results:\n")

for i, paper in enumerate(results, 1):
    # paper.categories is already a list of strings
    categories = paper.categories
    print(f"{i}. {paper.title[:60]}...")
    print(f"   Categories: {', '.join(categories)}")
    print()

print("üí° Notice: None of these are in cs.AI or cs.LG (machine learning)!")
print("   We need to filter by category!")

üîç Analyzing the categories of our 'wrong' results:

1. AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling f...
   Categories: cs.CL, cs.AI

2. Chaining the Evidence: Robust Reinforcement Learning for Dee...
   Categories: cs.CL

3. Mobility Trajectories from Network-Driven Markov Dynamics...
   Categories: cs.SI, math.PR

4. Probing Cosmic Expansion and Early Universe with Einstein Te...
   Categories: astro-ph.CO, gr-qc

5. LookAroundNet: Extending Temporal Context with Transformers ...
   Categories: cs.LG

üí° Notice: None of these are in cs.AI or cs.LG (machine learning)!
   We need to filter by category!


In [6]:
# Cell 6: Search with Category Filtering

"""
ArXiv categories for AI/ML:
- cs.AI  = Artificial Intelligence
- cs.LG  = Machine Learning
- cs.CL  = Computation and Language (NLP)
- cs.CV  = Computer Vision
"""

# Better search with category filtering
search_ai = arxiv.Search(
    query="cat:cs.AI OR cat:cs.LG OR cat:cs.CL",  # Filter by AI/ML categories
    max_results=10,
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending
)

print("üîç Searching AI/ML papers from cs.AI, cs.LG, cs.CL categories\n")
print("-" * 80)

ai_papers = list(client.results(search_ai))

for i, paper in enumerate(ai_papers, 1):
    # paper.categories is already a list of strings
    categories = paper.categories
    print(f"\n{i}. {paper.title}")
    print(f"   Authors: {', '.join([author.name for author in paper.authors[:2]])}...")
    print(f"   Published: {paper.published.strftime('%Y-%m-%d')}")
    print(f"   Categories: {', '.join(categories[:3])}")

print("\n" + "-" * 80)
print(f"‚úÖ Retrieved {len(ai_papers)} AI/ML papers!")

üîç Searching AI/ML papers from cs.AI, cs.LG, cs.CL categories

--------------------------------------------------------------------------------

1. Manifold limit for the training of shallow graph convolutional neural networks
   Authors: Johanna Tengler, Christoph Brune...
   Published: 2026-01-09
   Categories: stat.ML, cs.LG, math.FA

2. AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs
   Authors: Chengming Cui, Tianxin Wei...
   Published: 2026-01-09
   Categories: cs.CL, cs.AI

3. Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
   Authors: Jiajie Zhang, Xin Lv...
   Published: 2026-01-09
   Categories: cs.CL

4. LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection
   Authors: √û√≥r Sverrisson, Steinn Gu√∞mundsson...
   Published: 2026-01-09
   Categories: cs.LG

5. Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem
   A

### **üî¨ Exploring Paper Metadata**

Now I'm going to dig deeper into what information ArXiv actually gives us for each paper. This is crucial because I need to understand what data I'll have available when building the agent pipeline.

**What I'm doing here:**
- Examining the full structure of a paper object to see all available fields
- I will check if abstracts are complete enough for analysis
- Testing whether I can reliably access PDF links for download



In [7]:
# Cell 7: Explore Full Paper Metadata

"""
Let's examine one paper in detail to understand all available metadata.
This will inform how we structure our data pipeline later.
"""

# Pick the first paper from our AI/ML results
sample_paper = ai_papers[0]

print("üìÑ DETAILED PAPER STRUCTURE")
print("=" * 80)
print(f"\nüîπ Title: {sample_paper.title}")
print(f"\nüîπ ArXiv ID: {sample_paper.entry_id.split('/')[-1]}")
print(f"\nüîπ Published Date: {sample_paper.published.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nüîπ Updated Date: {sample_paper.updated.strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\nüîπ Authors ({len(sample_paper.authors)}):")
for author in sample_paper.authors[:5]:  # Show first 5
    print(f"   - {author.name}")
if len(sample_paper.authors) > 5:
    print(f"   ... and {len(sample_paper.authors) - 5} more")

print(f"\nüîπ Categories: {', '.join(sample_paper.categories)}")

print(f"\nüîπ Abstract ({len(sample_paper.summary)} characters):")
print(f"   {sample_paper.summary[:300]}...")  # First 300 chars

print(f"\nüîπ PDF URL: {sample_paper.pdf_url}")

print(f"\nüîπ ArXiv Page URL: {sample_paper.entry_id}")

print(f"\nüîπ Primary Category: {sample_paper.primary_category}")

print(f"\nüîπ Comment: {sample_paper.comment if sample_paper.comment else 'None'}")

print("\n" + "=" * 80)
print("‚úÖ All key metadata fields are accessible and complete!")

üìÑ DETAILED PAPER STRUCTURE

üîπ Title: Manifold limit for the training of shallow graph convolutional neural networks

üîπ ArXiv ID: 2601.06025v1

üîπ Published Date: 2026-01-09 18:59:20

üîπ Updated Date: 2026-01-09 18:59:20

üîπ Authors (3):
   - Johanna Tengler
   - Christoph Brune
   - Jos√© A. Iglesias

üîπ Categories: stat.ML, cs.LG, math.FA, math.OC

üîπ Abstract (1500 characters):
   We study the discrete-to-continuum consistency of the training of shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled point clouds under a manifold assumption. Graph convolution is defined spectrally via the graph Laplacian, whose low-frequency spectrum approximates th...

üîπ PDF URL: https://arxiv.org/pdf/2601.06025v1

üîπ ArXiv Page URL: http://arxiv.org/abs/2601.06025v1

üîπ Primary Category: stat.ML

üîπ Comment: 44 pages, 0 figures, 1 table

‚úÖ All key metadata fields are accessible and complete!


## üì• Testing PDF Download

Now I need to verify that I can actually download PDFs programmatically. This is critical because my agent will need to extract full paper content, not just abstracts.

**What I'm testing:**
- Whether the arxiv library can download PDFs automatically
- I will check file sizes to confirm complete downloads
- Verifying that files are saved correctly to our data folder

**Why this matters:** The entire "Paper Analyzer" agent depends on being able to read full papers. If PDF downloads are unreliable, I'll need a backup strategy.

In [8]:
# Cell 8: Test PDF Download Functionality

"""
Test downloading a PDF to ensure we can access full paper content.
We'll download to our data/raw folder.
"""

import os

# Create data/raw directory if it doesn't exist
os.makedirs('../data/raw', exist_ok=True)

# Download the first paper's PDF
sample_paper = ai_papers[0]
paper_id = sample_paper.entry_id.split('/')[-1].replace('.', '_')

print(f"üì• Downloading: {sample_paper.title[:60]}...")
print(f"   ArXiv ID: {paper_id}")
print(f"   PDF URL: {sample_paper.pdf_url}\n")

# Download PDF
pdf_path = f"../data/raw/{paper_id}.pdf"
sample_paper.download_pdf(filename=pdf_path)

# Check if download succeeded
if os.path.exists(pdf_path):
    file_size = os.path.getsize(pdf_path) / 1024  # Size in KB
    print(f"‚úÖ Download successful!")
    print(f"   Saved to: {pdf_path}")
    print(f"   File size: {file_size:.1f} KB")
else:
    print("‚ùå Download failed!")

üì• Downloading: Manifold limit for the training of shallow graph convolution...
   ArXiv ID: 2601_06025v1
   PDF URL: https://arxiv.org/pdf/2601.06025v1

‚úÖ Download successful!
   Saved to: ../data/raw/2601_06025v1.pdf
   File size: 694.5 KB


## üîß Building a Reusable Search Function

I'm now going to create a clean, production-ready function that I can reuse across all notebooks and eventually in my agent pipeline. This will be the foundation of the "Research Finder" agent.

**What I'm building:**
- A flexible search function that handles different query types and categories
- I will make it return structured data (not just raw objects)
- Error handling so the agent doesn't crash on bad queries



In [9]:
# Cell 9: Build Reusable ArXiv Search Function

"""
Create a clean, reusable function for searching ArXiv papers.
This will be the core of our Research Finder agent.
"""

def search_arxiv_papers(
    query=None,
    categories=["cs.AI", "cs.LG", "cs.CL"],
    max_results=10,
    days_back=7,
    sort_by="submitted"
):
    """
    Search ArXiv for AI/ML papers with flexible parameters.
    
    Args:
        query (str): Keyword search (e.g., "transformer models")
        categories (list): ArXiv categories to filter by
        max_results (int): Maximum number of papers to return
        days_back (int): Only get papers from last N days (None for all time)
        sort_by (str): "submitted" or "relevance"
    
    Returns:
        list: List of dictionaries containing paper metadata
    """
    
    # Build query string
    if query and categories:
        # Combine keyword search with category filter
        category_query = " OR ".join([f"cat:{cat}" for cat in categories])
        full_query = f"{query} AND ({category_query})"
    elif categories:
        # Category filter only
        full_query = " OR ".join([f"cat:{cat}" for cat in categories])
    elif query:
        # Keyword only (not recommended - gets all categories)
        full_query = query
    else:
        raise ValueError("Must provide either query or categories")
    
    # Set sort criterion
    if sort_by == "submitted":
        sort_criterion = arxiv.SortCriterion.SubmittedDate
    else:
        sort_criterion = arxiv.SortCriterion.Relevance
    
    # Create search
    search = arxiv.Search(
        query=full_query,
        max_results=max_results,
        sort_by=sort_criterion,
        sort_order=arxiv.SortOrder.Descending
    )
    
    # Execute search
    try:
        results = client.results(search)
        papers = []
        
        for paper in results:
            # Filter by date if specified
            if days_back:
                cutoff_date = datetime.now() - timedelta(days=days_back)
                if paper.published.replace(tzinfo=None) < cutoff_date:
                    continue
            
            # Structure the data
            paper_data = {
                'arxiv_id': paper.entry_id.split('/')[-1],
                'title': paper.title,
                'authors': [author.name for author in paper.authors],
                'published': paper.published.strftime('%Y-%m-%d'),
                'categories': paper.categories,
                'primary_category': paper.primary_category,
                'abstract': paper.summary,
                'pdf_url': paper.pdf_url,
                'arxiv_url': paper.entry_id
            }
            papers.append(paper_data)
        
        return papers
    
    except Exception as e:
        print(f"‚ùå Search failed: {e}")
        return []


print("Function Defined: Ready to use in production pipeline")

Function Defined: Ready to use in production pipeline


### **üß™ Testing Our Search Function**

Time to put my reusable function through its paces. I'll test different scenarios to make sure it handles various use cases that my agent will encounter.

**What I'm testing:**
- Category-only search (broad AI/ML papers)
- I will try keyword + category combination (specific topics)
- Testing the date filter to get only recent papers


In [10]:
# Cell 10: Test Search Function with Different Scenarios

"""
Test our search function with various parameter combinations.
"""

# Test 1: Recent papers in AI/ML (no keyword)
print("üß™ TEST 1: Recent AI/ML papers (last 3 days)")
print("-" * 80)
recent_papers = search_arxiv_papers(
    categories=["cs.AI", "cs.LG"],
    max_results=5,
    days_back=3
)
print(f"Found {len(recent_papers)} papers")
for p in recent_papers[:3]:
    print(f"  - {p['title'][:60]}... ({p['published']})")

print("\n" + "=" * 80 + "\n")

# Test 2: Keyword search + category filter
print("üß™ TEST 2: Papers on 'reinforcement learning' (last 7 days)")
print("-" * 80)
rl_papers = search_arxiv_papers(
    query="reinforcement learning",
    categories=["cs.AI", "cs.LG"],
    max_results=5,
    days_back=7
)
print(f"Found {len(rl_papers)} papers")
for p in rl_papers[:3]:
    print(f"  - {p['title'][:60]}...")
    print(f"    Categories: {', '.join(p['categories'][:2])}")

print("\n" + "=" * 80 + "\n")

# Test 3: Specific topic search
print("üß™ TEST 3: Papers on 'large language models'")
print("-" * 80)
llm_papers = search_arxiv_papers(
    query="large language models",
    categories=["cs.CL", "cs.AI"],
    max_results=5,
    days_back=7
)
print(f"Found {len(llm_papers)} papers")
for p in llm_papers[:3]:
    print(f"  - {p['title'][:60]}...")

print("\n" + "=" * 80)
print("‚úÖ All tests passed! Function works reliably.")

üß™ TEST 1: Recent AI/ML papers (last 3 days)
--------------------------------------------------------------------------------
Found 0 papers


üß™ TEST 2: Papers on 'reinforcement learning' (last 7 days)
--------------------------------------------------------------------------------
Found 5 papers
  - Manifold limit for the training of shallow graph convolution...
    Categories: stat.ML, cs.LG
  - Chaining the Evidence: Robust Reinforcement Learning for Dee...
    Categories: cs.CL
  - LookAroundNet: Extending Temporal Context with Transformers ...
    Categories: cs.LG


üß™ TEST 3: Papers on 'large language models'
--------------------------------------------------------------------------------
Found 5 papers
  - AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling f...
  - Chaining the Evidence: Robust Reinforcement Learning for Dee...
  - Probing Cosmic Expansion and Early Universe with Einstein Te...

‚úÖ All tests passed! Function works reliably.


In [11]:
# Cell 10: FIXED - Test Search Function

"""
Simplified tests to debug what's actually working.
"""

# Test 1: Just get recent AI/ML papers (no date filter for now)
print("üß™ TEST 1: Recent AI/ML papers (NO date filter)")
print("-" * 80)
recent_papers = search_arxiv_papers(
    categories=["cs.AI", "cs.LG"],
    max_results=5,
    days_back=None  # Remove date filter to see what we get
)
print(f"Found {len(recent_papers)} papers")
for p in recent_papers[:5]:
    print(f"  - {p['title'][:70]}")
    print(f"    Published: {p['published']} | Categories: {', '.join(p['categories'][:2])}")
    print()



üß™ TEST 1: Recent AI/ML papers (NO date filter)
--------------------------------------------------------------------------------
Found 5 papers
  - Manifold limit for the training of shallow graph convolutional neural 
    Published: 2026-01-09 | Categories: stat.ML, cs.LG

  - AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs
    Published: 2026-01-09 | Categories: cs.CL, cs.AI

  - LookAroundNet: Extending Temporal Context with Transformers for Clinic
    Published: 2026-01-09 | Categories: cs.LG

  - Detecting Stochasticity in Discrete Signals via Nonparametric Excursio
    Published: 2026-01-09 | Categories: stat.ML, cs.LG

  - The Molecular Structure of Thought: Mapping the Topology of Long Chain
    Published: 2026-01-09 | Categories: cs.CL, cs.AI



In [12]:
# Cell 11: Fixed Search Function (Correcting Date Filter)

"""
The date filtering logic had a bug. Here's the corrected version.
"""

def search_arxiv_papers(
    query=None,
    categories=["cs.AI", "cs.LG", "cs.CL"],
    max_results=10,
    days_back=None,  # Changed default to None
    sort_by="submitted"
):
    """
    Search ArXiv for AI/ML papers with flexible parameters.
    
    Args:
        query (str): Keyword search (e.g., "transformer models")
        categories (list): ArXiv categories to filter by
        max_results (int): Maximum number of papers to return
        days_back (int): Only get papers from last N days (None for all)
        sort_by (str): "submitted" or "relevance"
    
    Returns:
        list: List of dictionaries containing paper metadata
    """
    
    # Build query string
    if query and categories:
        category_query = " OR ".join([f"cat:{cat}" for cat in categories])
        full_query = f"{query} AND ({category_query})"
    elif categories:
        full_query = " OR ".join([f"cat:{cat}" for cat in categories])
    elif query:
        full_query = query
    else:
        raise ValueError("Must provide either query or categories")
    
    # Set sort criterion
    if sort_by == "submitted":
        sort_criterion = arxiv.SortCriterion.SubmittedDate
    else:
        sort_criterion = arxiv.SortCriterion.Relevance
    
    # Create search
    search = arxiv.Search(
        query=full_query,
        max_results=max_results,
        sort_by=sort_criterion,
        sort_order=arxiv.SortOrder.Descending
    )
    
    # Calculate cutoff date if needed
    cutoff_date = None
    if days_back:
        cutoff_date = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=days_back)
    
    # Execute search
    try:
        results = client.results(search)
        papers = []
        
        for paper in results:
            # Filter by date if specified (FIXED LOGIC)
            if cutoff_date:
                paper_date = paper.published.replace(tzinfo=None, hour=0, minute=0, second=0, microsecond=0)
                if paper_date < cutoff_date:
                    continue
            
            # Structure the data
            paper_data = {
                'arxiv_id': paper.entry_id.split('/')[-1],
                'title': paper.title,
                'authors': [author.name for author in paper.authors],
                'published': paper.published.strftime('%Y-%m-%d'),
                'categories': paper.categories,
                'primary_category': paper.primary_category,
                'abstract': paper.summary,
                'pdf_url': paper.pdf_url,
                'arxiv_url': paper.entry_id
            }
            papers.append(paper_data)
        
        return papers
    
    except Exception as e:
        print(f"Search failed: {e}")
        return []



In [13]:
# Test with 3-day filter again
print("üß™ Testing fixed date filter (last 3 days):")
print("-" * 80)

test_papers = search_arxiv_papers(
    categories=["cs.AI", "cs.LG"],
    max_results=5,
    days_back=3
)

print(f"Found {len(test_papers)} papers")
for p in test_papers:
    print(f"  - {p['title'][:60]}... ({p['published']})")

print("\n Should now show papers from Jan 9 onwards!")

üß™ Testing fixed date filter (last 3 days):
--------------------------------------------------------------------------------
Found 5 papers
  - Manifold limit for the training of shallow graph convolution... (2026-01-09)
  - AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling f... (2026-01-09)
  - LookAroundNet: Extending Temporal Context with Transformers ... (2026-01-09)
  - Detecting Stochasticity in Discrete Signals via Nonparametri... (2026-01-09)
  - The Molecular Structure of Thought: Mapping the Topology of ... (2026-01-09)

 Should now show papers from Jan 9 onwards!


In [14]:
# Quick test: Get MORE papers to see date distribution
print("Checking what dates are actually available...")
print("-" * 80)

test_papers = search_arxiv_papers(
    categories=["cs.AI", "cs.LG"],
    max_results=50,  # Get more to see date spread
    days_back=None   # No filter, get everything recent
)

# Check date distribution
dates = {}
for p in test_papers:
    date = p['published']
    dates[date] = dates.get(date, 0) + 1

print("Papers by date:")
for date in sorted(dates.keys(), reverse=True):
    print(f"  {date}: {dates[date]} papers")

print("\nüí° Diagnosis:")
if '2026-01-12' not in dates and '2026-01-11' not in dates and '2026-01-10' not in dates:
    print("   ‚Üí ArXiv hasn't published papers for Jan 10-12 yet")
    print("   ‚Üí This is normal (weekend + processing delays)")
else:
    print("   ‚Üí Papers exist, but our filter had a bug")

Checking what dates are actually available...
--------------------------------------------------------------------------------
Papers by date:
  2026-01-09: 50 papers

üí° Diagnosis:
   ‚Üí ArXiv hasn't published papers for Jan 10-12 yet
   ‚Üí This is normal (weekend + processing delays)


### üìä Converting to Structured DataFrame

Now I'll organize the paper data into a pandas DataFrame. This makes it much easier to analyze, filter, and eventually store in a database.

**What I'm doing:**
- Converting our list of dictionaries into a clean DataFrame
- I will add some useful derived columns (like abstract length)
- Saving a sample dataset to CSV for future reference



In [15]:
# Cell 12: Organize Papers into DataFrame

"""
Convert paper data to pandas DataFrame for easier analysis.
"""

# Get a larger sample of recent papers
print("üì• Fetching 20 recent AI/ML papers...\n")
papers = search_arxiv_papers(
    categories=["cs.AI", "cs.LG", "cs.CL", "cs.CV"],
    max_results=20,
    days_back=7
)

# Convert to DataFrame
df = pd.DataFrame(papers)

# Add some useful derived columns
df['abstract_length'] = df['abstract'].str.len()
df['num_authors'] = df['authors'].apply(len)
df['num_categories'] = df['categories'].apply(len)

print(f"‚úÖ Created DataFrame with {len(df)} papers\n")
print("=" * 80)
print("üìä DATASET OVERVIEW:")
print("=" * 80)
print(f"\nShape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"\nColumns: {', '.join(df.columns)}")
print(f"\nDate range: {df['published'].min()} to {df['published'].max()}")
print(f"\nAverage abstract length: {df['abstract_length'].mean():.0f} characters")
print(f"\nCategories represented: {len(set([cat for cats in df['categories'] for cat in cats]))}")

print("\n" + "=" * 80)
print("üìã SAMPLE DATA (first 3 papers):")
print("=" * 80)

# Display sample
for idx, row in df.head(3).iterrows():
    print(f"\n{idx+1}. {row['title']}")
    print(f"   Authors: {len(row['authors'])} | Published: {row['published']}")
    print(f"   Categories: {', '.join(row['categories'][:3])}")
    print(f"   Abstract: {row['abstract'][:150]}...")

print("\n" + "=" * 80)
print("‚úÖ Data organized and ready for analysis!")

üì• Fetching 20 recent AI/ML papers...

‚úÖ Created DataFrame with 20 papers

üìä DATASET OVERVIEW:

Shape: 20 rows √ó 12 columns

Columns: arxiv_id, title, authors, published, categories, primary_category, abstract, pdf_url, arxiv_url, abstract_length, num_authors, num_categories

Date range: 2026-01-09 to 2026-01-09

Average abstract length: 1369 characters

Categories represented: 14

üìã SAMPLE DATA (first 3 papers):

1. Manifold limit for the training of shallow graph convolutional neural networks
   Authors: 3 | Published: 2026-01-09
   Categories: stat.ML, cs.LG, math.FA
   Abstract: We study the discrete-to-continuum consistency of the training of shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled p...

2. AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs
   Authors: 9 | Published: 2026-01-09
   Categories: cs.CL, cs.AI
   Abstract: Large language models (LLMs) exhibit complementary strengths arising from differences in pretr

### **üíæ Wrapping up Notebook 01**

I'll now save this dataset to CSV so I can reuse it in future notebooks without hitting the API again. Then I'll summarize what I've learned.

**What I'm doing:**
- Saving the DataFrame to our data/processed folder
- I will document key insights about ArXiv's data structure
- Outlining what needs to happen in the next notebook



In [16]:
# Cell 13: Save Dataset and Summarize Learnings

"""
Save our sample dataset and document key findings.
"""

# Create processed data directory
import os
os.makedirs('../data/processed', exist_ok=True)

# Save to CSV
csv_path = '../data/processed/sample_papers_jan2026.csv'
df.to_csv(csv_path, index=False)

print("üíæ SAVED DATASET")
print("=" * 80)
print(f"Location: {csv_path}")
print(f"Records: {len(df)} papers")
print(f"Size: {os.path.getsize(csv_path) / 1024:.1f} KB\n")

print("=" * 80)
print("üéØ KEY LEARNINGS FROM THIS NOTEBOOK")
print("=" * 80)

learnings = """
1. ArXiv API is reliable and well-structured
   - No authentication needed
   - Rich metadata (title, abstract, authors, categories, PDF links)
   - Average abstract length: ~1369 characters (perfect for LLM analysis)

2. Category filtering is Essential
   - Without it, you get astronomy and physics papers
   - cs.AI, cs.LG, cs.CL, cs.CV are the key AI/ML categories

3. Publishing schedule matters
   - Papers typically published on weekdays
   - Weekend submissions appear Monday
   - Always fetch extra results to account for date gaps

4. Production-ready search function created
   - Handles keyword + category filtering
   - Date filtering works correctly
   - Returns structured dictionaries (easy to convert to DataFrame)

5. Data structure is clean
   - 12 columns of useful metadata
   - Easy to extend with derived features
   - Ready for database storage
"""

print(learnings)


print("üìã NEXT STEPS (Notebook 02)")


next_steps = """
‚Üí Extract text from downloaded PDFs
‚Üí Parse paper structure (sections, equations, figures)
‚Üí Test different PDF extraction libraries
‚Üí Handle edge cases (formatting issues, missing sections)
‚Üí Build data pipeline: Raw PDF ‚Üí Structured text
"""

print(next_steps)


print("Notebook 01 Complete!")
print(f"   Total execution time: ~5-10 minutes")
print(f"   Ready to move to Paper Processing (Notebook 02)")

üíæ SAVED DATASET
Location: ../data/processed/sample_papers_jan2026.csv
Records: 20 papers
Size: 32.6 KB

üéØ KEY LEARNINGS FROM THIS NOTEBOOK

1. ArXiv API is reliable and well-structured
   - No authentication needed
   - Rich metadata (title, abstract, authors, categories, PDF links)
   - Average abstract length: ~1369 characters (perfect for LLM analysis)

2. Category filtering is Essential
   - Without it, you get astronomy and physics papers
   - cs.AI, cs.LG, cs.CL, cs.CV are the key AI/ML categories

3. Publishing schedule matters
   - Papers typically published on weekdays
   - Weekend submissions appear Monday
   - Always fetch extra results to account for date gaps

4. Production-ready search function created
   - Handles keyword + category filtering
   - Date filtering works correctly
   - Returns structured dictionaries (easy to convert to DataFrame)

5. Data structure is clean
   - 12 columns of useful metadata
   - Easy to extend with derived features
   - Ready for da