# Data Collection from arXiv

## Objective
Collect academic papers from arXiv API to build our recommendation dataset.

## Plan
1. Explore arXiv API
2. Collect sample papers (~10k for development)
3. Extract key fields: title, abstract, authors, categories, citations
4. Save to disk for processing

## arXiv Categories
- cs.AI - Artificial Intelligence
- cs.LG - Machine Learning
- cs.CL - Computation and Language (NLP)
- cs.CV - Computer Vision
- stat.ML - Machine Learning (Statistics)

In [1]:
import arxiv
import pandas as pd
import json
from datetime import datetime
from tqdm import tqdm
import time
import os

print("✓ Imports successful")

✓ Imports successful


In [2]:
# Test the arXiv API with a simple search
print("Testing arXiv API...")

# Search for 5 recent ML papers
search = arxiv.Search(
    query="cat:cs.LG",  # Category: Machine Learning
    max_results=5,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

print("\nSample papers from cs.LG (Machine Learning):\n")
for i, paper in enumerate(search.results(), 1):
    print(f"{i}. {paper.title}")
    print(f"   Authors: {', '.join([a.name for a in paper.authors[:3]])}...")
    print(f"   Published: {paper.published.date()}")
    print(f"   Categories: {', '.join(paper.categories)}")
    print(f"   Abstract: {paper.summary[:150]}...")
    print()

print("✓ API is working!")

Testing arXiv API...

Sample papers from cs.LG (Machine Learning):



  for i, paper in enumerate(search.results(), 1):


1. Biases in the Blind Spot: Detecting What LLMs Fail to Mention
   Authors: Iván Arcuschin, David Chanin, Adrià Garriga-Alonso...
   Published: 2026-02-10
   Categories: cs.LG, cs.AI
   Abstract: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these ...

2. Olaf-World: Orienting Latent Actions for Video World Modeling
   Authors: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang...
   Published: 2026-02-10
   Categories: cs.CV, cs.AI, cs.LG
   Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfa...

3. Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy
   Authors: Júlio Oliveira, Rodrigo Ferreira, André Riker...
   Published: 2026-02-10
   Categories: cs.LG, cs.CR
   Abstract: Data privacy and eXplainable Artificial Intelligence (XAI) are two important aspect

In [3]:
# Get one paper and examine all available fields
search = arxiv.Search(query="cat:cs.LG", max_results=1)
paper = next(search.results())

print("Available paper attributes:\n")
print(f"Title: {paper.title}")
print(f"Paper ID: {paper.entry_id}")
print(f"Published: {paper.published}")
print(f"Updated: {paper.updated}")
print(f"Authors: {[a.name for a in paper.authors]}")
print(f"Categories: {paper.categories}")
print(f"Primary Category: {paper.primary_category}")
print(f"Abstract length: {len(paper.summary)} characters")
print(f"PDF URL: {paper.pdf_url}")
print(f"\nFull abstract:\n{paper.summary}")

  paper = next(search.results())


Available paper attributes:

Title: Design Rule Checking with a CNN Based Feature Extractor
Paper ID: http://arxiv.org/abs/2012.11510v1
Published: 2020-12-21 17:26:31+00:00
Updated: 2020-12-21 17:26:31+00:00
Authors: ['Luis Francisco', 'Tanmay Lagare', 'Arpit Jain', 'Somal Chaudhary', 'Madhura Kulkarni', 'Divya Sardana', 'W. Rhett Davis', 'Paul Franzon']
Categories: ['cs.LG']
Primary Category: cs.LG
Abstract length: 700 characters
PDF URL: https://arxiv.org/pdf/2012.11510v1

Full abstract:
Design rule checking (DRC) is getting increasingly complex in advanced nodes technologies. It would be highly desirable to have a fast interactive DRC engine that could be used during layout. In this work, we establish the proof of feasibility for such an engine. The proposed model consists of a convolutional neural network (CNN) trained to detect DRC violations. The model was trained with artificial data that was derived from a set of $50$ SRAM designs. The focus in this demonstration was metal 1 ru

In [4]:
# Categories we want to collect from
CATEGORIES = [
    'cs.AI',      # Artificial Intelligence
    'cs.LG',      # Machine Learning
    'cs.CL',      # Computation and Language (NLP)
    'cs.CV',      # Computer Vision
    'cs.IR',      # Information Retrieval
    'stat.ML'     # Statistics - Machine Learning
]

# How many papers per category
PAPERS_PER_CATEGORY = 2000  # Start with 2k per category = 12k total

print(f"Target collection:")
print(f"  Categories: {len(CATEGORIES)}")
print(f"  Papers per category: {PAPERS_PER_CATEGORY}")
print(f"  Total papers: {len(CATEGORIES) * PAPERS_PER_CATEGORY}")
print(f"\nEstimated time: ~10-15 minutes")

Target collection:
  Categories: 6
  Papers per category: 2000
  Total papers: 12000

Estimated time: ~10-15 minutes


In [7]:
# Use the new Client API (fixes deprecation warning)
client = arxiv.Client()

def collect_papers_balanced(category, max_results=1000):
    """
    Collect papers using both recency and relevance
    to ensure balanced coverage across time periods
    """
    papers = []
    seen_ids = set()
    
    # Sort criteria to use
    sort_criteria = [
        (arxiv.SortCriterion.SubmittedDate, "recent"),
        (arxiv.SortCriterion.Relevance, "relevant")
    ]
    
    for sort_by, label in sort_criteria:
        print(f"  Collecting {label} papers...")
        
        search = arxiv.Search(
            query=f"cat:{category}",
            max_results=max_results,
            sort_by=sort_by
        )
        
        for paper in tqdm(client.results(search),
                         total=max_results,
                         desc=f"  {label}"):
            paper_id = paper.entry_id.split('/')[-1]
            
            # Skip if already collected
            if paper_id in seen_ids:
                continue
            
            seen_ids.add(paper_id)
            papers.append({
                'paper_id': paper_id,
                'title': paper.title,
                'abstract': paper.summary,
                'authors': [a.name for a in paper.authors],
                'categories': paper.categories,
                'primary_category': paper.primary_category,
                'published': str(paper.published.date()),
                'updated': str(paper.updated.date()),
                'pdf_url': paper.pdf_url
            })
    
    return papers

# Test with small sample
print("Testing balanced collection with cs.AI (10 papers)...")
test_papers = collect_papers_balanced('cs.AI', max_results=5)
print(f"\n✓ Collected {len(test_papers)} unique papers")

# Check date range
dates = [p['published'] for p in test_papers]
print(f"Date range: {min(dates)} → {max(dates)}")

Testing balanced collection with cs.AI (10 papers)...
  Collecting recent papers...


  recent: 100%|██████████| 5/5 [00:00<00:00, 23.02it/s]


  Collecting relevant papers...


  relevant: 100%|██████████| 5/5 [00:03<00:00,  1.45it/s]


✓ Collected 10 unique papers
Date range: 2020-12-09 → 2026-02-10





In [8]:
CATEGORIES = [
    'cs.AI',
    'cs.LG',
    'cs.CL',
    'cs.CV',
    'cs.IR',
    'stat.ML'
]

PAPERS_PER_SORT = 1000  # 1000 recent + 1000 relevant per category

all_papers = []

print("Starting balanced data collection...")
print("="*50)

for category in CATEGORIES:
    print(f"\nCollecting {category}...")
    
    papers = collect_papers_balanced(category, max_results=PAPERS_PER_SORT)
    all_papers.extend(papers)
    
    print(f"✓ Collected {len(papers)} unique papers from {category}")
    print(f"  Total so far: {len(all_papers)}")
    
    # Be polite to the API
    time.sleep(3)

print("\n" + "="*50)
print(f"✓ Collection complete!")
print(f"  Total papers (before global dedup): {len(all_papers)}")

Starting balanced data collection...

Collecting cs.AI...
  Collecting recent papers...


  recent: 100%|██████████| 1000/1000 [00:27<00:00, 35.95it/s]


  Collecting relevant papers...


  relevant: 100%|██████████| 1000/1000 [00:38<00:00, 26.17it/s]


✓ Collected 2000 unique papers from cs.AI
  Total so far: 2000

Collecting cs.LG...
  Collecting recent papers...


  recent: 100%|██████████| 1000/1000 [00:28<00:00, 35.61it/s]


  Collecting relevant papers...


  relevant: 100%|██████████| 1000/1000 [00:40<00:00, 24.75it/s]


✓ Collected 2000 unique papers from cs.LG
  Total so far: 4000

Collecting cs.CL...
  Collecting recent papers...


  recent: 100%|██████████| 1000/1000 [00:28<00:00, 35.69it/s]


  Collecting relevant papers...


  relevant: 100%|██████████| 1000/1000 [00:42<00:00, 23.79it/s]


✓ Collected 2000 unique papers from cs.CL
  Total so far: 6000

Collecting cs.CV...
  Collecting recent papers...


  recent: 100%|██████████| 1000/1000 [00:39<00:00, 25.61it/s]


  Collecting relevant papers...


  relevant: 100%|██████████| 1000/1000 [00:36<00:00, 27.62it/s]


✓ Collected 2000 unique papers from cs.CV
  Total so far: 8000

Collecting cs.IR...
  Collecting recent papers...


  recent: 100%|██████████| 1000/1000 [00:36<00:00, 27.77it/s]


  Collecting relevant papers...


  relevant: 100%|██████████| 1000/1000 [00:54<00:00, 18.18it/s]


✓ Collected 2000 unique papers from cs.IR
  Total so far: 10000

Collecting stat.ML...
  Collecting recent papers...


  recent: 100%|██████████| 1000/1000 [00:43<00:00, 23.22it/s]


  Collecting relevant papers...


  relevant: 100%|██████████| 1000/1000 [01:33<00:00, 10.75it/s]


✓ Collected 2000 unique papers from stat.ML
  Total so far: 12000

✓ Collection complete!
  Total papers (before global dedup): 12000


In [9]:
# Convert to dataframe
df = pd.DataFrame(all_papers)

print("Before deduplication:")
print(f"  Total papers: {len(df)}")

# Remove duplicates based on paper_id
df = df.drop_duplicates(subset='paper_id')

print(f"\nAfter deduplication:")
print(f"  Unique papers: {len(df)}")
print(f"  Duplicates removed: {len(all_papers) - len(df)}")

# Reset index
df = df.reset_index(drop=True)

# Basic stats
print(f"\nDataset overview:")
print(f"  Date range: {df['published'].min()} → {df['published'].max()}")
print(f"\nPapers per primary category:")
print(df['primary_category'].value_counts())

# Save to disk
os.makedirs('../data/raw', exist_ok=True)
df.to_csv('../data/raw/arxiv_papers.csv', index=False)
print(f"\n✓ Saved to ../data/raw/arxiv_papers.csv")

df.to_json('../data/raw/arxiv_papers.json', orient='records', indent=2)
print(f"✓ Saved to ../data/raw/arxiv_papers.json")

Before deduplication:
  Total papers: 12000

After deduplication:
  Unique papers: 9562
  Duplicates removed: 2438

Dataset overview:
  Date range: 2008-08-07 → 2026-02-10

Papers per primary category:
primary_category
cs.LG          2036
cs.CV          1676
cs.CL          1652
cs.IR          1264
cs.AI           650
               ... 
cs.PF             1
q-bio.MN          1
nlin.CD           1
cs.GL             1
astro-ph.HE       1
Name: count, Length: 102, dtype: int64

✓ Saved to ../data/raw/arxiv_papers.csv
✓ Saved to ../data/raw/arxiv_papers.json


In [11]:
import matplotlib.pyplot as plt
import ast

# Abstract length distribution
df['abstract_length'] = df['abstract'].str.len()

print("Abstract Statistics:")
print(f"  Mean length:   {df['abstract_length'].mean():.0f} characters")
print(f"  Median length: {df['abstract_length'].median():.0f} characters")
print(f"  Min length:    {df['abstract_length'].min()} characters")
print(f"  Max length:    {df['abstract_length'].max()} characters")

# Papers per year
df['year'] = pd.to_datetime(df['published']).dt.year
yearly_counts = df['year'].value_counts().sort_index()

print(f"\nPapers per year (last 5 years):")
print(yearly_counts.tail(5))

# Check for missing abstracts
missing = df['abstract'].isna().sum()
print(f"\nMissing abstracts: {missing}")
print(f"Empty abstracts: {(df['abstract'] == '').sum()}")

print("\n✓ EDA complete - data looks clean!")

Abstract Statistics:
  Mean length:   1228 characters
  Median length: 1230 characters
  Min length:    125 characters
  Max length:    1924 characters

Papers per year (last 5 years):
year
2022     118
2023     276
2024     152
2025     782
2026    4133
Name: count, dtype: int64

Missing abstracts: 0
Empty abstracts: 0

✓ EDA complete - data looks clean!


In [12]:
# Check the suspiciously short abstracts
print("Shortest abstracts:")
short_papers = df.nsmallest(3, 'abstract_length')[['title', 'abstract', 'abstract_length']]

for _, row in short_papers.iterrows():
    print(f"\nTitle: {row['title']}")
    print(f"Length: {row['abstract_length']} chars")
    print(f"Abstract: {row['abstract']}")
    print("-"*50)

Shortest abstracts:

Title: Anomaly Detection Based on Deep Learning Using Video for Prevention of Industrial Accidents
Length: 125 chars
Abstract: This paper proposes an anomaly detection method for the prevention of industrial accidents using machine learning technology.
--------------------------------------------------

Title: Covapixels
Length: 167 chars
Abstract: We propose and discuss the summarization of superpixel-type image tiles/patches using mean and covariance information. We refer to the resulting objects as covapixels.
--------------------------------------------------

Title: A Minesweeper Solver Using Logic Inference, CSP and Sampling
Length: 187 chars
Abstract: Minesweeper as a puzzle video game and is proved that it is an NPC problem. We use CSP, Logic Inference and Sampling to make a minesweeper solver and we limit us each select in 5 seconds.
--------------------------------------------------


In [13]:
# Check impact of different thresholds
thresholds = [200, 400, 600, 800, 1000]

print("Impact of different minimum abstract lengths:\n")
print(f"{'Threshold':<12} {'Papers Kept':<15} {'Papers Removed':<15} {'% Removed':<10}")
print("-" * 52)

for threshold in thresholds:
    kept = len(df[df['abstract_length'] >= threshold])
    removed = len(df) - kept
    pct = (removed / len(df)) * 100
    print(f"{threshold:<12} {kept:<15} {removed:<15} {pct:.1f}%")

print(f"\nCurrent dataset: {len(df)} papers")

Impact of different minimum abstract lengths:

Threshold    Papers Kept     Papers Removed  % Removed 
----------------------------------------------------
200          9557            5               0.1%
400          9516            46              0.5%
600          9280            282             2.9%
800          8601            961             10.1%
1000         7181            2381            24.9%

Current dataset: 9562 papers


In [14]:
# Look at papers between 600-800 characters
mid_range = df[(df['abstract_length'] >= 600) & 
               (df['abstract_length'] < 800)]

print(f"Papers between 600-800 chars: {len(mid_range)}")
print("\nSample abstracts in this range:\n")

for _, row in mid_range.sample(3).iterrows():
    print(f"Title: {row['title']}")
    print(f"Length: {row['abstract_length']} chars")
    print(f"Abstract: {row['abstract']}")
    print("-"*50)

Papers between 600-800 chars: 679

Sample abstracts in this range:

Title: How important is Recall for Measuring Retrieval Quality?
Length: 680 chars
Abstract: In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.
--------------------------------------------------
Title: Demonstrating PAR4SEM - A Semantic Writing Aid with Adaptive Paraphrasing
Length: 674 chars
Abstract:

In [15]:
# Filter short abstracts
original_count = len(df)
df = df[df['abstract_length'] >= 600].reset_index(drop=True)

print(f"Before filtering: {original_count} papers")
print(f"After filtering:  {len(df)} papers")
print(f"Removed:          {original_count - len(df)} papers")
print(f"\n✓ Dataset ready for embedding generation!")
print(f"  Papers: {len(df)}")
print(f"  Date range: {df['published'].min()} → {df['published'].max()}")
print(f"  Avg abstract length: {df['abstract_length'].mean():.0f} chars")

# Save cleaned dataset
df.to_csv('../data/raw/arxiv_papers_cleaned.csv', index=False)
df.to_json('../data/raw/arxiv_papers_cleaned.json', orient='records', indent=2)
print(f"\n✓ Saved cleaned dataset")

Before filtering: 9562 papers
After filtering:  9280 papers
Removed:          282 papers

✓ Dataset ready for embedding generation!
  Papers: 9280
  Date range: 2008-08-07 → 2026-02-10
  Avg abstract length: 1251 chars

✓ Saved cleaned dataset
