# Test arXiv PDF Parser Integration

This notebook tests the arXiv client, PDF parsing via Docling, metadata extraction and chunking workflow.

In [1]:
import os
os.chdir("./..")
print("CWD:", os.getcwd())

CWD: d:\Projects\arxiv-ai-explorer\backend


In [2]:
import asyncio
from pathlib import Path

from src.services.arxiv.client import ArxivClient
from src.services.pdf_parser.factory import make_pdf_parser_service
from src.services.arxiv.metadata_extractor import MetadataExtractor
from src.config import get_settings

  from .autonotebook import tqdm as notebook_tqdm


## Test 1: Search for recent papers

In [3]:
settings = get_settings()
client = ArxivClient()
category = (settings.arxiv_categories[0] if settings.arxiv_categories else 'cs.AI')
print(f'Using category: {category}')

papers = await client.search_papers(
    query=f'cat:{category}',
    max_results=3,
    sort_by='submittedDate',
    sort_order='descending',
)
print(f'Found {len(papers)} papers')
for i, paper in enumerate(papers):
    print(f"\n{i+1}. {paper['title'][:80]}...")
    print(f"   arXiv ID: {paper['arxiv_id']}")
    print(f"   PDF URL: {paper['pdf_url']}")

test_paper = papers[0] if papers else None
assert test_paper, 'No papers found for the selected category'

Using category: cs.AI
[ 2025-09-21 14:28:04,048 ] [researchmind] | Module: client |Function: _make_request | Line: 50 - INFO - Fetching arXiv data: http://export.arxiv.org/api/query?search_query=cat%3Acs.AI&start=0&max_results=3&sortBy=submittedDate&sortOrder=descending
[ 2025-09-21 14:28:04,210 ] [researchmind] | Module: client |Function: search_papers | Line: 178 - INFO - Found 3 papers for query: cat:cs.AI
Found 3 papers

1. Generalizable Geometric Image Caption Synthesis...
   arXiv ID: 2509.15217v1
   PDF URL: http://arxiv.org/pdf/2509.15217v1

2. Explicit Context-Driven Neural Acoustic Modeling for High-Fidelity RIR Generatio...
   arXiv ID: 2509.15210v1
   PDF URL: http://arxiv.org/pdf/2509.15210v1

3. FlowRL: Matching Reward Distributions for LLM Reasoning...
   arXiv ID: 2509.15207v1
   PDF URL: http://arxiv.org/pdf/2509.15207v1


In [4]:
test_paper

{'arxiv_id': '2509.15217v1',
 'title': 'Generalizable Geometric Image Caption Synthesis',
 'abstract': 'Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better ta

## Test 2: Download PDF from arXiv URL

In [5]:
print(f"Testing with paper: {test_paper['title']}")
print(f"arXiv ID: {test_paper['arxiv_id']}")

download_dir = Path('./data/test_downloads')
download_dir.mkdir(parents=True, exist_ok=True)

pdf_path = await client.download_pdf(
    pdf_url="https://arxiv.org/pdf/2509.15207",
    download_path=download_dir / f"{test_paper['arxiv_id'].replace('/', '_')}.pdf",
    max_file_size_mb=150
)
print(f'Downloaded PDF to: {pdf_path}')
print(f'File size: {pdf_path.stat().st_size / (1024*1024):.2f} MB')
print(f'File exists: {pdf_path.exists()}')

Testing with paper: Generalizable Geometric Image Caption Synthesis
arXiv ID: 2509.15217v1
[ 2025-09-21 14:28:04,397 ] [researchmind] | Module: client |Function: download_pdf | Line: 272 - INFO - Downloading PDF from: https://arxiv.org/pdf/2509.15207
[ 2025-09-21 14:28:04,650 ] [researchmind] | Module: client |Function: download_pdf | Line: 296 - INFO - Downloaded PDF (0.9MB) to: data\test_downloads\2509.15217v1.pdf
Downloaded PDF to: data\test_downloads\2509.15217v1.pdf
File size: 0.88 MB
File exists: True


## Test 3: Parse downloaded PDF with Docling

In [6]:
pdf_parser = make_pdf_parser_service()
parsed_content = await pdf_parser.parse_pdf(pdf_path)

    

[ 2025-09-21 14:28:04,686 ] [docling.datamodel.document] | Module: document |Function: _guess_format | Line: 328 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
[ 2025-09-21 14:28:04,897 ] [docling.document_converter] | Module: document_converter |Function: _convert | Line: 318 - INFO - Going to convert document batch...
[ 2025-09-21 14:28:04,898 ] [docling.document_converter] | Module: document_converter |Function: _get_pipeline | Line: 363 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 60c8066c482b9239b869b997da3fb1da
[ 2025-09-21 14:28:05,879 ] [docling.models.factories.base_factory] | Module: base_factory |Function: load_from_plugins | Line: 112 - INFO - Loading plugin 'docling_defaults'
[ 2025-09-21 14:28:05,885 ] [docling.models.factories] | Module: __init__ |Function: get_picture_description_factory | Line: 26 - INFO - Registered picture descriptions: ['vlm', 'api']
[ 2025-09-21 14:28:05,943 ] [docling.models.factories.base_factory] | Module: base

In [27]:

if parsed_content:
    paper_with_content = {**paper}
    paper_with_content['content'] = parsed_content.raw_text
    paper_with_content['sections'] = [
        {'title': section.title, 'content': section.content}
        for section in parsed_content.sections
    ]
    paper_with_content['is_processed'] = True
    

In [29]:
paper_with_content

{'arxiv_id': '2509.15207v1',
 'title': 'FlowRL: Matching Reward Distributions for LLM Reasoning',
 'abstract': 'We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\\%$ over GRPO and $5.1\\%$ over PPO on

## Test 4: Build paper_with_content and extract metadata

In [8]:
from src.services.arxiv.metadata_extractor import MetadataExtractor


In [9]:
paper_with_content = {**test_paper}
paper_with_content['content'] = parsed_content.raw_text if parsed_content else ''
paper_with_content['sections'] = [
    {'title': s.title, 'content': s.content} for s in (parsed_content.sections if parsed_content else [])
]
paper_with_content['is_processed'] = True

metadata_extractor = MetadataExtractor()
enriched = await metadata_extractor.extract_metadata(paper_with_content)



[ 2025-09-21 14:28:44,995 ] [researchmind] | Module: metadata_extractor |Function: extract_metadata | Line: 177 - INFO - Extracting metadata for paper: 2509.15217v1


[ 2025-09-21 14:28:45,031 ] [researchmind] | Module: metadata_extractor |Function: extract_metadata | Line: 251 - INFO - Extracted metadata: 1 metrics


In [10]:
paper_with_content.update(enriched)


In [11]:
paper_with_content

{'arxiv_id': '2509.15217v1',
 'title': 'Generalizable Geometric Image Caption Synthesis',
 'abstract': 'Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better ta

In [12]:
paper_with_content

{'arxiv_id': '2509.15217v1',
 'title': 'Generalizable Geometric Image Caption Synthesis',
 'abstract': 'Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better ta

# Test 5 : Chunker test 


In [13]:
from src.services.chunking.chunker import PaperChunker

In [14]:
paper_chunker = PaperChunker()

chunks = paper_chunker.chunk_paper(enriched)

  embeddings = SentenceTransformerEmbeddings(model_name=self.config.embedding_model)


[ 2025-09-21 14:28:45,530 ] [sentence_transformers.SentenceTransformer] | Module: SentenceTransformer |Function: __init__ | Line: 219 - INFO - Use pytorch device_name: cpu
[ 2025-09-21 14:28:45,531 ] [sentence_transformers.SentenceTransformer] | Module: SentenceTransformer |Function: __init__ | Line: 227 - INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2


In [15]:
chunks

[{'arxiv_id': '2509.15217v1',
  'title': 'Generalizable Geometric Image Caption Synthesis',
  'primary_category': 'cs.AI',
  'categories': ['cs.AI', 'cs.CV', 'cs.LG'],
  'section_title': 'Content',
  'section_type': 'Content',
  'chunk_index': 0,
  'total_chunks': 1,
  'chunk_text': 'arXiv:2509.15207v1  [cs.LG]  18 Sep 2025\nFlowRL: Matching Reward Distributions for LLM Reasoning\n2025-09-17',
  'start_char': 0,
  'end_char': 107,
  'published_date': '2025-09-18T17:59:11Z',
  'authors': ['Yue Xin',
   'Wenyuan Wang',
   'Rui Pan',
   'Ruida Wang',
   'Howard Meng',
   'Renjie Pi',
   'Shizhe Diao',
   'Tong Zhang'],
  'word_count': 13},
 {'arxiv_id': '2509.15217v1',
  'title': 'Generalizable Geometric Image Caption Synthesis',
  'primary_category': 'cs.AI',
  'categories': ['cs.AI', 'cs.CV', 'cs.LG'],
  'section_title': 'FlowRL: Matching Reward Distributions for LLM Reasoning',
  'section_type': 'FlowRL: Matching Reward Distributions for LLM Reasoning',
  'chunk_index': 0,
  'total_chu

In [16]:
enriched.keys()

dict_keys(['arxiv_id', 'title', 'abstract', 'authors', 'categories', 'primary_category', 'published', 'updated', 'pdf_url', 'arxiv_url', 'doi', 'journal_ref', 'content', 'sections', 'is_processed', 'metrics', 'research_area', 'research_areas_all', 'word_count', 'author_count', 'institutions'])

dict_keys(['arxiv_id', 'title', 'summary', 'authors', 'categories', 'primary_category', 'published', 'updated', 'pdf_url', 'abs_url', 'doi', 'journal_ref', 'content', 'sections', 'is_processed', 'datasets', 'metrics', 'research_area', 'research_areas_all', 'word_count', 'author_count', 'institutions'])


In [17]:
enriched

{'arxiv_id': '2509.15217v1',
 'title': 'Generalizable Geometric Image Caption Synthesis',
 'abstract': 'Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better ta

# Test : testing the retriever

In [1]:
import os
os.chdir("./..")
print("CWD:", os.getcwd())

from src.services.retrieval import get_retriever

from dotenv import load_dotenv
load_dotenv()



CWD: d:\Projects\arxiv-ai-explorer\backend


  from .autonotebook import tqdm as notebook_tqdm





True

In [2]:
q = "reward function"
top_k = 10


In [3]:
retriever = get_retriever()



In [4]:
include_sections = []
exclude_sections = ["References", "Content"]

chunks_no_references = retriever.vector_search(q, limit=top_k, include_sections=include_sections, exclude_sections=exclude_sections)


Tool Called with Vector search: reward function | Limit: 10 | Include Sections: [] | Exclude Sections: ['References', 'Content']


In [5]:
chunks_no_ref_sections = [section["section_title"] for section in chunks_no_references if section["section_title"]]
chunks_no_ref_sections

['E Additional Ablation Studies',
 '3.1. From Reward Maximization to Distribution Matching',
 '1. Introduction',
 '5. Experiment Settings',
 '4.1 Experimental Setup',
 'FlowRL: Matching Reward Distributions for LLM Reasoning',
 'B. Theoretical Analysis',
 '2. Preliminaries',
 '4.1. Reinforcement Learning for Reasoning',
 'C. GFlowNets']

In [6]:
chunks = retriever.vector_search(q, limit=top_k, include_sections=[], exclude_sections=[])


Tool Called with Vector search: reward function | Limit: 10 | Include Sections: [] | Exclude Sections: []


In [7]:
chunks_sections = [section["section_title"] for section in chunks if section["section_title"]]
chunks_sections

['References',
 'Content',
 'E Additional Ablation Studies',
 '3.1. From Reward Maximization to Distribution Matching',
 '1. Introduction',
 '5. Experiment Settings',
 '4.1 Experimental Setup',
 'FlowRL: Matching Reward Distributions for LLM Reasoning',
 'B. Theoretical Analysis',
 '2. Preliminaries']

Test : Testing Openai agents

In [8]:
from agents import Agent, ModelSettings
from src.agents.tools import search_papers

retrieval_agent = Agent(
    name="Paper Retrieval Agent",
    instructions=(
        "You are a research assistant. "
        "Use the search_papers tool to retrieve relevant papers or chunks. "
        "Apply filters when the user specifies sections to include or exclude."
        "Answer the user's question based on the retrieved papers or chunks."
    ),
    model="gpt-5-mini",
    tools=[search_papers],
    model_settings=ModelSettings(tool_choice="auto"),
)




In [None]:
from agents import Runner

query = "How Large language models think and reason ? Exculed sections with `References`."


In [10]:
res = await Runner.run(retrieval_agent, query)

Tool Called with Search papers: reward function for COT reward function "COT" "Chain of Thought" reward function
Tool Called with Vector search: reward function for COT reward function "COT" "Chain of Thought" reward function | Limit: 10 | Include Sections: None | Exclude Sections: ['References']
Tool Called with Search papers: "reward function" "CoT" "chain-of-thought" r(x,y) reward function CoT paper
Tool Called with Vector search: "reward function" "CoT" "chain-of-thought" r(x,y) reward function CoT paper | Limit: 10 | Include Sections: None | Exclude Sections: ['References']


In [11]:
res.final_output

'Short answer (from FlowRL, excluding References):\n\n- Base formulation: FlowRL treats the reward r(x,y) as the scalar outcome reward and converts it into a target distribution\n  p~(y|x) ∝ exp(β · r(x,y)) / Zφ(x).\n\n- Modified reward used in FlowRL (incorporates a reference model prior):\n  replace β·r(x,y) with β·r(x,y) + log π_ref(y|x),\n  so the target becomes proportional to π_ref(y|x) · exp(β·r(x,y)).\n\n- Practical normalizations and transforms:\n  - Outcome reward r(x,y) is group-normalized within each sampled group:\n    r̂_i = (r_i − mean(r)) / std(r), where r = {r1,...,rG}.\n  - Length (reward) shaping: the log-probability term is normalized by sequence length (1/|y| · log πθ(y|x)) to avoid exploding gradients for long CoT sequences.\n  - Importance-sampling / off-policy correction: use weight w = detach[πθ(y|x)] / π_old(y|x) with PPO-style clipping to stabilize updates.\n\n- Hyperparameter: they follow prior work and set β = 15 in experiments.\n\nIn short: the CoT reward 

I searched the available paper excerpts and can answer based on those documents (I excluded any \"References\" sections as you requested). Two papers in the database are especially informative about how modern LLMs “think” and perform reasoning (how they are trained/steered and what mechanisms matter):

Key high-level points
 LLM “reasoning” is implemented as generating extended token sequences (chains‑of‑thought, CoT) whose intermediate steps function as the model’s internal reasoning trajectory.  
 Supervised pretraining + in‑context examples produce a base ability to produce those trajectories; targeted fine‑tuning and RL are used to make those trajectories reliably correct and useful for downstream tasks. (See GeoReasoning-10K work and FlowRL excerpts.)
 Reward design and training objective strongly shape what the model considers a good reasoning trace: naive reward maximization can collapse to a single dominant solution mode, while objectives that encourage distributional coverage of high‑reward trajectories produce more diverse and robust reasoning. (FlowRL.)
 
 How this looks in the papers (details and mechanisms)
 Reasoning as trajectories / long CoT sequences:
   - Reasoning tasks are cast as producing long token trajectories (CoT). Long trajectories (e.g., thousands of tokens) create optimization challenges for RL and standard reward signals (FlowRL excerpts).
 Reinforcement learning and reward signals:
   - GeoReasoning-10K: uses an RL loop to refine captions for geometric images. The reward is composite: a reasoning reward (does the caption enable solving the downstream math question — evaluated by a frozen LLM) and a caption reward (semantic similarity to ground truth, measured by ROUGE/BLEU). The reasoning reward checks both answer format and correctness. This shapes model outputs toward captions that contain the key reasoning facts.
   - FlowRL: notes that reward‑maximizing RL (PPO, etc.) tends to optimize toward dominant modes and can ignore other valid reasoning paths (mode collapse). Instead, it proposes matching the model’s output distribution to a target reward distribution so the model samples diverse, high‑reward trajectories in proportion to their reward.
 Distribution matching / energy-based normalization:
   - FlowRL introduces a learned partition function Z_phi(x) that turns scalar rewards into a normalized target distribution (˜π(y|x) ∝ exp(β r(x,y))/Zφ(x)). Minimizing reverse KL between policy and that target (or using an equivalent trajectory‑balance loss inspired by GFlowNets) encourages sampling a variety of good reasoning trajectories rather than a single high‑probability one.
 Practical techniques to avoid sparse rewards and mode collapse:
   - Use auxiliary similarity rewards (to avoid early sparsity) — e.g., caption ROUGE/BLEU in GeoReasoning.
   - Use entropy regularization or explicit training with higher-entropy/diverse data; or change objective from reward maximization to distribution matching (FlowRL).
   - Use frozen LLMs as verifiers to compute reasoning rewards (GeoReasoning).
 Empirical effects reported in snippets:
   - FlowRL reports consistent improvements on math and code reasoning benchmarks by promoting diverse solution exploration rather than collapsing to a single mode.
   - GeoReasoning-10K reports accuracy gains on several mathematical reasoning benchmarks when captions are refined with RL using verifiable rewards.
 
 Short intuitive summary
 LLMs “think” by producing sequences of tokens that can be read as stepwise reasoning. Training and fine‑tuning shape which sequences the model prefers. Carefully designed reward signals and training objectives (especially ones that reward a distribution of good reasoning traces rather than a single mode) improve accuracy, generalization, and robustness of the model’s internal reasoning paths.
 
 Limitations / caveat
 The above summary is drawn only from the provided excerpts (FlowRL: \"FlowRL: Matching Reward Distributions for LLM Reasoning\" and GeoReasoning: \"Generalizable Geometric Image Caption Synthesis\"). There are many other perspectives and experiments in the broader literature not included in these snippets. If you want, I can pull more excerpts from other papers in the database (e.g., on chain‑of‑thought, mechanistic interpretability, or verification) to expand or compare views.