# ARXIV CONTENT RETRIEVER


- [Build an Arxiv paper content retriever](https://nayakpplaban.medium.com/build-an-arxiv-paper-content-retriever-and-summarizer-agent-using-completely-local-openai-swarm-cc03afeecef1)


## SETUP


## LLM

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()
model = os.getenv("LLM_MODEL")

'qwen2.5:3b'

In [47]:
from openai import OpenAI
from swarm import Swarm, Agent

ollama_client = OpenAI(
    base_url="http://localhost:11434/v1", api_key="ollama"  # required but unused
)
#
client = Swarm(client=ollama_client)

## helper function to return axriv paper details based on the topic provided

In [11]:
import arxiv
import pandas as pd


def get_url_topic(topic):
    # Prompt user for the topic to search
    print(topic)
    # topic = "ChunkRag"
    # Set up the search parameters
    search = arxiv.Search(
        query=topic,
        max_results=1,  # You can adjust this number as needed
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending,
    )

    # Prepare a list to store results
    all_data = []

    # Execute the search and collect results
    for result in search.results():
        # print(result)
        paper_info = {
            "Title": result.title,
            "Date": result.published.date(),
            "Id": result.entry_id,
            "Summary": result.summary,
            "URL": result.pdf_url,
        }
        all_data.append(paper_info)

    if all_data:
        results = "\n\n".join(
            [
                f"Title:{d['Title']}\nDate:{d['Date']}\nURL:{d['URL']}\nSummary:{d['Summary']}"
                for d in all_data
            ]
        )
        # print(results)
    return results

In [14]:
print(get_url_topic("ChunkRag"))

ChunkRag


  for result in search.results():


Title:ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems
Date:2024-10-25
URL:http://arxiv.org/pdf/2410.19572v4
Summary:Retrieval-Augmented Generation (RAG) systems using large language models
(LLMs) often generate inaccurate responses due to the retrieval of irrelevant
or loosely related information. Existing methods, which operate at the document
level, fail to effectively filter out such content. We propose LLM-driven chunk
filtering, ChunkRAG, a framework that enhances RAG systems by evaluating and
filtering retrieved information at the chunk level. Our approach employs
semantic chunking to divide documents into coherent sections and utilizes
LLM-based relevance scoring to assess each chunk's alignment with the user's
query. By filtering out less pertinent chunks before the generation phase, we
significantly reduce hallucinations and improve factual accuracy. Experiments
show that our method outperforms existing RAG models, achieving higher accuracy
on tasks requiring precis

## helper function to extract content of arxiv paper based on the url provided

In [48]:
from typing import List, Dict
from langchain_ollama import ChatOllama
from arxiv2text import arxiv_to_text
from openai import OpenAI

ollama_client = OpenAI(
    base_url="http://localhost:11434/v1", api_key="ollama"  # required but unused
)

llm = ChatOllama(model="llama3.2:1b", temperature=0, num_predict=1000)


class SummarizerAgent:
    def __init__(self):
        self.llm = ChatOllama(model="llama3.2:1b", temperature=0.0, num_predict=1000)

    def extract_content(self, url: str) -> str:
        # Replace with your specific arXiv PDF URL
        pdf_url = url
        extracted_text = arxiv_to_text(pdf_url)
        return extracted_text

    def summarize_paper(self, paper: Dict, content: str) -> str:
        """
        Summarize a single paper using Llama2
        """
        prompt = f"""
        Please provide a concise summary of the following research paper:
        Title: {paper['title']}
        Authors: {', '.join(paper['authors'])}
        Abstract: {paper['summary']}
        Content : {content}

        Generate a clear ,concise and informative summary in no more than 6-8 sentences.
        """

        return self.llm.predict(prompt)

    def summarize_papers(self, papers: List[Dict]) -> List[Dict]:
        """
        Summarize multiple papers
        """
        summarized_papers = []
        for paper in papers:
            summary = self.summarize_paper(paper)
            summarized_papers.append(
                {"title": paper["title"], "summary": summary, "original_paper": paper}
            )

        return summarized_papers

In [None]:
# # test the helper function


# def extract_content(url):
#     summ = SummarizerAgent()
#     content = summ.extract_content(url)
#     return content


# content = extract_content("https://arxiv.org/pdf/2502.13966v2")

In [None]:
# test the helper function


def summarize_extracted_content(url):
    summ = SummarizerAgent()
    content = summ.extract_content(url)
    return content


content = extract_content("https://arxiv.org/pdf/2502.13966v2")

In [52]:
llm = ChatOllama(model="llama3.2:latest", temperature=0, num_predict=1000)
llm.invoke("what is 2+2")

AIMessage(content='2 + 2 = 4', additional_kwargs={}, response_metadata={'model': 'llama3.2:latest', 'created_at': '2025-02-21T01:57:56.685224Z', 'done': True, 'done_reason': 'stop', 'total_duration': 958943166, 'load_duration': 34555791, 'prompt_eval_count': 31, 'prompt_eval_duration': 845000000, 'eval_count': 8, 'eval_duration': 78000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-57d47a04-d0d2-47d7-94ad-075d4c8b6bb7-0', usage_metadata={'input_tokens': 31, 'output_tokens': 8, 'total_tokens': 39})

## Create URL Agent

In [53]:
url_agent = Agent(
    name="Extract URL Assistant",
    instruction="Get the arxiv search results for the given topic.",
    functions=[get_url_topic],
    model=model,
)

## Create extract url agent

In [54]:
extract_url = Agent(
    name="URL Assistant", instruction="Get the URL from the given content.", model=model
)

## Create Summary Agent

In [55]:
content_agent = Agent(
    name="Extract Summary Assistant",
    instruction="""Generate a clear ,concise and informative summary of the arxiv paper.The Summary should include the authors of the paper , the date it was published and
                  the concept behind the topic explained i the paper.""",
    functions=[extract_content],
    model=model,
)

## Test Summary Agent

In [56]:
content[:50]

'Where’s the Bug? Attention Probing for Scalable Fa'

In [57]:
summary_response = client.run(
    agent=content_agent, messages=[{"role": "user", "content": content}]
)

print(summary_response.messages[-1]["content"])

I understand that you're providing additional information and results related to your study on feedback loop (FL) error analysis using prompting methods for natural language processing. Here's a refined version with some formatting adjustments and clarity improvements:

---

### Supplementary Information

#### B. Additional Results

##### B.1 Error Bars for FL Results

Error bars are typically used to visualize the variability or uncertainty around the mean of results. For the FL error analysis presented in Table 2, the following table provides additional information on error margins:

| Metric | Prompting Method (Accuracy) |
|--------|--------------------------------|
| D4J    | 0.144 ± X                     |
| GH-Py  |                                  | 
| N/A    |                                  |  
| GH-J   | 0.375 (N/A, No result available)|  
| DeepFix|R/No data                      |
| TSSB   | R/R                            |
| MS4J   | 0.169                          |  
| Ju