<a href="https://colab.research.google.com/github/Om22210564/Research_Assitant_Agent/blob/main/Literature_Survey_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A User Agent receives the query,

It constructs a custom arXiv API URL,

Downloads and parses the XML,

Extracts summary and metadata,

Converts metadata into IEEE citation format, and

Displays results in a clean table.

Further the user can download specific paper.

Will run RAG over each paper for QA.

In [None]:
!pip install langchain langchain_groq


In [4]:
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableMap
import os
from google.colab import userdata

llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0.2,api_key= userdata.get('GROQ_API_KEY'))


In [11]:
system_prompt = """
You are a helpful assistant specialized in generating correct arXiv API query URLs.

Refer to the official arXiv API documentation:
https://info.arxiv.org/help/api/user-manual.html#5-appendices

Your task is:
- Read the user's natural language research query.
- Extract the meaningful search intent.
- Convert it into a valid arXiv API query URL (as per the manual).
- Use fields like `ti:` (title), `au:` (author), `cat:` (category), `submittedDate:`, etc., based on context.
- Return only the final URL string starting with "https://export.arxiv.org/api/query?search_query=..."

Example Input: "AI in healthcare research paper for last 3 years"
Expected Output: https://export.arxiv.org/api/query?search_query=ti:AI+AND+ti:healthcare+AND+submittedDate:[202106010600+TO+202506010600]

Another Input: "Research paper of Artificial Intelligence domain"
Output: https://export.arxiv.org/api/query?search_query=cat:cs.AI

Now handle this new query:
"""


In [12]:
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["user_query"],
    template=system_prompt + "{user_query}"
)

chain = prompt | llm

In [13]:
user_query = "AI in healthcare research paper for last 3 years"
response = chain.invoke({"user_query": user_query})
print("Generated URL:", response.content)

Generated URL: https://export.arxiv.org/api/query?search_query=ti:AI+AND+ti:healthcare+AND+submittedDate:[202106010600+TO+202506010600]


In [14]:
user_query = "AI in Agriculture research paper for last 5 years"
response = chain.invoke({"user_query": user_query})
print("Generated URL:", response.content)

Generated URL: To construct the arXiv API query URL for the given search query, we need to break down the query into its components and map them to the appropriate arXiv API search fields.

The search query is: "AI in Agriculture research paper for last 5 years"

1. "AI" and "Agriculture" are the keywords that should appear in the title (`ti:`) of the research papers.
2. The time frame is the last 5 years, which means we need to calculate the date range for the `submittedDate:` field.

Assuming the current year is 2024 (for calculation purposes), the last 5 years would be from 2019 to 2024. However, since the task requires a precise date format, let's consider the start date as June 1, 2019, and the end date as June 1, 2024, for simplicity. The date format required is `YYYYMMDDHHMM`.

- Start date (2019): 201906010600
- End date (2024): 202406010600

Now, let's construct the search query:

- `ti:AI` for AI in the title
- `AND` to combine conditions
- `ti:Agriculture` for Agriculture in

In [15]:
response.content.find('https')

1150

In [16]:
user_query = "Blockchain in Voting research paper for last 3 years"
response = chain.invoke({"user_query": user_query})
print("Generated URL:", response.content)

Generated URL: To construct the arXiv API query URL for the given search query, we need to break down the intent into its components and map them to the appropriate arXiv search fields.

The search query is: "Blockchain in Voting research paper for last 3 years"

1. **Blockchain** and **Voting** are the key topics, which should be searched within the title (`ti:`) of the papers.
2. The time frame is the **last 3 years**. Given that the current year is 2025 (as of the today date provided, 23 June 2025), we calculate the start of the period as 2022. Thus, the submitted date range should be from 2022 to 2025.

Now, let's construct the query:

- **Title Search**: `ti:Blockchain+AND+ti:Voting`
- **Date Range**: Assuming the start date is January 1, 2022, and the end date is June 23, 2025 (to include papers up to the current date), the date format for arXiv API is `YYYYMMDDHHMM`. Therefore, the date range is `[202201010600+TO+202506231600]`.

Combining these elements into a single search que

In [17]:
import re
urls = re.findall(r'https?://[^\s]+', response.content)
arxiv_query = urls[-1] if urls else None

print(arxiv_query)

https://export.arxiv.org/api/query?search_query=ti:Blockchain+AND+ti:Voting+AND+submittedDate:[202201010600+TO+202506231600]`


 Download XML

In [18]:
import requests

def fetch_arxiv_data(query_url):
    response = requests.get(query_url)
    if response.status_code == 200:
        return response.content
    else:
        raise Exception(f"Failed to fetch: {response.status_code}")

In [None]:
fetch_arxiv_data(arxiv_query)

To Make Citation

In [20]:
import xml.etree.ElementTree as ET
from datetime import datetime

def parse_arxiv_entries_with_citation(xml_data):
    root = ET.fromstring(xml_data)
    ns = {'atom': 'http://www.w3.org/2005/Atom'}
    entries = root.findall('atom:entry', ns)

    results = []

    for entry in entries:
        arxiv_id = entry.find('atom:id', ns).text.split('/')[-1]
        title = entry.find('atom:title', ns).text.strip()
        summary = entry.find('atom:summary', ns).text.strip().replace('\n', ' ')

        # Author extraction
        authors = entry.findall('atom:author', ns)
        author_names = []
        for author in authors:
            name = author.find('atom:name', ns).text
            first, *last = name.split()
            initials = first[0] + "."
            last_name = last[-1] if last else first
            author_names.append(f"{initials} {last_name}")
        author_str = ', '.join(author_names)

        # Published date
        published = entry.find('atom:published', ns).text
        date_obj = datetime.strptime(published, "%Y-%m-%dT%H:%M:%SZ")
        month_year = date_obj.strftime("%b. %Y")

        # IEEE-style citation
        citation = f'{author_str}, “{title.rstrip(".")},” *arXiv preprint arXiv:{arxiv_id}*, {month_year}. [Online]. Available: https://arxiv.org/abs/{arxiv_id}'

        results.append({
            "Title": title,
            "Summary": summary,
            "arXiv ID": arxiv_id,
            "Citation": citation
        })

    return results


In [21]:

xml_data = fetch_arxiv_data(arxiv_query)

papers = parse_arxiv_entries_with_citation(xml_data)

import pandas as pd
df = pd.DataFrame(papers)
df.head()


Unnamed: 0,Title,Summary,arXiv ID,Citation
0,Anonymous voting scheme using quantum assisted...,Voting forms the most important tool for arriv...,2206.03182v1,"S. Mishra, K. Thapliyal, S. Rewanth, A. Parakh..."
1,SBvote: Scalable Self-Tallying Blockchain-Base...,Decentralized electronic voting solutions repr...,2206.06019v1,"I. Stančíková, I. Homoliak, “SBvote: Scalable ..."
2,A Blockchain-based Electronic Voting System: E...,The development of an electronic voting system...,2307.10726v1,"A. Spanos, I. Kantzavelou, “A Blockchain-based..."
3,DeepThought: a Reputation and Voting-based Blo...,Thanks to built-in immutability and persistenc...,2209.11032v2,"M. Gennaro, L. Italiano, G. Meroni, G. Quattro..."
4,Voting Participation and Engagement in Blockch...,This paper investigates the potential of block...,2404.08906v1,"L. Ante, A. Saggu, B. Schellinger, F. Wazinksi..."


In [22]:
df

Unnamed: 0,Title,Summary,arXiv ID,Citation
0,Anonymous voting scheme using quantum assisted...,Voting forms the most important tool for arriv...,2206.03182v1,"S. Mishra, K. Thapliyal, S. Rewanth, A. Parakh..."
1,SBvote: Scalable Self-Tallying Blockchain-Base...,Decentralized electronic voting solutions repr...,2206.06019v1,"I. Stančíková, I. Homoliak, “SBvote: Scalable ..."
2,A Blockchain-based Electronic Voting System: E...,The development of an electronic voting system...,2307.10726v1,"A. Spanos, I. Kantzavelou, “A Blockchain-based..."
3,DeepThought: a Reputation and Voting-based Blo...,Thanks to built-in immutability and persistenc...,2209.11032v2,"M. Gennaro, L. Italiano, G. Meroni, G. Quattro..."
4,Voting Participation and Engagement in Blockch...,This paper investigates the potential of block...,2404.08906v1,"L. Ante, A. Saggu, B. Schellinger, F. Wazinksi..."
5,Effects of Vote Delegation in Blockchains: Who...,This paper investigates which alternative bene...,2408.05410v1,"H. Gersbach, M. Schneider, P. Shahkar, “Effect..."
6,Validated Strong Consensus Protocol for Asynch...,Vote-based blockchains construct a state machi...,2409.08161v2,"Y. Xu, J. Shao, T. Slaats, B. Düdder, Y. Zhou,..."
7,Blockchain-based decentralized voting system s...,This research study focuses primarily on Block...,2303.06306v1,"J. Singh, U. Rastogi, Y. Goel, B. Gupta, U. Ut..."
8,"ElectAnon: A Blockchain-Based, Anonymous, Robu...",Remote voting has become more critical in rece...,2204.00057v2,"C. Onur, A. Yurdakul, “ElectAnon: A Blockchain..."
9,Understanding Blockchain Governance: Analyzing...,Smart contracts are contractual agreements bet...,2305.17655v3,"J. Messias, V. Pahari, B. Chandrasekaran, K. G..."


In [33]:
# from google.colab import sheets
# sheet = sheets.InteractiveSheet(df=df)
# # Use to create Google Sheet

Code to download PDF of research paper

In [None]:
!pip install arxiv

In [30]:
import arxiv
import os

# List of arXiv paper IDs (you can add more here)
# list1 = [
#     "1605.08386v1",
#     "2401.12981v1",
#     "2506.03188v1"
# ]
list1 = list(df['arXiv ID'])
# Directory to store downloads
download_dir = "./mydir"
os.makedirs(download_dir, exist_ok=True)

# Initialize arXiv client
client = arxiv.Client()

# Loop over first three paper
for paper_id in list1[0:1]:
    try:
        # Fetch the paper
        paper = next(client.results(arxiv.Search(id_list=[paper_id])))

        # Create a unique filename per paper (e.g., use arXiv ID)
        filename = f"{paper_id.replace('/', '_')}.pdf"

        # Download the paper PDF
        paper.download_pdf(dirpath=download_dir, filename=filename)

        print(f"Downloaded {paper_id} to {os.path.join(download_dir, filename)}")
    except Exception as e:
        print(f"Failed to download {paper_id}: {e}")


Downloaded 2206.03182v1 to ./mydir/2206.03182v1.pdf


NOW TO RAG over the research paper

User will ask to simplify or tell more about paper one or topic of paper or paper id then:
Download the paper
Feed it to LLM
and let the conversation happen

QA Agent: Verifies relevance of content.

In [None]:
!pip install pypdf langchain langchain-community sentence-transformers faiss-cpu


In [5]:
import os
import requests
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI  # or Groq wrapped

# Optional: use Groq if you want to stay consistent
from langchain_groq import ChatGroq



In [None]:

# Setup LLM (same as before)
llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0.3,api_key= userdata.get('GROQ_API_KEY'))
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [42]:
import requests

# def get_or_download_paper(paper_id, folder="/mydir"):
#     file_path = os.path.join(folder, f"{paper_id}.pdf")

#     if os.path.exists(file_path):
#         print(f"✅ Found cached PDF for paper ID: {paper_id}")
#     else:
#         print(f"⬇️ Downloading paper {paper_id} from arXiv...")
#         url = f"https://arxiv.org/pdf/{paper_id}.pdf"
#         response = requests.get(url)
#         if response.status_code == 200:
#             with open(file_path, "wb") as f:
#                 f.write(response.content)
#             print(f"✅ Downloaded and saved to {file_path}")
#         else:
#             raise Exception(f"❌ Failed to download PDF. Status code: {response.status_code}")

#     return file_path

import os
import arxiv
def get_or_download_paper(paper_id, folder="./mydir"):
    os.makedirs(folder, exist_ok=True)
    filename = f"{paper_id.replace('/', '_')}.pdf"
    file_path = os.path.join(folder, filename)

    if os.path.exists(file_path):
        print(f"✅ Found cached PDF for paper ID: {paper_id}")
    else:
        print(f"⬇️ Downloading paper {paper_id} from arXiv using official client...")
        try:
            client = arxiv.Client()
            search = arxiv.Search(id_list=[paper_id])
            paper = next(client.results(search))
            paper.download_pdf(dirpath=folder, filename=filename)
            print(f"✅ Downloaded {paper_id} to {file_path}")
        except Exception as e:
            raise RuntimeError(f"❌ Failed to download {paper_id}: {e}")

    return file_path


In [43]:
# def create_rag_chain_from_pdf(pdf_path):
#     loader = PyPDFLoader(pdf_path)
#     pages = loader.load()

#     text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
#     docs = text_splitter.split_documents(pages)

#     vector_store = FAISS.from_documents(docs, embedding)
#     retriever = vector_store.as_retriever()

#     qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=False)
#     return qa_chain
def build_rag_chain(pdf_path):
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()

    splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    docs = splitter.split_documents(pages)

    vectordb = FAISS.from_documents(docs, embedding)
    retriever = vectordb.as_retriever()

    qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
    return qa

In [44]:
# def answer_question_about_paper(paper_id, user_question):
#     pdf_path = download_arxiv_pdf(paper_id)
#     rag_chain = create_rag_chain_from_pdf(pdf_path)
#     return rag_chain.run(user_question)
def answer_user_question(paper_id, question, folder="/mydir"):
    pdf_path = get_or_download_paper(paper_id, folder=folder)
    rag_chain = build_rag_chain(pdf_path)
    answer = rag_chain.run(question)
    return answer

In [45]:
paper_id = "2206.03182v1"  # Can be extracted from table or user message
question = "What is the core contribution of this paper?"

response = answer_user_question(paper_id, question)
print("🤖", response)



✅ Found cached PDF for paper ID: 2206.03182v1
🤖 The core contribution of this paper appears to be the proposal of an anonymous voting scheme using quantum-assisted blockchain technology. The paper aims to address the limitations of traditional electronic voting systems and provide a secure, verifiable, and auditable voting process using the principles of quantum mechanics and blockchain technology. The proposed scheme utilizes a permissioned quantum blockchain to ensure the integrity and security of the voting process, and it outlines the roles and responsibilities of various stakeholders, including the voting authority, miners, and voters.


In [47]:
import os

os.makedirs("./mydir", exist_ok=True)

In [50]:
paper_id = "2409.08161v2"  # Can be extracted from table or user message
question = "Why was Hot stuff excluded from the experiment?"

response = answer_user_question(paper_id, question)
print("🤖", response)



✅ Found cached PDF for paper ID: 2409.08161v2
🤖 HotStuff was excluded from the experiment in the asynchronous network (bad-case latency) because it cannot function in an asynchronous network.
