## Agentic Research Assistant

### Part 1
The application enable users to conduct research on a specific topic using arXiv. It allows users to enter a research topic, which triggers an agent to search arXiv, retrieve relevant papers, extract their content, convert them to embeddings, and store them in a vector database. Once indexing is complete, the system informs the user.

The second part  enables users to query the indexed knowledge base by entering research questions, which the agent retrieves relevant papers for using semantic search, generates informative responses grounded in those papers, and provides proper citations with links and metadata to the original sources.

In [1]:
import os
from dotenv import load_dotenv
import arxiv
import requests
from PyPDF2 import PdfReader
from io import BytesIO
from tqdm import tqdm

# LangChain + OpenAI
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.documents import Document

# Pinecone v4 + LangChain Pinecone VectorStore
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore


load_dotenv()

AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_INDEX_NAME = os.getenv("PINECONE_INDEX_NAME", "arxiv")

# Azure OpenAI LLM + Embeddings
llm = AzureChatOpenAI(
    azure_deployment="gpt-4.1",
    temperature=0.2,
    max_tokens=1000
)

embeddings_model = AzureOpenAIEmbeddings(
    azure_deployment="text-embedding-3-small"
)

#  Pinecone v4 Initialization
pc = Pinecone(api_key=PINECONE_API_KEY)

# Create Index if doesn't exist
if PINECONE_INDEX_NAME not in pc.list_indexes().names():
    pc.create_index(
        name=PINECONE_INDEX_NAME,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

index = pc.Index(PINECONE_INDEX_NAME)
print("Connected to Pinecone:", PINECONE_INDEX_NAME)

# Fetch and parse arXiv PDF data
def fetch_arxiv_papers(query, max_results=5):
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.Relevance
    )
    client = arxiv.Client()

    papers = []
    for result in client.results(search):
        try:
            pdf_url = result.pdf_url
            response = requests.get(pdf_url, timeout=20)
            response.raise_for_status()

            pdf_file = BytesIO(response.content)
            reader = PdfReader(pdf_file)
            text = ""

            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"

            papers.append({
                "id": result.entry_id,
                "title": result.title,
                "abstract": result.summary,
                "url": result.entry_id,
                "pdf_text": text,
            })
        except Exception as e:
            print(f"Skipping {result.title[:50]} due to error: {e}")

    print(f"Retrieved {len(papers)} papers for query '{query}'")
    return papers

# Chunk + Embed + Upsert into Pinecone
def index_papers(papers):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )

    vectors = []

    for paper in tqdm(papers, desc="Indexing papers"):
        text = paper.get("pdf_text")
        if not text:
            continue

        docs = splitter.split_documents([
            Document(page_content=text, metadata=paper)
        ])

        for idx, d in enumerate(docs):
            embedding = embeddings_model.embed_query(d.page_content)
            chunk_id = f"{paper['id']}_chunk{idx}"

            metadata = {
                "title": paper["title"],
                "url": paper["url"],
                "text": d.page_content,
                "chunk_id": idx
            }

            vectors.append((chunk_id, embedding, metadata))

    if vectors:
        index.upsert(vectors=vectors)
        print(f"Indexed {len(vectors)} chunks!")
    else:
        print("No data to upsert!")
        
### Part 2
# VectorStore for Retrieval
vectorstore = PineconeVectorStore.from_existing_index(
    index_name=PINECONE_INDEX_NAME,
    embedding=embeddings_model
)

# RAG Query System
def retrieve_context(query, k=5):
    docs = vectorstore.similarity_search(query, k=k)
    context = "\n\n".join([d.page_content for d in docs])
    sources = [f"- [{d.metadata.get('title')}]( {d.metadata.get('url')} )" for d in docs]
    return context, sources

def research_qa(query):
    context, sources = retrieve_context(query)

    system_prompt = (
        "You are a helpful AI research assistant. "
        "Answer based ONLY on the provided excerpts from papers. "
        "Cite your sources at the end using markdown links. "
        "If the context lacks an answer, say so clearly."
    )

    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=f"Question:\n{query}\n\nContext:\n{context}")
    ]

    response = llm.invoke(messages)
    print("\nAnswer:\n", response.content)
    print("\nSources:")
    for s in sources:
        print(s)


topic = input("Enter research topic: ")
papers = fetch_arxiv_papers(topic)
index_papers(papers)

question = input("\nEnter your research question: ")
research_qa(question)


Connected to Pinecone: arxiv
Retrieved 5 papers for query 'generative AI'


Indexing papers: 100%|██████████| 5/5 [01:05<00:00, 13.14s/it]


Indexed 346 chunks!


Found document with no `text` key. Skipping.



Answer:
 Emerging trends in generative AI include:

1. **Monetization and Open Source Dynamics**: Developers are increasingly monetizing generative AI systems, even when releasing open-source versions. These open-source releases often come with restrictions or conditions to enable monetization, reflecting broader industry disputes about openness and control (e.g., the European Commission’s Google Android case) [Vincent, 2023; Dastin et al., 2023].

2. **Integrated Services**: There is a notable influx of integrated generative AI services. These include search engines incorporating large language models (LLMs), personal assistants, note-taking and editing tools, creative task automation, video-editing applications, and generative AI-augmented search. This trend points to generative AI being embedded across a wide array of digital products and workflows [Reid, 2023].

3. **Advanced Information Access**: Generative AI models are distinguished by their ability to generate complex, high-qu

### PART 2
Part-2: Agentic Research Assistant

Extend of Part-1 by merging the two separate applications into a single unified system where an LLM-driven agent intelligently decides whether the user is in an indexing phase or a query phase based on user intent. The agent should manage state transitions explicitly informing the user when indexing is complete and they can begin asking questions, or detecting when a user wants to start a new research topic. The system maintains conversational context and guides the user through natural transition points with clear communication about what can be done next, creating a seamless research workflow without requiring separate interfaces.

In [3]:
import os
import sys
from dotenv import load_dotenv
import arxiv
import requests
from PyPDF2 import PdfReader
from io import BytesIO
from tqdm import tqdm

from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.documents import Document

from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

# env & clients
load_dotenv()
PINECONE_INDEX_NAME = os.getenv("PINECONE_INDEX_NAME", "arxiv")

llm = AzureChatOpenAI(azure_deployment="gpt-4.1", temperature=0.2, max_tokens=1000)
embeddings_model = AzureOpenAIEmbeddings(azure_deployment="text-embedding-3-small")
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

if PINECONE_INDEX_NAME not in pc.list_indexes().names():
    pc.create_index(name=PINECONE_INDEX_NAME, dimension=1536, metric="cosine",
                    spec=ServerlessSpec(cloud="aws", region="us-east-1"))

index = pc.Index(PINECONE_INDEX_NAME)
vectorstore = PineconeVectorStore.from_existing_index(index_name=PINECONE_INDEX_NAME, embedding=embeddings_model)


# parameters to avoid Pinecone limits
MAX_CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
BATCH_SIZE = 50

MAX_PINECONE_METADATA_SIZE = 3_500_000  # bytes (3.5MB, safe below 4MB limit)
MAX_METADATA_TEXT = 500  # truncate metadata text to 500 chars (safe under 40KB)

def split_large_chunk(text, initial_chunk_size=MAX_CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=initial_chunk_size,
        chunk_overlap=chunk_overlap
    )
    return splitter.split_text(text)

# fetching and indexing data
def fetch_and_index(topic, max_results=3):
    search = arxiv.Search(query=topic, max_results=max_results, sort_by=arxiv.SortCriterion.Relevance)
    client = arxiv.Client()
    papers = []

    print(f"\n Fetching {max_results} papers for topic: '{topic}'")
    results = list(client.results(search))
    for r in tqdm(results, desc="Downloading PDFs"):
        try:
            pdf = requests.get(r.pdf_url, timeout=20)
            pdf.raise_for_status()
            reader = PdfReader(BytesIO(pdf.content))
            text = "".join(page.extract_text() or "" for page in reader.pages)
            if not text.strip():
                continue
            papers.append({"id": r.entry_id, "title": r.title, "url": r.entry_id, "pdf_text": text})
        except:
            continue

    vectors = []
    print("\n Creating embeddings and preparing for indexing...")

    for paper in tqdm(papers, desc="Processing papers"):
        base_chunks = split_large_chunk(paper["pdf_text"], initial_chunk_size=MAX_CHUNK_SIZE)
        for idx, chunk in enumerate(base_chunks):
            embedding = embeddings_model.embed_query(chunk)
            metadata = {
                "title": paper["title"],
                "url": paper["url"],
                "chunk_id": idx,
                "text": chunk
            }
            vectors.append((f"{paper['id']}_chunk{idx}", embedding, metadata))

    print("\n Uploading chunks to Pinecone...")
    for i in tqdm(range(0, len(vectors), BATCH_SIZE), desc="Indexing chunks"):
        batch = vectors[i:i+BATCH_SIZE]
        index.upsert(vectors=batch)

    print(f"\n Indexed {len(vectors)} chunks from {len(papers)} papers.")
    return papers

# retrive and answer questions
def answer_question(query):
    print(f"\n Your question: {query}\n")
    docs = vectorstore.similarity_search(query, k=5)
    context = "\n\n".join(d.page_content for d in docs)
    sources = [f"- [{d.metadata.get('title')}]({d.metadata.get('url')}) (chunk {d.metadata.get('chunk_id',0)})" for d in docs]

    system_prompt = (
        "You are a helpful AI research assistant. Answer ONLY using the excerpts provided. "
        "Cite sources at the end. If the context doesn't answer, say so clearly."
    )

    messages = [SystemMessage(content=system_prompt),
                HumanMessage(content=f"Question:\n{query}\n\nContext:\n{context}")]
    response = llm.invoke(messages)
    print("\nAnswer:\n", response.content)
    print("\nSources:")
    for s in sources:
        print(s)

# intent classifier to decide indexing vs querying
def classify_intent(user_input, current_topic):
    """
    Decide if the user wants to 'index' a new topic or 'query' the current topic.
    """
    if not current_topic:
        return "index"

    prompt = (
        f"Current topic: {current_topic}\n"
        f"User input: \"{user_input}\"\n"
        "Decide the intent: 'index' if user wants a new topic, 'query' if asking a question. "
        "Respond with only 'index' or 'query'."
    )
    result = llm.invoke([HumanMessage(content=prompt)])
    intent = result.content.strip().lower().replace(".", "")
    return "index" if intent == "index" else "query"

# main loop
print(" Hello! Enter a research topic to fetch papers.")
print("Type 'reset' for a new topic, 'quit' to exit.")

current_topic = None
indexed = False

while True:
    # Dynamic instructions for users to decide next action
    if not current_topic:
        prompt_text = "Type a research topic to fetch papers (or 'quit' to exit): "
    elif indexed:
        prompt_text = "Ask a question or type a new topic to fetch papers: "
    else:
        prompt_text = "Type a research topic to fetch papers (or 'quit' to exit): "

    user_input = input(f"\nYou: {prompt_text}").strip()

    if user_input.lower() in ["quit", "exit"]:
        print("Goodbye! ")
        break

    # Model detects if this is a new topic or a query
    intent = classify_intent(user_input, current_topic)

    if intent == "index":
        # User wants to index a new topic
        current_topic = user_input
        fetch_and_index(current_topic)
        indexed = True
        print(f"\n Indexed papers for '{current_topic}'. You can now ask questions or type a new topic.")
    else:
        # User is asking a question
        if not indexed:
            print("\n No research indexed yet! Please provide a topic first.")
            continue
        answer_question(user_input)



 Hello! Enter a research topic to fetch papers.
Type 'reset' for a new topic, 'quit' to exit.

 Fetching 3 papers for topic: 'machine learning'


Downloading PDFs: 100%|██████████| 3/3 [00:13<00:00,  4.62s/it]



 Creating embeddings and preparing for indexing...


Processing papers: 100%|██████████| 3/3 [02:14<00:00, 44.74s/it]



 Uploading chunks to Pinecone...


Indexing chunks: 100%|██████████| 15/15 [00:30<00:00,  2.05s/it]


 Indexed 704 chunks from 3 papers.

 Indexed papers for 'machine learning'. You can now ask questions or type a new topic.






 Your question: What are emerging trends in machine learning?


Answer:
 Emerging trends in machine learning include:

1. **Analysis of Massive and Complex Datasets**: The increasing availability of open and large data sources has enabled machine learning to analyze massive datasets, identify complex patterns and relationships, and perform non-linear forecasting, which traditional survey methods may not capture [1,2].

2. **Applications in Diverse Fields**: Machine learning is being applied in areas such as autonomous driving, protein structure prediction, and medicine, demonstrating its transformative impact across disciplines [He et al., 2019; Grigorescu et al., 2020; Jumper].

3. **Recurring Concept Drift in Data Streams**: There is growing interest in handling recurring concept drift in data streams, which refers to changes in data distributions over time. This is important for applications where data evolves, requiring models to adapt continuously [Suárez-Cetrulo et al., 2023].

