### App 1
The application allows users to enter a research topic, which triggers an agent to search arXiv, retrieve relevant papers, extract their content, convert them to embeddings, and store them in a vector database. Once indexing is complete, the system informs the user.


In [None]:
# !capture --no-stderr
# !pip install --quiet -U langchain_openai langchain_core langchain langchain-community langchain_pinecone beautifulsoup4 requests pinecone tabulate
# !pip install arxiv
# !pip install PyPDF2


In [None]:
import os 
from dotenv import load_dotenv 
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI 
from langchain_text_splitters import RecursiveCharacterTextSplitter
import langchain 
import arxiv 
import requests 
from PyPDF2 import PdfReader 
from io import BytesIO
from tqdm import tqdm
from pinecone import Pinecone

In [4]:
load_dotenv()

AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_ENV = os.getenv("PINECONE_ENV")
PINECONE_INDEX_NAME = os.getenv("PINECONE_INDEX_NAME", "arxiv")


In [None]:
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("arxiv")
print("Connected to Pinecone index:", PINECONE_INDEX_NAME, index)

Connected to Pinecone index: arxiv <pinecone.db_data.index.Index object at 0x000001ECEB820CE0>


In [6]:
llm = AzureChatOpenAI(
    azure_deployment="gpt-4.1",
    temperature=0.2,
    max_tokens=1000
    )

In [11]:
from langchain_openai import AzureOpenAIEmbeddings

# Embeddings
embeddings_model = AzureOpenAIEmbeddings(
    azure_deployment="text-embedding-3-small",
    dimensions=1536
)

In [None]:
def fetch_arxiv_papers(query, max_results=5):
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.Relevance
    )
    
    client = arxiv.Client()

    papers = []

    for result in client.results(search):
        try:
            pdf_url = result.pdf_url
            response = requests.get(pdf_url, timeout=20)
            response.raise_for_status()

            pdf_file = BytesIO(response.content)
            reader = PdfReader(pdf_file)
            full_text = ""
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    full_text += page_text + "\n"

            combined_text = (result.summary or "") + "\n\n" + full_text

            papers.append({
                "id": result.entry_id,
                "title": result.title,
                "abstract": result.summary,
                "url": result.entry_id,
                "pdf_text": full_text,
                "text": combined_text[:2000],
            })
        except Exception as e:
            print(f"Skipping {result.title[:50]}... due to error: {e}")

    print(f"Retrieved {len(papers)} papers for query '{query}'")
    return papers

In [13]:
try:
    # Newer versions of LangChain
    from langchain_core.documents import Document
except ImportError:
    # Older versions fallback
    from langchain.docstore.document import Document


In [None]:
def index_papers(papers, index, embeddings_model):
    """
    Index papers into Pinecone, split into overlapping chunks for accurate retrieval.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )

    vectors_to_upsert = []
    total_chunks = 0

    for paper in tqdm(papers, desc="Chunking and indexing papers"):
        text = paper.get("pdf_text") or paper.get("content")
        if not text:
            print(f"Skipping '{paper.get('title', 'Unknown')}' — no text found.")
            continue

        # Document object for LangChain splitter
        docs = [Document(page_content=text, metadata=paper)]

        # Split into smaller overlapping chunks
        doc_chunks = splitter.split_documents(docs)

        for i, chunk in enumerate(doc_chunks):
            chunk_text = chunk.page_content.strip()
            if not chunk_text:
                continue

            # Create embedding for each chunk
            embedding = embeddings_model.embed_query(chunk_text)

            # Add chunk metadata
            metadata = {
                "title": paper.get("title", "Unknown"),
                "url": paper.get("url", ""),
                "chunk_id": i,
                "text": chunk_text,
            }

            # Unique ID per chunk (paper_id + chunk index)
            paper_id = paper.get("id") or paper.get("url") or paper.get("title", "")[:50]
            chunk_id = f"{paper_id}_chunk{i}"

            vectors_to_upsert.append((chunk_id, embedding, metadata))

        total_chunks += len(doc_chunks)

    # Upsert all chunks to Pinecone
    if vectors_to_upsert:
        index.upsert(vectors=vectors_to_upsert)
        print(f"Indexed {total_chunks} chunks from {len(papers)} papers.")
    else:
        print("No chunks indexed.")

In [None]:
topic = input("Enter research topic: ")
papers = fetch_arxiv_papers(topic)
index_papers(papers, index, embeddings_model)

print(f"Success: Indexed {len(papers)} papers on '{topic}' in Pinecone db!")


✅ Retrieved 5 papers for query 'deep learning'


Chunking and indexing papers: 100%|██████████| 5/5 [02:12<00:00, 26.47s/it]


✅ Indexed 344 chunks from 5 papers.
✅ Indexed 5 papers on 'deep learning' in Pinecone!


---------- end of app first --------------


## App 2
 application enables users to query the indexed knowledge base by entering research questions, which the agent retrieves relevant papers for using semantic search, generates informative responses grounded in those papers, and provides proper citations with links and metadata to the original sources.

In [16]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_openai import AzureChatOpenAI
from langchain_pinecone import Pinecone

In [17]:
vectorstore = Pinecone(
    index_name=os.environ["PINECONE_INDEX_NAME"],
    embedding=embeddings_model
)

  vectorstore = Pinecone(


In [18]:
def retrieve_context(query, k=5):
    docs = vectorstore.similarity_search(query, k=k)
    context_parts = []
    sources = []

    for d in docs:
        # Fallback: use text from metadata if page_content is empty
        content = d.page_content or d.metadata.get("text", "")
        if content:
            context_parts.append(content)
            sources.append(f"- [{d.metadata.get('title', 'Untitled')}]({d.metadata.get('url', 'Unknown URL')})")

    context_text = "\n\n".join(context_parts)
    return context_text, sources


In [None]:
def research_qa(query, k=5):
    context, sources = retrieve_context(query, k=k)

    system_prompt = (
        "You are a helpful AI research assistant. "
        "Use the provided academic paper excerpts to answer the user's question clearly and concisely. "
        "Cite your sources at the end in markdown link format like [Title](URL). "
        "If the answer cannot be found in the papers, say so explicitly."
    )

    user_prompt = f"Question:\n{query}\n\nContext:\n{context}"

    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ]

    response = llm.invoke(messages)

    print("\nAnswer:")
    print(response.content)
    print("\nSources:")
    for s in sources:
        print(s)

In [20]:
question = input("Enter your research question: ")
research_qa(question)


Answer:
Deep learning algorithms are a subset of machine learning methods that use neural networks with many layers (hence "deep") to automatically learn hierarchical feature representations from raw data. The key characteristics and advantages of deep learning algorithms include:

- **Layered Feature Representation:** Deep learning models learn successive layers of increasingly abstract and meaningful features from data. Each layer transforms the input data into a more complex representation, enabling the model to capture intricate patterns and relationships. This process is often referred to as "feature representation learning" ([A Review on Deep Learning Techniques Applied to Semantic Segmentation](https://arxiv.org/abs/1704.06857)).

- **Automatic Feature Extraction:** Unlike traditional machine learning, which often requires manual feature engineering, deep learning algorithms automatically discover the features that best represent the data, making them highly effective for compl

In [None]:
<!-- https://colab.research.google.com/drive/1iA6VqdRi1RPirf3PtLfPT74UJiHdPKxr?usp=sharing#scrollTo=ApA0U6w5v8er -->
<!-- https://colab.research.google.com/drive/1-o4mnBbFtTXfaP-2FaG3X3F6q5zxRUbW?usp=sharing#scrollTo=tD4t0206sPgC -->


to do:
Part-2: Agentic Research Assistant

Extend Part-1 by merging the two separate applications into a single unified system where an LLM-driven agent intelligently decides whether the user is in an indexing phase or a query phase based on user intent. The agent should manage state transitions explicitly—informing the user when indexing is complete and they can begin asking questions, or detecting when a user wants to start a new research topic. The system maintains conversational context and guides the user through natural transition points with clear communication about what can be done next, creating a seamless research workflow without requiring separate interfaces.

In [None]:
!pip install langchain-text-splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_openai import AzureChatOpenAI
from langchain_pinecone import Pinecone

from langchain_openai import AzureOpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Pinecone as PineconeVectorStore

In [None]:
-----------------------------------