# Explanation of Key Components


## Data Collection Pipeline:       
Uses arXiv API to search for papers          
Prioritizes HTML versions (available for papers since Dec 2023)            
Falls back to abstracts for older papers          
Avoids duplicate processing with database checks      

## Vector Store Management:         
Uses LangChain's Chroma wrapper for compatibility         
Stores paper content with rich metadata        
Chunks long papers for better retrieval           
Enables semantic search with GoogleAI embeddings         

## LangGraph RAG Workflow:          
Structured workflow with search and generate steps          
Maintains state throughout the process        
Uses few-shot prompting for high-quality responses         
De-duplicates search results            

## Enhanced Search Capabilities:         
Query expansion with few-shot examples               
Deep paper analysis for selected documents          
Relevance scoring and ranking          

## User Interaction:         
Interactive search interface          
Options for query expansion             
Deep-dive into specific papers           
Links to download PDFs             

In [22]:
import os
import requests
import arxiv
import json
import numpy as np
from bs4 import BeautifulSoup
from IPython.display import Markdown
from tqdm import tqdm

# LangChain & Google GenAI specific
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain.tools import Tool
import google.generativeai as genai

# LangChain vectorstore
from langchain_community.vectorstores import Chroma

# LangGraph specific
from typing import TypedDict, Annotated, Sequence, List
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from langchain_core.runnables import RunnableLambda

"""
Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)]
ChromaDB version: 0.5.3
LangChain version: 0.3.7
Google GenAI version: 0.8.5
langgraph                 0.3.34                   pypi_0    pypi
langgraph-checkpoint      2.0.24                   pypi_0    pypi
langgraph-prebuilt        0.1.8                    pypi_0    pypi
langgraph-sdk             0.1.63                   pypi_0    pypi
"""

'\nPython version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)]\nChromaDB version: 0.5.3\nLangChain version: 0.3.7\nGoogle GenAI version: 0.8.5\nlanggraph                 0.3.34                   pypi_0    pypi\nlanggraph-checkpoint      2.0.24                   pypi_0    pypi\nlanggraph-prebuilt        0.1.8                    pypi_0    pypi\nlanggraph-sdk             0.1.63                   pypi_0    pypi\n'

In [41]:
#!pip install -U google-generativeai langchain-google-genai

In [42]:
# Configure with your API key
genai.configure(api_key=GOOGLE_API_KEY)

# List all available models
for m in genai.list_models():
    if 'generateContent' in m.supported_generation_methods:
        print(f"Model name: {m.name}")
        print(f"Display name: {m.display_name}")
        print(f"Supported methods: {m.supported_generation_methods}")
        print("-" * 50)

Model name: models/gemini-1.0-pro-vision-latest
Display name: Gemini 1.0 Pro Vision
Supported methods: ['generateContent', 'countTokens']
--------------------------------------------------
Model name: models/gemini-pro-vision
Display name: Gemini 1.0 Pro Vision
Supported methods: ['generateContent', 'countTokens']
--------------------------------------------------
Model name: models/gemini-1.5-pro-latest
Display name: Gemini 1.5 Pro Latest
Supported methods: ['generateContent', 'countTokens']
--------------------------------------------------
Model name: models/gemini-1.5-pro-001
Display name: Gemini 1.5 Pro 001
Supported methods: ['generateContent', 'countTokens', 'createCachedContent']
--------------------------------------------------
Model name: models/gemini-1.5-pro-002
Display name: Gemini 1.5 Pro 002
Supported methods: ['generateContent', 'countTokens', 'createCachedContent']
--------------------------------------------------
Model name: models/gemini-1.5-pro
Display name: Gemin

In [43]:
embedding_models = [m for m in genai.list_models() if "embedding" in m.name.lower()]

if embedding_models:
    # Use the first available embedding model
    embedding_model_name = embedding_models[0].name
    print(f"Using embedding model: {embedding_model_name}")
else:
    # Fall back to a text model for embeddings
    embedding_model_name = "models/gemini-1.5-pro-latest"
    print(f"No embedding models found. Using text model for embeddings: {embedding_model_name}")

Using embedding model: models/embedding-gecko-001


In [44]:

# API Configuration
GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY", "AIzaSyBuYf9Sdm8M8tIMvfArkcS_YUjhEZfZqes")
genai.configure(api_key=GOOGLE_API_KEY)

# Initialize Gemini model for RAG workflow
gemini_pro = ChatGoogleGenerativeAI(
    model="models/gemini-1.5-pro-latest",  # Use exactly this format from the list
    temperature=0.3,
    google_api_key=GOOGLE_API_KEY
)

# Initialize embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model="embedding-gecko-001",  # Not "models/embedding-001"
    google_api_key=GOOGLE_API_KEY
)

In [45]:
# Initialize vector store with LangChain's wrapper
vectorstore = Chroma(
    collection_name="arxiv_papers",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

## arXiv Data Collection Functions

In [46]:

# ============= arXiv Data Collection Functions =============

def arxiv_api_search(query, max_results=3):
    """Search arXiv for papers matching keywords"""
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending
    )
    return list(search.results())

def check_html_availability(paper_id):
    """Check if HTML version is available for paper"""
    html_url = f"https://arxiv.org/html/{paper_id}"
    response = requests.head(html_url)
    return response.status_code == 200

def get_html_content(paper_id):
    """Get HTML content of paper if available"""
    html_url = f"https://arxiv.org/html/{paper_id}"
    response = requests.get(html_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract main content section
        content = soup.find('main')
        if content:
            # Remove script tags
            for script in content.find_all('script'):
                script.decompose()
            return content.get_text(separator=' ', strip=True)
        return ""
    return None

def extract_pdf_abstract(paper):
    """For papers without HTML, use the abstract as content"""
    return paper.summary

def is_paper_in_vectorstore(paper_id):
    """Check if paper is already in vector store"""
    try:
        # Try to retrieve metadata for the paper
        results = vectorstore.get(
            where={"paper_id": paper_id},
            limit=1
        )
        return len(results['ids']) > 0
    except:
        # If there's an error (e.g., collection doesn't exist), paper is not there
        return False

## Vector Store Operations

In [47]:
# ============= Vector Store Operations =============

def store_paper_in_vector_store(paper, content, content_source):
    """Store paper and its content in vector store"""
    paper_id = paper.entry_id.split('/')[-1]
    
    # Chunk content for more effective retrieval
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len
    )
    chunks = text_splitter.split_text(content)
    
    # Prepare metadata
    metadatas = [{
        "paper_id": paper_id,
        "title": paper.title,
        "authors": ", ".join(author.name for author in paper.authors),
        "published": paper.published.strftime("%Y-%m-%d"),
        "url": paper.entry_id,
        "chunk_id": i,
        "source": content_source,
        "abstract": paper.summary[:500] + "..." if len(paper.summary) > 500 else paper.summary
    } for i in range(len(chunks))]


# Store in vector store
    vectorstore.add_texts(
        texts=chunks,
        metadatas=metadatas,
        ids=[f"{paper_id}_{i}" for i in range(len(chunks))]
    )
    
    print(f"Stored {len(chunks)} chunks for paper: {paper.title}")
    return len(chunks)

def semantic_search(query, n_results=5):
    """Search for relevant papers in vector store"""
    results = vectorstore.similarity_search_with_score(
        query=query,
        k=n_results
    )
    
    formatted_results = []
    for doc, score in results:
        metadata = doc.metadata
        formatted_results.append({
            "content": doc.page_content,
            "metadata": metadata,
            "similarity": 1.0 - float(score)  # Convert distance to similarity
        })
    
    return formatted_results

## LangGraph RAG Workflow

In [48]:
# ============= LangGraph RAG Workflow =============
from typing import List, Sequence, Dict, Any, TypedDict

# Define the state for our workflow
class RAGState(TypedDict):
    query: str
    context: List[str]
    messages: Sequence[HumanMessage | AIMessage | ToolMessage]
    search_results: List[Dict[str, Any]]

# Search function node
def search_papers(state: RAGState) -> RAGState:
    """Node that performs semantic search"""
    query = state["query"]
    results = semantic_search(query, n_results=5)
    
    context = []
    search_results = []
    
    # Format search results for context
    for result in results:
        metadata = result["metadata"]
        document = result["content"]
        
        context.append(f"Paper: {metadata['title']}\nAuthors: {metadata['authors']}\nPublished: {metadata['published']}\nURL: {metadata['url']}\nAbstract: {metadata['abstract']}\n\nExcerpt: {document[:500]}...\n\n")
        
        # De-duplicate search results by paper_id
        if not any(r["paper_id"] == metadata["paper_id"] for r in search_results):
            search_results.append({
                "title": metadata["title"],
                "url": metadata["url"],
                "paper_id": metadata["paper_id"],
                "similarity": result["similarity"]
            })
    
    state["context"] = context
    state["search_results"] = search_results
    return state

# Response generation node
def generate_response(state: RAGState) -> RAGState:
    """Node that generates a response based on context"""
    query = state["query"]
    context = state["context"]
    
    # Format the full prompt with context
    few_shot_prompt = """
    You are a helpful research assistant that provides accurate, relevant information from arXiv papers. I will provide a query and relevant excerpts from papers. Please answer based only on the provided information.

    Example Query: "Is there hybrid convolutional neural networks and vision transformers?"
    Example Context: [A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification]
    Example Response: Based on the provided paper, there is an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for medical image classification. The model enhances interpretability for multi-class tasks.
    
    Example Query: "How does the Less-Attention Vision Transformer architecture address the computational inefficiencies and saturation problems of traditional Vision Transformers?"
    Example Context: [You Only Need Less Attention at Each Stage in Vision Transformers]
    Example Response: Less-Attention Vision Transformer reduces ViT's quadratic attention cost by reusing early-layer attention scores through linear transformations. It also mitigates attention saturation using residual downsampling and a custom loss to preserve attention structure.
    When answering:
    1. Only use information present in the provided context
    2. Cite specific papers when presenting findings
    3. Clearly indicate if the answer is incomplete or if more information is needed
    4. Structure your response with clear sections and bullet points
    5. Acknowledge limitations in the available research

    Now answer my query using only the information in the provided context.
"""
    
    # Combine context into one string
    context_text = "\n".join(context)
    
    # Create messages
    messages = [
        HumanMessage(content=f"{few_shot_prompt}\n\nQuery: {query}\n\nContext: {context_text}")
    ]
    
    # Generate response
    ai_message = gemini_pro.invoke(messages)
    
    # Add to state
    state["messages"] = list(state.get("messages", [])) + [messages[0], ai_message]
    
    return state

# Build the LangGraph
def build_rag_graph():
    """Create the LangGraph workflow"""
    # Initialize workflow
    workflow = StateGraph(RAGState)
    
    # Add nodes
    workflow.add_node("search", search_papers)
    workflow.add_node("generate", generate_response)
    
    # Add edges
    workflow.add_edge("search", "generate")
    workflow.add_edge("generate", END)
    
    # Set entry point
    workflow.set_entry_point("search")
    
    # Compile
    return workflow.compile()

# Create the executable graph
rag_graph = build_rag_graph()

##  Query Expansion with Few-Shot 

In [49]:
# ============= Query Expansion with Few-Shot =============

def expand_research_query(query):
    """Expand a research query using few-shot prompting"""
    few_shot_examples = """
    Example 1:
    Query: "transformer models for NLP"
    Expanded: The query is about transformer architecture models used in natural language processing. I should search for papers about BERT, GPT, T5, and other transformer variants, their applications in NLP tasks like translation, summarization, and question answering, and recent improvements to transformer architectures.

    Example 2:
    Query: "quantum computing cryptography"
    Expanded: The query relates to the intersection of quantum computing and cryptography. I should search for papers about quantum threats to classical cryptography, post-quantum cryptographic algorithms resistant to quantum attacks, quantum key distribution protocols, and quantum cryptographic primitives.

    Example 3:
    Query: "reinforcement learning in robotics"
    Expanded: This query is about applying reinforcement learning techniques to robotic systems. I should search for papers on robotic control using RL, sim-to-real transfer for robotic learning, sample-efficient RL methods for physical systems, deep RL for manipulation tasks, and multi-agent RL for coordinated robots.
    """

    prompt = f"""Based on the examples below, expand my research query to identify key concepts, relevant subtopics, and specific areas to explore:

{few_shot_examples}

Query: "{query}"
Expanded:"""

    response = gemini_pro.invoke([HumanMessage(content=prompt)])
    return response.content

# ============= arXiv Research Pipeline =============

def research_arxiv(query, use_expanded_query=True):
    """Main research function that ties everything together"""
    print(f"Original query: {query}")
    
    # Step 1: Optionally expand the query
    if use_expanded_query:
        expanded_query = expand_research_query(query)
        print(f"\nExpanded query:\n{expanded_query}")
        search_query = f"{query} {expanded_query}"
    else:
        search_query = query
    
    # Step 2: Get papers from arXiv API
    papers = arxiv_api_search(search_query)
    print(f"\nFound {len(papers)} papers from arXiv API")
    
    # Step 3: Process and store papers
    papers_processed = 0
    for paper in tqdm(papers, desc="Processing papers"):
        # Extract paper ID
        paper_id = paper.entry_id.split('/')[-1]
        
        # Check if paper is already in database
        if is_paper_in_vectorstore(paper_id):
            print(f"Paper already in database: {paper.title}")
            continue
        
        # Try to get HTML content first
        has_html = check_html_availability(paper_id)
        
        if has_html:
            content = get_html_content(paper_id)
            if content:
                store_paper_in_vector_store(paper, content, "html")
                papers_processed += 1
                # Add a small delay to avoid overwhelming the server
                time.sleep(0.5)
                continue
        
        # Fallback to abstract for older papers
        store_paper_in_vector_store(paper, paper.summary, "abstract")
        papers_processed += 1
        # Add a small delay to avoid overwhelming the server
        time.sleep(0.5)
    
    print(f"\nProcessed {papers_processed} new papers")
    
    # Step 4: Run the RAG workflow
    print("\nGenerating research response...")
    result = rag_graph.invoke({
        "query": query,
        "context": [],
        "messages": [],
        "search_results": []
    })
    
    return {
        "answer": result["messages"][-1].content,
        "search_results": result["search_results"]
    }

## Paper Deep Analysis 

In [50]:
# ============= Paper Deep Analysis =============

def analyze_paper_deeply(paper_id, query):
    """Perform a deeper analysis of a specific paper"""
    # Check if HTML is available
    has_html = check_html_availability(paper_id)
    
    if has_html:
        content = get_html_content(paper_id)
        source = "HTML"
    else:
        # Get paper from arXiv
        search = arxiv.Search(id_list=[paper_id])
        results = list(search.results())
        if results:
            paper = results[0]
            content = paper.summary
            source = "Abstract only"
        else:
            return "Paper not found"
    
    # Few-shot prompt for paper analysis
    analysis_prompt = """
You are a research assistant performing in-depth analysis of academic papers. Analyze this paper in relation to the research query.

Example:
Paper: "Attention Is All You Need" (Transformer architecture)
Query: "efficient NLP models"
Analysis:
# Paper Analysis: Attention Is All You Need

## Key Contributions
- Introduced the transformer architecture based entirely on self-attention mechanisms
- Eliminated recurrence and convolutions entirely in sequence modeling
- Achieved state-of-the-art results in machine translation with significantly reduced training time

## Methodology
- Self-attention mechanism computes representation of a sequence by relating different positions
- Multi-head attention allows the model to jointly attend to information from different representation subspaces
- Position-wise feed-forward networks apply the same feed-forward network to each position

## Notable Findings
- Achieved BLEU score of 28.4 on English-to-German translation, outperforming previous best models
- Training was 3.5x faster than the best previous models on the same hardware
- Demonstrated effectiveness on English constituency parsing without any task-specific tuning

## Relevance to Query
- Directly addresses efficiency by significantly reducing training time
- The parallelizable nature of attention enables much faster training on modern hardware
- The architecture has become the foundation for many efficient NLP models that followed

## Limitations
- High memory requirements for very long sequences (quadratic complexity)
- Position encoding scheme may not be optimal for all sequence modeling tasks
- Limited evaluation on tasks beyond machine translation

Now analyze the following paper:
"""

    prompt = f"{analysis_prompt}\n\nPaper content ({source}):\n{content[:5000]}\n\nQuery: {query}\n\nAnalysis:"
    
    response = gemini_pro.invoke([HumanMessage(content=prompt)])
    return response.content

## User Interaction

In [51]:
# ============= User Interaction =============

def run_interactive_search():
    """Interactive search function"""
    query = input("Enter your research query: ")
    
    # Ask if user wants to expand the query
    expand = input("Do you want to expand the query with few-shot prompting? (y/n, default=y): ").lower() != 'n'
    
    print("\nSearching and analyzing papers...")
    results = research_arxiv(query, use_expanded_query=expand)
    
    print("\n=== RESEARCH RESULTS ===\n")
    display(Markdown(results["answer"]))
    
    print("\n=== TOP PAPERS ===\n")
    for i, paper in enumerate(results["search_results"][:5]):
        print(f"{i+1}. {paper['title']}")
        print(f"   URL: {paper['url']}")
        print(f"   Relevance: {paper['similarity']:.2f}\n")
    
    # Ask if user wants to explore any paper in depth
    while True:
        selection = input("\nEnter paper number for deeper analysis (or 0 to exit): ")
        if selection == '0' or not selection.strip():
            break
            
        if selection.isdigit() and 1 <= int(selection) <= len(results["search_results"]):
            paper_id = results["search_results"][int(selection)-1]["paper_id"]
            paper_url = results["search_results"][int(selection)-1]["url"]
            paper_title = results["search_results"][int(selection)-1]["title"]
            
            print(f"\nPerforming deep analysis of paper: {paper_title}...")
            analysis = analyze_paper_deeply(paper_id, query)
            
            print("\n=== PAPER ANALYSIS ===\n")
            display(Markdown(analysis))
            
            print(f"\nPaper URL: {paper_url}")
            
            # Option to download PDF
            download = input("Do you want to download the PDF? (y/n): ").lower()
            if download == 'y':
                print(f"PDF available at: https://arxiv.org/pdf/{paper_id}.pdf")
        else:
            print("Invalid selection. Please try again.")

##  Example Usage 

In [52]:
if __name__ == "__main__":
    # Interactive mode
    run_interactive_search()
    
    # Or direct usage:
    # results = research_arxiv("transformer models for efficient NLP")
    # print(results["answer"])


Searching and analyzing papers...
Original query: is there hybrid cnn and vision transformers?


  return list(search.results())



Found 3 papers from arXiv API


Processing papers:   0%|          | 0/3 [00:00<?, ?it/s]


GoogleGenerativeAIError: Error embedding content: 400 * BatchEmbedContentsRequest.model: unexpected model name format
* BatchEmbedContentsRequest.requests[0].model: unexpected model name format
