# ü§ñ AI Research Assistant with LangGraph

A modular, extensible research assistant using LangGraph, BM25 retrieval, and Groq LLM.

## ‚ú® Key Features

- **Object-Oriented Design**: Clean agent classes with clear responsibilities
- **Design Patterns**: Strategy, Builder, Facade, and Agent patterns for extensibility
- **Flexible Pipelines**: Easy to customize agent workflows
- **BM25 Retrieval**: Fast lexical search over PDF documents
- **Three-Agent System**: Researcher ‚Üí Reviewer ‚Üí Synthesizer

## üöÄ Quick Start

1. Run all cells in order
2. Use the examples to see the system in action
3. Customize pipelines using the builder pattern
4. Test your own queries with `quick_research()`

---

In [2]:
import os
import re
from glob import glob
from typing import List, Dict, Any, TypedDict, Optional
from dataclasses import dataclass

# Document loading & splitting
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# LLM
from langchain_groq import ChatGroq

# LangGraph
from langgraph.graph import StateGraph

# BM25
from rank_bm25 import BM25Okapi

# LangGraph Research Assistant with Design Patterns

This notebook demonstrates:
- **Strategy Pattern**: Different retrieval strategies (BM25, could add semantic, hybrid, etc.)
- **Agent Pattern**: Each agent is a class with a defined interface
- **Builder Pattern**: For constructing the research pipeline
- Modular, extensible architecture for easy customization

In [None]:
# ---------------------------
# Configuration
# ---------------------------
FILES_DIR = "../hackathon - Copie/files"  # folder containing PDFs
os.environ["GROQ_API_KEY"] = "put-apikey-here"

# Initialize LLM
llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)

## Step 1: Define Base Classes and Interfaces

In [4]:
from abc import ABC, abstractmethod

# ---------------------------
# Simple tokenizer for BM25
# ---------------------------
def simple_tokenize(text: str) -> List[str]:
    """Tokenize text for BM25 indexing"""
    tokens = re.findall(r"\w+", text.lower())
    return [t for t in tokens if len(t) > 1]

# ---------------------------
# Data classes
# ---------------------------
@dataclass
class DocChunk:
    """Represents a document chunk with content and metadata"""
    page_content: str
    metadata: Dict[str, Any]

@dataclass
class ResearchState:
    """State object passed between agents"""
    topic: str
    summary: Optional[str] = None
    critique: Optional[str] = None
    insight: Optional[str] = None
    sources: Optional[List[str]] = None
    
    def to_dict(self) -> dict:
        """Convert to dictionary for LangGraph"""
        return {
            "topic": self.topic,
            "summary": self.summary,
            "critique": self.critique,
            "insight": self.insight,
            "sources": self.sources
        }
    
    @classmethod
    def from_dict(cls, data: dict) -> 'ResearchState':
        """Create from dictionary"""
        return cls(**data)

## Step 2: Strategy Pattern - Retrieval Strategies

In [5]:
# ---------------------------
# Strategy Pattern: Retrieval Interface
# ---------------------------
class RetrievalStrategy(ABC):
    """Abstract base class for retrieval strategies"""
    
    @abstractmethod
    def retrieve(self, query: str, k: int = 3) -> List[DocChunk]:
        """Retrieve relevant documents for a query"""
        pass
    
    @abstractmethod
    def get_strategy_name(self) -> str:
        """Return the name of this strategy"""
        pass


class BM25RetrievalStrategy(RetrievalStrategy):
    """BM25 lexical retrieval strategy"""
    
    def __init__(self, chunks: List[DocChunk]):
        self.chunks = chunks
        self.tokenized_texts = [simple_tokenize(c.page_content) for c in chunks]
        self.bm25 = BM25Okapi(self.tokenized_texts)
    
    def retrieve(self, query: str, k: int = 3) -> List[DocChunk]:
        """Retrieve documents using BM25 ranking"""
        q_tokens = simple_tokenize(query)
        if not q_tokens:
            return []
        
        scores = self.bm25.get_scores(q_tokens)
        idx_scores = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
        top = [i for i, sc in idx_scores[:k] if sc > 0]
        
        if not top and len(idx_scores) > 0:
            top = [i for i, _ in idx_scores[:k]]
        
        return [self.chunks[i] for i in top]
    
    def get_strategy_name(self) -> str:
        return "BM25 Lexical Retrieval"


# Placeholder for future strategies
class SemanticRetrievalStrategy(RetrievalStrategy):
    """Placeholder for semantic/embedding-based retrieval"""
    
    def __init__(self, chunks: List[DocChunk]):
        self.chunks = chunks
        # TODO: Initialize embeddings
    
    def retrieve(self, query: str, k: int = 3) -> List[DocChunk]:
        # TODO: Implement semantic search
        raise NotImplementedError("Semantic retrieval not yet implemented")
    
    def get_strategy_name(self) -> str:
        return "Semantic Embedding Retrieval"

## Step 3: Document Loader

In [6]:
# ---------------------------
# Document Loader Class
# ---------------------------
class DocumentLoader:
    """Handles loading and chunking of documents"""
    
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, 
            chunk_overlap=chunk_overlap
        )
    
    def load_pdfs(self, files_dir: str) -> List[DocChunk]:
        """Load and chunk all PDFs from a directory"""
        chunks: List[DocChunk] = []
        pdf_paths = sorted(glob(os.path.join(files_dir, "*.pdf")))
        
        print(f"üì• Loading {len(pdf_paths)} PDF(s)...")
        
        for pdf_path in pdf_paths:
            filename = os.path.basename(pdf_path)
            try:
                loader = PyPDFLoader(pdf_path)
                docs = loader.load()
            except Exception as e:
                print(f"‚ö†Ô∏è  Warning: failed to load {pdf_path}: {e}")
                continue
            
            # Add metadata
            for i, d in enumerate(docs):
                if not d.metadata:
                    d.metadata = {}
                d.metadata["source"] = filename
                d.metadata["orig_page_index"] = d.metadata.get("page", i)
            
            # Split into chunks
            doc_chunks = self.splitter.split_documents(docs)
            for idx, c in enumerate(doc_chunks):
                meta = dict(c.metadata)
                meta["chunk_id"] = f"{filename}__chunk{idx}"
                chunks.append(DocChunk(page_content=c.page_content, metadata=meta))
        
        print(f"‚úÖ Loaded and chunked {len(chunks)} chunks from {len(pdf_paths)} PDF(s).")
        return chunks

## Step 4: Agent Classes

In [7]:
# ---------------------------
# Agent Base Class
# ---------------------------
class Agent(ABC):
    """Abstract base class for all agents"""
    
    def __init__(self, llm, name: str):
        self.llm = llm
        self.name = name
    
    @abstractmethod
    def process(self, state: dict) -> dict:
        """Process the state and return updates"""
        pass
    
    def __call__(self, state: dict) -> dict:
        """Make agent callable for LangGraph"""
        return self.process(state)


# ---------------------------
# Researcher Agent
# ---------------------------
class ResearcherAgent(Agent):
    """Agent that retrieves and summarizes relevant documents"""
    
    def __init__(self, llm, retrieval_strategy: RetrievalStrategy, k: int = 4):
        super().__init__(llm, "Researcher")
        self.retrieval_strategy = retrieval_strategy
        self.k = k
    
    def process(self, state: dict) -> dict:
        """Retrieve documents and create summary"""
        topic = state.get("topic", "").strip()
        
        if not topic:
            return {"summary": "No topic provided."}
        
        # Retrieve documents
        docs = self.retrieval_strategy.retrieve(topic, k=self.k)
        
        if not docs:
            return {"summary": "No relevant documents found."}
        
        # Prepare context
        context_pieces = []
        sources = []
        
        for d in docs:
            snippet = d.page_content.strip()
            if len(snippet) > 800:
                snippet = snippet[:800].rsplit(" ", 1)[0] + " ..."
            
            source = d.metadata.get("source", "unknown")
            chunk_id = d.metadata.get("chunk_id", "")
            context_pieces.append(f"[SOURCE: {source} | CHUNK: {chunk_id}]\n{snippet}")
            sources.append(source)
        
        context = "\n\n---\n\n".join(context_pieces)
        
        # Create prompt
        prompt = (
            f"You are a research assistant. The user asked about: '{topic}'.\n\n"
            f"Read the following retrieved excerpts (using {self.retrieval_strategy.get_strategy_name()}) "
            f"and produce a concise summary of the main findings or facts relevant to the topic. "
            f"Be explicit about which sources support which points.\n\n"
            f"EXCERPTS:\n\n{context}\n\n"
            "Return a short summary and a short list of (source -> supporting sentence)."
        )
        
        # Get LLM response
        resp = self.llm.invoke(prompt)
        summary_text = getattr(resp, "content", None) or str(resp)
        
        return {
            "summary": summary_text, 
            "sources": list(dict.fromkeys(sources))
        }


# ---------------------------
# Reviewer Agent
# ---------------------------
class ReviewerAgent(Agent):
    """Agent that critically reviews the research summary"""
    
    def __init__(self, llm):
        super().__init__(llm, "Reviewer")
    
    def process(self, state: dict) -> dict:
        """Review and critique the summary"""
        summary = state.get("summary", "")
        
        if not summary:
            return {"critique": "No summary to review."}
        
        prompt = (
            "You are a critical reviewer. Read the following summary and point out: "
            "1) statements that lack direct support from the provided excerpts, "
            "2) possible biases or missing considerations, and "
            "3) questions or follow-ups to verify the claims.\n\n"
            f"SUMMARY:\n\n{summary}\n\n"
            "Give your critique in bullet points."
        )
        
        resp = self.llm.invoke(prompt)
        critique_text = getattr(resp, "content", None) or str(resp)
        
        return {"critique": critique_text}


# ---------------------------
# Synthesizer Agent
# ---------------------------
class SynthesizerAgent(Agent):
    """Agent that synthesizes insights from research and review"""
    
    def __init__(self, llm):
        super().__init__(llm, "Synthesizer")
    
    def process(self, state: dict) -> dict:
        """Synthesize final insights"""
        summary = state.get("summary", "")
        critique = state.get("critique", "")
        sources = state.get("sources", [])
        
        prompt = (
            "You are a synthesizer. Combine the summary and critique into a 'Collective Insight Report'. "
            "Include: a 2-3 sentence insight, 2 testable hypotheses or follow-up experiments, and which sources "
            "would be most relevant to test those hypotheses. Keep it concise.\n\n"
            f"SUMMARY:\n{summary}\n\nCRITIQUE:\n{critique}\n\nSOURCES:\n{', '.join(sources)}"
        )
        
        resp = self.llm.invoke(prompt)
        insight_text = getattr(resp, "content", None) or str(resp)
        
        return {"insight": insight_text}

## Step 5: Builder Pattern - Research Pipeline Builder

In [8]:
# ---------------------------
# Builder Pattern: Pipeline Builder
# ---------------------------
class ResearchPipelineBuilder:
    """Builder for constructing research pipelines with different configurations"""
    
    def __init__(self, llm):
        self.llm = llm
        self.retrieval_strategy = None
        self.agents = []
        self.graph_config = []
    
    def with_retrieval_strategy(self, strategy: RetrievalStrategy):
        """Set the retrieval strategy"""
        self.retrieval_strategy = strategy
        return self
    
    def with_researcher(self, k: int = 4):
        """Add researcher agent"""
        if not self.retrieval_strategy:
            raise ValueError("Retrieval strategy must be set before adding researcher")
        
        researcher = ResearcherAgent(self.llm, self.retrieval_strategy, k=k)
        self.agents.append(("researcher", researcher))
        return self
    
    def with_reviewer(self):
        """Add reviewer agent"""
        reviewer = ReviewerAgent(self.llm)
        self.agents.append(("reviewer", reviewer))
        return self
    
    def with_synthesizer(self):
        """Add synthesizer agent"""
        synthesizer = SynthesizerAgent(self.llm)
        self.agents.append(("synthesizer", synthesizer))
        return self
    
    def build(self):
        """Build the LangGraph pipeline"""
        if not self.agents:
            raise ValueError("No agents added to pipeline")
        
        # Create state schema
        class PipelineState(TypedDict):
            topic: str
            summary: Optional[str]
            critique: Optional[str]
            insight: Optional[str]
            sources: Optional[List[str]]
        
        # Create graph
        graph = StateGraph(PipelineState)
        
        # Add nodes
        for name, agent in self.agents:
            graph.add_node(name, agent)
        
        # Add edges (sequential for now)
        for i in range(len(self.agents) - 1):
            graph.add_edge(self.agents[i][0], self.agents[i + 1][0])
        
        # Set entry point
        graph.set_entry_point(self.agents[0][0])
        
        return graph.compile()


# ---------------------------
# Research Assistant (Facade Pattern)
# ---------------------------
class ResearchAssistant:
    """High-level interface for the research system"""
    
    def __init__(self, pipeline, retrieval_strategy: RetrievalStrategy):
        self.pipeline = pipeline
        self.retrieval_strategy = retrieval_strategy
    
    def research(self, topic: str) -> dict:
        """Perform research on a topic"""
        print(f"üî¨ Researching: {topic}")
        print(f"üìä Using: {self.retrieval_strategy.get_strategy_name()}")
        print("-" * 80)
        
        result = self.pipeline.invoke({"topic": topic})
        return result
    
    def print_results(self, result: dict):
        """Pretty print research results"""
        topic = result.get("topic", "Unknown")
        
        print("\n" + "=" * 80)
        print(f"üìù RESEARCH REPORT: {topic}")
        print("=" * 80 + "\n")
        
        print("üìò RESEARCHER SUMMARY:")
        print("-" * 80)
        print(result.get("summary", "‚Äî"))
        print()
        
        print("\nüîç REVIEWER CRITIQUE:")
        print("-" * 80)
        print(result.get("critique", "‚Äî"))
        print()
        
        print("\nüí° COLLECTIVE INSIGHT:")
        print("-" * 80)
        print(result.get("insight", "‚Äî"))
        print()
        
        sources = result.get("sources", [])
        if sources:
            print("\nüìö SOURCES USED:")
            print("-" * 80)
            for i, source in enumerate(sources, 1):
                print(f"{i}. {source}")
        
        print("\n" + "=" * 80 + "\n")

## Step 6: Initialize the System

In [9]:
# Load documents
loader = DocumentLoader(chunk_size=1000, chunk_overlap=200)
chunks = loader.load_pdfs(FILES_DIR)

# Create retrieval strategy
bm25_strategy = BM25RetrievalStrategy(chunks) if chunks else None

if not bm25_strategy:
    print("‚ö†Ô∏è  No documents loaded. Please add PDFs to the files directory.")
else:
    print(f"‚úÖ System ready with {len(chunks)} document chunks")

üì• Loading 5 PDF(s)...


could not convert string to float: b'0.00-9999999' : FloatObject (b'0.00-9999999') invalid; use 0.0 instead
could not convert string to float: b'0.00-9999999' : FloatObject (b'0.00-9999999') invalid; use 0.0 instead


‚úÖ Loaded and chunked 262 chunks from 5 PDF(s).
‚úÖ System ready with 262 document chunks


In [10]:
# Build the research pipeline using the builder pattern
pipeline = (ResearchPipelineBuilder(llm)
    .with_retrieval_strategy(bm25_strategy)
    .with_researcher(k=4)
    .with_reviewer()
    .with_synthesizer()
    .build())

# Create the research assistant
assistant = ResearchAssistant(pipeline, bm25_strategy)

print("ü§ñ Research Assistant is ready!")
print("üìä Pipeline: Researcher ‚Üí Reviewer ‚Üí Synthesizer")
print(f"üîç Strategy: {bm25_strategy.get_strategy_name()}")

ü§ñ Research Assistant is ready!
üìä Pipeline: Researcher ‚Üí Reviewer ‚Üí Synthesizer
üîç Strategy: BM25 Lexical Retrieval


## üß™ Example 1: Research on Climate Change and AI

In [11]:
# Run research on a specific topic
result1 = assistant.research("How is AI being used to combat climate change?")
assistant.print_results(result1)

üî¨ Researching: How is AI being used to combat climate change?
üìä Using: BM25 Lexical Retrieval
--------------------------------------------------------------------------------

üìù RESEARCH REPORT: How is AI being used to combat climate change?

üìò RESEARCHER SUMMARY:
--------------------------------------------------------------------------------
**Summary:** AI is being used to combat climate change through various applications, including integrative emissions monitoring and management for nature-based climate solutions, such as forests. AI techniques, like machine learning and computer vision, can detect wildfires, estimate carbon stock, and support disaster response efforts. However, the development of AI models also has a significant environmental impact, with high energy consumption and emissions. 

**Source-Sentence List:**
- 3_Climate And Resource Awareness is Imperative to Achieving Sustainable AI   and Preventing a Global A.pdf -> "required 30.84M GPU hours ... with a

## üß™ Example 2: Research on Machine Learning for Weather Prediction

In [None]:
result2 = assistant.research("What are the applications of machine learning in weather forecasting?")
assistant.print_results(result2)

## üß™ Example 3: Building a Custom Pipeline

You can easily customize the pipeline by changing the order or configuration of agents:

In [12]:
# Example: Create a simpler pipeline with just researcher and synthesizer (no reviewer)
simple_pipeline = (ResearchPipelineBuilder(llm)
    .with_retrieval_strategy(bm25_strategy)
    .with_researcher(k=3)
    .with_synthesizer()
    .build())

simple_assistant = ResearchAssistant(simple_pipeline, bm25_strategy)

# Test the simple pipeline
result3 = simple_assistant.research("Climate data analysis techniques")
simple_assistant.print_results(result3)

üî¨ Researching: Climate data analysis techniques
üìä Using: BM25 Lexical Retrieval
--------------------------------------------------------------------------------

üìù RESEARCH REPORT: Climate data analysis techniques

üìò RESEARCHER SUMMARY:
--------------------------------------------------------------------------------
**Summary:** Climate data analysis techniques involve the use of AI and machine learning methods to analyze and understand the impact of climate on various factors such as disease prevalence, emissions, and resource management. Techniques such as computer vision and deep learning can be used to analyze climate data and support automation and decision-making. Climate data analysis can help identify patterns and relationships between climate factors and disease trends, and can inform strategies for mitigating the impacts of climate change.

**Source -> Supporting Sentence:**
* 2_Towards AI-driven Integrative Emissions Monitoring  Management for   Nature-Based Clim

## üìä Design Patterns Summary

This notebook demonstrates several design patterns:

### 1. **Strategy Pattern** (Retrieval Strategies)
- **Interface**: `RetrievalStrategy`
- **Implementations**: `BM25RetrievalStrategy`, `SemanticRetrievalStrategy` (placeholder)
- **Benefit**: Easy to swap retrieval algorithms without changing agent code

### 2. **Agent Pattern** (Agent Classes)
- **Base Class**: `Agent`
- **Implementations**: `ResearcherAgent`, `ReviewerAgent`, `SynthesizerAgent`
- **Benefit**: Each agent has clear responsibilities and can be tested independently

### 3. **Builder Pattern** (Pipeline Builder)
- **Class**: `ResearchPipelineBuilder`
- **Benefit**: Fluent API for constructing complex pipelines with different configurations

### 4. **Facade Pattern** (Research Assistant)
- **Class**: `ResearchAssistant`
- **Benefit**: Simple high-level interface hiding complex pipeline details

---

## üîß How to Extend

### Adding a New Retrieval Strategy:
```python
class HybridRetrievalStrategy(RetrievalStrategy):
    def __init__(self, chunks):
        self.bm25 = BM25RetrievalStrategy(chunks)
        # Add semantic retriever here
    
    def retrieve(self, query: str, k: int = 3):
        # Combine BM25 and semantic results
        pass
    
    def get_strategy_name(self):
        return "Hybrid BM25 + Semantic Retrieval"
```

### Adding a New Agent:
```python
class FactCheckerAgent(Agent):
    def __init__(self, llm):
        super().__init__(llm, "FactChecker")
    
    def process(self, state: dict):
        # Implement fact-checking logic
        return {"fact_check": "..."}
```

### Building Custom Pipelines:
```python
custom_pipeline = (ResearchPipelineBuilder(llm)
    .with_retrieval_strategy(custom_strategy)
    .with_researcher(k=5)
    .with_reviewer()
    .with_custom_agent(FactCheckerAgent(llm))
    .with_synthesizer()
    .build())
```

## üß™ Interactive Testing

You can easily test different queries:

In [13]:
# Quick test function
def quick_research(topic: str):
    """Quick research on any topic"""
    result = assistant.research(topic)
    assistant.print_results(result)
    return result

# Example: Test with your own query
# Uncomment and modify the line below:
# quick_research("Your research topic here")