## Code to Chapter 3 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bXSuZZMUjI6IqWfPD-CO2BCt4NzzjROR?usp=sharing)

This notebook demonstrates how to build a very simple Retrieval-Augmented Generation (RAG) system using LangChain, combining document retrieval with conversational AI agents. We'll create a system that can answer questions about Nobel Prize winners by searching through PDF documents and external databases.

## Package Installation

In [None]:
#install all proper packages
!pip install -qU langchain langchain-community langchain-core langchain-openai langchain-text-splitters \
  openai tiktoken chromadb pypdf xmltodict

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.4/70.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.0/755.0 kB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m305.5/305.5 kB[0m [31m19.8 MB/s[0m eta [36m0:00:

In [None]:
!pip freeze | grep "lang\|openai\|tiktoken|\chroma"

google-ai-generativelanguage==0.6.15
google-cloud-language==2.17.2
langchain==0.3.26
langchain-community==0.3.26
langchain-core==0.3.66
langchain-openai==0.3.27
langchain-text-splitters==0.3.8
langcodes==3.5.0
langsmith==0.4.1
language_data==1.3.0
libclang==18.1.1
openai==1.93.0


In [None]:
import pprint

# LangChain

## Environment Setup and API Configuration

In [None]:
from google.colab import userdata
import os

# Set OpenAI API key from Google Colab's user environment or default
def set_openai_api_key(default_key: str = "YOUR_API_KEY") -> None:
    """Set the OpenAI API key from Google Colab's user environment or use a default value."""

    os.environ["OPENAI_API_KEY"] = userdata.get("LC4LS_OPENAI_API_KEY") or default_key
    #os.environ["LANGCHAIN_TRACING_V2"]="true"
    #os.environ["LANGCHAIN_PROJECT"]="lc4ls-ch3-1"
    #os.environ["LANGCHAIN_API_KEY"] = userdata.get("LANGCHAIN_API_KEY")

try:
  set_openai_api_key()
except Exception as e:
  print(e)

This comprehensive import section brings in all necessary LangChain components:

**API Key Management:**
- Creates a secure function to retrieve OpenAI API key from Colab's userdata storage

**Core Components:**
- **ChatOpenAI**: Interface to OpenAI's chat models (GPT-4, GPT-3.5, etc.)
- **OpenAIEmbeddings**: Creates vector embeddings using OpenAI's embedding models
- **Chroma**: Vector database for storing and searching document embeddings

**Chain Components:**
- **RunnablePassthrough**: Passes data through unchanged in a chain
- **RunnableLambda**: Wraps Python functions to be used in LangChain chains

**Memory System:**
- **ConversationSummaryBufferMemory**: Maintains conversation history with automatic summarization
- **ChatMessageHistory**: Stores individual chat messages

**Prompt Management:**
- **ChatPromptTemplate**: Creates structured prompts for chat models
- **MessagesPlaceholder**: Placeholder for dynamic message insertion

**Document Processing:**
- **PyPDFLoader**: Loads and extracts text from PDF files
- **RecursiveCharacterTextSplitter**: Intelligently splits documents into chunks

**Agent System:**
- **AgentExecutor**: Runs agents with tools and memory
- **Various agent utilities**: For formatting and parsing agent interactions

**Configuration:**
- Uses GPT-4o-mini model for cost efficiency
- Enables verbose mode for detailed operation logging

In [None]:
# model
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

#index
from langchain_community.vectorstores import Chroma

# chains
from langchain_core.runnables import (
    RunnablePassthrough,
    RunnableLambda,
)

# memory
from langchain.memory import ConversationSummaryBufferMemory, ChatMessageHistory

# prompts
from langchain_core.prompts.chat import ChatPromptTemplate, MessagesPlaceholder

#tools
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.tools import format_tool_to_openai_function
from langchain.tools.retriever import create_retriever_tool
from langchain_community.tools.pubmed.tool import PubmedQueryRun

# agents
from langchain.agents import AgentExecutor
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers.openai_functions import (
    OpenAIFunctionsAgentOutputParser,
)

# other imports
from operator import itemgetter

In [None]:
model_name = "gpt-4o-mini"
verbose= True

## Memory System Setup

**Memory Architecture:**
- **MEMORY_KEY**: Standardized key for accessing conversation history across the system
- **ConversationSummaryBufferMemory**: Advanced memory type that:
  - Keeps recent messages in full detail
  - Summarizes older messages to save token space
  - Maintains conversation context efficiently

**Initialization Process:**
1. **ChatMessageHistory**: Creates a storage container for individual messages
2. **Warm-up Exchange**: Pre-loads the conversation with an initial greeting to establish context
3. **Memory Configuration**:
   - Uses GPT model for generating summaries
   - `return_messages=True`: Returns messages in chat format
   - `memory_key`: Specifies how to access this memory in chains

In [None]:
# Creating memory
MEMORY_KEY = "chat_history"

def setup_conversation_memory(model_name):
    """Initialize conversation memory with a warm-up exchange"""
    # Define key for accessing memory throughout the system
    MEMORY_KEY = "chat_history"

    # Create history with initial warm-up messages
    history = ChatMessageHistory()
    history.add_user_message("Hi, I want you to help me to answer some questions and complete a couple of tasks")
    history.add_ai_message("Hello! Sure I can help you. Can you specify your task?")

    # Create memory that can summarize conversation as it grows
    memory = ConversationSummaryBufferMemory(
        llm=ChatOpenAI(temperature=0, model_name=model_name),
        return_messages=True,
        memory_key=MEMORY_KEY,
        chat_memory=history
    )

    return memory, history

**Memory Testing:**
- `load_memory_variables({})`: Displays current memory state
- `predict_new_summary()`: Demonstrates how the system would summarize the conversation

This setup ensures the agent maintains context across multiple interactions while managing token usage efficiently.

In [None]:
memory, history = setup_conversation_memory(model_name)
print(memory.load_memory_variables({}))
memory.predict_new_summary(existing_summary="", messages=history.messages)

  memory = ConversationSummaryBufferMemory(


{'chat_history': [HumanMessage(content='Hi, I want you to help me to answer some questions and complete a couple of tasks', additional_kwargs={}, response_metadata={}), AIMessage(content='Hello! Sure I can help you. Can you specify your task?', additional_kwargs={}, response_metadata={})]}


'The human asks the AI for help with questions and tasks. The AI responds positively, asking for clarification on the specific tasks.'

## Prompt Template Creation

**Prompt Structure:**
1. **System Message**: Establishes the agent's role and behavior:
   - Specializes in scientific questions
   - Instructed to use available tools when needed
   - Emphasizes careful question comprehension
   - Provides clear fallback response for unknown information

2. **Memory Placeholder**: `MessagesPlaceholder(variable_name=memory_key)`
   - Dynamically inserts conversation history
   - Maintains context across interactions

3. **User Input**: `("user", "{input}")`
   - Placeholder for the current user question
   - Will be filled dynamically during execution

4. **Agent Scratchpad**: `MessagesPlaceholder(variable_name="agent_scratchpad")`
   - Shows the agent's reasoning process
   - Displays tool usage and intermediate steps
   - Critical for function-calling agents

In [None]:
# Create a prompt using the template
def create_prompt_template(memory_key):
    """Create a prompt template with system message and placeholders"""
    return ChatPromptTemplate.from_messages([
        (
            "system",
            """
            You are an assistant who helps answer scientific questions.
            Use tools you have if required.
            Be sure to understand the question correctly.
            If you don't know the answer - respond, "Sorry, I don't know."
            """,
        ),
        MessagesPlaceholder(variable_name=memory_key),
        ("user", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ])

prompt = create_prompt_template(MEMORY_KEY)

**The purpose of such structure:**
- Provides clear behavioral guidelines
- Maintains conversation context
- Enables transparent reasoning process
- Supports tool integration seamlessly

## Document Loading and Processing

**Document Source:**
- **PDF URL**: Official Nobel Prize document containing detailed scientific information
- **PyPDFLoader**: LangChain's PDF processing tool that:
  - Downloads the PDF from the URL
  - Extracts text content
  - Preserves document structure and metadata
  - Handles multi-page documents automatically

**Output Structure:**
- `pdf_doc[0:3]` displays the first 3 pages/sections
- Each element contains:
  - Page content (text)
  - Metadata (page numbers, source URL)
  - Document structure information

This provides the raw material that will be processed into searchable chunks for our RAG system.


## Text Chunking Strategy

In [None]:
pdf_link = "https://www.nobelprize.org/uploads/2023/10/advanced-chemistryprize2023.pdf"

In [None]:
pdf_loader = PyPDFLoader(pdf_link)
pdf_doc = pdf_loader.load()

In [None]:
pdf_doc[0:3]

[Document(metadata={'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creator': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creationdate': '2023-09-30T09:06:00+02:00', 'moddate': '2023-10-26T14:26:47+02:00', 'title': '', 'source': 'https://www.nobelprize.org/uploads/2023/10/advanced-chemistryprize2023.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}, page_content='Nobel Prize ® and the Nobel Prize ® medal design mark \nare registrated trademarks of the Nobel Foundation\n4 OCTOBER 2023\nScientific Background to the Nobel Prize in Chemistry 2023\nQUANTUM DOTS – SEEDS OF NANOSCIENCE\nThe Nobel Committee for Chemistry\nTHE ROYAL SWEDISH ACADEMY OF SCIENCES  has as its aim to promote the sciences and strengthen their influence in society.\nBOX 50005 (LILLA FRESCATIVÄGEN 4 A), SE-104 05 STOCKHOLM, SWEDEN \nTEL +46 8 673 95 00 \uf0ab WWW.KVA.SE'),
 Document(metadata={'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creator': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creationdate'

**RecursiveCharacterTextSplitter Configuration:**
- **separators=["\n"]**: Prioritizes splitting at newlines to preserve logical document structure
- **chunk_size=500**: Each chunk contains approximately 500 characters
  - Balances context preservation with embedding efficiency
  - Optimal size for most embedding models
- **chunk_overlap=200**: 200-character overlap between consecutive chunks
  - Prevents information loss at chunk boundaries
  - Ensures continuity of context across chunks
- **keep_separator=False**: Removes separator characters to clean up text

**Such approach achieves great results:**
1. **Semantic Integrity**: Splitting at newlines preserves paragraph and section boundaries
2. **Search Optimization**: 500-character chunks provide focused, relevant search results
3. **Context Preservation**: Overlap ensures no information is lost between chunks
4. **Embedding Efficiency**: Optimal size for vector embedding models

In [None]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n"], chunk_size = 500, chunk_overlap=200, keep_separator=False)
chunks = text_splitter.split_documents(pdf_doc)

In [None]:
chunks[10]

Document(metadata={'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creator': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creationdate': '2023-09-30T09:06:00+02:00', 'moddate': '2023-10-26T14:26:47+02:00', 'title': '', 'source': 'https://www.nobelprize.org/uploads/2023/10/advanced-chemistryprize2023.pdf', 'total_pages': 19, 'page': 3, 'page_label': '4'}, page_content='photocatalysis.  \nTheory and early observations of quantum size effects \nThe basic theoretical concept underlying quantum dots is referred to as the ‘ particle-in-a-box’ \nproblem. When a quantum mechanical particle, such as an electron, is confined inside a ‘box’ with \na size L  comparable to the particle’s de Broglie wavelength, the energies of the wave function’s \nallowed eigenstates depend critically on L, and the energy spacing ∆E scales as 1/L2. This concept')

## Vector Database Creation

**Vector Store Setup:**
- **Chroma**: Efficient vector database that:
  - Stores document chunks as vector embeddings
  - Enables fast similarity search
  - Maintains metadata associations
  - Supports filtering and advanced queries

- **OpenAIEmbeddings(model="text-embedding-3-large")**:
  - Uses OpenAI's most advanced embedding model
  - Creates high-dimensional vector representations of text
  - Captures semantic meaning and relationships
  - Enables accurate similarity matching

**Retriever Configuration:**
- **search_kwargs={"k": 3}**: Returns top 3 most relevant chunks for each query
- **Semantic Search**: Finds documents based on meaning, not just keyword matching
- **Relevance Ranking**: Orders results by similarity score

**Process Flow:**
1. Each text chunk → Vector embedding
2. User query → Vector embedding  
3. Similarity search → Top k relevant chunks
4. Retrieved chunks → Context for AI response

This creates the foundation for accurate, context-aware question answering.

In [None]:
vectorstore = Chroma.from_documents(documents=chunks, embedding=OpenAIEmbeddings(model = "text-embedding-3-large"))
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [None]:
# Creating vectorstore

def load_index():
    ...

#retriever=load_index().vectorstore.as_retriever()

## Tool Development

1. **Document Retrieval Tool:**
   - **create_retriever_tool()**: Converts the vector store retriever into an agent tool
   - **Name**: "search_through_pdf_text" - Clear, descriptive identifier
   - **Description**: Tells the agent when and how to use this tool
   - **Function**: Searches through the Nobel Prize PDF for relevant information

2. **PubMed Tool:**
   - **PubmedQueryRun()**: Connects to PubMed database
   - Searches scientific literature and research papers
   - Provides access to millions of biomedical publications
   - Enables real-time research queries

**Tool Integration:**
- **tools = [PubmedQueryRun(), retrive_tool]**: Creates tool collection
- **format_tool_to_openai_function()**: Converts LangChain tools to OpenAI function format
- **formatted_functions**: Prepared for OpenAI's function calling system

**Agent Capabilities:**
- Can search local PDF knowledge
- Can query external scientific databases
- Automatically selects appropriate tool based on question type
- Combines multiple information sources for comprehensive answers

In [None]:
retrive_tool = create_retriever_tool(
    retriever,
    "search_through_pdf_text",
    "This function searches and returns data from pdf text regarding nobel prize winners and their work",
)

tools = [PubmedQueryRun(), retrive_tool]
formatted_functions = [format_tool_to_openai_function(t) for t in tools]

  formatted_functions = [format_tool_to_openai_function(t) for t in tools]


## Agent Assembly

**Language Model Setup:**
- **temperature=0**: Ensures deterministic, factual responses
- **llm_with_tools**: Binds available functions to the language model

**Agent Chain Architecture:**
1. **Input Processing:**
   - `"input": lambda x: x["input"]` - Extracts user question
   - `"agent_scratchpad"` - Formats intermediate reasoning steps

2. **Memory Integration:**
   - `RunnablePassthrough.assign()` - Adds memory to the chain
   - `memory.load_memory_variables()` - Retrieves conversation history
   - `itemgetter(MEMORY_KEY)` - Extracts specific memory content

3. **Chain Flow:**
   - Input → Memory loading → Prompt formatting → LLM processing → Output parsing

**Agent Executor:**
- **AgentExecutor**: Orchestrates the entire system
- **agent**: The reasoning chain we just created
- **tools**: Available functions (PDF search, PubMed)
- **memory**: Conversation memory system
- **verbose=True**: Shows detailed execution steps

**Result:**
A complete conversational AI system that can:
- Maintain conversation context
- Search through documents
- Query external databases
- Provide transparent reasoning
- Give accurate, cited responses

In [None]:
llm = ChatOpenAI(temperature=0, model_name=model_name)
llm_with_tools = llm.bind(functions=formatted_functions)

In [None]:
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_function_messages(
            x["intermediate_steps"]
        ),
    }
    | RunnablePassthrough.assign(
        **{MEMORY_KEY: RunnableLambda(memory.load_memory_variables) | itemgetter(MEMORY_KEY)}
    )
    | prompt
    | llm_with_tools
    | OpenAIFunctionsAgentOutputParser()
)

qa = AgentExecutor(agent=agent, tools=tools, memory=memory, verbose=verbose)

## Query Examples and Testing

### Nobel Prize Literature Question
Let's now ask some questions regarding the Nobel Prize winners:

In [None]:
# Function to print the response for a given query
def print_response_for_query(result):
    output_format = f"""
    ===================================
    {result['output']}
    ===================================
    """
    return pprint.pprint("".join(output_format))

In [None]:
query = "Who and for what won the Nobel prize in Literacy in 2023?"
result = qa.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_through_pdf_text` with `{'query': 'Nobel Prize in Literature 2023'}`


[0m

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


[33;1m[1;3mNobel Prize ® and the Nobel Prize ® medal design mark 
are registrated trademarks of the Nobel Foundation
4 OCTOBER 2023
Scientific Background to the Nobel Prize in Chemistry 2023
QUANTUM DOTS – SEEDS OF NANOSCIENCE
The Nobel Committee for Chemistry
THE ROYAL SWEDISH ACADEMY OF SCIENCES  has as its aim to promote the sciences and strengthen their influence in society.
BOX 50005 (LILLA FRESCATIVÄGEN 4 A), SE-104 05 STOCKHOLM, SWEDEN 
TEL +46 8 673 95 00  WWW.KVA.SE

1 (17)
 
Quantum dots — seeds of nanoscience 
The Royal Swedish Academy of Sciences has decided to award Moungi G. Bawendi, Louis 
E. Brus, and Aleksey I. Yekimov the Nobel Prize in Chemistry 2023, for the discovery and 
synthesis of quantum dots.
Introduction 
This year’s Nobel Prize in Chemistry recognizes the discovery and synthesis of nanometre-sized 
semiconductor crystals, the properties of which are determined by quantum size effects. Referred

Related Nobel Prizes 
The Nobel Prize in Chemistry 1996 – “f

In [None]:
print_response_for_query(result)

('\n'
 "    Sorry, I don't know.\n"
 '    ')


In [None]:
query = "Sorry, I meant chem"
result = qa.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_through_pdf_text` with `{'query': 'Nobel Prize in Chemistry 2023'}`


[0m[33;1m[1;3mNobel Prize ® and the Nobel Prize ® medal design mark 
are registrated trademarks of the Nobel Foundation
4 OCTOBER 2023
Scientific Background to the Nobel Prize in Chemistry 2023
QUANTUM DOTS – SEEDS OF NANOSCIENCE
The Nobel Committee for Chemistry
THE ROYAL SWEDISH ACADEMY OF SCIENCES  has as its aim to promote the sciences and strengthen their influence in society.
BOX 50005 (LILLA FRESCATIVÄGEN 4 A), SE-104 05 STOCKHOLM, SWEDEN 
TEL +46 8 673 95 00  WWW.KVA.SE

1 (17)
 
Quantum dots — seeds of nanoscience 
The Royal Swedish Academy of Sciences has decided to award Moungi G. Bawendi, Louis 
E. Brus, and Aleksey I. Yekimov the Nobel Prize in Chemistry 2023, for the discovery and 
synthesis of quantum dots.
Introduction 
This year’s Nobel Prize in Chemistry recognizes the discovery and synthesis of nanometre-sized 
semi

In [None]:
print_response_for_query(result)

('\n'
 '    The Nobel Prize in Chemistry 2023 was awarded to Moungi G. Bawendi, '
 'Louis E. Brus, and Aleksey I. Yekimov for their discovery and synthesis of '
 'quantum dots. These are nanometre-sized semiconductor crystals whose '
 'properties are determined by quantum size effects. Quantum dots represent a '
 'new class of materials that are distinct from both molecular and bulk '
 'materials.\n'
 '    ')


This demonstrates the agent's conversational abilities:
- **Context Understanding**: Recognizes "chem" refers to Chemistry based on conversation history
- **Memory Utilization**: Uses previous question context to understand the clarification  
- **Adaptive Search**: Should now search for Chemistry Nobel Prize information
- **Conversation Flow**: Shows natural dialogue capabilities

### Multichain example

This tests technical document comprehension:
- **Scientific Terminology**: Searches for specific crystal structures
- **PDF Deep Search**: Requires detailed analysis of document content
- **Technical Accuracy**: Must identify correct chemical compounds and materials
- **Context Specificity**: Focuses on crystallographic information

In [None]:
query = "What crystals were used in the paper?"
result = qa.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_through_pdf_text` with `{'query': 'Nobel Prize in Chemistry 2023 quantum dots crystals'}`


[0m[33;1m[1;3m1 (17)
 
Quantum dots — seeds of nanoscience 
The Royal Swedish Academy of Sciences has decided to award Moungi G. Bawendi, Louis 
E. Brus, and Aleksey I. Yekimov the Nobel Prize in Chemistry 2023, for the discovery and 
synthesis of quantum dots.
Introduction 
This year’s Nobel Prize in Chemistry recognizes the discovery and synthesis of nanometre-sized 
semiconductor crystals, the properties of which are determined by quantum size effects. Referred

This year’s Nobel Prize in Chemistry recognizes the discovery and synthesis of nanometre-sized 
semiconductor crystals, the properties of which are determined by quantum size effects. Referred 
to as quantum dots, such nanoparticles are so small that their physical size determines the 
quantum mechanical states of the material’s charge carriers. 
Qu
ant

In [None]:
print_response_for_query(result)

('\n'
 '    The paper discusses the discovery and synthesis of nanometre-sized '
 'semiconductor crystals known as quantum dots. However, it does not specify '
 'the exact types of crystals used in the synthesis of these quantum dots. '
 'Quantum dots can be made from various semiconductor materials, but the '
 'specific materials used by the Nobel laureates in their work are not '
 'detailed in the provided text.\n'
 '    ')


## Advanced Retrieval Configurations
**Basic Retrieval**: Standard similarity search returning top 3 most relevant chunks.

In [None]:
retriever = vectorstore.as_retriever(
   search_kwargs={"k": 3}
)
retriever.invoke(query)

[Document(metadata={'source': 'https://www.nobelprize.org/uploads/2023/10/advanced-chemistryprize2023.pdf', 'creator': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'total_pages': 19, 'creationdate': '2023-09-30T09:06:00+02:00', 'moddate': '2023-10-26T14:26:47+02:00', 'page': 16, 'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'title': '', 'page_label': '17'}, page_content='153 (3), 989-&. DOI: DOI 10.1103/PhysRev.153.989. \n(50) Berry, C. R. Structure and Optical Absorption of Agi Microcrystals. Phys Rev 1967, 161 (3), \n848-&. DOI: DOI 10.1103/PhysRev.161.848. \n(51) Katzschmann, R.; Kranold, R.; Rehfeld, A. Optical-Anomaly of Small Particles in Glasses. \nPhys Status Solidi A 1977, 40 (2), K161-K161. DOI: DOI 10.1002/pssa.2210400255. \n(52) Montanarella, F.; Kovalenko, M. V. Three Millennia of Nanocrystals. A CS Nano 2022, 16'),
 Document(metadata={'title': '', 'source': 'https://www.nobelprize.org/uploads/2023/10/advanced-chemistryprize2023.pdf', 'moddate': '2023-10-26T14:26:47+02:

**MMR (Maximal Marginal Relevance)**: Advanced retrieval strategy that:
- **fetch_k=10**: Initially retrieves 10 candidates
- **k=3**: Returns final 3 results  
- **lambda_mult=0.25**: Balance between relevance and diversity
- **Purpose**: Reduces redundancy, increases information diversity

In [None]:
retriever = vectorstore.as_retriever(
    search_type="mmr", search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult":0.25}
)
retriever.invoke(query)

[Document(metadata={'moddate': '2023-10-26T14:26:47+02:00', 'total_pages': 19, 'title': '', 'creator': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'page_label': '17', 'page': 16, 'source': 'https://www.nobelprize.org/uploads/2023/10/advanced-chemistryprize2023.pdf', 'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creationdate': '2023-09-30T09:06:00+02:00'}, page_content='153 (3), 989-&. DOI: DOI 10.1103/PhysRev.153.989. \n(50) Berry, C. R. Structure and Optical Absorption of Agi Microcrystals. Phys Rev 1967, 161 (3), \n848-&. DOI: DOI 10.1103/PhysRev.161.848. \n(51) Katzschmann, R.; Kranold, R.; Rehfeld, A. Optical-Anomaly of Small Particles in Glasses. \nPhys Status Solidi A 1977, 40 (2), K161-K161. DOI: DOI 10.1002/pssa.2210400255. \n(52) Montanarella, F.; Kovalenko, M. V. Three Millennia of Nanocrystals. A CS Nano 2022, 16'),
 Document(metadata={'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'title': '', 'page': 5, 'creator': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'total_

**Filtered Retrieval**: Adds metadata filtering to search:
- **filter={"page": {"$lte": 12}}**: Only searches pages 1-12
- **Use Case**: Focus on specific document sections
- **Efficiency**: Reduces search space for targeted queries

In [None]:
retriever = vectorstore.as_retriever(
   search_kwargs={"k": 3, "filter": {"page": {"$lte": 12}}}
)
retriever.invoke(query)

[Document(metadata={'moddate': '2023-10-26T14:26:47+02:00', 'creationdate': '2023-09-30T09:06:00+02:00', 'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'creator': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'page_label': '6', 'total_pages': 19, 'source': 'https://www.nobelprize.org/uploads/2023/10/advanced-chemistryprize2023.pdf', 'page': 5, 'title': ''}, page_content='Institute.27 He aimed to understand the chemical composition and structure of colloidal particles \nin coloured glasses, as well as the mechanism of their growth. 27 Using techniques familiar to him \nfrom his PhD training in semiconductor physics, he and his co-workers measured the optical \nabsorption spectrum of heat-treated silicate glasses with additions of Cu and Cl of the order of a \nfew percent, above the solubility limit of the matrix. At the cryogenic temperature of 4.2 K, the'),
 Document(metadata={'creationdate': '2023-09-30T09:06:00+02:00', 'producer': 'Adobe Acrobat Pro (64-bit) 23.3.20284', 'title': ''

## Complex Multi-Tool Query

- **Instruction Following**: Must exclude specific phrases as requested
- **Focus Refinement**: Concentrate on chemical compounds rather than general information
- **Query Optimization**: Requires reformulating search strategy
- **Technical Precision**: Demands specific chemical terminology


In [None]:
query = "What crystals were used in the paper? exclude mentioning Nobel Prize in Chemistry 2023 in the query, focus on the chemical compounds"
result = qa.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_through_pdf_text` with `{'query': 'crystals used in quantum dots synthesis'}`


[0m[33;1m[1;3mdots with much more well-defined size and with high optical quality.62 This discovery opened the 
door to the development of applications of colloidal quantum dots. Their synthesis begins with 
the injection and immediate pyrolysis of organometallic reagents (the precursors for the desired 
nanoparticles) into a hot coordinating solvent with a high boiling point. The rapid increase of 
reagent concentrations results in abrupt supersaturation and in nucleation that takes place at a

8 (17) 
iso
late inherently size -dependent behaviour.58 For example, luminescence from nanocrystals 
available in the late 1980s was typically limited by a quantum yield of only a few percent.59, 61 
In 1993, Moungi Bawendi and co-workers developed a method for the synthesis of quantum 
dots with much more well-defined size and with 

In [None]:
print_response_for_query(result)

('\n'
 '    The synthesis of quantum dots typically involves the use of '
 'organometallic reagents as precursors for the desired nanoparticles. '
 'However, the specific chemical compounds or crystals used in the synthesis '
 'are not detailed in the retrieved text. Generally, quantum dots can be made '
 'from various semiconductor materials, such as cadmium selenide (CdSe), '
 'cadmium telluride (CdTe), and lead sulfide (PbS), among others. If you need '
 'more specific information, please let me know!\n'
 '    ')


This demonstrates somewhat sophisticated multi-step reasoning:

**Step 1**: Search PDF for crystal mentions (uses retrieval tool)

**Step 2**: Extract specific crystal names from results  

**Step 3**: Query PubMed for recent publications on each crystal (uses PubMed tool)

**Step 4**: Organize and present findings systematically

**Agent Capabilities Tested:**
- **Multi-tool coordination**: Using both PDF search and PubMed
- **Information synthesis**: Combining local and external data sources
- **Complex query decomposition**: Breaking down multi-part requests
- **Data organization**: Presenting structured results clearly

This query showcases the full power of the RAG system with agent capabilities, demonstrating how it can seamlessly combine local document knowledge with real-time external database searches to provide comprehensive, up-to-date answers.

In [None]:
query = "What are the titles of the 3 most recent publications for each of the crystals mentioned in the paper?"
result = qa.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `pub_med` with `{'query': 'cadmium selenide'}`


[0mToo Many Requests, waiting for 0.20 seconds...
Too Many Requests, waiting for 0.40 seconds...
[36;1m[1;3mPublished: 2025-06-12
Title: Doxorubicin detection in plasma and blood using a luminescence turn-off nanosensor based on alloyed CdZnSeS/ZnS quantum dots.
Copyright Information: © 2025. The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature.
Summary::
The feasibility of doxorubicin (Dox) detection in plasma and blood using a luminescence turn-off nanosensor based on alloyed CdZnSeS/ZnS quantum dots (QDs) is demonstrated. The QD luminescence quenching in the presence of Dox was carefully examined. Optimal assay conditions were selected, including plasma dilution and QD concentration. The calculated limit of detection value was 0.02 µg/mL. Time-resolved QD luminescence quenching and absorption measurements allowed us to

In [None]:
print_response_for_query(result)

('\n'
 '    Here are the three most recent publications for each of the mentioned '
 'crystals:\n'
 '\n'
 '### Cadmium Selenide (CdSe)\n'
 '1. **Title:** Doxorubicin detection in plasma and blood using a luminescence '
 'turn-off nanosensor based on alloyed CdZnSeS/ZnS quantum dots.\n'
 '   - **Published:** 2025-06-12\n'
 '\n'
 '2. **Title:** Highly sensitive CdSe/ZnS quantum dots immunosensor via DNA '
 'tetrahedra nanostructure-assembled with multiple hybridization chain '
 'reaction.\n'
 '   - **Published:** 2025-05-24\n'
 '\n'
 '3. **Title:** [No additional publication found in the search results.]\n'
 '\n'
 '### Cadmium Telluride (CdTe)\n'
 '1. **Title:** A Very Thin MCT Film in HDVIP Achieves High Absorption.\n'
 '   - **Published:** 2025-06-13\n'
 '\n'
 '2. **Title:** Quantitative assessment of ohmic-type CdTe sensor response in '
 'a photon-counting X-ray imaging detector under continuous 12-49 keV '
 'irradiations.\n'
 '   - **Published:** 2025-07-01\n'
 '\n'
 '3. **Title:** [

## Summary

This notebook demonstrates a complete RAG (Retrieval-Augmented Generation) system that:

1. **Processes Documents**: Loads, chunks, and indexes PDF content
2. **Creates Vector Database**: Enables semantic search capabilities  
3. **Implements Memory**: Maintains conversation context
4. **Provides Tools**: Combines local search with external databases
5. **Uses Agents**: Orchestrates complex, multi-step reasoning
6. **Delivers Results**: Provides accurate, context-aware answers