### What Are Retrievers?
Retrievers are components in LangChain designed to fetch relevant documents from a data source in response to a user’s query

- Input: User’s query

- Process: Searching a data source for documents relevant to the query.

- Output: Multiple relevant document objects (LangChain document objects)

The data source can be anything, such as a vector store or an API. Internally, retrievers act like search engines, scanning the data source to identify the most relevant documents and returning them to the user.

### Key points about retrievers in LangChain:

1. There is not just one retriever; multiple retrievers exist for different use cases.
2. All retrievers in LangChain are “runnables,” meaning they have an invoke function and can be plugged into chains, enhancing system flexibility.


### Types of Retrievers
Retrievers can be categorized in two main ways:

1. Based on Data Source:

- Wikipedia Retriever: Queries Wikipedia API to fetch relevant articles based on keywords in the user’s query. It uses keyword matching rather than semantic or syntactic search.

- Vector Store Retriever: Fetches relevant documents from vector databases like Chroma or FAISS. It performs semantic search using embeddings, comparing query and document vectors.

- Archive Retriever: Searches research papers or documents from archive websites.

2. Based on Retrieval Strategy:

 Different retrievers use different search mechanisms to find relevant documents:

- MMR (Maximum Marginal Relevance) Retriever: Balances relevance and diversity in search results.

- Multi-Query Retriever: Breaks down ambiguous queries into multiple clearer queries and merges results.

- Contextual Compression Retriever: Compresses documents after retrieval to keep only relevant content based on the query.

### Wikipedia Retriever
This retriever queries the Wikipedia API to fetch relevant articles matching a user query. It is keyword-based and not semantic search. Upon receiving a query, it sends it to Wikipedia, retrieves relevant articles, and returns them as LangChain document objects. 


In [1]:
from langchain_community.retrievers import  WikipediaRetriever


In [3]:
from yarl import Query


retriever = WikipediaRetriever(top_k_results=2,lang="en")

Query = "Research The Topic of What is Agentic AI ?"

docs = retriever.invoke(Query)

In [4]:
# Print retrieved content
for i, doc in enumerate(docs):
    print(f"\n--- Result {i+1} ---")
    print(f"Content:\n{doc.page_content}...")  # truncate for display


--- Result 1 ---
Content:
Perplexity AI, Inc., or simply Perplexity, is an American privately held software company offering a web search engine that processes user queries and synthesizes responses. It uses large language models and incorporates real-time web search capabilities, enabling it to provide responses based on current Internet content. With a conversational approach, Perplexity allows users to ask follow-up questions and receive contextual answers. All responses include citations to their sources from the Internet to support transparency and allow users to verify information. A free public version is available, while a paid Pro subscription offers access to more advanced language models and additional features.
Perplexity AI, Inc. was founded in 2022 by Aravind Srinivas, Denis Yarats, Johnny Ho, and Andy Konwinski. As of July 2025, the company was valued at US$18 billion.
Perplexity AI has attracted legal scrutiny over allegations of copyright infringement, unauthorized co

### Vector Store Retriever
The vector store retriever is the most common type in LangChain. It stores documents as dense vectors (embeddings) in vector databases like Chroma or FAISS. When a query arrives, it is also converted into an embedding vector, and semantic similarity search is performed across stored document vectors to retrieve the most relevant documents.


In [1]:
import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings

load_dotenv()
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

In [2]:
import os
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

# 1. Load PDF and extract text
loader = PyMuPDFLoader("Attenstion all you Need.pdf")
documents = loader.load()

# 2. Split text into chunks
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = splitter.split_documents(documents)

# 3. Initialize embeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

# 4. Set up persistent Qdrant storage in Output folder
output_folder = os.path.join(os.getcwd(), "Output")
os.makedirs(output_folder, exist_ok=True)
qdrant_path = os.path.join(output_folder, "qdrant_db")

client = QdrantClient(path=qdrant_path)

vector_size = len(embeddings.embed_query("sample text"))
print(vector_size)

if not client.collection_exists("test"):
    client.create_collection(
        collection_name="test",
        vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
    )

# 5. Store embeddings in Qdrant
vector_store = QdrantVectorStore(
    client=client,
    collection_name="test",
    embedding=embeddings,
)

vector_store.add_documents(docs)
print(f"Stored {len(docs)} chunks in Qdrant at {qdrant_path}")

3072
Stored 15 chunks in Qdrant at t:\LangGraph\Langchain\Output\qdrant_db


In [9]:
from langchain_community.vectorstores import qdrant
retriever = vector_store.as_retriever(search_kwargs={"k": 2})
query = "What is Transformer architecture?"
results = retriever.invoke(query)

In [10]:
results

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'source': 'Attenstion all you Need.pdf', 'file_path': 'Attenstion all you Need.pdf', 'total_pages': 15, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'trapped': '', 'modDate': 'D:20240410211143Z', 'creationDate': 'D:20240410211143Z', 'page': 1, '_id': '16b51798c77b471796d321685fde1e11', '_collection_name': 'test'}, page_content='1\nIntroduction\nRecurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks\nin particular, have been firmly established as state of the art approaches in sequence modeling and\ntransduction problems such as language modeling and machine translation [35, 2, 5]. Numerous\nefforts have since continued to push the boundaries of recurrent language models and encoder-decoder\narchitectures [38, 24, 15].\nRecurrent models typically f

In [11]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
1
Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficie

In [12]:
results = vector_store.similarity_search(query, k=2)
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
1
Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficie

### MMR (Maximum Marginal Relevance) Retriever
MMR addresses a common problem of redundancy in retrieval results. For example, a simple similarity search might return multiple documents expressing the same idea, reducing diversity in retrieved content. MMR ensures that documents returned are not only relevant to the query but also diverse from each other.

MMR works by first selecting the most relevant document
Subsequent documents are chosen for relevance and dissimilarity to previously selected documents
This reduces redundancy and provides a broader perspective on the query topic


In [3]:
retriever = vector_store.as_retriever(
    search_type="mmr",                   # <-- This enables MMR
    search_kwargs={"k": 3, "lambda_mult": 0.5}  # k = top results, lambda_mult = relevance-diversity balance
)

In [4]:
query = "What is Transformer architecture?"
results = retriever.invoke(query)

In [5]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
1
Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficie

### Multi-Query Retriever
This retriever is designed to handle ambiguous or broad queries by breaking them down into multiple specific queries using an LLM (Large Language Model). The process:

An ambiguous user query is sent to an LLM
The LLM generates multiple related but more specific queries
Each query is sent to a standard retriever (e.g., similarity-based) to fetch results
The results are merged, duplicates removed, and top results returned to the user
Example: For query “How can I stay healthy?”, the LLM might generate sub-queries like “What foods maintain good health?”, “How often should I exercise?”, etc.

In [6]:
# Create retrievers
similarity_retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [7]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_google_genai import ChatGoogleGenerativeAI
multiquery_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
    llm=ChatGoogleGenerativeAI(model="gemini-2.5-flash")
)

In [8]:
query = "What is Transformer architecture?"
similarity_results = similarity_retriever.invoke(query)
multiquery_results= multiquery_retriever.invoke(query)

In [9]:
for i, doc in enumerate(similarity_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)

print("*"*150)

for i, doc in enumerate(multiquery_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
1
Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficie

### Contextual Compression Retriever
This advanced retriever improves retrieval quality by compressing documents after retrieval, keeping only parts relevant to the user query and discarding irrelevant content.

Problem addressed:
Many documents contain multiple topics. For example, a single document might talk about both the Grand Canyon and photosynthesis. If a query asks about photosynthesis, returning the entire document may include irrelevant sections, reducing answer quality.

How it works:

A base retriever fetches relevant documents for the query
The documents and query are sent to an LLM-based compressor
The compressor trims the document, keeping only parts relevant to the query
The compressed documents are returned
This helps reduce noise and improves user experience by providing precise, focused information. It is especially useful for large documents or when wanting to reduce context length for downstream LLM processing.

In [10]:
from langchain.retrievers.document_compressors import LLMChainExtractor
llm=ChatGoogleGenerativeAI(model="gemini-2.5-flash")
compressor = LLMChainExtractor.from_llm(llm)
base_retriever = vector_store.as_retriever(search_kwargs={"k": 5})

In [11]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    base_compressor=compressor
)

In [12]:
# Query the retriever
query = "What is Transformer architecture?"
compressed_results = compression_retriever.invoke(query)

In [None]:
for i, doc in enumerate(compressed_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

--- Result 2 ---
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

--- Result 3 ---
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected l

: 