# Retrievers

## Wikipedia Retriver

In [15]:
from langchain_community.retrievers import WikipediaRetriever

# Initialize the retriever (optional: set language and top_k)
retriever = WikipediaRetriever(top_k_results=2, lang="en")

In [16]:

# Define your query
query = "the geopolitical history of india and pakistan from the perspective of a china"

# Get relevant Wikipedia documents
docs = retriever.invoke(query)

In [17]:
print(docs[0].page_content)

The India–Pakistan war of 1965, also known as the second India–Pakistan war, was an armed conflict between Pakistan and India that took place from August 1965 to September 1965.
The conflict began following Pakistan's unsuccessful Operation Gibraltar, which was designed to infiltrate forces into Jammu and Kashmir to precipitate an insurgency against Indian rule. The seventeen day war caused thousands of casualties on both sides and witnessed the largest engagement of armoured vehicles and the largest tank battle since World War II. Hostilities between the two countries ended after a ceasefire was declared through UNSC Resolution 211 following a diplomatic intervention by the Soviet Union and the United States, and the subsequent issuance of the Tashkent Declaration. Much of the war was fought by the countries' land forces in Kashmir and along the border between India and Pakistan. This war saw the largest amassing of troops in Kashmir since the Partition of India in 1947, a number that

In [18]:
for doc in docs:
    print(doc.metadata)
    print(doc.metadata['source'])

{'title': 'India–Pakistan war of 1965', 'summary': "The India–Pakistan war of 1965, also known as the second India–Pakistan war, was an armed conflict between Pakistan and India that took place from August 1965 to September 1965.\nThe conflict began following Pakistan's unsuccessful Operation Gibraltar, which was designed to infiltrate forces into Jammu and Kashmir to precipitate an insurgency against Indian rule. The seventeen day war caused thousands of casualties on both sides and witnessed the largest engagement of armoured vehicles and the largest tank battle since World War II. Hostilities between the two countries ended after a ceasefire was declared through UNSC Resolution 211 following a diplomatic intervention by the Soviet Union and the United States, and the subsequent issuance of the Tashkent Declaration. Much of the war was fought by the countries' land forces in Kashmir and along the border between India and Pakistan. This war saw the largest amassing of troops in Kashmi

## Vector Store

In [19]:
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings, embeddings

hf_embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={"device": "cuda"}
    ) 

vectorstore = Chroma(
    persist_directory="my_chroma_db",
    embedding_function=hf_embeddings,
    collection_name="sample"
)

In [20]:
vectorstore.get(include=['embeddings','documents', 'metadatas'])

{'ids': ['e9851729-4def-49ab-8d91-6dff1f0de97d',
  'c8590840-3c64-4d3d-a7de-5c41eaddf66a',
  'cd667cb3-d609-47c8-a4c0-c9a5ea9958f7',
  'e629abfe-9ab7-4bdd-bd83-fd9243ec770c',
  '7ea6f0e6-7ac2-4ed8-bde3-2ccd1e41ab4e'],
 'embeddings': array([[ 0.00994725,  0.0691433 , -0.0514712 , ..., -0.03543341,
          0.01284809,  0.01248289],
        [ 0.00127746,  0.03129849, -0.02375378, ..., -0.00518363,
         -0.03280615,  0.02737718],
        [-0.10265911,  0.02650811,  0.02271506, ..., -0.03359749,
         -0.07984943, -0.01507711],
        [ 0.02123394, -0.02468549, -0.04494366, ..., -0.1099581 ,
          0.0057256 ,  0.09915379],
        [ 0.01873979,  0.04382843, -0.0430425 , ..., -0.07801619,
         -0.07840686, -0.00304195]], shape=(5, 384)),
 'documents': ['Virat Kohli is one of the most successful and consistent batsmen in IPL history. Known for his aggressive batting style and fitness, he has led the Royal Challengers Bangalore in multiple seasons.',
  "Rohit Sharma is the mo

In [21]:
#as retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

In [22]:
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x000001466141E350>, search_kwargs={'k': 2})

In [23]:
retriever.invoke("Who is a bowler?")

[Document(id='e629abfe-9ab7-4bdd-bd83-fd9243ec770c', metadata={'team': 'Mumbai Indians'}, page_content='Jasprit Bumrah is considered one of the best fast bowlers in T20 cricket. Playing for Mumbai Indians, he is known for his yorkers and death-over expertise.'),
 Document(id='7ea6f0e6-7ac2-4ed8-bde3-2ccd1e41ab4e', metadata={'team': 'Chennai Super Kings'}, page_content='Ravindra Jadeja is a dynamic all-rounder who contributes with both bat and ball. Representing Chennai Super Kings, his quick fielding and match-winning performances make him a key player.')]

In [24]:
retriever.batch(["Who is a bowler?", "Who is a batsman?"])

[[Document(id='e629abfe-9ab7-4bdd-bd83-fd9243ec770c', metadata={'team': 'Mumbai Indians'}, page_content='Jasprit Bumrah is considered one of the best fast bowlers in T20 cricket. Playing for Mumbai Indians, he is known for his yorkers and death-over expertise.'),
  Document(id='7ea6f0e6-7ac2-4ed8-bde3-2ccd1e41ab4e', metadata={'team': 'Chennai Super Kings'}, page_content='Ravindra Jadeja is a dynamic all-rounder who contributes with both bat and ball. Representing Chennai Super Kings, his quick fielding and match-winning performances make him a key player.')],
 [Document(id='7ea6f0e6-7ac2-4ed8-bde3-2ccd1e41ab4e', metadata={'team': 'Chennai Super Kings'}, page_content='Ravindra Jadeja is a dynamic all-rounder who contributes with both bat and ball. Representing Chennai Super Kings, his quick fielding and match-winning performances make him a key player.'),
  Document(id='e9851729-4def-49ab-8d91-6dff1f0de97d', metadata={'team': 'Royal Challengers Bangalore'}, page_content='Virat Kohli is 

In [25]:
for chunk in retriever.stream("Who is a bowler?"):
    print(chunk)

[Document(id='e629abfe-9ab7-4bdd-bd83-fd9243ec770c', metadata={'team': 'Mumbai Indians'}, page_content='Jasprit Bumrah is considered one of the best fast bowlers in T20 cricket. Playing for Mumbai Indians, he is known for his yorkers and death-over expertise.'), Document(id='7ea6f0e6-7ac2-4ed8-bde3-2ccd1e41ab4e', metadata={'team': 'Chennai Super Kings'}, page_content='Ravindra Jadeja is a dynamic all-rounder who contributes with both bat and ball. Representing Chennai Super Kings, his quick fielding and match-winning performances make him a key player.')]


In [26]:
#similarity Search
results = vectorstore.similarity_search(query, k=2)

In [None]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
MS Dhoni, famously known as Captain Cool, has led Chennai Super Kings to multiple IPL titles. His finishing skills, wicketkeeping, and leadership are legendary.

--- Result 2 ---
Rohit Sharma is the most successful captain in IPL history, leading Mumbai Indians to five titles. He's known for his calm demeanor and ability to play big innings under pressure.


- **Similarity search directly queries the vector store for nearest documents, while a retriever wraps the vector store into a Runnable interface so it can plug into LangChain pipelines and be called with .invoke().**

- **similarity_search() usually uses one fixed similarity method (like cosine similarity), while a retriever lets you switch or configure different retrieval strategies (MMR, filtering, hybrid search, etc.), making it more flexible.**

## MMR ( maximum marginal Releevance)

* Problem with basic Retrival

In [43]:
from langchain_core.documents import Document

documents = [
    Document(page_content="LangChain makes it easy to work with LLMs."),
    Document(page_content="LangChain is used to build LLM based applications."),
    Document(page_content="Chroma is used to store and search document embeddings."),
    Document(page_content="Embeddings are vector representations of text."),
    Document(page_content="MMR helps you get diverse results when doing similarity search."),
    Document(page_content="LangChain supports Chroma, FAISS, Pinecone, and more.")]

In [44]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

# 1️⃣ Create embedding model
hf_embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={"device": "cuda"}
    )
# 2️⃣ Create FAISS vector store in memory
vectorstore = FAISS.from_documents(documents, hf_embeddings)

In [None]:
query = "What is langchain?" 

**basic retrieval**
- similarity search

In [59]:
retriever = vectorstore.as_retriever(               
    search_kwargs={"k": 3} ) 

results = retriever.invoke(query)
for docs in results:
    print(docs.page_content)

LangChain supports Chroma, FAISS, Pinecone, and more.
LangChain is used to build LLM based applications.
LangChain makes it easy to work with LLMs.


**MMR serch type**

In [60]:
retriever = vectorstore.as_retriever(
    search_type="mmr",                  
    search_kwargs={"k": 3, "lambda_mult": 0.5} ) # 0 = diverse result ||  1 = similar result
results = retriever.invoke(query)
for docs in results:
    print(docs.page_content)

LangChain supports Chroma, FAISS, Pinecone, and more.
LangChain is used to build LLM based applications.
Embeddings are vector representations of text.


## MultiQuery Retriever

Idea Behind MultiQueryRetriever

**User Query:**
- *How can I build a good GenAI application?*

This could actually mean:

- *How do I choose the right LLM?*
- *How do I implement RAG in my project?*
- *How can I reduce hallucinations in LLMs?*
- *What techniques improve prompt engineering?*

---



![image.png](attachment:image.png)

In [7]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
from langchain_ollama import ChatOllama
from langchain_classic.retrievers.multi_query import MultiQueryRetriever

In [21]:
from langchain_core.documents import Document

all_docs = [
    Document(
        page_content="Regular walking boosts heart health and can reduce symptoms of depression.",
        metadata={"source": "H1", "category": "health"}
    ),
    Document(
        page_content="Consuming leafy greens and fruits helps detox the body and improve longevity.",
        metadata={"source": "H2", "category": "health"}
    ),
    Document(
        page_content="Deep sleep is crucial for cellular repair and emotional regulation.",
        metadata={"source": "H3", "category": "health"}
    ),
    Document(
        page_content="Mindfulness and controlled breathing lower cortisol and improve mental clarity.",
        metadata={"source": "H4", "category": "health"}
    ),
    Document(
        page_content="Drinking sufficient water throughout the day helps maintain metabolism and energy.",
        metadata={"source": "H5", "category": "health"}
    ),
    Document(
        page_content="The solar energy system in modern homes helps balance electricity demand.",
        metadata={"source": "I1", "category": "solar energy"}
    ),
    Document(
        page_content="Python balances readability with power, making it a popular system design language.",
        metadata={"source": "I2", "category": "technology"}
    ),
    Document(
        page_content="Photosynthesis enables plants to produce energy by converting sunlight.",
        metadata={"source": "I3", "category": "biology"}
    ),
    Document(
        page_content="The 2022 FIFA World Cup was held in Qatar and drew global energy and excitement.",
        metadata={"source": "I4", "category": "sports"}
    ),
    Document(
        page_content="Black holes bend spacetime and store immense gravitational energy.",
        metadata={"source": "I5", "category": "physics"}
    ),
]

In [None]:
# 1️⃣ Create embedding model
hf_embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={"device": "cuda"}
    )
# 2️⃣ Create FAISS vector store in memory
vectorstore = FAISS.from_documents(all_docs, hf_embeddings)

In [23]:
# Create retrievers
similarity_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [32]:
multiquery_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=ChatOllama(model="qwen3:latest")
)

In [25]:
# Query
query = "How to improve energy levels and maintain balance?"

In [39]:
# Retrieve results
similarity_results = similarity_retriever.invoke(query)
for i, doc in enumerate(similarity_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content , "--" , doc.metadata['category'])



--- Result 1 ---
Drinking sufficient water throughout the day helps maintain metabolism and energy. -- health

--- Result 2 ---
The solar energy system in modern homes helps balance electricity demand. -- solar energy

--- Result 3 ---
Consuming leafy greens and fruits helps detox the body and improve longevity. -- health

--- Result 4 ---
Mindfulness and controlled breathing lower cortisol and improve mental clarity. -- health

--- Result 5 ---
Photosynthesis enables plants to produce energy by converting sunlight. -- biology


In [42]:
multiquery_results = multiquery_retriever.invoke(query)

In [35]:
for i, doc in enumerate(multiquery_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content, "--" , doc.metadata['category'])


--- Result 1 ---
Drinking sufficient water throughout the day helps maintain metabolism and energy. -- health

--- Result 2 ---
The solar energy system in modern homes helps balance electricity demand. -- solar energy

--- Result 3 ---
Consuming leafy greens and fruits helps detox the body and improve longevity. -- health

--- Result 4 ---
Photosynthesis enables plants to produce energy by converting sunlight. -- biology

--- Result 5 ---
Regular walking boosts heart health and can reduce symptoms of depression. -- health

--- Result 6 ---
Mindfulness and controlled breathing lower cortisol and improve mental clarity. -- health


**generate queries manuall** 

In [45]:
generated_queries = multiquery_retriever.llm_chain.invoke(
    {"question": query}
)
generated_queries

['What lifestyle changes can enhance energy levels and promote overall balance?  ',
 'What nutritional strategies and habits can boost energy levels and support daily balance?  ',
 'What holistic practices and routines can improve energy levels and foster mental and physical balance?']

In [46]:
all_results = []

for q in generated_queries:
    docs = vectorstore.similarity_search(q, k=5)
    all_results.extend(docs)

In [48]:
for doc in all_results:
    print(doc.page_content , "--" , doc.metadata['category'])

Drinking sufficient water throughout the day helps maintain metabolism and energy. -- health
Consuming leafy greens and fruits helps detox the body and improve longevity. -- health
The solar energy system in modern homes helps balance electricity demand. -- solar energy
Regular walking boosts heart health and can reduce symptoms of depression. -- health
Photosynthesis enables plants to produce energy by converting sunlight. -- biology
Drinking sufficient water throughout the day helps maintain metabolism and energy. -- health
Consuming leafy greens and fruits helps detox the body and improve longevity. -- health
The solar energy system in modern homes helps balance electricity demand. -- solar energy
Regular walking boosts heart health and can reduce symptoms of depression. -- health
Photosynthesis enables plants to produce energy by converting sunlight. -- biology
Drinking sufficient water throughout the day helps maintain metabolism and energy. -- health
Mindfulness and controlled br

In [51]:
unique_docs = {doc.page_content: doc for doc in all_results}.values()
unique_docs = list(unique_docs)

for doc in unique_docs:
    print(doc.page_content , "--" , doc.metadata['category'])

Drinking sufficient water throughout the day helps maintain metabolism and energy. -- health
Consuming leafy greens and fruits helps detox the body and improve longevity. -- health
The solar energy system in modern homes helps balance electricity demand. -- solar energy
Regular walking boosts heart health and can reduce symptoms of depression. -- health
Photosynthesis enables plants to produce energy by converting sunlight. -- biology
Mindfulness and controlled breathing lower cortisol and improve mental clarity. -- health


## ContextualCompressionRetriever

![image.png](attachment:image.png)

In [58]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
from langchain_ollama import ChatOllama
from langchain_classic.retrievers.document_compressors import LLMChainExtractor
from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever

In [64]:
docs = [
    Document(
        page_content=(
            """The Grand Canyon is one of the most visited natural wonders in the world.
            Photosynthesis is the process by which green plants convert sunlight into energy.
            Millions of tourists travel to see it every year. The rocks date back millions of years."""
        ),
        metadata={"source": "Doc1", "category": "nature"}
    ),

    Document(
        page_content=(
            """In medieval Europe, castles were built primarily for defense.
            The chlorophyll in plant cells captures sunlight during photosynthesis.
            Knights wore armor made of metal. Siege weapons were often used to breach castle walls."""
        ),
        metadata={"source": "Doc2", "category": "history , nature"}
    ),

    Document(
        page_content=(
            """Basketball was invented by Dr. James Naismith in the late 19th century.
            It was originally played with a soccer ball and peach baskets. NBA is now a global league."""
        ),
        metadata={"source": "Doc3", "category": "sports"}
    ),

    Document(
        page_content=(
            """The history of cinema began in the late 1800s. Silent films were the earliest form.
            Thomas Edison was among the pioneers. Photosynthesis does not occur in animal cells.
            Modern filmmaking involves complex CGI and sound design."""
        ),
        metadata={"source": "Doc4", "category": "cinema , nature"}
    )
]

In [65]:
# 1️⃣ Create embedding model
hf_embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={"device": "cuda"}
    )
# 2️⃣ Create FAISS vector store in memory
vectorstore = FAISS.from_documents(docs, hf_embeddings)

In [66]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [67]:
# Set up the compressor using an LLM
llm = ChatOllama(model = "qwen3:latest")
compressor = LLMChainExtractor.from_llm(llm)

In [68]:
# Create the contextual compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    base_compressor=compressor
)

In [69]:
# Query the retriever
query = "What is photosynthesis?"
compressed_results = compression_retriever.invoke(query)

In [70]:
for i, doc in enumerate(compressed_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content , '--' ,doc.metadata['category'])



--- Result 1 ---
Photosynthesis is the process by which green plants convert sunlight into energy. -- nature

--- Result 2 ---
The chlorophyll in plant cells captures sunlight during photosynthesis. -- history , nature

--- Result 3 ---
Photosynthesis does not occur in animal cells. -- cinema , nature
