## Based on different data/document sources
---

### wikipedia retriever

In [1]:
from langchain_community.retrievers import WikipediaRetriever
retriever = WikipediaRetriever(
    language="en",
    top_k_results=2
)

In [2]:
query = "when is Artificial intelligence become popular?"
docs = retriever.invoke(query)

In [3]:
for i in docs:
    print(i.page_content)
    print("-"*100)

Generative artificial intelligence (Generative AI, GenAI, or GAI) is a subfield of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.
Generative AI tools have become more common since an "AI boom" in the 2020s. This boom was made possible by improvements in transformer-based deep neural networks, particularly large language models (LLMs). Major tools include chatbots such as ChatGPT, DeepSeek, Copilot, Gemini, Llama, and Grok; text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney, and DALL-E; and text-to-video AI generators such as Sora. Technology companies developing generative AI include OpenAI, Anthropic, Microsoft, Google, DeepSeek, and Baidu.
Generative AI has raised many ethical question

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

spliter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
)

### Vector Store Retrievers

In [5]:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("src/articles", glob="*.pdf")
docs = loader.load()
docs = spliter.split_documents(docs)

In [6]:
from langchain_chroma import Chroma

from langchain_google_genai import GoogleGenerativeAIEmbeddings, GoogleGenerativeAI
from dotenv import load_dotenv
load_dotenv()

embedding_model = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [7]:
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embedding_model,
    persist_directory="my_chroma_db",
    collection_name="my_news_collection"
)

In [8]:
# convert to retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [9]:
# vectorstore.delete_collection()
# vectorstore.get()

In [10]:
retriever.invoke("google")

[Document(id='66083772-01dc-49c9-8311-b806edf6adc5', metadata={'source': 'src/articles/news_10020031.pdf'}, page_content='Google has long dominated the search engine market, shaping the structure of the internet through the rise and fall of content driven by its search'),
 Document(id='4f2e231e-61ee-4e36-9367-76902bd3c4e3', metadata={'source': 'src/articles/news_10018577.pdf'}, page_content='As expected, Google’s annual conference this time is focused on AI, with the search and advertising giant showcasing a lot of new products and'),
 Document(id='0a61602f-a9ba-4a28-8390-9b77edb4d283', metadata={'source': 'src/articles/news_10020031.pdf'}, page_content='providing an end-to-end search experience. However, Google is also facing a prominent antitrust challenge where the US government is pushing for the'),
 Document(id='39a27e18-df33-485f-9065-181c0e76920c', metadata={'source': 'src/articles/news_10020031.pdf'}, page_content='it’s also a glimpse of what’s to come,” Elizabeth Reid, vice pr

in vector store, similarity stores can be used to do the same thing(i.e, getting similarity result), but the vector store only uses one algorighm, with retrievers, we can use different retrieval strategies

## based on retrieval strategy

---

### 1. MMR (Maximal Marginal Relevance)

MMR is an information retrieval algorithm designed to reduce redundancy in the retrieved results while maintaining high relevance to the query.

**Why MMR Retriever?**

In regular similarity search, you may get documents that are:
All very similar to each other
- Repeating the same info
- Lacking diverse perspectives

MMR Retriever avoids that by:
- Picking the most relevant document first
- Then picking the next most relevant and least similar to already selected docs

And so on...
This helps especially in RAG pipelines where:

- You want your context window to contain diverse but still relevant information
- Especially useful when documents are semantically overlapping

In [11]:
from langchain_community.vectorstores import FAISS


# Step 2: Create the FAISS vector store from documents
vectorstore = FAISS.from_documents(
    documents=docs,
    embedding=embedding_model
)

In [23]:
# Enable MMR in the retriever
retriever_1 = vectorstore.as_retriever(
    search_type="mmr",                   # <-- This enables MMR
    search_kwargs={"k": 5, "lambda_mult": 0.2}  # k = top results, lambda_mult = relevance-diversity balance, 0 means very different and 1 acts as similarity search
)

retriever_2 = vectorstore.as_retriever(
    search_type="mmr",                  
    search_kwargs={"k": 5, "lambda_mult": 1} 
)

In [24]:
result_set_1 = retriever_1.invoke("what is google glasses")
result_set_2 = retriever_2.invoke("what is google glasses")

In [25]:
for i in result_set_1:
    print(i.page_content)
    
print("-" * 50)

for i in result_set_1:
    print(i.page_content)

Listing the big terror attacks in India, Sharma told The Indian Express: “Even when Parliament of India was attacked, India was outraged, but still
William Hutchinson, quoted by Desert Sun newspaper said “Everything is in question, whether this is an act of terrorism.”
jihadi terrorism as a weapon of war has ended. When the Islamist Republic’s military rulers are unable to convince the world of their credentials,
analyse Operation Sindoor. Today, I have come to say with pride that the revenge of Pahalgam has been taken by decimating the headquarters of (terror
our region created because of decades of cross-border terrorism, (by) terrorist groups which are funded, nurtured and sheltered by the Pakistani
--------------------------------------------------
Listing the big terror attacks in India, Sharma told The Indian Express: “Even when Parliament of India was attacked, India was outraged, but still
William Hutchinson, quoted by Desert Sun newspaper said “Everything is in question, wheth

### Multiquery Retriever

If user query is not clear about the context, then it will be passed to another llm, so that it will generate some query related to user asked query, in a better way

In [27]:
from langchain_core.documents import Document
from langchain.retrievers.multi_query import MultiQueryRetriever

# Relevant health & wellness documents
all_docs = [
    Document(page_content="Regular walking boosts heart health and can reduce symptoms of depression.", metadata={"source": "H1"}),
    Document(page_content="Consuming leafy greens and fruits helps detox the body and improve longevity.", metadata={"source": "H2"}),
    Document(page_content="Deep sleep is crucial for cellular repair and emotional regulation.", metadata={"source": "H3"}),
    Document(page_content="Mindfulness and controlled breathing lower cortisol and improve mental clarity.", metadata={"source": "H4"}),
    Document(page_content="Drinking sufficient water throughout the day helps maintain metabolism and energy.", metadata={"source": "H5"}),
    Document(page_content="The solar energy system in modern homes helps balance electricity demand.", metadata={"source": "I1"}),
    Document(page_content="Python balances readability with power, making it a popular system design language.", metadata={"source": "I2"}),
    Document(page_content="Photosynthesis enables plants to produce energy by converting sunlight.", metadata={"source": "I3"}),
    Document(page_content="The 2022 FIFA World Cup was held in Qatar and drew global energy and excitement.", metadata={"source": "I4"}),
    Document(page_content="Black holes bend spacetime and store immense gravitational energy.", metadata={"source": "I5"}),
]

In [35]:
llm = GoogleGenerativeAI(model='gemini-2.5-flash-preview-05-20')

In [28]:
vectorstore = FAISS.from_documents(documents=all_docs, embedding=embedding_model)

similarity_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

multiquery_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=llm
)

In [29]:
# Query
query = "How to improve energy levels and maintain balance?"

In [30]:
# Retrieve results
similarity_results = similarity_retriever.invoke(query)
multiquery_results= multiquery_retriever.invoke(query)

In [31]:
for i in similarity_results:
    print(i.page_content)
    
print("-" * 50)

for i in multiquery_results:
    print(i.page_content)

Drinking sufficient water throughout the day helps maintain metabolism and energy.
Mindfulness and controlled breathing lower cortisol and improve mental clarity.
Regular walking boosts heart health and can reduce symptoms of depression.
Consuming leafy greens and fruits helps detox the body and improve longevity.
The solar energy system in modern homes helps balance electricity demand.
--------------------------------------------------
Drinking sufficient water throughout the day helps maintain metabolism and energy.
Mindfulness and controlled breathing lower cortisol and improve mental clarity.
Consuming leafy greens and fruits helps detox the body and improve longevity.
Regular walking boosts heart health and can reduce symptoms of depression.
Deep sleep is crucial for cellular repair and emotional regulation.


### ContextualCompressionRetriever

Sometime fetched document/chunk might have different contextual meaning sentences which may not necessary 

In [34]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Recreate the document objects from the previous data
docs = [
    Document(page_content=(
        """The Grand Canyon is one of the most visited natural wonders in the world.
        Photosynthesis is the process by which green plants convert sunlight into energy.
        Millions of tourists travel to see it every year. The rocks date back millions of years."""
    ), metadata={"source": "Doc1"}),

    Document(page_content=(
        """In medieval Europe, castles were built primarily for defense.
        The chlorophyll in plant cells captures sunlight during photosynthesis.
        Knights wore armor made of metal. Siege weapons were often used to breach castle walls."""
    ), metadata={"source": "Doc2"}),

    Document(page_content=(
        """Basketball was invented by Dr. James Naismith in the late 19th century.
        It was originally played with a soccer ball and peach baskets. NBA is now a global league."""
    ), metadata={"source": "Doc3"}),

    Document(page_content=(
        """The history of cinema began in the late 1800s. Silent films were the earliest form.
        Thomas Edison was among the pioneers. Photosynthesis does not occur in animal cells.
        Modern filmmaking involves complex CGI and sound design."""
    ), metadata={"source": "Doc4"})
]

In [37]:
vectorstore = FAISS.from_documents(docs, embedding_model)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
compressor = LLMChainExtractor.from_llm(llm)

In [38]:
# Create the contextual compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    base_compressor=compressor
)

In [39]:
# Query the retriever
query = "What is photosynthesis?"
compressed_results = compression_retriever.invoke(query)

In [40]:
for i in compressed_results:
    print(i.page_content)

Photosynthesis is the process by which green plants convert sunlight into energy.
Photosynthesis does not occur in animal cells.
The chlorophyll in plant cells captures sunlight during photosynthesis.
