# ***What is Retrievers?***
- A retriever is a component in LangChain that fetches relevant documents form a data source in response to a user's query.
- There are multiple types of retrievers
- Main point is all retrievers are runnable means its can be use in the chain


### ***Types of Retrievers Based on Datasources***
- Wikipedia Retrievers
- Vector Store
- Arxiv Retriever
- Custom Retriever

### ***Types of Retrievers Based on `Retrievers`***
- MMr- Maximum Marginal Relevance
- Multi query retrievers
- Contextual compression Retriever



## *Wikipedia Retrievers*

- A wiki retrievers is retrievers that queries the wikipedia API to fetch relevant content for given query
- its doing keyword searching and retrievers relevant result


In [1]:
from langchain_community.retrievers import WikipediaRetriever

In [2]:
# initialize the retriever (option: set the language and top_k value)
retriever = WikipediaRetriever(
    top_k_results=2,
    lang='en'
)

In [3]:
## the query
query = "The geopolitical history of Bangladesh and Pakistan from the perspective of 1971"

docs = retriever.invoke(query)
len(docs)

2

In [None]:
docs[0]

Document(metadata={'title': 'History of Bangladesh', 'summary': "The history of Bangladesh dates back over four millennia to the Chalcolithic period. The region's early history was characterized by a succession of Hindu and Buddhist kingdoms and empires that fought for control over the Bengal region. Islam arrived in the 8th century and gradually became dominant from the early 13th century with the conquests led by Bakhtiyar Khalji and the activities of Sunni missionaries like Shah Jalal. Muslim rulers promoted the spread of Islam by building mosques across the region. From the 14th century onward, Bengal was ruled by the Bengal Sultanate, founded by Fakhruddin Mubarak Shah, who established an individual currency. The Bengal Sultanate expanded under rulers like Shamsuddin Ilyas Shah, leading to economic prosperity and military dominance, with Bengal being referred to by Europeans as the richest country to trade with. The region later became a part of the Mughal Empire, and according to

In [5]:
for i, doc in enumerate(docs):
    print(f"\n--- Result {i+1} ----")
    print(f"Content: \n{doc.page_content}....")


--- Result 1 ----
Content: 
The history of Bangladesh dates back over four millennia to the Chalcolithic period. The region's early history was characterized by a succession of Hindu and Buddhist kingdoms and empires that fought for control over the Bengal region. Islam arrived in the 8th century and gradually became dominant from the early 13th century with the conquests led by Bakhtiyar Khalji and the activities of Sunni missionaries like Shah Jalal. Muslim rulers promoted the spread of Islam by building mosques across the region. From the 14th century onward, Bengal was ruled by the Bengal Sultanate, founded by Fakhruddin Mubarak Shah, who established an individual currency. The Bengal Sultanate expanded under rulers like Shamsuddin Ilyas Shah, leading to economic prosperity and military dominance, with Bengal being referred to by Europeans as the richest country to trade with. The region later became a part of the Mughal Empire, and according to historian C. A. Bayly, it was proba

## Vector Store Retrievers

In [6]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document

In [7]:
doc1 = Document(
    page_content="Bangladesh gained independence in 1971 after a nine-month-long war with Pakistan, which is commemorated every year on Victory Day, December 16."
)

doc2 = Document(
    page_content="The Sundarbans, located in the southwestern part of Bangladesh, is the largest mangrove forest in the world and home to the Royal Bengal Tiger."
)

doc3 = Document(
    page_content="Dhaka, the capital of Bangladesh, is one of the most densely populated cities in the world and a major hub for trade, textiles, and culture in South Asia."
)

doc4 = Document(
    page_content="The Padma Bridge, inaugurated in 2022, is one of the most significant infrastructure projects in Bangladesh, connecting the country's southwest to the capital."
)

doc5 = Document(
    page_content="Bangladesh is one of the largest contributors to the global ready-made garments industry, with the sector being a key driver of its economy and employment."
)

documents = [doc1, doc2, doc3, doc4, doc5]

In [8]:
embeddings = HuggingFaceEmbeddings(model_name = 'sentence-transformers/all-MiniLM-L6-v2')

  embeddings = HuggingFaceEmbeddings(model_name = 'sentence-transformers/all-MiniLM-L6-v2')
  from .autonotebook import tqdm as notebook_tqdm


In [9]:
embeddings

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [10]:
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name='My_collection'
)

  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [12]:
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 2
    }
)

In [13]:
query = "Tell me Name of the city which is the capital of bangladesh?"

response = retriever.invoke(query)

In [15]:
for i, doc in enumerate(response):
    print(f"\n---- Results {i+1} ------")
    print(f"{doc.page_content}")


---- Results 1 ------
Dhaka, the capital of Bangladesh, is one of the most densely populated cities in the world and a major hub for trade, textiles, and culture in South Asia.

---- Results 2 ------
Bangladesh is one of the largest contributors to the global ready-made garments industry, with the sector being a key driver of its economy and employment.


## ***🧠Maximal Marginal Relevance (MMR)***
*`How can we pick result that are not only relevant to the query but also different from each other?`*

- MMR is an information retrieval ago designed to reduce redundancy in the retrieved results while maintaining high relevance to the query

#### Why MMR retriever?
- Picking the most relevant document first
- Then picking the next most relevant and least similar to already selected docs
- and so on

`This help especially in RAG pipelines where:`
- you want your context window to contain diverse but still relevant information
- Especially useful when documents are semantically overlapping


In [16]:
doc1 = Document(
    page_content="The Sundarbans in Bangladesh is the largest tidal halophytic mangrove forest in the world and a UNESCO World Heritage Site."
)

doc2 = Document(
    page_content="Dhaka is the capital of Bangladesh, known for its vibrant culture, bustling streets, and the historic Lalbagh Fort."
)

doc3 = Document(
    page_content="The Padma River, one of the major rivers of Bangladesh, plays a crucial role in agriculture, transportation, and fisheries."
)

doc4 = Document(
    page_content="The Padma Bridge, inaugurated in 2022, has significantly improved connectivity between the capital and southern Bangladesh."
)

doc5 = Document(
    page_content="Chittagong, a major port city in Bangladesh, is known for its natural beauty, the Hill Tracts, and the busy seaport."
)

doc6 = Document(
    page_content="Cox’s Bazar in Bangladesh boasts the world’s longest natural sandy sea beach, attracting tourists year-round."
)

doc7 = Document(
    page_content="Bangladesh’s economy is rapidly growing, with the garments sector, remittances, and agriculture contributing heavily to GDP."
)

doc8 = Document(
    page_content="The Rohingya refugee crisis led to the displacement of hundreds of thousands of people to Bangladesh’s Cox’s Bazar region."
)
documents = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8]

In [17]:
from langchain_community.vectorstores import FAISS


vectorStore = FAISS.from_documents(
    documents=documents,
    embedding=embeddings,
)

In [None]:
retriever = vectorStore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,
        "lambda_mult": 1 # lambda mult = relevance diversity balance -- high value like 1 its work as similirty search
    }
)

In [19]:
query = "Tell me about infrastructure and rivers in Bangladesh"
response = retriever.invoke(query)

In [20]:
for i, doc in enumerate(response):
    print(f"\n---- Results {i+1} ------")
    print(f"{doc.page_content}")


---- Results 1 ------
The Padma River, one of the major rivers of Bangladesh, plays a crucial role in agriculture, transportation, and fisheries.

---- Results 2 ------
Dhaka is the capital of Bangladesh, known for its vibrant culture, bustling streets, and the historic Lalbagh Fort.

---- Results 3 ------
Bangladesh’s economy is rapidly growing, with the garments sector, remittances, and agriculture contributing heavily to GDP.


In [21]:
retriever = vectorStore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,
        "lambda_mult": 0.5 # lambda mult = relevance diversity balance -- high value like 1 its work as similirty search
    }
)

In [22]:
query = "Tell me about infrastructure and rivers in Bangladesh"
response = retriever.invoke(query)

In [23]:
for i, doc in enumerate(response):
    print(f"\n---- Results {i+1} ------")
    print(f"{doc.page_content}")


---- Results 1 ------
The Padma River, one of the major rivers of Bangladesh, plays a crucial role in agriculture, transportation, and fisheries.

---- Results 2 ------
The Rohingya refugee crisis led to the displacement of hundreds of thousands of people to Bangladesh’s Cox’s Bazar region.

---- Results 3 ------
Bangladesh’s economy is rapidly growing, with the garments sector, remittances, and agriculture contributing heavily to GDP.


## ***Multi Query Retriever***
`sometimes a singe query might not capture all the way information in phrased in my document`
### Query:
#### How can i stay healthy?
`Could mean`
-What should i eat
- How often should i exercise
- How can i manage stress

*A simple `similarity search` might miss document that talk about those things but do not use the word healthy.*

In [24]:
from langchain.retrievers.multi_query import MultiQueryRetriever

In [25]:
doc1 = Document(
    page_content="Bangladesh has made significant strides in digital transformation, with projects like 'Digital Bangladesh' aiming to modernize services and governance."
)

doc2 = Document(
    page_content="The Sundarbans is a unique ecosystem in southern Bangladesh that supports a wide range of wildlife, including the endangered Royal Bengal Tiger."
)

doc3 = Document(
    page_content="The textile and garment industry is the backbone of Bangladesh’s export economy, employing millions and generating billions in revenue."
)

doc4 = Document(
    page_content="The Padma Bridge is a landmark infrastructure project that connects the southwestern region to Dhaka, reducing travel time dramatically."
)

doc5 = Document(
    page_content="Bangladesh is home to several UNESCO World Heritage Sites, including the Sundarbans and the historic mosque city of Bagerhat."
)

doc6 = Document(
    page_content="Tourism in Bangladesh is growing, with destinations like Cox’s Bazar, Sylhet’s tea gardens, and the hill tracts of Bandarban gaining popularity."
)

doc7 = Document(
    page_content="Remittance from overseas Bangladeshi workers is a vital source of foreign currency and supports millions of households across the country."
)

doc8 = Document(
    page_content="Education in Bangladesh has improved over the years, with increased enrollment rates, gender parity in primary schools, and growing digital learning tools."
)

documents = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8]

In [26]:
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq

groq_api_key = os.getenv("GROQ_API_KEY")
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")

llm = ChatGroq(groq_api_key=groq_api_key, model="Llama3-8b-8192")

In [27]:
vectorStore = FAISS.from_documents(
    documents=documents,
    embedding=embeddings
)

In [30]:
## simple Retriever
similarity_retriever = vectorStore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 5
    }
)

In [31]:
## advanced Retriever
multiquery_retirever = MultiQueryRetriever.from_llm(
    retriever=vectorStore.as_retriever(search_kwargs={"k": 5}),
    llm=llm
)

In [32]:
query = "Tell me about the development of Bangladesh" ## its an ambigous query

In [33]:
similarity_response = similarity_retriever.invoke(query)
multiquery_response = multiquery_retirever.invoke(query)

In [35]:
for i, doc in enumerate(similarity_response):
    print(f"\n---- Results {i+1} ------")
    print(f"{doc.page_content}")
print("\n")
print("*"*150)

for i, doc in enumerate(multiquery_response):
    print(f"\n---- Results {i+1} ------")
    print(f"{doc.page_content}")


---- Results 1 ------
The textile and garment industry is the backbone of Bangladesh’s export economy, employing millions and generating billions in revenue.

---- Results 2 ------
Bangladesh has made significant strides in digital transformation, with projects like 'Digital Bangladesh' aiming to modernize services and governance.

---- Results 3 ------
Education in Bangladesh has improved over the years, with increased enrollment rates, gender parity in primary schools, and growing digital learning tools.

---- Results 4 ------
Tourism in Bangladesh is growing, with destinations like Cox’s Bazar, Sylhet’s tea gardens, and the hill tracts of Bandarban gaining popularity.

---- Results 5 ------
Bangladesh is home to several UNESCO World Heritage Sites, including the Sundarbans and the historic mosque city of Bagerhat.


******************************************************************************************************************************************************

---- Results 1 -

## ***Contextual Compression Retriever***
`The CCR in LangChain is an advanced retriever that improves retrieval quality by compressing documents after retrieval -- keeping only th relevant content based on the user's query`

### Example
* Question: What is photosynthesis?

### Document
*The grand canyon is a famous natural site. `photosynthesis` is how plants convert light into energy. many tourists visit every year in Sundorbon*
- main document is second sentence


### When to Use
- document is larges and mix type of content
- reduces context length for llm
- you need to improve answer accuracy in RAG pipeline

In [36]:
doc1 = Document(
    page_content="Bangladesh has a tropical monsoon climate characterized by wide seasonal variations in rainfall, high temperatures, and humidity. However, the Sundarbans region plays a crucial role in protecting the coastal zones from natural disasters like cyclones and tidal surges."
)

doc2 = Document(
    page_content="The national dish of Bangladesh is Hilsa fish curry, commonly served with rice. While food is central to Bangladeshi culture, the country is also heavily investing in renewable energy sources like solar power to reduce dependency on fossil fuels."
)

doc3 = Document(
    page_content="Cox’s Bazar is the longest sea beach in the world and a major tourist destination in Bangladesh. It draws thousands of visitors each year. At the same time, the nearby refugee camps house displaced Rohingya populations, creating humanitarian and logistical challenges."
)

doc4 = Document(
    page_content="The Liberation War of 1971 is a defining moment in Bangladesh’s history. The country gained independence from Pakistan after a nine-month war. While cricket is now the most popular sport, football was more dominant during the early post-independence years."
)

doc5 = Document(
    page_content="Bangladesh is a riverine country with over 700 rivers flowing throughout its territory. Although many rivers have ecological and cultural importance, frequent flooding poses major threats to agriculture and infrastructure."
)

doc6 = Document(
    page_content="The country’s booming garment industry contributes more than 80% of Bangladesh’s total exports. While fashion giants source from Bangladeshi factories, the sector also faces challenges like labor rights issues and factory safety concerns."
)

doc7 = Document(
    page_content="Education in Bangladesh has improved in terms of access and gender parity. However, rural areas still lag in digital access and quality. University-level research is growing, particularly in areas like agriculture, climate resilience, and public health."
)

doc8 = Document(
    page_content="Dhaka, the capital, is known for its rich Mughal architecture, street food, and traffic congestion. Despite urban issues, Dhaka is central to policymaking, trade, and tech innovation in the country."
)

documents = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8]

In [38]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [37]:
## step 1 make a base retrievers
vectorStore = FAISS.from_documents(
    documents=documents,
    embedding=embeddings
)

base_retriever = vectorStore.as_retriever(
    search_kwargs={
        "k": 5
    }
)

## ***Now build Contextual compression Retriever***

In [39]:
compressor = LLMChainExtractor.from_llm(llm=llm)

In [40]:
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    base_compressor=compressor
)

In [41]:
query = "How is Bangladesh tackling climate-related challenges?"

In [42]:
compressed_result = compression_retriever.invoke(query)

In [43]:
for i, doc in enumerate(compressed_result):
    print(f"\n---- Results {i+1} ------")
    print(f"{doc.page_content}")


---- Results 1 ------
Extracted relevant parts:

Bangladesh has a tropical monsoon climate characterized by wide seasonal variations in rainfall, high temperatures, and humidity.

---- Results 2 ------
>>>
Bangladesh is a riverine country with over 700 rivers flowing throughout its territory. Although many rivers have ecological and cultural importance, frequent flooding poses major threats to agriculture and infrastructure.

---- Results 3 ------
>>>
University-level research is growing, particularly in areas like agriculture, climate resilience, and public health.
>>>

(Note: These extracted parts are relevant to answer the question "How is Bangladesh tackling climate-related challenges?")


## More about Retrievers you can check out LangChain documentation

[LangChain-Retriever Documentation](https://python.langchain.com/docs/integrations/retrievers/)