# Environment Setup with dotenv

We'll use `python-dotenv` to manage our environment variables securely. Create a `.env` file in your project root with:

```
GOOGLE_API_KEY=your_google_api_key_here
```

This keeps sensitive information out of your code and makes it easier to manage different environments.

In [1]:
# Load environment variables
from dotenv import load_dotenv, find_dotenv
import os

# Try to find and load the .env file
env_path = find_dotenv()
if env_path:
    print(f"Found .env file at: {env_path}")
    load_dotenv(env_path, override=True)  # override=True to ensure variables are updated
else:
    print("❌ No .env file found")
    
# Verify environment variables are loaded
google_api_key = os.getenv("GOOGLE_API_KEY")
if google_api_key:
    print("✅ GOOGLE_API_KEY loaded successfully")
    # Mask the key for security
    masked_key = google_api_key[:4] + "..." + google_api_key[-4:]
    print(f"API Key (masked): {masked_key}")
else:
    print("❌ GOOGLE_API_KEY not found in environment variables")

Found .env file at: /media/rgukt/data/RAG/.env
✅ GOOGLE_API_KEY loaded successfully
API Key (masked): AIza...5wK8


In [None]:
# Install required packages
%pip install -q langchain chromadb langchain-google-genai google-generativeai python-dotenv langchain-community wikipedia

In [3]:
!which python

/media/rgukt/data/RAG/venv/bin/python


## Wikipedia Retriever

In [4]:
from langchain_community.retrievers import WikipediaRetriever


In [5]:
retriver = WikipediaRetriever(top_k_results=2, lang='en')

In [6]:
query = "The history of the how the sequence to sequence models are improved"

docs = retriver.invoke(query)

In [7]:
docs

[Document(metadata={'title': 'The Simpsons opening sequence', 'summary': 'The Simpsons opening sequence is the title sequence of the American animated television series The Simpsons. It is accompanied by "The Simpsons Theme". The first episode to use this introduction was the series\' second episode "Bart the Genius".\nEach episode has the same basic sequence of events: the camera zooms through cumulus clouds, through the show\'s title towards the town of Springfield. The camera then follows the members of the Simpson family on their way home. Upon entering their house, the Simpsons settle down on their couch to watch television. One of the most distinctive aspects of the opening is that three of its elements change from episode to episode: Bart writes different phrases on the school chalkboard, Lisa plays different solos on her saxophone (or occasionally a different instrument), and different visual gags accompany the family as they enter their living room to sit on the couch.\nThe st

In [8]:

# Define your query
query = "the geopolitical history of india and pakistan from the perspective of a chinese"

# Get relevant Wikipedia documents

In [9]:
docs2 = retriver.invoke(query)

In [10]:
print(type(docs2))
print(docs2[0])

<class 'list'>
page_content='The United States has been providing military aid and economic assistance to Pakistan for various purposes since 1948. In 2017, the U.S. stopped military aid to Pakistan, which was about US$2 billion per year. With U.S. military assistance suspended in 2018 and civilian aid reduced to about $300 million for 2022, Pakistani authorities have turned to other countries for help.


== History ==
From 1947 to 1958, under civilian leadership, the United States provided Pakistan with modest economic aid and limited military assistance. During this period, Pakistan became a member of the South East Asian Treaty Organization (SEATO) and the Central Treaty Organization (CENTO), after a Mutual Defence Assistance Agreement signed in May 1954, which facilitated increased levels of both economic and military aid from the U.S.
In 1958, Ayub Khan led Pakistan's first military coup, becoming Chief Martial Law Administrator (CMLA) and later President until 1969. During his te

In [11]:
# Print retrieved content
for i, doc in enumerate(docs):
    print(f"\n--- Result {i+1} ---")
    print(f"Content:\n{doc.page_content}...")  # truncate for display


--- Result 1 ---
Content:
The Simpsons opening sequence is the title sequence of the American animated television series The Simpsons. It is accompanied by "The Simpsons Theme". The first episode to use this introduction was the series' second episode "Bart the Genius".
Each episode has the same basic sequence of events: the camera zooms through cumulus clouds, through the show's title towards the town of Springfield. The camera then follows the members of the Simpson family on their way home. Upon entering their house, the Simpsons settle down on their couch to watch television. One of the most distinctive aspects of the opening is that three of its elements change from episode to episode: Bart writes different phrases on the school chalkboard, Lisa plays different solos on her saxophone (or occasionally a different instrument), and different visual gags accompany the family as they enter their living room to sit on the couch.
The standard opening has had two major revisions. The fir

In [12]:
# Print retrieved content
for i, doc in enumerate(docs2):
    print(f"\n--- Result {i+1} ---")
    print(f"Content:\n{doc.page_content}...")  # truncate for display


--- Result 1 ---
Content:
The United States has been providing military aid and economic assistance to Pakistan for various purposes since 1948. In 2017, the U.S. stopped military aid to Pakistan, which was about US$2 billion per year. With U.S. military assistance suspended in 2018 and civilian aid reduced to about $300 million for 2022, Pakistani authorities have turned to other countries for help.


== History ==
From 1947 to 1958, under civilian leadership, the United States provided Pakistan with modest economic aid and limited military assistance. During this period, Pakistan became a member of the South East Asian Treaty Organization (SEATO) and the Central Treaty Organization (CENTO), after a Mutual Defence Assistance Agreement signed in May 1954, which facilitated increased levels of both economic and military aid from the U.S.
In 1958, Ayub Khan led Pakistan's first military coup, becoming Chief Martial Law Administrator (CMLA) and later President until 1969. During his tenu

## Vector store retriever

# Vector Store Retriever with Chroma

## Prerequisites:
1. Properly initialized OpenAI API key
2. Installed dependencies (chromadb, openai)
3. Valid documents with content
4. Working embedding model

## Common Issues:
1. OpenAI API key not set or invalid
2. Embedding model initialization failed
3. Documents not in correct format
4. Chroma persistence directory issues

# Using Google's Gemini Embeddings

## Advantages of Gemini Embeddings:
1. High-quality text representations
2. Optimized for different tasks (retrieval, classification, etc.)
3. Cost-effective compared to other options
4. Good performance on multilingual content

## Model Details:
- Model: `gemini-embedding-001`
- Task Type: `retrieval_document`
- Output: High-dimensional vectors (suitable for semantic search)
- Integration: Seamless with LangChain and ChromaDB

## Requirements:
1. Google API key in environment variables
2. `langchain-google-genai` package installed
3. `google-generativeai` Python package
4. Proper task type configuration for embeddings

In [13]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.schema import Document
from dotenv import load_dotenv
load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/media/rgukt/data/RAG/venv/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/media/rgukt/data/RAG/venv/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance
   

True

In [14]:
# Step 1: Your source documents
documents = [
    Document(page_content="LangChain helps developers build LLM applications easily."),
    Document(page_content="Chroma is a vector database optimized for LLM-based search."),
    Document(page_content="Embeddings convert text into high-dimensional vectors."),
    Document(page_content="OpenAI provides powerful embedding models."),
]

In [15]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

E0000 00:00:1760084985.663186   22361 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [16]:
# Step 3: Create Chroma vector store in memory
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="my_collection"
)

In [17]:
# Step 4: Convert vectorstore into a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

In [18]:
query = "What is Chroma used for?"
results = retriever.invoke(query)

In [19]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
Chroma is a vector database optimized for LLM-based search.

--- Result 2 ---
LangChain helps developers build LLM applications easily.


In [20]:
results = vectorstore.similarity_search(query, k=2)

In [21]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
Chroma is a vector database optimized for LLM-based search.

--- Result 2 ---
LangChain helps developers build LLM applications easily.


## MMR

In [22]:
# Sample documents
docs = [
    Document(page_content="LangChain makes it easy to work with LLMs."),
    Document(page_content="LangChain is used to build LLM based applications."),
    Document(page_content="Chroma is used to store and search document embeddings."),
    Document(page_content="Embeddings are vector representations of text."),
    Document(page_content="MMR helps you get diverse results when doing similarity search."),
    Document(page_content="LangChain supports Chroma, FAISS, Pinecone, and more."),
]

In [23]:
from langchain_community.vectorstores import FAISS

# Initialize OpenAI embeddings
embedding_model =GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
# Step 2: Create the FAISS vector store from documents
vectorstore = FAISS.from_documents(
    documents=docs,
    embedding=embedding_model
)

E0000 00:00:1760084988.563849   22361 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [34]:
# Enable MMR in the retriever
retriever = vectorstore.as_retriever(
    search_type="mmr",                   # <-- This enables MMR
    search_kwargs={"k": 3, "lambda_mult": 1}  # k = top results, lambda_mult = relevance-diversity balance
)

In [35]:
query = "what is langchain?"
results = retriever.invoke(query)

In [36]:
print(type(results))
results

<class 'list'>


[Document(id='d9bc7c92-6b30-40a2-8227-0b5ac95569c1', metadata={}, page_content='LangChain is used to build LLM based applications.'),
 Document(id='f8538277-5d61-4849-a2e4-3127df1bd697', metadata={}, page_content='LangChain makes it easy to work with LLMs.'),
 Document(id='00f62938-7097-4a95-b280-cf7ccbbe5faf', metadata={}, page_content='LangChain supports Chroma, FAISS, Pinecone, and more.')]

In [37]:
print(results[0])

page_content='LangChain is used to build LLM based applications.'


In [38]:
for doc in results:
    print(doc.page_content)

LangChain is used to build LLM based applications.
LangChain makes it easy to work with LLMs.
LangChain supports Chroma, FAISS, Pinecone, and more.


In [39]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
LangChain is used to build LLM based applications.

--- Result 2 ---
LangChain makes it easy to work with LLMs.

--- Result 3 ---
LangChain supports Chroma, FAISS, Pinecone, and more.


## Multiquery Retriever

In [40]:
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings

from langchain_core.documents import Document

from langchain_google_genai import ChatGoogleGenerativeAI

from langchain.retrievers.multi_query import MultiQueryRetriever



In [41]:
# Relevant health & wellness documents
all_docs = [
    Document(page_content="Regular walking boosts heart health and can reduce symptoms of depression.", metadata={"source": "H1"}),
    Document(page_content="Consuming leafy greens and fruits helps detox the body and improve longevity.", metadata={"source": "H2"}),
    Document(page_content="Deep sleep is crucial for cellular repair and emotional regulation.", metadata={"source": "H3"}),
    Document(page_content="Mindfulness and controlled breathing lower cortisol and improve mental clarity.", metadata={"source": "H4"}),
    Document(page_content="Drinking sufficient water throughout the day helps maintain metabolism and energy.", metadata={"source": "H5"}),
    Document(page_content="The solar energy system in modern homes helps balance electricity demand.", metadata={"source": "I1"}),
    Document(page_content="Python balances readability with power, making it a popular system design language.", metadata={"source": "I2"}),
    Document(page_content="Photosynthesis enables plants to produce energy by converting sunlight.", metadata={"source": "I3"}),
    Document(page_content="The 2022 FIFA World Cup was held in Qatar and drew global energy and excitement.", metadata={"source": "I4"}),
    Document(page_content="Black holes bend spacetime and store immense gravitational energy.", metadata={"source": "I5"}),
]

In [43]:
embedding_model = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

E0000 00:00:1760086112.074948   22361 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [44]:
vectorstore= FAISS.from_documents(
    documents=all_docs,
    embedding= embedding_model
)

In [45]:
# Create retrievers
similarity_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [46]:
multiquery_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm= ChatGoogleGenerativeAI(model= 'gemini-2.5-flash')
)

E0000 00:00:1760086357.294579   22361 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [47]:
# Query

query = "How to improve energy levels and maintain balance?"

In [48]:
# Retrieve results
similarity_results = similarity_retriever.invoke(query)
multiquery_results= multiquery_retriever.invoke(query)

In [49]:
for i, doc in enumerate(similarity_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)

print("*"*150)

for i, doc in enumerate(multiquery_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
Drinking sufficient water throughout the day helps maintain metabolism and energy.

--- Result 2 ---
The solar energy system in modern homes helps balance electricity demand.

--- Result 3 ---
Mindfulness and controlled breathing lower cortisol and improve mental clarity.

--- Result 4 ---
Consuming leafy greens and fruits helps detox the body and improve longevity.

--- Result 5 ---
Regular walking boosts heart health and can reduce symptoms of depression.
******************************************************************************************************************************************************

--- Result 1 ---
Drinking sufficient water throughout the day helps maintain metabolism and energy.

--- Result 2 ---
Mindfulness and controlled breathing lower cortisol and improve mental clarity.

--- Result 3 ---
Consuming leafy greens and fruits helps detox the body and improve longevity.

--- Result 4 ---
Regular walking boosts heart health and can reduce sympt

## ContextualCompressionRetriever

In [55]:
from langchain_community.vectorstores import FAISS
from langchain_google_genai import  GoogleGenerativeAIEmbeddings,ChatGoogleGenerativeAI
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_core.documents import Document

In [51]:
# Recreate the document objects from the previous data
docs = [
    Document(page_content=(
        """The Grand Canyon is one of the most visited natural wonders in the world.
        Photosynthesis is the process by which green plants convert sunlight into energy.
        Millions of tourists travel to see it every year. The rocks date back millions of years."""
    ), metadata={"source": "Doc1"}),

    Document(page_content=(
        """In medieval Europe, castles were built primarily for defense.
        The chlorophyll in plant cells captures sunlight during photosynthesis.
        Knights wore armor made of metal. Siege weapons were often used to breach castle walls."""
    ), metadata={"source": "Doc2"}),

    Document(page_content=(
        """Basketball was invented by Dr. James Naismith in the late 19th century.
        It was originally played with a soccer ball and peach baskets. NBA is now a global league."""
    ), metadata={"source": "Doc3"}),

    Document(page_content=(
        """The history of cinema began in the late 1800s. Silent films were the earliest form.
        Thomas Edison was among the pioneers. Photosynthesis does not occur in animal cells.
        Modern filmmaking involves complex CGI and sound design."""
    ), metadata={"source": "Doc4"})
]

In [56]:
embedding_model = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

E0000 00:00:1760086765.825022   22361 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [53]:
# Create a FAISS vector store from the documents
vectorstore = FAISS.from_documents(docs, embedding_model)

In [54]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [57]:
llm = ChatGoogleGenerativeAI(model= 'gemini-2.5-flash')

E0000 00:00:1760086790.157760   22361 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [58]:
compressor = LLMChainExtractor.from_llm(llm)

In [59]:
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    base_compressor=compressor
)

In [60]:
# Query the retriever
query = "What is photosynthesis?"
compressed_results = compression_retriever.invoke(query)

In [61]:
for i, doc in enumerate(compressed_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
Photosynthesis is the process by which green plants convert sunlight into energy.

--- Result 2 ---
The chlorophyll in plant cells captures sunlight during photosynthesis.
