# 14 — Using a Vector Store as a Retriever

A **retriever** is a unified interface that, given a query, returns the **most relevant Documents** from an index — usually a **vector store** like Chroma.

Instead of calling `similarity_search()` manually, we can convert a vector store into a retriever with `.as_retriever()` and use it as part of a RAG (Retrieval-Augmented Generation) pipeline.

✅ **Advantage:** you can easily change search strategy (similarity, MMR, threshold, filters) without rewriting logic.

In [4]:
# ╔══════════════════════════════════════════════════════╗
# ║ Setup: Load environment variables & initialize model ║
# ╚══════════════════════════════════════════════════════╝

import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

print("✅ Environment loaded.")

✅ Environment loaded.


## 1️⃣ Build or reuse a vector store

We'll reuse the logic from the previous notebook (embedding and storing documents in Chroma).

In [5]:
# Dummy document fallback (if not loaded before)
from langchain_core.documents import Document
docs = [
    Document(page_content=(
        "Solar energy is one of the most promising renewable energy sources today. "
        "It converts sunlight directly into electricity using photovoltaic cells. "
        "Wind energy uses turbines to transform air currents into mechanical power. "
        "Governments are investing heavily in renewables to reach carbon neutrality by 2050. "
        "Research also focuses on improving energy storage systems such as lithium-ion batteries."
    ), metadata={"source": "renewable_energy.txt"})
]

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)

print("Vector store ready with 1 document.")

Vector store ready with 1 document.


## 2️⃣ Create a retriever

The retriever wraps the vector store and standardizes access to relevant documents. 
By default, it uses **similarity search** and returns 4 documents.

In [6]:
retriever = vectorstore.as_retriever(
    search_type="similarity",   # 'similarity' | 'mmr' | 'similarity_score_threshold'
    search_kwargs={"k": 4}      # number of documents to return
)

query = "What technologies are used to generate renewable energy?"
docs = retriever.invoke(query)

print(f"Retrieved {len(docs)} documents.")
print(docs[0].page_content)
print(docs[0].metadata)

Retrieved 2 documents.
Solar energy is one of the most promising renewable energy sources today. It converts sunlight directly into electricity using photovoltaic cells. Wind energy uses turbines to transform air currents into mechanical power. Governments are investing heavily in renewables to reach carbon neutrality by 2050. Research also focuses on improving energy storage systems such as lithium-ion batteries.
{'source': 'renewable_energy.txt'}


Each element in the response is a **Document** with:
- `page_content` → the text chunk.
- `metadata` → info such as file path, page, section, etc.

## 3️⃣ Adjusting retrieval strategy

### a) Return more (or fewer) chunks
```python
retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
```

### b) Use MMR (Maximal Marginal Relevance)
This strategy increases **diversity** by avoiding redundant results.
```python
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 6, "fetch_k": 20}
)
```

### c) Use similarity with a minimum score threshold
Discard results that are not close enough.
```python
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.2, "k": 8}
)
```

### d) Filter by metadata (when supported)
```python
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"source": "renewable_energy.txt"}
    }
)
```

## 4️⃣ Typical retriever output

```python
len(response)   # → 4
response[0]     # → Document(...)
response        # → [Document(...), Document(...), Document(...), Document(...)]
```

Example of one Document:
```python
Document(
  page_content="Solar energy is one of the most promising renewable energy sources today...",
  metadata={'source': './data/renewable_energy.txt'}
)
```

The retriever found the relevant text **by meaning**, not by exact words.

## 5️⃣ Why use retrievers instead of calling `.similarity_search()` directly?

- The retriever gives you a **standard interface** that LangChain chains and tools expect.
- You can **swap strategies** (similarity, MMR, threshold) easily.
- It integrates natively with RAG chains, evaluators, and agents.

In short → `retriever = vectorstore.as_retriever()` is a lightweight adapter layer.

## 🔧 Best practices

- **Chunk size:** 700–1200 chars with ~200 overlap works well for long docs.
- **Metadata:** always store `source`, `page`, or `section` → useful for citations.
- **Inspect results:** check a few `.page_content` outputs to validate relevance.
- **MMR:** improves recall diversity (less redundancy).
- **Thresholds:** tune `score_threshold` to filter low-confidence matches.
- **Compression (advanced):** use a *context compressor* to shorten retrieved chunks before sending them to the LLM (useful when token limits are tight).

## ✅ Summary

- A **retriever** wraps your vector store to return relevant Documents for a query.
- It’s flexible, composable, and plug-and-play for RAG chains.
- You can change retrieval logic (similarity, MMR, thresholds) without touching LLM code.
- The result is a list of Documents with text + metadata → ready to feed into a prompt.