# 13 — Vector Stores (Chroma) — Basics

A **Vector Store** (vector database) stores **embeddings** (numerical vectors) and lets you search them efficiently.

Instead of matching literal words, it compares **semantic vectors**:
- Similar meanings → vectors are close.
- Different meanings → vectors are far.

This is great for:
- **Semantic retrieval** (not just keyword matching).
- Finding relevant text chunks in large corpora.
- Powering assistants and chatbots with dynamic context.

In [3]:
# ╔══════════════════════════════════════════════════════╗
# ║ Setup: Load environment variables & initialize model ║
# ╚══════════════════════════════════════════════════════╝

import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

from langchain_openai import ChatOpenAI
# from langchain_groq import ChatGroq

# Not strictly needed for this notebook, but kept for consistency across the course
chat_model = ChatOpenAI(model="gpt-4o-mini")
# chat_model = ChatGroq(model="llama-3.1-70b-versatile")

print("✅ Environment loaded.")

✅ Environment loaded.


## Workflow (step by step)

1. **Load** a document (e.g., `renewable_energy.txt`).
2. **Split** text into manageable chunks.
3. **Embed** each chunk with a model (e.g., `OpenAIEmbeddings`).
4. **Index** embeddings in a vector store (here: **Chroma**).
5. **Query** with semantic similarity and inspect the top results.

### Dependencies
You will need the following (already included in your environment if you followed the course setup):
- `langchain-text-splitters`
- `langchain-community`
- `langchain-chroma`
- `chromadb`
- `langchain-openai`

In [4]:
# 1) Load a document (with a safe fallback if the file is missing)
from langchain_community.document_loaders import TextLoader

filepath = "data/renewable_energy.txt"
docs = []
if os.path.exists(filepath):
    print(f"Loading file: {filepath}")
    docs = TextLoader(filepath, encoding="utf-8").load()
else:
    print("⚠️ File not found. Using a small fallback text instead.")
    fallback_text = (
        "Solar energy converts sunlight directly into electricity using photovoltaic cells. "
        "Wind energy uses turbines to harness the kinetic power of air currents. "
        "Both sources help reduce carbon emissions, and research is improving battery storage "
        "to handle the intermittent nature of renewables."
    )
    from langchain_core.documents import Document
    docs = [Document(page_content=fallback_text)]

print(f"Loaded {len(docs)} document(s). Preview:\n", docs[0].page_content[:200], "...\n")

Loading file: data/renewable_energy.txt
Loaded 1 document(s). Preview:
 Solar energy is one of the most promising renewable energy sources today.
It converts sunlight directly into electricity using photovoltaic cells.
Over the last decade, the cost of solar panels has dr ...



In [None]:
# 2) Split into chunks
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)
chunked_docs = text_splitter.split_documents(docs)
print(f"Created {len(chunked_docs)} chunk(s).")
print("First chunk preview:\n", chunked_docs[0].page_content[:150], "...\n")

Created 1 chunk(s).
First chunk preview:
 Solar energy is one of the most promising renewable energy sources today.
It converts sunlight directly into electricity using photovoltaic cells.
Over the last decade, the cost of solar panels has dr ...



In [6]:
# 3) Create embeddings
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
print("Embeddings model ready.")

Embeddings model ready.


In [7]:
# 4) Index in a vector store (Chroma)
from langchain_chroma import Chroma

# You can also persist to disk by passing a persist_directory
vector_db = Chroma.from_documents(documents=chunked_docs, embedding=embeddings)
print("Chroma index built.")

Chroma index built.


## 5) Similarity search

When you search, the **query** is embedded and compared against the stored vectors. The store returns the **most similar** chunks.

In [8]:
question = "What technologies are used to generate renewable energy?"
results = vector_db.similarity_search(question, k=3)

for i, doc in enumerate(results, 1):
    print(f"[{i}]\n{doc.page_content}\n---")

[1]
Solar energy is one of the most promising renewable energy sources today.
It converts sunlight directly into electricity using photovoltaic cells.
Over the last decade, the cost of solar panels has dropped by nearly 80%.

Wind energy, on the other hand, uses turbines to transform the kinetic energy
of air currents into mechanical power, which can then be converted into electricity.

Both solar and wind energy play a crucial role in reducing greenhouse gas emissions.
Governments around the world are investing heavily in renewable technologies
to achieve carbon neutrality by 2050.

However, renewable energy sources are intermittent. This is why new research
focuses on improving energy storage systems, such as lithium-ion and solid-state batteries.
---


### Interpreting the result

For the query:

> *What technologies are used to generate renewable energy?*

A typical retrieved passage might be:

```
Solar energy converts sunlight directly into electricity using photovoltaic cells.
Wind energy uses turbines to harness air currents and generate mechanical power.
Both technologies play a key role in reducing greenhouse gas emissions worldwide.
```

Notice it found the **semantically relevant** passage, even without exact keyword matching.

## Why a Vector Store is useful

- **Semantic search:** Find what the text *means*, not only exact words.
- **Scale:** Handles thousands to millions of chunks.
- **Speed:** Optimized for vector similarity.
- **Use cases:**
  - Chatbots with document memory.
  - RAG (Retrieval-Augmented Generation) pipelines.
  - Internal search engines.
  - Knowledge analysis at scale.

## Summary

- Vector stores keep **embeddings** and support **fast semantic search**.
- The basic pipeline is: **load → split → embed → index → search**.
- You’ve built a tiny but complete retrieval flow using **Chroma**.

Next steps: wire this retrieval step into a **RAG** chain and feed retrieved chunks into an LLM prompt.