## LangChain Indexing API - A Complete Guide

### What is the Indexing API?

The **LangChain Indexing API** helps you efficiently sync your documents into a vector store. It solves critical problems when working with document embeddings:

1. **Avoids duplicate content** - Prevents writing the same document multiple times
2. **Saves embedding costs** - Skips re-computing embeddings for unchanged content
3. **Handles updates intelligently** - Only re-embeds content that has actually changed
4. **Manages deletions** - Removes stale documents that no longer exist in your source

### Key Components

| Component | Purpose |
|-----------|---------|
| `RecordManager` | Tracks which documents have been indexed (stores hashes & timestamps) |
| `VectorStore` | Stores the actual document embeddings |
| `index()` function | Orchestrates the sync process between source documents and vector store |

### Cleanup Modes

The `cleanup` parameter controls how the indexer handles document changes:

| Mode | Behavior |
|------|----------|
| `None` | Only adds new documents, never deletes anything |
| `"incremental"` | Deletes outdated versions when source documents are re-indexed |
| `"full"` | Deletes ALL documents not in the current batch (complete sync) |

### Step 1: Environment Setup

Load environment variables (like `OPENAI_API_KEY`) from a `.env` file. This keeps sensitive credentials out of your code.


In [1]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

### Step 2: Initialize Vector Store and Embeddings

We'll use:
- **OpenAIEmbeddings**: Converts text into numerical vectors using OpenAI's embedding model
- **PGVector**: A PostgreSQL-based vector store that supports similarity search

> **Note**: Make sure PostgreSQL with pgvector extension is running. You can use the provided `docker-compose.yaml` to spin up a local instance.


In [None]:
# Import required components
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_postgres.vectorstores import PGVector

# Initialize the embedding model (uses OpenAI's text-embedding-ada-002 by default)
embeddings = OpenAIEmbeddings()

# PostgreSQL connection string format: postgresql+psycopg://user:password@host:port/database
CONNECTION_STRING = "postgresql+psycopg://admin:admin@127.0.0.1:5432/vectordb"

# Collection name acts like a "table" within the vector database
COLLECTION_NAME = "vectordb"

# Initialize PGVector - this creates the necessary tables if they don't exist
vectorstore = PGVector(
    collection_name=COLLECTION_NAME,  # Logical grouping of vectors
    connection=CONNECTION_STRING,      # Database connection
    embeddings=embeddings,             # Embedding model to use
)

### Step 3: Load and Split Documents

Before indexing, we need to:
1. **Load** the source document (a text file about "Bella Vista" restaurant)
2. **Split** it into smaller chunks for better retrieval

The `CharacterTextSplitter` breaks text at natural boundaries while respecting the chunk size limit.

In [19]:
# Load the source document
loader = TextLoader("./bella_vista.txt")
documents = loader.load()

# Split into chunks:
# - chunk_size=150: Target size for each chunk (in characters)
# - chunk_overlap=20: Characters shared between adjacent chunks (helps maintain context)
text_splitter = CharacterTextSplitter(chunk_size=150, chunk_overlap=20)
docs = text_splitter.split_documents(documents)

# See how many chunks were created
print(f"Document split into {len(docs)} chunks")

Created a chunk of size 177, which is longer than the specified 150
Created a chunk of size 229, which is longer than the specified 150
Created a chunk of size 233, which is longer than the specified 150
Created a chunk of size 206, which is longer than the specified 150
Created a chunk of size 203, which is longer than the specified 150
Created a chunk of size 299, which is longer than the specified 150


Document split into 7 chunks


In [21]:
for elem in docs:
    print(elem)

page_content='Q: What are the hours of operation for Bella Vista?
A: Bella Vista is open from 11 a.m. to 11 p.m. from Monday to Saturday. On Sundays, we welcome guests from 12 p.m. to 10 p.m.' metadata={'source': './bella_vista.txt'}
page_content='Q: What type of cuisine does Bella Vista serve?
A: Bella Vista offers a delightful blend of Mediterranean and contemporary American cuisine. We pride ourselves on using the freshest ingredients, many of which are sourced locally.' metadata={'source': './bella_vista.txt'}
page_content='Q: Do you offer vegetarian or vegan options at Bella Vista?
A: Absolutely! Bella Vista boasts a diverse menu that includes a variety of vegetarian and vegan dishes. Our chefs are also happy to customize dishes based on dietary needs.' metadata={'source': './bella_vista.txt'}
page_content='Q: Is Bella Vista family-friendly? sdoasdokasdoaskodosa
A: Yes, Bella Vista is a family-friendly establishment. We have a dedicated kids' menu and offer high chairs and booster

### Step 4: Set Up the Record Manager

The **Record Manager** is the brain of the Indexing API. It:
- Maintains a registry of all indexed documents
- Stores content hashes to detect changes
- Tracks timestamps to determine document freshness

We use `SQLRecordManager` which stores this metadata in a SQL database (same PostgreSQL instance in this case).


In [None]:
# Import the Indexing API components
from langchain.indexes import SQLRecordManager, index
# SQLRecordManager: Tracks document state in a SQL database
# index: The main function that syncs documents to the vector store

In [None]:
# Namespace uniquely identifies this index
# Format: "vectorstore_type/collection_name" - helps manage multiple indices
namespace = f"pgvector/{COLLECTION_NAME}"

# Create the record manager
# It will create a table to track document hashes, sources, and timestamps
record_manager = SQLRecordManager(
    namespace,                    # Unique identifier for this index
    db_url=CONNECTION_STRING      # Where to store the tracking metadata
)

In [22]:
# IMPORTANT: Create the database schema (tables) for the record manager
# This only needs to be done once - it's safe to call multiple times
# Creates a table to store: document_id, hash, source, updated_at
record_manager.create_schema()

### Step 5: Initial Indexing with `cleanup=None`

Now let's index our documents! We'll start with `cleanup=None` mode.

**What happens with `cleanup=None`:**
- New documents are added to the vector store
- Existing documents (same content hash) are skipped
- **Nothing is ever deleted** - old/stale documents remain

This mode is useful when you only want to add new content without affecting existing data.

In [None]:
# Reload and re-split the document to see what we're working with
loader = TextLoader("./bella_vista.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=150, chunk_overlap=20)
docs = text_splitter.split_documents(documents)

# Preview each chunk - notice the 'source' in metadata
# The source is critical for incremental cleanup mode!
for i, doc in enumerate(docs):
    print(f"--- Chunk {i+1} ---")
    print(doc)
    print()

Created a chunk of size 177, which is longer than the specified 150
Created a chunk of size 229, which is longer than the specified 150
Created a chunk of size 233, which is longer than the specified 150
Created a chunk of size 206, which is longer than the specified 150
Created a chunk of size 203, which is longer than the specified 150
Created a chunk of size 299, which is longer than the specified 150


page_content='Q: What are the hours of operation for Bella Vista?
A: Bella Vista is open from 11 a.m. to 11 p.m. from Monday to Saturday. On Sundays, we welcome guests from 12 p.m. to 10 p.m.' metadata={'source': './bella_vista.txt'}
page_content='Q: What type of cuisine does Bella Vista serve?
A: Bella Vista offers a delightful blend of Mediterranean and contemporary American cuisine. We pride ourselves on using the freshest ingredients, many of which are sourced locally.' metadata={'source': './bella_vista.txt'}
page_content='Q: Do you offer vegetarian or vegan options at Bella Vista?
A: Absolutely! Bella Vista boasts a diverse menu that includes a variety of vegetarian and vegan dishes. Our chefs are also happy to customize dishes based on dietary needs.' metadata={'source': './bella_vista.txt'}
page_content='Q: Is Bella Vista family-friendly? sdoasdokasdoaskodosa
A: Yes, Bella Vista is a family-friendly establishment. We have a dedicated kids' menu and offer high chairs and booster

**First Index Run** - Adding 7 new documents with `cleanup=None`


In [23]:
# Index the documents with cleanup=None
result = index(
    docs,                    # Documents to index
    record_manager,          # Tracks what's been indexed
    vectorstore,             # Where to store embeddings
    cleanup=None,            # No deletion - only add new docs
    source_id_key="source",  # Metadata key that identifies the document source
)

# Result shows what happened:
# - num_added: New documents embedded and stored
# - num_updated: Documents with changed content (re-embedded)
# - num_skipped: Unchanged documents (already indexed)
# - num_deleted: Removed documents (only with cleanup modes)
print(f"Result: {result}")

Result: {'num_added': 7, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}


### Step 6: Simulating Document Changes

Let's simulate real-world scenarios where documents change:
1. **Modify** an existing document (update content)
2. **Delete** a document from the source
3. **Add** a new document

This demonstrates how the Indexing API handles each case.


In [24]:
from langchain.schema import Document

# Simulate document changes:

# 1. UPDATE: Modify the content of an existing document
docs[1].page_content = "updated"

# 2. DELETE: Remove a document from our source (index 6)
del docs[6]

# 3. ADD: Create a brand new document with a different source
docs.append(Document(
    page_content="new content", 
    metadata={"source": "important"}  # Note: different source!
))

print(f"Now we have {len(docs)} documents (1 deleted, 1 added = net 0 change)")

Now we have 7 documents (1 deleted, 1 added = net 0 change)


**Second Index Run** - Re-indexing after changes with `cleanup=None`

Notice: 
- 2 added (1 updated doc + 1 new doc - both have new content hashes)
- 5 skipped (unchanged documents)
- 0 deleted (cleanup=None never deletes!)

**Problem**: The old version of doc[1] and the deleted doc[6] are still in the vector store!


In [25]:
# Re-index with cleanup=None - observe what happens
index(
    docs,
    record_manager,
    vectorstore,
    cleanup=None,            # Still no deletions
    source_id_key="source",
)
# Expected: num_added=2, num_skipped=5, num_deleted=0
# The old/deleted docs are STILL in the vector store (duplicates possible!)

{'num_added': 2, 'num_updated': 0, 'num_skipped': 5, 'num_deleted': 0}

### Step 7: Using `cleanup="incremental"` Mode

Now let's try **incremental cleanup** - this is the recommended mode for most use cases!

**How incremental cleanup works:**
- When a document with a given `source` is re-indexed, all OLD versions from that source are deleted
- Only affects documents from sources that appear in the current batch
- Documents from sources NOT in the batch are left untouched

This is perfect for syncing specific source files without affecting others.


In [26]:
# Make more changes to simulate ongoing document updates:

# Update doc[1] again with new content
docs[1].page_content = "updated again"

# Delete several more documents from the source
del docs[2]  # Remove another doc
del docs[3]  # And another
del docs[4]  # And one more

# Add another new document (same source as the previous new doc)
docs.append(Document(
    page_content="more new content", 
    metadata={"source": "important"}  # Same source as before
))

print(f"After changes: {len(docs)} documents remain")

After changes: 5 documents remain


**Third Index Run** - Using `cleanup="incremental"`

Now we'll see deletions! The indexer will:
1. Add the 2 new documents (updated content + new doc)
2. Delete 6 old documents that are no longer in the source

Notice: Only documents from sources present in this batch are cleaned up.


In [27]:
# Index with incremental cleanup - THIS IS THE RECOMMENDED MODE
result = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",   # Delete old versions from same sources
    source_id_key="source",  # Groups documents by source for cleanup
)

print(f"Result: {result}")
# num_added: New/updated documents
# num_deleted: Old versions removed (from sources in this batch only!)

Result: {'num_added': 2, 'num_updated': 0, 'num_skipped': 3, 'num_deleted': 6}


### Important: Incremental Cleanup with Empty Batch

What happens if we pass an **empty list** with `cleanup="incremental"`?

**Answer**: Nothing gets deleted! Because no sources are in the batch, no sources get cleaned up.

This is a KEY DIFFERENCE from `cleanup="full"` mode.


In [28]:
# Pass an empty list with incremental cleanup
result = index(
    [],                      # Empty document list!
    record_manager,
    vectorstore,
    cleanup="incremental",   # Incremental mode
    source_id_key="source",
)

print(f"Result: {result}")
# All zeros! No sources in the batch = no cleanup happens
# Existing documents are safe because no source was "re-indexed"

Result: {'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}


### Step 8: Using `cleanup="full"` Mode

**Full cleanup** is the nuclear option - use with caution!

**How it works:**
- Deletes ALL documents in the vector store that are NOT in the current batch
- Regardless of source - if it's not in the batch, it's deleted

**When to use:**
- Complete re-sync from scratch
- You have ALL your documents in the batch
- You want to ensure the vector store exactly matches your source

**Warning**: Passing an empty list with `cleanup="full"` will DELETE EVERYTHING!


In [29]:
# DANGER: Full cleanup with empty list = DELETE EVERYTHING
result = index(
    [],                      # Empty list
    record_manager, 
    vectorstore, 
    cleanup="full",          # Full sync mode
    source_id_key="source"
)

print(f"Result: {result}")
# num_deleted will show ALL remaining documents were removed!
# The vector store is now empty - use with extreme caution!

Result: {'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 5}


## Summary: Choosing the Right Cleanup Mode

| Cleanup Mode | Best For | Deletes? | Safe with Empty Batch? |
|--------------|----------|----------|------------------------|
| `None` | Append-only workloads | Never | ✅ Yes |
| `"incremental"` | Regular updates (recommended) | Only from sources in batch | ✅ Yes |
| `"full"` | Complete re-sync | Everything not in batch | ❌ No (deletes all!) |

### Key Takeaways

1. **Always use `source_id_key`** - It's required for `incremental` cleanup and helps track document provenance
2. **Start with `cleanup="incremental"`** - It's the safest choice for most production use cases
3. **Use `cleanup="full"` carefully** - Only when you have ALL documents and want a complete sync
4. **The Record Manager is essential** - It stores hashes to detect content changes efficiently

### Production Tips

- Run the indexer on a schedule (cron job) to keep vectors in sync
- Use meaningful source identifiers (file paths, URLs, database IDs)
- Monitor the return values to track indexing health
- Consider using separate namespaces for different document collections
