# PGVector

- Author: [Min-su Jung](https://github.com/effort-type), [Joonha Jeon](https://github.com/realjoonha)
- Design: 
- Peer Review : 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-PGVector.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-PGVector.ipynb)

## Overview  

[PGVector](https://github.com/pgvector/pgvector) is an open-source extension for PostgreSQL that allows you to store and search vector data alongside your regular database information.

This notebook shows how to use functionality related to `PGVector`, implementing LangChain vectorstore abstraction using postgres as the backend and utilizing the pgvector extension.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [What is PGVector?](#what-is-pgvector)
    - [Set up PGVector](#set-up-pgvector)
- [Initialization](#initialization)
    - [Select Embeddings model](#select-embeddings-model)
    - [Create collections](#create-collections)
    - [Manage collections](#manage-collections)
    - [List collections](#list-collections)
    - [Delete collections](#delete-collections)
- [Manage vector store](#manage-vector-store)
    - [Add items to vector store](#add-items-to-vector-store)
    - [Delete items to vector store](#delete-items-from-vector-store)
    - [Upsert items to vector store](#upsert-items-to-vector-store)
- [Query vector store](#query-vector-store)
    - [Query directly](#query-directly)
    - [Query with filters](#query-with-filters)
    - [Similarity search with score](#similarity-search-with-score)
    - [Query by turning into retriever](#query-by-turning-into-retreiver)


### References

- [langchain-postgres](https://github.com/langchain-ai/langchain-postgres/)
- [pgvector](https://github.com/pgvector/pgvector)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [None]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [1]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_postgres",
        "langchain_openai",
        "psycopg[binary,pool]",
    ],
    verbose=False,
    upgrade=False,
)

In [2]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "PGVector",
        "OPENAI_API_KEY": "",
    }
)

Environment variables have been set successfully.


In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

False

## What is PGVector?

`PGVector` is a PostgreSQL extension that enables vector similarity search directly within your PostgreSQL database, making it ideal for AI applications, semantic search, and recommendation systems.

This is particularly valuable for who already use PostgreSQL who want to add vector search capabilities without managing separate infrastructure or learning new query languages.

**Features** :
1. Native PostgreSQL integration with standard SQL queries
2. Multiple similarity search methods including L2, Inner Product, Cosine
3. Several indexing options including HNSW and IVFFlat
4. Support for up to 2,000 dimensions per vector
5. ACID compliance inherited from PostgreSQL

**Advantages** :

1. Free and open-source
2. Easy integration with existing PostgreSQL databases
3. Full SQL functionality and transactional support
4. No additional infrastructure needed
5. Supports hybrid searches combining vector and traditional SQL queries

**Disadvantages** :
1. Performance limitations with very large datasets (billions of vectors)
2. Limited to single-node deployment
3. Memory-intensive for large vector dimensions
4. Requires manual optimization for best performance
5. Less specialized features compared to dedicated vector databases

### Set up PGVector

You can easily set up `PGVector` by running the following command that spins up a docker container:

```bash
docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16
```

For more detailed instructions, please refer to [the official documentation](https://github.com/pgvector/pgvector) 

## Initialization

Once setting up an instance of postgres with pgvector enabled, you can directly instantiate a `PGVector` vector store to store embedded data and perform similarity search.

### Select Embeddings model

You should define an embedding model to use before instantiating `PGVector`.

In this subsection we use ```text-embedding-3-large``` model of OpenAI here.

In [24]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

### Create collections

You can create a collection to use by instantiating `PGVector` with a collection name. Note that the default value is `langchain`, and it is recommended to define your own to manage multiple collections. 

In [25]:
from langchain_core.documents import Document
from langchain_postgres import PGVector


# See docker command above to launch a postgres instance with pgvector enabled.
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"  # Uses psycopg3!
collection_name = "my_docs"

vector_store = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

### Manage collections

As postgres is basically a relational DB even with an extension of pgvector, the data management is quite different with other vector DBs. You can see that instantiating `PGVector` makes two default tables below `langchain` database.

- `langchain_pg_collection`: stores metadata of collections
- `langchain_pg_embedding`: stores actual data including document and embeddings

In [18]:
import psycopg

# Connection parameters
conn_params = {
    "dbname": "langchain",
    "user": "langchain",
    "password": "langchain",
    "host": "localhost",
    "port": "6024",
}

with psycopg.connect(**conn_params) as conn:
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT table_name 
            FROM information_schema.tables 
            WHERE table_schema = 'public'
            AND table_type = 'BASE TABLE';
        """
        )

        tables = cur.fetchall()

        print("Tables in the database:")
        for table in tables:
            print(table[0])

Tables in the database:
langchain_pg_collection
langchain_pg_embedding


### List collections

You can list all of the collections that are created in a dedicated database (`langchain`)

In [None]:
from psycopg.rows import dict_row

with psycopg.connect(**conn_params) as conn:
    with conn.cursor(row_factory=dict_row) as cur:
        cur.execute("SELECT name FROM langchain_pg_collection;")

        rows = cur.fetchall()
        names = [row["name"] for row in rows]

        print(names)

['my_docs']


### Delete collections

You can use below method to delete a collection with its name

In [None]:
def delete_collection_and_embeddings(collection_name):
    with psycopg.connect(**conn_params) as conn:
        with conn.cursor() as cur:
            # First, delete the corresponding embeddings
            cur.execute(
                """
                DELETE FROM langchain_pg_embedding
                WHERE collection_id IN (
                    SELECT uuid 
                    FROM langchain_pg_collection 
                    WHERE name = %s
                );
            """,
                (collection_name,),
            )

            embeddings_deleted = cur.rowcount

            # Then, delete the collection
            cur.execute(
                """
                DELETE FROM langchain_pg_collection
                WHERE name = %s;
            """,
                (collection_name,),
            )

            collections_deleted = cur.rowcount

        conn.commit()

    return collections_deleted, embeddings_deleted


# Usage
collection_name_to_delete = "your_collection_name"
collections, embeddings = delete_collection_and_embeddings(collection_name_to_delete)

print(f"Deleted {collections} collection(s) and {embeddings} related embedding(s).")

## Manage vector store

Once you have instantiated your vector store, we can interact with it by adding and deleting different items.

### Add items to vector store

We can add items to our vector store by using the add_documents function.

In this tutorial, we will store **the little prince** by Saiot-Exupery.

You can find the raw text file in data directory.

In [None]:
# This is a long document we can split up.
data_path = "./data/the_little_prince.txt"
with open(data_path, encoding="utf8") as f:
    raw_text = f.read()

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from uuid import uuid4

# define text splitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

# split raw text by splitter.
split_docs = text_splitter.create_documents([raw_text])

# print one of documents to check its structure
print(split_docs[0])

In [None]:
# define document preprocessor
def preprocess_documents(
    split_docs, metadata_keys, min_length, use_basename=False, **kwargs
):
    metadata = kwargs

    if use_basename:
        assert metadata.get("source", None) is not None, "source must be provided"
        metadata["source"] = metadata["source"].split("/")[-1]

    result_docs = []
    for idx, doc in enumerate(split_docs):
        if len(doc.page_content) < min_length:
            continue
        for k in metadata_keys:
            doc.metadata.update({k: metadata.get(k, "")})
        doc.metadata.update({"page": idx + 1, "id": str(uuid4())})
        result_docs.append(doc)

    return result_docs

In [None]:
# preprocess raw documents
processed_docs = preprocess_documents(
    split_docs=split_docs,
    metadata_keys=["source", "page", "author"],
    min_length=5,
    use_basename=True,
    source=data_path,
    author="Saiot-Exupery",
)

# print one of preprocessed document to chekc its structure
print(processed_docs[0])

Now we have processed documents (or chunks) with unique **id**.

To use it later, we will store the ids and pass it to ```add_documents``` method.

**Note**

If one did not pass the ids, randomly created id will be assigned for each items.

In [None]:
#
uuids = [doc.metadata["id"] for doc in print(processed_docs[0])]
vector_store.add_documents(print(processed_docs[0]), ids=uuids)

['da61d994-7cd8-4de7-86ad-e8dc3124ce67',
 '3b7eda28-21be-4d84-85fc-e5a7120c03e2',
 '8bb2273a-f7d2-42d7-85d4-8b80235845c4',
 '959886e7-bd55-4ea3-91f9-80cd7ba13132',
 '0cb6c40a-d948-41db-983a-4ecc35a1120b',
 '36342e32-f07c-4a11-999d-aabfba674c1c',
 '13a1a431-2f83-4fc4-ba93-ab249168b935',
 '8b2ce43e-a858-40fa-892b-b4f7411548a0',
 'cf5a8530-a71d-4dd2-a498-ca7bfcfb758c',
 '9b8e364f-db57-46aa-9cde-62f56aff1ac5']

### Delete items from vector store

In [47]:
vector_store.delete(ids=[uuids[2]])

### Upsert items to vector store

You can upsert (update and insert) item by adding documents with ID that matches with an existing document's ID by over-writing.

In [48]:
id_to_update = uuids[-1]
new_doc = Document(
    page_content="cooking classes for beginners and novices are offered at the community center",
    metadata={"id": id_to_update, "location": "community center", "topic": "classes"},
)

In [49]:
vector_store.add_documents([new_doc], ids=[id_to_update])

['9b8e364f-db57-46aa-9cde-62f56aff1ac5']

In [50]:
print(vector_store.get_by_ids([id_to_update])[0].page_content)

cooking classes for beginners and novices are offered at the community center


## Query vector store

Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent.

### Query directly

Performing a simple similarity search can be done as follows:

In [51]:
results = vector_store.similarity_search("kitty", k=10)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

* there are cats in the pond [{'id': 'da61d994-7cd8-4de7-86ad-e8dc3124ce67', 'topic': 'animals', 'location': 'pond'}]
* the book club meets at the library [{'id': '8b2ce43e-a858-40fa-892b-b4f7411548a0', 'topic': 'reading', 'location': 'library'}]
* the library hosts a weekly story time for kids [{'id': 'cf5a8530-a71d-4dd2-a498-ca7bfcfb758c', 'topic': 'reading', 'location': 'library'}]
* ducks are also found in the pond [{'id': '3b7eda28-21be-4d84-85fc-e5a7120c03e2', 'topic': 'animals', 'location': 'pond'}]
* a new coffee shop opened on Main Street [{'id': '13a1a431-2f83-4fc4-ba93-ab249168b935', 'topic': 'food', 'location': 'Main Street'}]
* the new art exhibit is fascinating [{'id': '0cb6c40a-d948-41db-983a-4ecc35a1120b', 'topic': 'art', 'location': 'museum'}]
* a sculpture exhibit is also at the museum [{'id': '36342e32-f07c-4a11-999d-aabfba674c1c', 'topic': 'art', 'location': 'museum'}]
* the market also sells fresh oranges [{'id': '959886e7-bd55-4ea3-91f9-80cd7ba13132', 'topic': 'fo

### Query with filters

The vectorstore supports a set of filters that can be applied against the metadata fields of the documents.

You can find a list of filtering operators:

| Operator | Meaning/Category        |
|----------|-------------------------|
| \$eq      | Equality (==)           |
| \$ne      | Inequality (!=)         |
| \$lt      | Less than (&lt;)           |
| \$lte     | Less than or equal (&lt;=) |
| \$gt      | Greater than (>)        |
| \$gte     | Greater than or equal (>=) |
| \$in      | Special Cased (in)      |
| \$nin     | Special Cased (not in)  |
| \$between | Special Cased (between) |
| \$like    | Text (like)             |
| \$ilike   | Text (case-insensitive like) |
| \$and     | Logical (and)           |
| \$or      | Logical (or)            |

In [52]:
vector_store.similarity_search(
    "ducks",
    k=10,
    filter={"location": {"$in": ["pond", "market"]}},
)

[Document(id='3b7eda28-21be-4d84-85fc-e5a7120c03e2', metadata={'id': '3b7eda28-21be-4d84-85fc-e5a7120c03e2', 'topic': 'animals', 'location': 'pond'}, page_content='ducks are also found in the pond'),
 Document(id='da61d994-7cd8-4de7-86ad-e8dc3124ce67', metadata={'id': 'da61d994-7cd8-4de7-86ad-e8dc3124ce67', 'topic': 'animals', 'location': 'pond'}, page_content='there are cats in the pond'),
 Document(id='959886e7-bd55-4ea3-91f9-80cd7ba13132', metadata={'id': '959886e7-bd55-4ea3-91f9-80cd7ba13132', 'topic': 'food', 'location': 'market'}, page_content='the market also sells fresh oranges')]

### Similarity search with score

You can also search with score:

In [53]:
results = vector_store.similarity_search_with_score(query="cats", k=1)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

* [SIM=0.554739] there are cats in the pond [{'id': 'da61d994-7cd8-4de7-86ad-e8dc3124ce67', 'topic': 'animals', 'location': 'pond'}]


### Query by turning into retreiver
You can also transform the vector store into a retriever for easier usage in your chains.

In [54]:
retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 1})
retriever.invoke("kitty")

[Document(id='da61d994-7cd8-4de7-86ad-e8dc3124ce67', metadata={'id': 'da61d994-7cd8-4de7-86ad-e8dc3124ce67', 'topic': 'animals', 'location': 'pond'}, page_content='there are cats in the pond')]