Postgres : pgvector implemenation #1926

makkarss929 · 2024-03-31T09:39:44Z

…entation

Description

Worked on this enhancement feature #1558
Created PostgresDataBackend from Ibis.DataBackend
Created PostgresVectorSearcher class inspired by MongoAtlasVectorSearcher.

Next steps :

Testing with changes

Related Issues

Checklist

Is this code covered by new or existing unit tests or integration tests?
Did you run make unit-testing and make integration-testing successfully?
Do new classes, functions, methods and parameters all have docstrings?
Were existing docstrings updated, if necessary?
Was external documentation updated, if necessary?

Additional Notes or Comments

…entation

jieguangzhou

Here are some advice

superduperdb/backends/postgres/data_backend.py

superduperdb/base/build.py

superduperdb/vector_search/postgres.py

superduperdb/base/build.py

blythed

This is a good start. What about indexing? It would be very useful if we could use that to speed up the calculations: https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing.

makkarss929 · 2024-04-18T06:10:23Z

@jieguangzhou @blythed @kartik4949 I have made all requested changes, please review my PR. thanks

jieguangzhou

Thanks @makkarss929
Some suggestions for modifications, mainly we don’t need to modify so much, our architecture can handle these logics.

Please help to clean the outputs of the notebook

jieguangzhou · 2024-04-18T12:38:08Z

superduperdb/backends/ibis/query.py

+    if CFG.cluster.vector_search.type == 'in_memory':
+        if flatten:
+            raise NotImplementedError('Flatten not yet supported for ibis')

-    if not outputs:
-        return
-
-    table_records = []
-    for ix in range(len(outputs)):
-        d = {
-            '_input_id': str(ids[ix]),
-            'output': outputs[ix],
-        }
-        table_records.append(d)
+        if not outputs:
+            return

-    for r in table_records:
-        if isinstance(r['output'], dict) and '_content' in r['output']:
-            r['output'] = r['output']['_content']['bytes']
+        table_records = []
+        for ix in range(len(outputs)):
+            d = {
+                '_input_id': str(ids[ix]),
+                'output': outputs[ix],
+            }
+            table_records.append(d)
+
+        for r in table_records:
+            if isinstance(r['output'], dict) and '_content' in r['output']:
+                r['output'] = r['output']['_content']['bytes']
+
+        db.databackend.insert(f'_outputs.{predict_id}', table_records)
+
+    elif CFG.cluster.vector_search.type == 'pg_vector':
+        # Connect to your PostgreSQL database
+        conn = psycopg2.connect(CFG.cluster.vector_search.uri)
+        table_name = f'_outputs.{predict_id}'
+        with conn.cursor() as cursor:
+            cursor.execute('CREATE EXTENSION IF NOT EXISTS vector')
+            cursor.execute(f"""DROP TABLE IF EXISTS "{table_name}";""")
+            cursor.execute(
+                f"""CREATE TABLE "{table_name}" (
+                    _input_id VARCHAR PRIMARY KEY,
+                    output vector(1024),
+                    _fold VARCHAR
+                );
+                """
+            )
+            for ix in range(len(outputs)):
+                try:
+                    cursor.execute(
+                        f"""INSERT INTO "{table_name}" (_input_id, output) VALUES (%s, %s);""",
+                        [str(ids[ix]), outputs[ix]]
+                    )
+                except:
+                    pass

-    db.databackend.insert(f'_outputs.{predict_id}', table_records)
+        # Commit the transaction
+        conn.commit()
+        # Close the connection
+        conn.close()


We don’t need to insert them into the table here. When vector search is really launched, the add method of PostgresVectorSearcher will be called, and at this time, the corresponding vector can be accepted and added to the table. It is necessary to manage duplicate items well as previously commented.

The results of the model are saved separately from vector search service

@jieguangzhou Don't worry about duplication, we have _input_id as a PRIMARY KEY while creating vector table Example _outputs.listener1::0. I have checked there is no duplication.

jieguangzhou · 2024-04-18T12:39:13Z

superduperdb/base/build.py

+    elif uri.startswith('postgres://') or uri.startswith("postgresql://"):
+        name = uri.split('//')[0]
+        if type == 'data_backend':
+            ibis_conn = ibis.connect(uri)
+            return mapping['ibis'](ibis_conn, name)
+        else:
+            assert type == 'metadata'
+            from sqlalchemy import create_engine
+
+            sql_conn = create_engine(uri)
+            return mapping['sqlalchemy'](sql_conn, name)


Please remove this, we can directly load the URI with ibis

I will remove this

jieguangzhou · 2024-04-18T12:39:27Z

superduperdb/base/superduper.py

+    elif item.startswith('postgres://') or item.startswith('postgresql://'):
+        kwargs['data_backend'] = item


Please remove this, the same reason

will remove this

jieguangzhou · 2024-04-18T12:40:50Z

superduperdb/vector_search/interface.py

+            if CFG.cluster.vector_search.type != 'pg_vector':
+                if not db.server_mode:
+                    request_server(
+                        service='vector_search',
+                        endpoint='create/search',
+                        args={
+                            'vector_index': self.vector_index,
+                        },
+                        type='get',
+                    )


We don't need to change this, this is for an independent vector search service, in which the pg vector will be used logically.

@jieguangzhou It's throwing error that's why, I did this

What error? Could you please put the error message here?

jieguangzhou · 2024-04-18T12:41:00Z

superduperdb/vector_search/interface.py

+            if CFG.cluster.vector_search.type != 'pg_vector':
+                response = request_server(
+                    service='vector_search',
+                    data=h,
+                    endpoint='query/search',
+                    args={'vector_index': self.vector_index, 'n': n},
+                )
+                return response['ids'], response['scores']


The same here

@jieguangzhou It's throwing error that's why, I did this

jieguangzhou · 2024-04-18T12:41:29Z

superduperdb/vector_search/update_tasks.py

+    if CFG.cluster.vector_search.type != 'pg_vector':
+        vi = db.vector_indices[vector_index]
+        if isinstance(query, dict):
+            # ruff: noqa: E501
+            query: CompoundSelect = Serializable.decode(query)  # type: ignore[no-redef]
+        assert isinstance(query, CompoundSelect)
+        if not ids:
+            select = query
+        else:
+            select = query.select_using_ids(ids)
+        docs = db.select(select)
+        docs = [doc.unpack() for doc in docs]
+        key = vi.indexing_listener.key
+        if '_outputs.' in key:
+            key = key.split('.')[1]
+        # TODO: Refactor the below logic
+        vectors = []
+        if isinstance(db.databackend, MongoDataBackend):
+            vectors = [
+                {
+                    'vector': MongoStyleDict(doc)[
+                        f'_outputs.{vi.indexing_listener.predict_id}'
+                    ],
+                    'id': str(doc['_id']),
+                }
+                for doc in docs
+            ]
+        elif isinstance(db.databackend, IbisDataBackend):
+            docs = db.execute(select.outputs(vi.indexing_listener.predict_id))
+            from superduperdb.backends.ibis.data_backend import INPUT_KEY

-        vectors = [
-            {
-                'vector': doc[f'_outputs.{vi.indexing_listener.predict_id}'],
-                'id': str(doc[INPUT_KEY]),
-            }
-            for doc in docs
-        ]
-    for r in vectors:
-        if hasattr(r['vector'], 'numpy'):
-            r['vector'] = r['vector'].numpy()
+            vectors = [
+                {
+                    'vector': doc[f'_outputs.{vi.indexing_listener.predict_id}'],
+                    'id': str(doc[INPUT_KEY]),
+                }
+                for doc in docs
+            ]
+        for r in vectors:
+            if hasattr(r['vector'], 'numpy'):
+                r['vector'] = r['vector'].numpy()

-    if vectors:
-        db.fast_vector_searchers[vi.identifier].add(
-            [VectorItem(**vector) for vector in vectors]
-        )
+        if vectors:
+            db.fast_vector_searchers[vi.identifier].add(
+                [VectorItem(**vector) for vector in vectors]
+            )


The same here

@jieguangzhou Remember we discussed in meeting Ibis is not compatible with pgvector whereas psycopg2 supports. It's throwing error when Ibis is trying to access table with vector embeddings created with psycopg2.

We should have two tables when using the pg vector search

The model output table(table1), all the embedding vectors results are saved here

the pg vector search table(table2), save the vectors, and build an index here

The workflow for building a vector search index is as follows:

Query the vectors from table1

Use these vectors to build the vector search(the add method of PostgresVectorSearcher)

As you can see here, we get the vectors and call the add db.fast_vector_searchers[vi.identifier].add

Hi, you want me to create 2 tables

outputs.listener1::0 which is created by ibis.

outputs.listener1::0_pgvector which will be created by psycopg2. with the add method of PostgresVectorSearcher.

No, just one table, the pgvector table created by psycopg2

The first one created by ibis data_backend automatically

@jieguangzhou why do we need 2 tables. Why can't we just use the original listener table?
My thought was that the table will be created as a pg_vector table when the listener is created
somehow? Copying data into another table sounds wasteful.

This needs to be viewed in conjunction with comment #1926 (comment) ; here are my thoughts:

The saving behavior of model outputs should not be limited by downstream applications (such as vector search). If we need to compatible vector search separately, it would increase the complexity of saving query results. This should be a very pure interface. If vector search is to be built, then an index should be built on this table on the vector search side (if an index cannot be built, a separate table needs to be created).

The vector search component should be independent. Ideally, we should be able to switch the underlying engine of my vector search at any time. If we couple vector search with saving, then expandability will decrease. If my vector search uses two different Postgres databases from databackend (one for data saving and one for vector search calculations), it cannot be supported.

Ideally, if the PostgresVectorSearcher can determine that the uri being used is the same as the databackend's uri, it should first attempt to build an index in the model output table. If building an index is not possible, then a separate table needs to be created.

WDYT? @blythed

I understand that developers could potentially use pg_vector as an external vector-store. However, most likely
users will be postgres users.

By hosting the vector-search with pg_vector, we avoid the problem of mapping the data to a new table/ database. This is exactly like MongoDB Atlas vector-search.

In the configuration with have CFG.cluster.vector_search.type = 'native' for such cases. If that is set, then the "copy across" the vectors jobs are skipped.

I just conducted a basic test with pgvector:

If a vector table is created using pgvector, ibis currently cannot directly read the data inside, as it throws an error for not being able to parse the vector type data of pgvector. Adaptations are necessary; otherwise, our data backend will no longer be able to access this table. I suspect that it would require activating the vector feature specifically for the ibis backend’s connection, which could be quite troublesome.

To create and access vector type data in pgvector, each connection needs to activate the vector feature to enable data interchange.
It is suggested to split it into two tables, as ibis does not seem to be well compatible with pgvector at the moment.
If merging into one table is considered, the following tests need to be conducted:

Test with two types of SQL databases (Postgres and SQLite) to ensure that non-Postgres vector implementations are compatible and do not affect the original SQL functionalities. Since SQLite already has unit tests, only Postgres needs to be tested.

Conduct application tests in cases where the vector search type is specified as pgvector, including vector searches and non-vector searches, to ensure that non-vector search functions are preserved normally. These tests should be integration tests

@makkarss929 @blythed

ibis-project/ibis#9025

blythed · 2024-04-22T08:53:39Z

superduperdb/components/vector_index.py

@@ -54,6 +61,26 @@ def on_load(self, db: Datalayer) -> None:
            self.compatible_listener = t.cast(
                Listener, db.load('listener', self.compatible_listener)
            )
+        if CFG.cluster.vector_search.type == "pg_vector":


This logic should be isolated in the PgVector class (vector_searcher).

@blythed you want me to move this logic to PosgresVectorSearcher class, and then call it from here.

This logic should be isolated in the PgVector class (vector_searcher).

@blythed you want me to move this logic to PosgresVectorSearcher class, and then call it from here. right?

@blythed working on it

blythed · 2024-04-22T08:54:38Z

superduperdb/vector_search/update_tasks.py

-    elif isinstance(db.databackend, IbisDataBackend):
-        docs = db.execute(select.outputs(vi.indexing_listener.predict_id))
-        from superduperdb.backends.ibis.data_backend import INPUT_KEY
+    if CFG.cluster.vector_search.type != 'pg_vector':


because of same error Ibis doesn't support pgvector and vector datatype

blythed · 2024-05-23T08:29:38Z

@makkarss929 the ibis project has fixed the data types for pgvector. Would you like to finish your PR? Otherwise we can take this over.

Creating PostgresDataBackend from IbisDataBackend for pgvector implem…

ee8be5c

…entation

makkarss929 marked this pull request as draft March 31, 2024 09:39

makkarss929 marked this pull request as ready for review March 31, 2024 10:38

makkarss929 marked this pull request as draft March 31, 2024 10:38

adding PostgresVectorSearcher

66cb314

jieguangzhou requested changes Apr 1, 2024

View reviewed changes

jieguangzhou requested a review from kartik4949 April 1, 2024 12:31

makding requested changes

3404153

makkarss929 requested a review from jieguangzhou April 2, 2024 10:42

blythed reviewed Apr 2, 2024

View reviewed changes

add hnsw, ivfflat indexing methods

e95f5d7

makkarss929 requested a review from blythed April 6, 2024 20:39

makkarss929 added 4 commits April 11, 2024 14:32

removing PostgresDatabackend

a70f95f

removing unnecessary statements

2e344aa

integration testing

009bdfa

testing and adding pgvector.ipynb notebook

4bf72be

makkarss929 marked this pull request as ready for review April 17, 2024 21:47

makkarss929 changed the title ~~Creating PostgresDataBackend from IbisDataBackend for pgvector implem…~~ pgvector implemenation Apr 17, 2024

makkarss929 changed the title ~~pgvector implemenation~~ Postgres : pgvector implemenation Apr 18, 2024

jieguangzhou requested changes Apr 18, 2024

View reviewed changes

makkarss929 requested a review from jieguangzhou April 18, 2024 14:48

hnsw, ivfflat

a9f9a23

blythed reviewed Apr 22, 2024

View reviewed changes

blythed force-pushed the main branch from e26c486 to ec67009 Compare May 2, 2024 13:09

blythed force-pushed the main branch 2 times, most recently from b8748a3 to 04010cf Compare May 22, 2024 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres : pgvector implemenation #1926

Postgres : pgvector implemenation #1926

makkarss929 commented Mar 31, 2024 •

edited

jieguangzhou left a comment

blythed left a comment

makkarss929 commented Apr 18, 2024

jieguangzhou left a comment •

edited

jieguangzhou Apr 18, 2024 •

edited

makkarss929 Apr 18, 2024 •

edited

jieguangzhou Apr 18, 2024

makkarss929 Apr 18, 2024

jieguangzhou Apr 18, 2024

makkarss929 Apr 18, 2024

jieguangzhou Apr 18, 2024

makkarss929 Apr 18, 2024

jieguangzhou Apr 19, 2024

jieguangzhou Apr 18, 2024

makkarss929 Apr 18, 2024

jieguangzhou Apr 18, 2024

makkarss929 Apr 18, 2024

jieguangzhou Apr 19, 2024

makkarss929 Apr 19, 2024

jieguangzhou Apr 19, 2024

blythed Apr 19, 2024 •

edited

jieguangzhou Apr 19, 2024

blythed Apr 19, 2024 •

edited

jieguangzhou Apr 19, 2024

jieguangzhou Apr 19, 2024

blythed Apr 22, 2024

makkarss929 Apr 22, 2024

makkarss929 Apr 22, 2024

makkarss929 Apr 24, 2024

blythed Apr 22, 2024

makkarss929 Apr 22, 2024 •

edited

blythed commented May 23, 2024

		elif item.startswith('postgres://') or item.startswith('postgresql://'):
		kwargs['data_backend'] = item

Postgres : pgvector implemenation #1926

Are you sure you want to change the base?

Postgres : pgvector implemenation #1926

Conversation

makkarss929 commented Mar 31, 2024 • edited

Description

Related Issues

Checklist

Additional Notes or Comments

jieguangzhou left a comment

Choose a reason for hiding this comment

blythed left a comment

Choose a reason for hiding this comment

makkarss929 commented Apr 18, 2024

jieguangzhou left a comment • edited

Choose a reason for hiding this comment

jieguangzhou Apr 18, 2024 • edited

Choose a reason for hiding this comment

makkarss929 Apr 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blythed Apr 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blythed Apr 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makkarss929 Apr 22, 2024 • edited

Choose a reason for hiding this comment

blythed commented May 23, 2024

makkarss929 commented Mar 31, 2024 •

edited

jieguangzhou left a comment •

edited

jieguangzhou Apr 18, 2024 •

edited

makkarss929 Apr 18, 2024 •

edited

blythed Apr 19, 2024 •

edited

blythed Apr 19, 2024 •

edited

makkarss929 Apr 22, 2024 •

edited