chore: add embeddings table migration #9372

jordanh · 2024-01-21T01:26:40Z

Description

After getting some feedback from @mattkrick, I've split the EmbeddingsIndex into two tables with two distinct concerns:

EmbeddingsIndex – contains all the meta information for which we've generated indexes
EmbeddingsJobQueue – contains a list of items pending embedding, or the reason(s) why creating the embedding failed

An outline of how these tables will be used:

For a particular embedding model, a service called embedder will...
- ...add historical objects to be embedded to the EmbeddingsIndex
- ...add items that lack the target model in the EmbeddingsIndex.model array to the EmbeddingsJobQueue
- ...shift the state from EmbeddingsJobQueue.queued -> embedding when calculating an embedding for a particular model
- ...remove the row from the EmbeddingsJobQueue if calculating the embedding was successful, and update the EmbeddingsIndex.model with the appropriate model name (A row will also be created in Embeddings_<MODEL> that can be joined to EmbeddingsIndex)
  - else, ...shift the state from queued -> failed if calculating the embedding should fail

Testing scenarios

[Please list all the testing scenarios a reviewer has to check before approving the PR]

Migrate up!
Migrate down!

Final checklist

I checked the code review guidelines
I have added Metrics Representative as reviewer(s) if my PR invovles metrics/data/analytics related changes
I have performed a self-review of my code, the same way I'd do it for any other team member
I have tested all cases I listed in the testing scenarios and I haven't found any issues or regressions
Whenever I took a non-obvious choice I added a comment explaining why I did it this way
I added the label Skip Maintainer Review Indicating the PR only requires reviewer review and can be merged right after it's approved if the PR introduces only minor changes, does not contain any architectural changes or does not introduce any new patterns and I think one review is sufficient'
PR title is human readable and could be used in changelog

jordanh · 2024-01-21T01:31:11Z

@mattkrick in general, I think separating the two concerns of the table over the original implementation in #9300 is better.

The only drawback I can think of is, when adding items to the EmbeddingsJobQueue from the EmbeddingsIndex I'll have to query the EmbeddingsIndex and join the EmbeddingsJobQueue to filter out previously filtered jobs so we don't keep retrying failed embeddings in perpetuity

Definitely open to suggestions for any improvements/changes

packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts

mattkrick · 2024-01-22T19:07:17Z

packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts

+        "objectType" "EmbeddingsObjectTypeEnum" NOT NULL,
+        "refId" VARCHAR(100),
+        UNIQUE("objectType", "refId"),
+        "refDateTime" TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,


-1 not sure what this is. if this is the time the row is created could we call this createdAt?

if it's when the reference is created could we call it refCreatedAt or leave it normalized on the table related to objectType?

The intent is to have a time on the index for "what time the embedded object represents". Use case would be filtering by a date range. This intends to denormalize that contextual time in the index. Job story:

When I want to search for discussions that only occurred between two dates, I want a set of filter controls that return only the results I am interested in.

It's debatable whether we need this now. I'd be ok to drop it until we need it.

What say you?

ah! that's a super useful feature. so this isn't necessarily when the embedding is created, but when the reference was last updated? if that entity is mutable & gets updated, then we update this column + the embed text? if that's the case, i might suggest something like refUpdatedAt to make a little more clear, but i really like having the date here

Ok, will do.

packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts

mattkrick · 2024-01-22T22:43:25Z

The only drawback I can think of is, when adding items to the EmbeddingsJobQueue from the EmbeddingsIndex I'll have to query the EmbeddingsIndex and join the EmbeddingsJobQueue to filter out previously filtered jobs so we don't keep retrying failed embeddings in perpetuity

I totally agree. Gave this some extra thought today. Since the job queue can be calculated by determining which objects don't have an embedding, it'd be great to leverage that into something simpler. From your PR, I think we can break down the functionality into the following:

IndexBuilder (populates the EmbeddingsIndex table)
QueueBuilder (populates the queue, a single worker per model/objectType)
ObjectBuilder (turns a meeting into many discussion objects)
LLM Endpoint (could be 1 or many behind a LB)
Embedder to pop the queue & call the endpoint (could be 1 or many)

The real bottleneck here is going to be the LLM endpoint. There's no easy way to distribute the work of building the queue, but we can break it up by objectType+model to distribute the load a bit.

Given those constraints, here's my proposal:

On startup, Embedder queries EmbeddingsIndex to see if at least 1 record exists for each model for each objectType
If an objectType for a model isn't found, it seeks a lock on redis key objectBuilder:${objectType}:${model}. If it's achieved, that embedder becomes IndexBuilder. If it's not achieved, it listens for redis to tell it the queue is created by subscribing to objectBuilder:${objectType}:${model}:complete
IndexBuilder iterates through EmbeddingsObjectTypeEnum.
- It has a handler for each, e.g. for retrospectiveDiscussionTopic
- it batches through NewMeeting extracting the id of each meeting from r.minval to r.now() (new items ignored, they are handled by GQL Executor).
- It creates a row in EmbeddingsIndex for each item. on conflict it does nothing.
When complete, IndexBuilder becomes QueueBuilder
- it left joins to Embeddings_% and filters for null to create a list of objects without embeddings. This job queue is pushed to a redis stream via xadd
- On crash, an embedder can skip IndexBuilder and become QueueBuilder if prompted (Out of scope, but this supports redis without persistence if we want to build it later)
- it publishes objectBuilder:${objectType}:${model}:complete to alert all Embedders to begin work. Those embedders do not start work until they exit QueueBuilder mode.
finally, Embedders subscribe to the queue via .xreadgroup(). They use xautoclaim to support failover. they ack when the job is done. they call xclaim if a task is taking longer than expected. see https://cschleiden.dev/blog/2022-04-08-task-queue-with-redis
Once Embedder has an item, it acts as ObjectBuilder & computes embedText & writes it to the DB table
After computing embedText, it sends it to the LLM endpoint to get embeddings. on return, it writes that to the Embeddings_% table
Finally, it calls ACK to redis to report the job as complete

jordanh · 2024-01-23T01:01:40Z

What you propose is strikingly similar to what I've written, except for a few details. Let's get work to get this PR merged, and then I'll make some updates to the service I wrote to match new nomenclature...and we can continue. Sound ok?

Awaiting your comments on some of the +1s

mattkrick · 2024-01-23T23:33:39Z

sounds great! Yep, I think we've largely converged on the embedder layout, i still gotta do a deeper dive on your PR to make sure i understand everything. redis vs. pg's SKIP LOCKED is a good area for convo, i haven't read enough to make an informed decision.

jordanh · 2024-01-24T17:23:51Z

sounds great! Yep, I think we've largely converged on the embedder layout, i still gotta do a deeper dive on your PR to make sure i understand everything. redis vs. pg's SKIP LOCKED is a good area for convo, i haven't read enough to make an informed decision.

We can definitely kick this conversation to the next PR :)

jordanh · 2024-01-26T18:15:42Z

@mattkrick, I've made all the changes we discussed.

Would you mind giving hits a migrate up/down test? I'd do it, but I don't want to lose all the data I've calculated :)

jordanh · 2024-02-01T19:31:00Z

@dbumblis-parabol this is the PR that is going to land soon, FYI

dbumblis-parabol · 2024-02-01T19:40:56Z

There may be a challenge in the near future where we cannot control pgvector being installed in the ironbank container image used for postgres in the pubsec deployment of the application. Is there any way that the migration itself can be gated? Or, how could we address a situation where the postgresql db does not have pgvector installed?

jordanh · 2024-02-02T22:43:13Z

Or, how could we address a situation where the postgresql db does not have pgvector installed?

@mattkrick now here's an interesting case. To rephrase what @dbumblis-parabol is saying:

If we merge this and cut a new IronBank image...
...when the migration runs, it may fail because pgvector may not be available in the destination environment

Blarg. That would stink.

What would be a reasonable pattern here? How could we "soft require" pgvector?

Dschoordsch · 2024-02-05T18:41:19Z

Do we want to make this all optional, i.e. put the vector data in a different database? Then we could not easily merge with the rest of the data, but I don't see any immediate negative consequences through this.
If it would be a separate DB, it would be easier to gate migrations and ensure fewer accidents if we want to have the option of not using AI in certain offerings.

mattkrick · 2024-02-05T19:12:48Z

if usage of pgvector is dependent on an environment & that decision could change over time then i think an env var is probably the cleanest solution.

// .env
USE_PGVECTOR=true

// preDeploy
if (process.env.USE_PGVECTOR === 'true') {
  const enabled = await pg.query(`SELECT 1 FROM pg_extension WHERE extname = 'pgvector'`).executeAndTakeFirst()
  if (!enabled) {
    await pg.query(`CREATE EXTENSION vector;`).execute()
  }
}

I'll have to hear more about @Dschoordsch's idea. A 2nd DB would mean managing a 2nd pool of PG connections, so I'd want to make sure that we're getting something worthwhile for the complexity.

jordanh · 2024-02-05T23:32:36Z

// .env
USE_PGVECTOR=true

// preDeploy
if (process.env.USE_PGVECTOR === 'true') {
  const enabled = await pg.query(`SELECT 1 FROM pg_extension WHERE extname = 'pgvector'`).executeAndTakeFirst()
  if (!enabled) {
    await pg.query(`CREATE EXTENSION vector;`).execute()
  }
}

@mattkrick I like this.

What to do about the migration in packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts tho? The up migration will fail if it is run and the vector extension isn't available.

Here's a bad option I can think of:

Rewrite the migration to skip adding the columns that rely on the vector type if the extension is not loaded
Update the preDeploy script to add the extension and the column if the USE_PGVECTOR env var is true

Do you have a better idea than this? If not, I can go ahead and make those changes and we can get this puppy merged

mattkrick · 2024-02-05T23:53:43Z

Interesting problem because 2 different envs will have 2 different DB schemas. this could get pretty hairy!

i like your solution. the alternative i see is having the migration adding the column of type bytea and then changing that column type to vector in predeploy. since the table won't have rows yet, changing the column type won't be expensive. It will be a little closer to the truth because the column will exist as the wrong type vs. not existing at all.

I'm OK with either!

jordanh · 2024-02-06T01:03:29Z

@mattkrick and I discussed via Slack and since the migration relies on no vector column data types (they are all added by the embedder process dynamically) updating the preDeploy script to create the extension if an env var is set will be the way to go for now.

I've create a bool env var called POSTGRES_USE_PGVECTOR to gate this behavior

jordanh · 2024-02-06T01:03:59Z

@mattkrick I think this is GTG now

Dschoordsch · 2024-02-06T09:40:22Z

I'll have to hear more about @Dschoordsch's idea. A 2nd DB would mean managing a 2nd pool of PG connections, so I'd want to make sure that we're getting something worthwhile for the complexity.

That was just an idea I had during our call yesterday. I thought it would be easier to skip the entire db and embedding module if the environment is not present. In that case our normal migrations wouldn't depend on environment variables, despite connection strings. I think it would have no technical benefit, but it might keep developers aware that they have to check for the absence of that data.
It also introduces more complexity in the migrations as we might (probably won't) run into the same ordering issue we had between rethink and postgres.
In both cases the logic to generate and consume the embeddings needs to be guarded by an environment variable, so there is no benefit here.

So I think just changing the type and not populating the data is way easier. We could run into issues when querying with the non-existent type and not catch it before it's deployed to ironbank users.

jordanh · 2024-02-06T17:53:51Z

@Dschoordsch I think as this PR now stands we have the best of all worlds:

The POSTGRES_USE_PGVECTOR env var gates whether or not the preDeploy will try to enable the pgvector extension

And in the next PR:

The embedder's AI_EMBEDDER_ENABLED env var will gate whether or not the embedder will even run, which determines whether or not the tables that contain the vector type will even be created

This should let us run flexibly in our SaaS or in a self-hosted model with or without pgvector

scripts/toolboxSrc/pgEnsureExtensions.ts

jordanh · 2024-02-08T17:21:03Z

Ooookay @mattkrick, now I think this is ready :)

Also: - fix: corrects types in standaloneMigrations.ts - fix: silly things I missed in the addEmbeddingTables migration

mattkrick · 2024-02-12T22:33:27Z

scripts/toolboxSrc/pgEnsureExtensions.ts

+
+export default async () => {
+  console.log('🔩 Postgres Extension Checks Started')
+  if (process.env.POSTGRES_USE_PGVECTOR) {


you'll have to write === 'true' since our dotenv parser does not parse the string "true" to the binary type true.

Ugh – color me embarrassed. Coming right up sir

* chore: Add more Atlassian logging (#9405) * chore: Add more Atlassian logging Mainly associate the logs with the traces so it's possible to check which GraphQL request caused a certain debug output. * Less code shuffling * Cleanup log directories Put them in the dev folder so they're out of sight. * fix: fix kudos in standups in nested lists (#9412) * Fix: fix kudos in standups in nested lists * fix test * chore(release): release v7.15.2 (#9414) Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com> * chore: update 3d secure card number in release_test.md (#9394) Previous 3d secure card number does not work anymore and I don't see it in the stripe docs https://stripe.com/docs/testing?testing-method=card-numbers#regulatory-cards * chore: bump node to v20.11.0 (#9410) Signed-off-by: Matt Krick <matt.krick@gmail.com> * chore: add embeddings table migration (#9372) * chore: add embeddings table migration * chore: code review changes * feat: auto-add pgvector extension in production Also: - fix: corrects types in standaloneMigrations.ts - fix: silly things I missed in the addEmbeddingTables migration * fix: check for POSTGRES_USE_PGVECTOR * fix: POSTGRES_USE_PGVECTOR strict check for === 'true' * feat: speed up ai search (#9421) * fix: not all jira projects are displayed in the list if there are a lot of them (#9422) * chore(release): release v7.16.0 (#9419) Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com> --------- Signed-off-by: Matt Krick <matt.krick@gmail.com> Co-authored-by: Georg Bremer <github@dschoordsch.de> Co-authored-by: Igor Lesnenko <igor.lesnenko@gmail.com> Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com> Co-authored-by: Matt Krick <matt.krick@gmail.com> Co-authored-by: Jordan Husney <jordan.husney@gmail.com> Co-authored-by: Nick O'Ferrall <nickoferrall@gmail.com>

jordanh force-pushed the feat-embedder-migration branch from 2787121 to 025c3d3 Compare January 21, 2024 01:26

github-actions bot requested a review from tianrunhe January 21, 2024 01:26

github-actions bot added the size/m label Jan 21, 2024

jordanh requested review from mattkrick and removed request for tianrunhe January 21, 2024 01:27

jordanh force-pushed the feat-embedder-migration branch from 025c3d3 to 5559089 Compare January 21, 2024 01:33

mattkrick requested changes Jan 22, 2024

View reviewed changes

mattkrick reviewed Jan 22, 2024

View reviewed changes

packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts Outdated Show resolved Hide resolved

github-actions bot added size/s and removed size/m labels Jan 26, 2024

jordanh force-pushed the feat-embedder-migration branch from 3acfeed to 468d94c Compare January 26, 2024 18:14

mattkrick approved these changes Feb 5, 2024

View reviewed changes

github-actions bot added size/m and removed size/s labels Feb 6, 2024

mattkrick reviewed Feb 7, 2024

View reviewed changes

scripts/toolboxSrc/pgEnsureExtensions.ts Outdated Show resolved Hide resolved

github-actions bot added size/xl and removed size/m labels Feb 8, 2024

jordanh force-pushed the feat-embedder-migration branch from 9c1b049 to 918bc71 Compare February 8, 2024 17:20

github-actions bot added size/m and removed size/xl labels Feb 8, 2024

ParabolInc deleted a comment from github-actions bot Feb 8, 2024

jordanh added 4 commits February 9, 2024 06:36

chore: add embeddings table migration

f7f7f8d

chore: code review changes

54d6126

feat: auto-add pgvector extension in production

9611a35

Also: - fix: corrects types in standaloneMigrations.ts - fix: silly things I missed in the addEmbeddingTables migration

fix: check for POSTGRES_USE_PGVECTOR

71af08d

jordanh force-pushed the feat-embedder-migration branch from 918bc71 to 71af08d Compare February 9, 2024 14:36

mattkrick reviewed Feb 12, 2024

View reviewed changes

fix: POSTGRES_USE_PGVECTOR strict check for === 'true'

10f2a1b

mattkrick merged commit 012ca77 into master Feb 13, 2024
5 checks passed

mattkrick deleted the feat-embedder-migration branch February 13, 2024 01:38

parabol-release-bot bot mentioned this pull request Feb 12, 2024

chore(release): release v7.16.0 #9419

Merged

github-actions bot mentioned this pull request Feb 15, 2024

chore(release): Test v7.16.0 #9426

Merged

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add embeddings table migration #9372

chore: add embeddings table migration #9372

jordanh commented Jan 21, 2024

jordanh commented Jan 21, 2024

mattkrick Jan 22, 2024

jordanh Jan 23, 2024

mattkrick Jan 23, 2024

jordanh Jan 24, 2024

mattkrick commented Jan 22, 2024 •

edited

jordanh commented Jan 23, 2024

mattkrick commented Jan 23, 2024

jordanh commented Jan 24, 2024

jordanh commented Jan 26, 2024

jordanh commented Feb 1, 2024

dbumblis-parabol commented Feb 1, 2024 •

edited

jordanh commented Feb 2, 2024 •

edited

Dschoordsch commented Feb 5, 2024

mattkrick commented Feb 5, 2024

jordanh commented Feb 5, 2024

mattkrick commented Feb 5, 2024

jordanh commented Feb 6, 2024

jordanh commented Feb 6, 2024

Dschoordsch commented Feb 6, 2024

jordanh commented Feb 6, 2024

jordanh commented Feb 8, 2024

mattkrick Feb 12, 2024

jordanh Feb 13, 2024

chore: add embeddings table migration #9372

chore: add embeddings table migration #9372

Conversation

jordanh commented Jan 21, 2024

Description

Testing scenarios

Final checklist

jordanh commented Jan 21, 2024

mattkrick Jan 22, 2024

Choose a reason for hiding this comment

jordanh Jan 23, 2024

Choose a reason for hiding this comment

mattkrick Jan 23, 2024

Choose a reason for hiding this comment

jordanh Jan 24, 2024

Choose a reason for hiding this comment

mattkrick commented Jan 22, 2024 • edited

jordanh commented Jan 23, 2024

mattkrick commented Jan 23, 2024

jordanh commented Jan 24, 2024

jordanh commented Jan 26, 2024

jordanh commented Feb 1, 2024

dbumblis-parabol commented Feb 1, 2024 • edited

jordanh commented Feb 2, 2024 • edited

Dschoordsch commented Feb 5, 2024

mattkrick commented Feb 5, 2024

jordanh commented Feb 5, 2024

mattkrick commented Feb 5, 2024

jordanh commented Feb 6, 2024

jordanh commented Feb 6, 2024

Dschoordsch commented Feb 6, 2024

jordanh commented Feb 6, 2024

jordanh commented Feb 8, 2024

mattkrick Feb 12, 2024

Choose a reason for hiding this comment

jordanh Feb 13, 2024

Choose a reason for hiding this comment

mattkrick commented Jan 22, 2024 •

edited

dbumblis-parabol commented Feb 1, 2024 •

edited

jordanh commented Feb 2, 2024 •

edited