Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add embeddings table migration #9372

Merged
merged 5 commits into from
Feb 13, 2024
Merged

Conversation

jordanh
Copy link
Contributor

@jordanh jordanh commented Jan 21, 2024

Description

After getting some feedback from @mattkrick, I've split the EmbeddingsIndex into two tables with two distinct concerns:

  • EmbeddingsIndex – contains all the meta information for which we've generated indexes
  • EmbeddingsJobQueue – contains a list of items pending embedding, or the reason(s) why creating the embedding failed

An outline of how these tables will be used:

  • For a particular embedding model, a service called embedder will...
    • ...add historical objects to be embedded to the EmbeddingsIndex
    • ...add items that lack the target model in the EmbeddingsIndex.model array to the EmbeddingsJobQueue
    • ...shift the state from EmbeddingsJobQueue.queued -> embedding when calculating an embedding for a particular model
    • ...remove the row from the EmbeddingsJobQueue if calculating the embedding was successful, and update the EmbeddingsIndex.model with the appropriate model name (A row will also be created in Embeddings_<MODEL> that can be joined to EmbeddingsIndex)
      • else, ...shift the state from queued -> failed if calculating the embedding should fail

Testing scenarios

[Please list all the testing scenarios a reviewer has to check before approving the PR]

  • Migrate up!
  • Migrate down!

Final checklist

  • I checked the code review guidelines
  • I have added Metrics Representative as reviewer(s) if my PR invovles metrics/data/analytics related changes
  • I have performed a self-review of my code, the same way I'd do it for any other team member
  • I have tested all cases I listed in the testing scenarios and I haven't found any issues or regressions
  • Whenever I took a non-obvious choice I added a comment explaining why I did it this way
  • I added the label Skip Maintainer Review Indicating the PR only requires reviewer review and can be merged right after it's approved if the PR introduces only minor changes, does not contain any architectural changes or does not introduce any new patterns and I think one review is sufficient'
  • PR title is human readable and could be used in changelog

@jordanh jordanh requested review from mattkrick and removed request for tianrunhe January 21, 2024 01:27
@jordanh
Copy link
Contributor Author

jordanh commented Jan 21, 2024

@mattkrick in general, I think separating the two concerns of the table over the original implementation in #9300 is better.

The only drawback I can think of is, when adding items to the EmbeddingsJobQueue from the EmbeddingsIndex I'll have to query the EmbeddingsIndex and join the EmbeddingsJobQueue to filter out previously filtered jobs so we don't keep retrying failed embeddings in perpetuity

Definitely open to suggestions for any improvements/changes

"objectType" "EmbeddingsObjectTypeEnum" NOT NULL,
"refId" VARCHAR(100),
UNIQUE("objectType", "refId"),
"refDateTime" TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 not sure what this is. if this is the time the row is created could we call this createdAt?

if it's when the reference is created could we call it refCreatedAt or leave it normalized on the table related to objectType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to have a time on the index for "what time the embedded object represents". Use case would be filtering by a date range. This intends to denormalize that contextual time in the index. Job story:

When I want to search for discussions that only occurred between two dates, I want a set of filter controls that return only the results I am interested in.

It's debatable whether we need this now. I'd be ok to drop it until we need it.

What say you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! that's a super useful feature. so this isn't necessarily when the embedding is created, but when the reference was last updated? if that entity is mutable & gets updated, then we update this column + the embed text? if that's the case, i might suggest something like refUpdatedAt to make a little more clear, but i really like having the date here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

@mattkrick
Copy link
Member

mattkrick commented Jan 22, 2024

The only drawback I can think of is, when adding items to the EmbeddingsJobQueue from the EmbeddingsIndex I'll have to query the EmbeddingsIndex and join the EmbeddingsJobQueue to filter out previously filtered jobs so we don't keep retrying failed embeddings in perpetuity

I totally agree. Gave this some extra thought today. Since the job queue can be calculated by determining which objects don't have an embedding, it'd be great to leverage that into something simpler. From your PR, I think we can break down the functionality into the following:

  • IndexBuilder (populates the EmbeddingsIndex table)
  • QueueBuilder (populates the queue, a single worker per model/objectType)
  • ObjectBuilder (turns a meeting into many discussion objects)
  • LLM Endpoint (could be 1 or many behind a LB)
  • Embedder to pop the queue & call the endpoint (could be 1 or many)

The real bottleneck here is going to be the LLM endpoint. There's no easy way to distribute the work of building the queue, but we can break it up by objectType+model to distribute the load a bit.

Given those constraints, here's my proposal:

  • On startup, Embedder queries EmbeddingsIndex to see if at least 1 record exists for each model for each objectType
  • If an objectType for a model isn't found, it seeks a lock on redis key objectBuilder:${objectType}:${model}. If it's achieved, that embedder becomes IndexBuilder. If it's not achieved, it listens for redis to tell it the queue is created by subscribing to objectBuilder:${objectType}:${model}:complete
  • IndexBuilder iterates through EmbeddingsObjectTypeEnum.
    • It has a handler for each, e.g. for retrospectiveDiscussionTopic
    • it batches through NewMeeting extracting the id of each meeting from r.minval to r.now() (new items ignored, they are handled by GQL Executor).
    • It creates a row in EmbeddingsIndex for each item. on conflict it does nothing.
  • When complete, IndexBuilder becomes QueueBuilder
    • it left joins to Embeddings_% and filters for null to create a list of objects without embeddings. This job queue is pushed to a redis stream via xadd
    • On crash, an embedder can skip IndexBuilder and become QueueBuilder if prompted (Out of scope, but this supports redis without persistence if we want to build it later)
    • it publishes objectBuilder:${objectType}:${model}:complete to alert all Embedders to begin work. Those embedders do not start work until they exit QueueBuilder mode.
  • finally, Embedders subscribe to the queue via .xreadgroup(). They use xautoclaim to support failover. they ack when the job is done. they call xclaim if a task is taking longer than expected. see https://cschleiden.dev/blog/2022-04-08-task-queue-with-redis
  • Once Embedder has an item, it acts as ObjectBuilder & computes embedText & writes it to the DB table
  • After computing embedText, it sends it to the LLM endpoint to get embeddings. on return, it writes that to the Embeddings_% table
  • Finally, it calls ACK to redis to report the job as complete

@jordanh
Copy link
Contributor Author

jordanh commented Jan 23, 2024

What you propose is strikingly similar to what I've written, except for a few details. Let's get work to get this PR merged, and then I'll make some updates to the service I wrote to match new nomenclature...and we can continue. Sound ok?

Awaiting your comments on some of the +1s

@mattkrick
Copy link
Member

sounds great! Yep, I think we've largely converged on the embedder layout, i still gotta do a deeper dive on your PR to make sure i understand everything. redis vs. pg's SKIP LOCKED is a good area for convo, i haven't read enough to make an informed decision.

@jordanh
Copy link
Contributor Author

jordanh commented Jan 24, 2024

sounds great! Yep, I think we've largely converged on the embedder layout, i still gotta do a deeper dive on your PR to make sure i understand everything. redis vs. pg's SKIP LOCKED is a good area for convo, i haven't read enough to make an informed decision.

We can definitely kick this conversation to the next PR :)

@jordanh
Copy link
Contributor Author

jordanh commented Jan 26, 2024

@mattkrick, I've made all the changes we discussed.

Would you mind giving hits a migrate up/down test? I'd do it, but I don't want to lose all the data I've calculated :)

@jordanh
Copy link
Contributor Author

jordanh commented Feb 1, 2024

@dbumblis-parabol this is the PR that is going to land soon, FYI

@dbumblis-parabol
Copy link
Contributor

dbumblis-parabol commented Feb 1, 2024

There may be a challenge in the near future where we cannot control pgvector being installed in the ironbank container image used for postgres in the pubsec deployment of the application. Is there any way that the migration itself can be gated? Or, how could we address a situation where the postgresql db does not have pgvector installed?

@jordanh
Copy link
Contributor Author

jordanh commented Feb 2, 2024

Or, how could we address a situation where the postgresql db does not have pgvector installed?

@mattkrick now here's an interesting case. To rephrase what @dbumblis-parabol is saying:

  • If we merge this and cut a new IronBank image...
  • ...when the migration runs, it may fail because pgvector may not be available in the destination environment

Blarg. That would stink.

What would be a reasonable pattern here? How could we "soft require" pgvector?

@Dschoordsch
Copy link
Contributor

Do we want to make this all optional, i.e. put the vector data in a different database? Then we could not easily merge with the rest of the data, but I don't see any immediate negative consequences through this.
If it would be a separate DB, it would be easier to gate migrations and ensure fewer accidents if we want to have the option of not using AI in certain offerings.

@mattkrick
Copy link
Member

if usage of pgvector is dependent on an environment & that decision could change over time then i think an env var is probably the cleanest solution.

// .env
USE_PGVECTOR=true

// preDeploy
if (process.env.USE_PGVECTOR === 'true') {
  const enabled = await pg.query(`SELECT 1 FROM pg_extension WHERE extname = 'pgvector'`).executeAndTakeFirst()
  if (!enabled) {
    await pg.query(`CREATE EXTENSION vector;`).execute()
  }
}

I'll have to hear more about @Dschoordsch's idea. A 2nd DB would mean managing a 2nd pool of PG connections, so I'd want to make sure that we're getting something worthwhile for the complexity.

@jordanh
Copy link
Contributor Author

jordanh commented Feb 5, 2024

// .env
USE_PGVECTOR=true

// preDeploy
if (process.env.USE_PGVECTOR === 'true') {
  const enabled = await pg.query(`SELECT 1 FROM pg_extension WHERE extname = 'pgvector'`).executeAndTakeFirst()
  if (!enabled) {
    await pg.query(`CREATE EXTENSION vector;`).execute()
  }
}

@mattkrick I like this.

What to do about the migration in packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts tho? The up migration will fail if it is run and the vector extension isn't available.

Here's a bad option I can think of:

  1. Rewrite the migration to skip adding the columns that rely on the vector type if the extension is not loaded
  2. Update the preDeploy script to add the extension and the column if the USE_PGVECTOR env var is true

Do you have a better idea than this? If not, I can go ahead and make those changes and we can get this puppy merged

@mattkrick
Copy link
Member

Interesting problem because 2 different envs will have 2 different DB schemas. this could get pretty hairy!

i like your solution. the alternative i see is having the migration adding the column of type bytea and then changing that column type to vector in predeploy. since the table won't have rows yet, changing the column type won't be expensive. It will be a little closer to the truth because the column will exist as the wrong type vs. not existing at all.

I'm OK with either!

@github-actions github-actions bot added size/m and removed size/s labels Feb 6, 2024
@jordanh
Copy link
Contributor Author

jordanh commented Feb 6, 2024

@mattkrick and I discussed via Slack and since the migration relies on no vector column data types (they are all added by the embedder process dynamically) updating the preDeploy script to create the extension if an env var is set will be the way to go for now.

I've create a bool env var called POSTGRES_USE_PGVECTOR to gate this behavior

@jordanh
Copy link
Contributor Author

jordanh commented Feb 6, 2024

@mattkrick I think this is GTG now

@Dschoordsch
Copy link
Contributor

I'll have to hear more about @Dschoordsch's idea. A 2nd DB would mean managing a 2nd pool of PG connections, so I'd want to make sure that we're getting something worthwhile for the complexity.

That was just an idea I had during our call yesterday. I thought it would be easier to skip the entire db and embedding module if the environment is not present. In that case our normal migrations wouldn't depend on environment variables, despite connection strings. I think it would have no technical benefit, but it might keep developers aware that they have to check for the absence of that data.
It also introduces more complexity in the migrations as we might (probably won't) run into the same ordering issue we had between rethink and postgres.
In both cases the logic to generate and consume the embeddings needs to be guarded by an environment variable, so there is no benefit here.

So I think just changing the type and not populating the data is way easier. We could run into issues when querying with the non-existent type and not catch it before it's deployed to ironbank users.

@jordanh
Copy link
Contributor Author

jordanh commented Feb 6, 2024

@Dschoordsch I think as this PR now stands we have the best of all worlds:

  • The POSTGRES_USE_PGVECTOR env var gates whether or not the preDeploy will try to enable the pgvector extension

And in the next PR:

  • The embedder's AI_EMBEDDER_ENABLED env var will gate whether or not the embedder will even run, which determines whether or not the tables that contain the vector type will even be created

This should let us run flexibly in our SaaS or in a self-hosted model with or without pgvector

@jordanh
Copy link
Contributor Author

jordanh commented Feb 8, 2024

Ooookay @mattkrick, now I think this is ready :)

@ParabolInc ParabolInc deleted a comment from github-actions bot Feb 8, 2024
Also:
   - fix: corrects types in standaloneMigrations.ts
   - fix: silly things I missed in the addEmbeddingTables migration

export default async () => {
console.log('🔩 Postgres Extension Checks Started')
if (process.env.POSTGRES_USE_PGVECTOR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll have to write === 'true' since our dotenv parser does not parse the string "true" to the binary type true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh – color me embarrassed. Coming right up sir

@mattkrick mattkrick merged commit 012ca77 into master Feb 13, 2024
5 checks passed
@mattkrick mattkrick deleted the feat-embedder-migration branch February 13, 2024 01:38
@github-actions github-actions bot mentioned this pull request Feb 15, 2024
24 tasks
nickoferrall added a commit that referenced this pull request Feb 19, 2024
* chore: Add more Atlassian logging (#9405)

* chore: Add more Atlassian logging

Mainly associate the logs with the traces so it's possible to check
which GraphQL request caused a certain debug output.

* Less code shuffling

* Cleanup log directories

Put them in the dev folder so they're out of sight.

* fix: fix kudos in standups in nested lists (#9412)

* Fix: fix kudos in standups in nested lists

* fix test

* chore(release): release v7.15.2 (#9414)

Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com>

* chore: update 3d secure card number in release_test.md (#9394)

Previous 3d secure card number does not work anymore and I don't see it in the stripe docs https://stripe.com/docs/testing?testing-method=card-numbers#regulatory-cards

* chore: bump node to v20.11.0 (#9410)

Signed-off-by: Matt Krick <matt.krick@gmail.com>

* chore: add embeddings table migration (#9372)

* chore: add embeddings table migration

* chore: code review changes

* feat: auto-add pgvector extension in production

Also:
   - fix: corrects types in standaloneMigrations.ts
   - fix: silly things I missed in the addEmbeddingTables migration

* fix: check for POSTGRES_USE_PGVECTOR

* fix: POSTGRES_USE_PGVECTOR strict check for === 'true'

* feat: speed up ai search (#9421)

* fix: not all jira projects are displayed in the list if there are a lot of them (#9422)

* chore(release): release v7.16.0 (#9419)

Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com>

---------

Signed-off-by: Matt Krick <matt.krick@gmail.com>
Co-authored-by: Georg Bremer <github@dschoordsch.de>
Co-authored-by: Igor Lesnenko <igor.lesnenko@gmail.com>
Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com>
Co-authored-by: Matt Krick <matt.krick@gmail.com>
Co-authored-by: Jordan Husney <jordan.husney@gmail.com>
Co-authored-by: Nick O'Ferrall <nickoferrall@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants