-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: add embeddings table migration #9372
Conversation
2787121
to
025c3d3
Compare
@mattkrick in general, I think separating the two concerns of the table over the original implementation in #9300 is better. The only drawback I can think of is, when adding items to the Definitely open to suggestions for any improvements/changes |
025c3d3
to
5559089
Compare
packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts
Outdated
Show resolved
Hide resolved
packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts
Outdated
Show resolved
Hide resolved
packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts
Outdated
Show resolved
Hide resolved
"objectType" "EmbeddingsObjectTypeEnum" NOT NULL, | ||
"refId" VARCHAR(100), | ||
UNIQUE("objectType", "refId"), | ||
"refDateTime" TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 not sure what this is. if this is the time the row is created could we call this createdAt
?
if it's when the reference is created could we call it refCreatedAt
or leave it normalized on the table related to objectType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is to have a time on the index for "what time the embedded object represents". Use case would be filtering by a date range. This intends to denormalize that contextual time in the index. Job story:
When I want to search for discussions that only occurred between two dates, I want a set of filter controls that return only the results I am interested in.
It's debatable whether we need this now. I'd be ok to drop it until we need it.
What say you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! that's a super useful feature. so this isn't necessarily when the embedding is created, but when the reference was last updated? if that entity is mutable & gets updated, then we update this column + the embed text? if that's the case, i might suggest something like refUpdatedAt
to make a little more clear, but i really like having the date here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do.
packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts
Outdated
Show resolved
Hide resolved
packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts
Outdated
Show resolved
Hide resolved
packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts
Outdated
Show resolved
Hide resolved
packages/server/postgres/migrations/1703031300000_addEmbeddingTables.ts
Outdated
Show resolved
Hide resolved
I totally agree. Gave this some extra thought today. Since the job queue can be calculated by determining which objects don't have an embedding, it'd be great to leverage that into something simpler. From your PR, I think we can break down the functionality into the following:
The real bottleneck here is going to be the LLM endpoint. There's no easy way to distribute the work of building the queue, but we can break it up by objectType+model to distribute the load a bit. Given those constraints, here's my proposal:
|
What you propose is strikingly similar to what I've written, except for a few details. Let's get work to get this PR merged, and then I'll make some updates to the service I wrote to match new nomenclature...and we can continue. Sound ok? Awaiting your comments on some of the +1s |
sounds great! Yep, I think we've largely converged on the embedder layout, i still gotta do a deeper dive on your PR to make sure i understand everything. redis vs. pg's |
We can definitely kick this conversation to the next PR :) |
3acfeed
to
468d94c
Compare
@mattkrick, I've made all the changes we discussed. Would you mind giving hits a migrate up/down test? I'd do it, but I don't want to lose all the data I've calculated :) |
@dbumblis-parabol this is the PR that is going to land soon, FYI |
There may be a challenge in the near future where we cannot control pgvector being installed in the ironbank container image used for postgres in the pubsec deployment of the application. Is there any way that the migration itself can be gated? Or, how could we address a situation where the postgresql db does not have pgvector installed? |
@mattkrick now here's an interesting case. To rephrase what @dbumblis-parabol is saying:
Blarg. That would stink. What would be a reasonable pattern here? How could we "soft require" pgvector? |
Do we want to make this all optional, i.e. put the vector data in a different database? Then we could not easily merge with the rest of the data, but I don't see any immediate negative consequences through this. |
if usage of pgvector is dependent on an environment & that decision could change over time then i think an env var is probably the cleanest solution. // .env
USE_PGVECTOR=true
// preDeploy
if (process.env.USE_PGVECTOR === 'true') {
const enabled = await pg.query(`SELECT 1 FROM pg_extension WHERE extname = 'pgvector'`).executeAndTakeFirst()
if (!enabled) {
await pg.query(`CREATE EXTENSION vector;`).execute()
}
} I'll have to hear more about @Dschoordsch's idea. A 2nd DB would mean managing a 2nd pool of PG connections, so I'd want to make sure that we're getting something worthwhile for the complexity. |
// .env
USE_PGVECTOR=true
// preDeploy
if (process.env.USE_PGVECTOR === 'true') {
const enabled = await pg.query(`SELECT 1 FROM pg_extension WHERE extname = 'pgvector'`).executeAndTakeFirst()
if (!enabled) {
await pg.query(`CREATE EXTENSION vector;`).execute()
}
} @mattkrick I like this. What to do about the migration in Here's a bad option I can think of:
Do you have a better idea than this? If not, I can go ahead and make those changes and we can get this puppy merged |
Interesting problem because 2 different envs will have 2 different DB schemas. this could get pretty hairy! i like your solution. the alternative i see is having the migration adding the column of type I'm OK with either! |
@mattkrick and I discussed via Slack and since the migration relies on no I've create a bool env var called |
@mattkrick I think this is GTG now |
That was just an idea I had during our call yesterday. I thought it would be easier to skip the entire db and embedding module if the environment is not present. In that case our normal migrations wouldn't depend on environment variables, despite connection strings. I think it would have no technical benefit, but it might keep developers aware that they have to check for the absence of that data. So I think just changing the type and not populating the data is way easier. We could run into issues when querying with the non-existent type and not catch it before it's deployed to ironbank users. |
@Dschoordsch I think as this PR now stands we have the best of all worlds:
And in the next PR:
This should let us run flexibly in our SaaS or in a self-hosted model with or without pgvector |
9c1b049
to
918bc71
Compare
Ooookay @mattkrick, now I think this is ready :) |
Also: - fix: corrects types in standaloneMigrations.ts - fix: silly things I missed in the addEmbeddingTables migration
918bc71
to
71af08d
Compare
|
||
export default async () => { | ||
console.log('🔩 Postgres Extension Checks Started') | ||
if (process.env.POSTGRES_USE_PGVECTOR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll have to write === 'true'
since our dotenv parser does not parse the string "true"
to the binary type true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh – color me embarrassed. Coming right up sir
* chore: Add more Atlassian logging (#9405) * chore: Add more Atlassian logging Mainly associate the logs with the traces so it's possible to check which GraphQL request caused a certain debug output. * Less code shuffling * Cleanup log directories Put them in the dev folder so they're out of sight. * fix: fix kudos in standups in nested lists (#9412) * Fix: fix kudos in standups in nested lists * fix test * chore(release): release v7.15.2 (#9414) Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com> * chore: update 3d secure card number in release_test.md (#9394) Previous 3d secure card number does not work anymore and I don't see it in the stripe docs https://stripe.com/docs/testing?testing-method=card-numbers#regulatory-cards * chore: bump node to v20.11.0 (#9410) Signed-off-by: Matt Krick <matt.krick@gmail.com> * chore: add embeddings table migration (#9372) * chore: add embeddings table migration * chore: code review changes * feat: auto-add pgvector extension in production Also: - fix: corrects types in standaloneMigrations.ts - fix: silly things I missed in the addEmbeddingTables migration * fix: check for POSTGRES_USE_PGVECTOR * fix: POSTGRES_USE_PGVECTOR strict check for === 'true' * feat: speed up ai search (#9421) * fix: not all jira projects are displayed in the list if there are a lot of them (#9422) * chore(release): release v7.16.0 (#9419) Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com> --------- Signed-off-by: Matt Krick <matt.krick@gmail.com> Co-authored-by: Georg Bremer <github@dschoordsch.de> Co-authored-by: Igor Lesnenko <igor.lesnenko@gmail.com> Co-authored-by: parabol-release-bot[bot] <150284312+parabol-release-bot[bot]@users.noreply.github.com> Co-authored-by: Matt Krick <matt.krick@gmail.com> Co-authored-by: Jordan Husney <jordan.husney@gmail.com> Co-authored-by: Nick O'Ferrall <nickoferrall@gmail.com>
Description
After getting some feedback from @mattkrick, I've split the
EmbeddingsIndex
into two tables with two distinct concerns:EmbeddingsIndex
– contains all the meta information for which we've generated indexesEmbeddingsJobQueue
– contains a list of items pending embedding, or the reason(s) why creating the embedding failedAn outline of how these tables will be used:
embedder
will...EmbeddingsIndex
EmbeddingsIndex
.model
array to theEmbeddingsJobQueue
EmbeddingsJobQueue
.queued
->embedding
when calculating an embedding for a particular modelEmbeddingsJobQueue
if calculating the embedding was successful, and update theEmbeddingsIndex
.model
with the appropriate model name (A row will also be created inEmbeddings_<MODEL>
that can be joined toEmbeddingsIndex
)queued
->failed
if calculating the embedding should failTesting scenarios
[Please list all the testing scenarios a reviewer has to check before approving the PR]
Final checklist