Skip to content

SQL Generation Speed: Store pickled query AST#1280

Merged
shangyian merged 8 commits intoDataJunction:mainfrom
shangyian:pickle-query-ast
Jan 27, 2025
Merged

SQL Generation Speed: Store pickled query AST#1280
shangyian merged 8 commits intoDataJunction:mainfrom
shangyian:pickle-query-ast

Conversation

@shangyian
Copy link
Copy Markdown
Collaborator

@shangyian shangyian commented Jan 21, 2025

Summary

This PR speeds up SQL generation quite significantly by storing and loading pickled query ASTs rather than recompiling a node query every time we need it. The primary changes include:

  • A new query_ast column on the noderevision table to store compressed pickled ASTs.
  • Every time a transform or dimension node is created, updated, or revalidated, we refresh the stored query AST.
  • When we need to load a compiled query AST, instead of parsing and compiling, which can be expensive for large/complex queries, we unpickle the object from the database and use that instead.

Our queries are structured in a way where we can easily reuse these compiled query ASTs across many query builds.

Test Plan

Local testing + unit tests.

Deployment Plan

We should do some additional testing prior to release.

@netlify
Copy link
Copy Markdown

netlify bot commented Jan 21, 2025

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 587d5b4
🔍 Latest deploy log https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/679522cebfe49400087fba3b

@shangyian shangyian marked this pull request as ready for review January 21, 2025 20:12
@shangyian shangyian marked this pull request as draft January 22, 2025 04:54
@shangyian shangyian marked this pull request as ready for review January 23, 2025 02:55
dimension_node_names,
options=[
joinedload(Node.current).options(
selectinload(NodeRevision.columns).options(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These extra joins brought in are fairly expensive and not used in SQL buidling

measure_columns.append(expr)
expr.set_semantic_type(SemanticType.MEASURE) # type: ignore
await parent_ast.compile(context)
dependencies, _ = await parent_ast.extract_dependencies(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By extracting dependencies at this stage, we avoid calling another query.compile() call on the final query, which is expensive and without much benefit (all it does is add type inference).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one fallout of imprecise function names. Every time I see build or compile I need to go back to the source and remind myself what it means.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed... compile is kind of like a "hydrate query with metadata" stage, and build actually changes the structure of the query. If you can come up with a better function name that represents the concept happy to rename compile! 😅

def type(self) -> ColumnType:
return self.select.type

async def build( # pylint: disable=R0913,C0415
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy query build that is now completely unused

Copy link
Copy Markdown
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome code!


def upgrade():
with op.batch_alter_table("noderevision", schema=None) as batch_op:
batch_op.add_column(sa.Column("query_ast", sa.PickleType(), nullable=True))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, SQL alchemy has an answer for any type these days...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this one is convenient!

measure_columns.append(expr)
expr.set_semantic_type(SemanticType.MEASURE) # type: ignore
await parent_ast.compile(context)
dependencies, _ = await parent_ast.extract_dependencies(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one fallout of imprecise function names. Every time I see build or compile I need to go back to the source and remind myself what it means.

@shangyian shangyian merged commit fcc785d into DataJunction:main Jan 27, 2025
@shangyian shangyian deleted the pickle-query-ast branch January 27, 2025 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants