Skip to content

Speed up the recursive dimensions graph CTE query#1385

Merged
shangyian merged 11 commits intoDataJunction:mainfrom
shangyian:faster-dim-dag
May 22, 2025
Merged

Speed up the recursive dimensions graph CTE query#1385
shangyian merged 11 commits intoDataJunction:mainfrom
shangyian:faster-dim-dag

Conversation

@shangyian
Copy link
Copy Markdown
Collaborator

@shangyian shangyian commented May 21, 2025

Summary

The existing dimensions graph query is a recursive CTE that is increasingly slow on even fairly small graphs. I had a closer look at the recursive CTE query we're using, and it looks like it's exhaustively tracing the graph in a way that finds every possible path to a dimension node, rather than just picking up the shortest paths.

This PR changes things so that the dimensions graph function does the following:

  1. Discovers all dimension nodes in the graph via a recursive CTE (this is the same structure as the node upstreams or downstreams query and is reasonably fast).
  2. Uses each dimension node's columns to augment with reference links.
  3. Augments with any local dimensions from the original node.

Test Plan

  • PR has an associated issue: #
  • make check passes
  • make test shows 100% unit test coverage

Deployment Plan

@netlify
Copy link
Copy Markdown

netlify bot commented May 21, 2025

Deploy Preview for thriving-cassata-78ae72 ready!

Name Link
🔨 Latest commit cc1f254
🔍 Latest deploy log https://app.netlify.com/projects/thriving-cassata-78ae72/deploys/682ebad1ed10ae000872943e
😎 Deploy Preview https://deploy-preview-1385--thriving-cassata-78ae72.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@shangyian shangyian changed the title WIP Speed up the recursive dimensions graph CTE query Speed up the recursive dimensions graph CTE query May 21, 2025
@shangyian shangyian changed the title Speed up the recursive dimensions graph CTE query WIP: Speed up the recursive dimensions graph CTE query May 21, 2025
@shangyian shangyian changed the title WIP: Speed up the recursive dimensions graph CTE query Speed up the recursive dimensions graph CTE query May 21, 2025
Copy link
Copy Markdown
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you just added local_dimensions. This is super cool. I am going to test it locally as well and then respond again.

type=str(col.type),
path=path,
)
return None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed in Python.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just had an annoying linter flag this function for not having returns in every case 😱


paths = dag.union_all(
select(
dag.c.node_revision_id,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is .c ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's how you reference a column on a SQLAlchemy query or table object

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. What a shortcut... Show how much I know ORMs :)

@shangyian shangyian marked this pull request as ready for review May 22, 2025 04:12
type=str(col.type),
path=[],
)
for col in node.current.columns
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we skip some numerical columns here, or just let the user hurt themselves?

Copy link
Copy Markdown
Collaborator Author

@shangyian shangyian May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by numerical columns? Never mind I see what you mean!

)
for col in node.current.columns
]
local_dimensions = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind, here is the filtering I asked about above. We could probably merge these two statements, cause there may be some unused work in the first one. But up to you.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh good catch, let me merge these two.

key=lambda x: x[0],
) == sorted(
[
("default.users.account_type", ["default.events", "default.users"]),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to figure out why the default.users item was added here. Any ideas?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems like it was added in more unit tests in the output.

Copy link
Copy Markdown
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better. I think the stages of building the dimension list are much clearer now.

@shangyian shangyian merged commit a5fcaf1 into DataJunction:main May 22, 2025
16 checks passed
@shangyian shangyian deleted the faster-dim-dag branch May 22, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants