Skip to content

Add additional database indexes for performance#1309

Merged
shangyian merged 9 commits intoDataJunction:mainfrom
shangyian:db-indexes
Jun 23, 2025
Merged

Add additional database indexes for performance#1309
shangyian merged 9 commits intoDataJunction:mainfrom
shangyian:db-indexes

Conversation

@shangyian
Copy link
Copy Markdown
Collaborator

@shangyian shangyian commented Feb 12, 2025

Summary

This migration adds missing indexes across several tables to improve query performance, particularly for recursive CTE queries and other common join patterns in the metadata database.

Specifically, indexes were added on foreign keys and frequently filtered columns across the following tables: cube, dimensionlink, nodecolumns, and noderelationship.

Benchmark Results (100 runs each) on /sql/measures

Configuration Average Duration (ms) Std Dev (ms)
Before (w/o indexes) 3035.37 528.03
After (with indexes) 2712.10 503.81

Benchmark Results (100 runs each) on /metrics/common/dimensions

Configuration Average Duration (ms) Std Dev (ms)
Before (w/o indexes) 4566.99 494.04
After (with indexes) 4415.48 479.55

A two-sample t-test confirms the difference is statistically significant for both endpoints.

Test Plan

  • PR has an associated issue: #
  • make check passes
  • make test shows 100% unit test coverage

Deployment Plan

@netlify
Copy link
Copy Markdown

netlify bot commented Feb 12, 2025

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 303dba6
🔍 Latest deploy log https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/67ac3ebec5f70e00081a0b53

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 28, 2025

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 95ae66d
🔍 Latest deploy log https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/680fe756f5904b0007bc1665

__table_args__ = (UniqueConstraint("attribute_type_id", "column_id"),)
__table_args__ = (
UniqueConstraint("attribute_type_id", "column_id"),
Index("idx_columnattribute_column_id", "column_id"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if the earlier UniqueConstraint had the column order switched then we wouldn't need this new one at all.
If it's easier to add then to change the other one then go for it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this because I'm not sure it makes much of a difference

@shangyian shangyian marked this pull request as ready for review April 28, 2025 20:42
"""

__tablename__ = "nodeavailabilitystate"
__table_args__ = (Index("idx_nodeavailabilitystate_node_id", "node_id"),)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

columns: Mapped[List["Column"]] = relationship(
back_populates="measure",
lazy="joined",
order_by="Column.name",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we need this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, removed!

"""

__tablename__ = "noderelationship"
__table_args__ = (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++1

foreign_keys="History.entity_name",
)

__table_args__ = (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these changes


__tablename__ = "partition"
__table_args__ = (
Index("idx_partition_column_id", "column_id", postgresql_using="btree"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think btree is the default, so probably no need to specify it everywhere.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, also removed these indexes because I don't think they make a difference for the specific slow endpoints I'm looking at

@netlify
Copy link
Copy Markdown

netlify bot commented Jun 23, 2025

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 2ecbc9e
🔍 Latest deploy log https://app.netlify.com/projects/thriving-cassata-78ae72/deploys/68598a74f4fc6c0008d180d0

@shangyian
Copy link
Copy Markdown
Collaborator Author

@agorajek I updated this with a smaller subset of indexes, which do show a stat sig increase in endpoint speed (see the updated PR description for details).

@shangyian shangyian merged commit 583c54c into DataJunction:main Jun 23, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants