-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add index for track owner id and udpate plays mat view to concurrent #865
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the code looks good, few minor questions. One more question i have is if we're only inserting 13 rows per 10 second sweep, is it worth doing every 10 seconds? we could do once a minute and more easily eat this cost.
@@ -33,8 +33,7 @@ def get_track_plays(self, db): | |||
most_recent_play_date = session.query( | |||
Play.updated_at | |||
).order_by( | |||
desc(Play.updated_at), | |||
desc(Play.id) | |||
desc(Play.updated_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we have the secondary sort here for a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this the first time I believe and can't think of a reason why it would need a secondary sort. It's just getting the most recent date.
Actually I think this query was pretty slow on prod so a couple weeks ago we put an index on play.id and play.updated_at together. But, that index not in our migrations. There is also an index in the migrations for just play.updated_at which should keep this query fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if two have the same updated_at? won't this be non-deterministic then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jowlee does having the secondary sort affect performance? if not, we should just not touch ii
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't seem to, but I believe it's because of the index on both cols that isn't in the migrations. But can just leave as it.
|
||
# Update the index on the aggregate_plays materialized view to be unique for concurrent updates | ||
connection = op.get_bind() | ||
connection.execute(''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you run this on a db dump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I ran this on a prod snapshot I booted up.
Yeah I also think we should just index plays on a larger interval. It was written that way b/c on startup, I wanted it to index plays from scratch at a fast pace and if it was done every 60 seconds it would have taken around 6 hrs instead of 1 hr. Thoughts on increasing it? |
3edc8fc
to
02b8fe4
Compare
i think it's probably okay to increase the index plays interval? when we first rolled this thing out we wanted to minimize impact so 10sec then made sense. but with several up to date, healthy discprovs to query from now, we can probably increase the interval. not to mention most SP's are restoring dp from a s3 snapshot we take daily so they won't be indexing from scratch either. thoughts @raymondjacobson? |
@@ -33,8 +33,7 @@ def get_track_plays(self, db): | |||
most_recent_play_date = session.query( | |||
Play.updated_at | |||
).order_by( | |||
desc(Play.updated_at), | |||
desc(Play.id) | |||
desc(Play.updated_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if two have the same updated_at? won't this be non-deterministic then?
@raymondjacobson bump about increasing play indexing to 60 sec intervals |
It's definitely better for us the more frequent this is, but probably a minute is fine. Thinking about implications for something like moving the listen history page over to use discprov entirely. If you listen to a track & have to wait 60s for it to be persisted, that's annoying. We can cache things on the client to get around it for the most part though. 60s is fine w/ me |
Trello Card Link
https://trello.com/c/w1WShtHE/1590-dp-indexes-for-optimizations
Description
In discovery provider:
aggregate_plays
index on play_item_id to be unique so that the mat view refresh can be called with theconcurrently
option. This should speed it up when there are only a few changes. It should also unblock queries to this mat view.I wrote up a notion doc with some DB insights
Services
Discovery Provider
Does it touch a critical flow like Discovery indexing, Creator Node track upload, Creator Node gateway, or Creator Node file system?
Delete an option.
How Has This Been Tested?
I booted a RDS instance using a prod snapshot and ran the DB migrations. From there I tested the queries for speed.
I ran the DP locally against the snapshot and made several queries for users and tracks to make sure that it was all still working.
I also ran the system locally and simulated plays to test the change to refresh materialized view concurrently and it worked fine.