Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index for track owner id and udpate plays mat view to concurrent #865

Merged
merged 2 commits into from
Oct 5, 2020

Conversation

jowlee
Copy link
Contributor

@jowlee jowlee commented Sep 29, 2020

Trello Card Link

https://trello.com/c/w1WShtHE/1590-dp-indexes-for-optimizations

Description

In discovery provider:

  • Update the mat view aggregate_plays index on play_item_id to be unique so that the mat view refresh can be called with the concurrently option. This should speed it up when there are only a few changes. It should also unblock queries to this mat view.
  • Update a few timing logs in the index_plays job
  • Add index for the tracks table owner_id column. This was reduced from ~35 ms to a fraction of a millisecond and is a hot query that takes up a good amount of load.

I wrote up a notion doc with some DB insights

Services

Discovery Provider

Does it touch a critical flow like Discovery indexing, Creator Node track upload, Creator Node gateway, or Creator Node file system?

Delete an option.

  • ✅ Nope

How Has This Been Tested?

I booted a RDS instance using a prod snapshot and ran the DB migrations. From there I tested the queries for speed.
I ran the DP locally against the snapshot and made several queries for users and tracks to make sure that it was all still working.

I also ran the system locally and simulated plays to test the change to refresh materialized view concurrently and it worked fine.

Copy link
Contributor

@dmanjunath dmanjunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the code looks good, few minor questions. One more question i have is if we're only inserting 13 rows per 10 second sweep, is it worth doing every 10 seconds? we could do once a minute and more easily eat this cost.

@@ -33,8 +33,7 @@ def get_track_plays(self, db):
most_recent_play_date = session.query(
Play.updated_at
).order_by(
desc(Play.updated_at),
desc(Play.id)
desc(Play.updated_at)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we have the secondary sort here for a reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this the first time I believe and can't think of a reason why it would need a secondary sort. It's just getting the most recent date.
Actually I think this query was pretty slow on prod so a couple weeks ago we put an index on play.id and play.updated_at together. But, that index not in our migrations. There is also an index in the migrations for just play.updated_at which should keep this query fast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if two have the same updated_at? won't this be non-deterministic then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jowlee does having the secondary sort affect performance? if not, we should just not touch ii

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't seem to, but I believe it's because of the index on both cols that isn't in the migrations. But can just leave as it.


# Update the index on the aggregate_plays materialized view to be unique for concurrent updates
connection = op.get_bind()
connection.execute('''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you run this on a db dump?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I ran this on a prod snapshot I booted up.

@jowlee
Copy link
Contributor Author

jowlee commented Sep 29, 2020

Overall the code looks good, few minor questions. One more question i have is if we're only inserting 13 rows per 10 second sweep, is it worth doing every 10 seconds? we could do once a minute and more easily eat this cost.

Yeah I also think we should just index plays on a larger interval. It was written that way b/c on startup, I wanted it to index plays from scratch at a fast pace and if it was done every 60 seconds it would have taken around 6 hrs instead of 1 hr. Thoughts on increasing it?

@dmanjunath
Copy link
Contributor

dmanjunath commented Sep 30, 2020

i think it's probably okay to increase the index plays interval? when we first rolled this thing out we wanted to minimize impact so 10sec then made sense. but with several up to date, healthy discprovs to query from now, we can probably increase the interval. not to mention most SP's are restoring dp from a s3 snapshot we take daily so they won't be indexing from scratch either. thoughts @raymondjacobson?

@@ -33,8 +33,7 @@ def get_track_plays(self, db):
most_recent_play_date = session.query(
Play.updated_at
).order_by(
desc(Play.updated_at),
desc(Play.id)
desc(Play.updated_at)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if two have the same updated_at? won't this be non-deterministic then?

@dmanjunath
Copy link
Contributor

i think it's probably okay to increase the index plays interval? when we first rolled this thing out we wanted to minimize impact so 10sec then made sense. but with several up to date, healthy discprovs to query from now, we can probably increase the interval. not to mention most SP's are restoring dp from a s3 snapshot we take daily so they won't be indexing from scratch either. thoughts @raymondjacobson?

@raymondjacobson bump about increasing play indexing to 60 sec intervals

@raymondjacobson
Copy link
Member

raymondjacobson commented Oct 2, 2020

i think it's probably okay to increase the index plays interval? when we first rolled this thing out we wanted to minimize impact so 10sec then made sense. but with several up to date, healthy discprovs to query from now, we can probably increase the interval. not to mention most SP's are restoring dp from a s3 snapshot we take daily so they won't be indexing from scratch either. thoughts @raymondjacobson?

@raymondjacobson bump about increasing play indexing to 60 sec intervals

It's definitely better for us the more frequent this is, but probably a minute is fine. Thinking about implications for something like moving the listen history page over to use discprov entirely. If you listen to a track & have to wait 60s for it to be persisted, that's annoying. We can cache things on the client to get around it for the most part though.

60s is fine w/ me

@jowlee jowlee merged commit 5ae9faf into master Oct 5, 2020
@jowlee jowlee deleted the jowlee-dp-indexes branch October 5, 2020 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants