Add index for track owner id and udpate plays mat view to concurrent #865

jowlee · 2020-09-29T16:07:15Z

Trello Card Link

https://trello.com/c/w1WShtHE/1590-dp-indexes-for-optimizations

Description

In discovery provider:

Update the mat view aggregate_plays index on play_item_id to be unique so that the mat view refresh can be called with the concurrently option. This should speed it up when there are only a few changes. It should also unblock queries to this mat view.
Update a few timing logs in the index_plays job
Add index for the tracks table owner_id column. This was reduced from ~35 ms to a fraction of a millisecond and is a hot query that takes up a good amount of load.

I wrote up a notion doc with some DB insights

Services

Discovery Provider

Does it touch a critical flow like Discovery indexing, Creator Node track upload, Creator Node gateway, or Creator Node file system?

Delete an option.

✅ Nope

How Has This Been Tested?

I booted a RDS instance using a prod snapshot and ran the DB migrations. From there I tested the queries for speed.
I ran the DP locally against the snapshot and made several queries for users and tracks to make sure that it was all still working.

I also ran the system locally and simulated plays to test the change to refresh materialized view concurrently and it worked fine.

dmanjunath

Overall the code looks good, few minor questions. One more question i have is if we're only inserting 13 rows per 10 second sweep, is it worth doing every 10 seconds? we could do once a minute and more easily eat this cost.

dmanjunath · 2020-09-29T16:10:05Z

discovery-provider/src/tasks/index_plays.py

@@ -33,8 +33,7 @@ def get_track_plays(self, db):
        most_recent_play_date = session.query(
            Play.updated_at
        ).order_by(
-            desc(Play.updated_at),
-            desc(Play.id)
+            desc(Play.updated_at)


did we have the secondary sort here for a reason?

I wrote this the first time I believe and can't think of a reason why it would need a secondary sort. It's just getting the most recent date.
Actually I think this query was pretty slow on prod so a couple weeks ago we put an index on play.id and play.updated_at together. But, that index not in our migrations. There is also an index in the migrations for just play.updated_at which should keep this query fast.

what if two have the same updated_at? won't this be non-deterministic then?

@jowlee does having the secondary sort affect performance? if not, we should just not touch ii

It didn't seem to, but I believe it's because of the index on both cols that isn't in the migrations. But can just leave as it.

dmanjunath · 2020-09-29T16:11:39Z

discovery-provider/alembic/versions/281de8af4b93_add_indexes.py

+
+    # Update the index on the aggregate_plays materialized view to be unique for concurrent updates
+    connection = op.get_bind()
+    connection.execute('''


have you run this on a db dump?

Yeah, I ran this on a prod snapshot I booted up.

jowlee · 2020-09-29T16:47:19Z

Overall the code looks good, few minor questions. One more question i have is if we're only inserting 13 rows per 10 second sweep, is it worth doing every 10 seconds? we could do once a minute and more easily eat this cost.

Yeah I also think we should just index plays on a larger interval. It was written that way b/c on startup, I wanted it to index plays from scratch at a fast pace and if it was done every 60 seconds it would have taken around 6 hrs instead of 1 hr. Thoughts on increasing it?

dmanjunath · 2020-09-30T01:16:25Z

i think it's probably okay to increase the index plays interval? when we first rolled this thing out we wanted to minimize impact so 10sec then made sense. but with several up to date, healthy discprovs to query from now, we can probably increase the interval. not to mention most SP's are restoring dp from a s3 snapshot we take daily so they won't be indexing from scratch either. thoughts @raymondjacobson?

raymondjacobson · 2020-10-02T17:44:38Z

discovery-provider/src/tasks/index_plays.py

@@ -33,8 +33,7 @@ def get_track_plays(self, db):
        most_recent_play_date = session.query(
            Play.updated_at
        ).order_by(
-            desc(Play.updated_at),
-            desc(Play.id)
+            desc(Play.updated_at)


what if two have the same updated_at? won't this be non-deterministic then?

dmanjunath · 2020-10-02T17:51:18Z

i think it's probably okay to increase the index plays interval? when we first rolled this thing out we wanted to minimize impact so 10sec then made sense. but with several up to date, healthy discprovs to query from now, we can probably increase the interval. not to mention most SP's are restoring dp from a s3 snapshot we take daily so they won't be indexing from scratch either. thoughts @raymondjacobson?

@raymondjacobson bump about increasing play indexing to 60 sec intervals

raymondjacobson · 2020-10-02T17:52:56Z

i think it's probably okay to increase the index plays interval? when we first rolled this thing out we wanted to minimize impact so 10sec then made sense. but with several up to date, healthy discprovs to query from now, we can probably increase the interval. not to mention most SP's are restoring dp from a s3 snapshot we take daily so they won't be indexing from scratch either. thoughts @raymondjacobson?

@raymondjacobson bump about increasing play indexing to 60 sec intervals

It's definitely better for us the more frequent this is, but probably a minute is fine. Thinking about implications for something like moving the listen history page over to use discprov entirely. If you listen to a track & have to wait 60s for it to be persisted, that's annoying. We can cache things on the client to get around it for the most part though.

60s is fine w/ me

jowlee requested a review from dmanjunath September 29, 2020 16:07

jowlee assigned raymondjacobson and hareeshnagaraj Sep 29, 2020

dmanjunath approved these changes Sep 29, 2020

View reviewed changes

Add index for track owner id and udpate plays mat view to concurrent

02b8fe4

jowlee force-pushed the jowlee-dp-indexes branch from 3edc8fc to 02b8fe4 Compare September 29, 2020 17:26

raymondjacobson approved these changes Oct 2, 2020

View reviewed changes

Increase index_plays task interval

bea7654

jowlee merged commit 5ae9faf into master Oct 5, 2020

jowlee deleted the jowlee-dp-indexes branch October 5, 2020 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add index for track owner id and udpate plays mat view to concurrent #865

Add index for track owner id and udpate plays mat view to concurrent #865

jowlee commented Sep 29, 2020

dmanjunath left a comment •

edited

dmanjunath Sep 29, 2020

jowlee Sep 29, 2020

raymondjacobson Oct 2, 2020

dmanjunath Oct 2, 2020

jowlee Oct 5, 2020

dmanjunath Sep 29, 2020

jowlee Sep 29, 2020

jowlee commented Sep 29, 2020

dmanjunath commented Sep 30, 2020 •

edited

raymondjacobson Oct 2, 2020

dmanjunath commented Oct 2, 2020

raymondjacobson commented Oct 2, 2020 •

edited

Add index for track owner id and udpate plays mat view to concurrent #865

Add index for track owner id and udpate plays mat view to concurrent #865

Conversation

jowlee commented Sep 29, 2020

Trello Card Link

Description

Services

Does it touch a critical flow like Discovery indexing, Creator Node track upload, Creator Node gateway, or Creator Node file system?

How Has This Been Tested?

dmanjunath left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jowlee commented Sep 29, 2020

dmanjunath commented Sep 30, 2020 • edited

Choose a reason for hiding this comment

dmanjunath commented Oct 2, 2020

raymondjacobson commented Oct 2, 2020 • edited

dmanjunath left a comment •

edited

dmanjunath commented Sep 30, 2020 •

edited

raymondjacobson commented Oct 2, 2020 •

edited