Hot rank update batching + deadlock avoidance #3175

sunaurus · 2023-06-18T03:26:03Z

This PR contains the following changes:

Ensure that comment_aggregates table rows are always updated in the same order to help with deadlocks
Reduce frequency of hot_rank updates from every 5 minutes to every 15 minutes
- If this is controversial, then I will happily revert it, but my reasoning was that it would give the hot rank scheduled task much more breathing room in case the amount of comments or posts on Lemmy increases exponentially in the near future. At the same time, 15 minutes should still be regular enough to keep the sorting updates feel quite alive.
Instead of doing hot rank updates in one big UPDATE statement which can lock the whole table for 10+ seconds, the updates are now done in batches of 5000 rows
- This is done with the help of a new hot_rank_updated column - we can use this to ensure that we process every row only once per each scheduled task run.
- I tried a few different batch sizes, including 100, 1000, 5000 and 10 000. On my instance, we have ~170k rows in comment_aggregates and ~28k rows in post_aggregates for the last week. Updating hot ranks for all of them with batches of 5000 takes <30 seconds total. For comparison, with batches of 100, it took about 4 minutes. (It could be that on different hardware, different batch sizes may work better)

I have been running this code (cherry-picked onto 0.17.4) live on lemm.ee for the past hour (starting from 05:15 on the graph below), and I have not seen a single deadlock since deploying these changes:

This should fix #3076.

Thanks to @phiresky for your input about all this here

crates/db_schema/src/aggregates/structs.rs

src/scheduled_tasks.rs

phiresky · 2023-06-20T13:36:15Z

Great! But I think you have an issue with your query performance because you're filtering by a different column than you're ordering by. In order for your selects to be performant you have to filter and order by the same column and that column must be indexed.

To verify, try running

explain analyze select comment_id from comment_aggregate where hot_rank_update < '2024-01-01' order by comment_id asc limit 10. this should do a slow scan via the comment_id index, filtering out rows one-by-one. it will get slower and slower the more comments are already updated because it always scans from the left of ids. try it with the WHERE set to a limit where only 1% of rows still are left to update. it will need to scan 100x as many rows as it needs.
make sure you have an create index on comment_aggregate(hot_rank_update);. then run explain analyze select comment_id from comment_aggregate where hot_rank_update < '2024-01-01' order by hot_rank_update asc limit 10. this should do an index scan on hot_rank_update and always only look at 10 rows total. even better, leave the order_by out completely and just let pg return the first rows it finds however it wants.

I know I said you have to do deterministic ordering to fix deadlocks. But you can still do that after selecting, they only have to be ordered by comment_id within each bach.

(I don't have an instance to verify what i'm saying, just looking at the code)

sunaurus · 2023-06-25T01:28:12Z

I pushed a new version.

Performance is improved now compared to original + no additional db columns are used.
Batches are now constructed based on the existing published column
Each batch is selected and updated in a single DB query
Deadlocks are mitigated with SELECT ... FOR UPDATE SKIP LOCKED
Batch size reduced to 1000, as bigger batches were not using indexes
Added indexes on published for comment_aggregates and community_aggregates (post_aggregates already had it)

The main benefit of doing it with a fixed batch size is that this approach will scale linearly with amount of comments, which I believe might be critically important for Lemmy going forward.

Caveat: I unfortunately had to use sql_query for this due to Diesel not supporting what we need here. I also wrote an alternate version with Diesel, using two separate queries in a transaction, and despite that version having roughly the same query plan, it performed about 4-5x worse on average than this sql_query based version (presumably due to serialization between the two queries). So it seems the trade-off of using sql_query is worth it in this case.

Btw, thanks again to @phiresky for additional help!

src/scheduled_tasks.rs

dessalines · 2023-06-26T16:46:04Z

src/scheduled_tasks.rs

+               WHERE a.published > $1
+               ORDER BY a.published
+               LIMIT $2
+               FOR UPDATE SKIP LOCKED)


Nice, had no idea about that one. I wonder if diesel has this available so we can use it on other scheduled jobs.

Diesel has .for_update() and .skip_locked(), but they have at least one significan't limitation: they can't be used together with .into_boxed()

Gotcha. @Nutomic would these be potentially useful in any apub jobs?

This seems to be useful when using a table as a kind of job queue? Might be useful for #2142 once we implement that.

src/scheduled_tasks.rs

dessalines

Thanks a ton for this, this is gonna work so much better than what's currently there for bigger instances.

Nutomic · 2023-06-27T08:06:51Z

src/scheduled_tasks.rs

+      Ok(updated_rows) => previous_batch_result = updated_rows.last().map(|row| row.published),
+      Err(e) => {
+        error!("Failed to update {} hot_ranks: {}", table_name, e);
+        break;


Should it really stop processing new batches if any one batch threw an error? Seems unnecessary.

sunaurus requested review from Nutomic and dessalines as code owners June 18, 2023 03:26

sunaurus force-pushed the hot_rank_update_batches branch 2 times, most recently from 4b4b433 to 68eab28 Compare June 18, 2023 11:47

dessalines reviewed Jun 19, 2023

View reviewed changes

crates/db_schema/src/aggregates/structs.rs Outdated Show resolved Hide resolved

src/scheduled_tasks.rs Outdated Show resolved Hide resolved

sunaurus force-pushed the hot_rank_update_batches branch from 68eab28 to fe0b913 Compare June 25, 2023 01:16

sunaurus requested a review from dessalines June 25, 2023 01:28

sunaurus force-pushed the hot_rank_update_batches branch 3 times, most recently from b39727e to 426095c Compare June 25, 2023 10:39

This was referenced Jun 25, 2023

Repeated deadlock detected warning in lemmy backend log after 0.18.0 update #3314

Closed

Scheduled tasks thread permanently crashes due to database deadlocks (causes "hot" to stop updating) #3076

Closed

sunaurus force-pushed the hot_rank_update_batches branch 2 times, most recently from b2737cb to a0eac8a Compare June 26, 2023 13:22

dessalines reviewed Jun 26, 2023

View reviewed changes

src/scheduled_tasks.rs Outdated Show resolved Hide resolved

sunaurus force-pushed the hot_rank_update_batches branch from a0eac8a to b926eb2 Compare June 26, 2023 17:39

Batch hot rank updates

8ee0e12

sunaurus force-pushed the hot_rank_update_batches branch from b926eb2 to 8ee0e12 Compare June 26, 2023 17:44

sunaurus requested a review from dessalines June 26, 2023 17:56

dessalines approved these changes Jun 26, 2023

View reviewed changes

Nutomic reviewed Jun 27, 2023

View reviewed changes

Nutomic merged commit 211e76d into LemmyNet:main Jun 27, 2023
1 check passed

phiresky mentioned this pull request Jun 29, 2023

/api/v3/comment/like takes 5+ sec to respond on production instances #3395

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hot rank update batching + deadlock avoidance #3175

Hot rank update batching + deadlock avoidance #3175

sunaurus commented Jun 18, 2023 •

edited

phiresky commented Jun 20, 2023 •

edited

sunaurus commented Jun 25, 2023

dessalines Jun 26, 2023

sunaurus Jun 26, 2023 •

edited

dessalines Jun 26, 2023

Nutomic Jun 27, 2023

dessalines left a comment

Nutomic Jun 27, 2023

Hot rank update batching + deadlock avoidance #3175

Hot rank update batching + deadlock avoidance #3175

Conversation

sunaurus commented Jun 18, 2023 • edited

phiresky commented Jun 20, 2023 • edited

sunaurus commented Jun 25, 2023

dessalines Jun 26, 2023

Choose a reason for hiding this comment

sunaurus Jun 26, 2023 • edited

Choose a reason for hiding this comment

dessalines Jun 26, 2023

Choose a reason for hiding this comment

Nutomic Jun 27, 2023

Choose a reason for hiding this comment

dessalines left a comment

Choose a reason for hiding this comment

Nutomic Jun 27, 2023

Choose a reason for hiding this comment

sunaurus commented Jun 18, 2023 •

edited

phiresky commented Jun 20, 2023 •

edited

sunaurus Jun 26, 2023 •

edited