Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled tasks thread permanently crashes due to database deadlocks (causes "hot" to stop updating) #3076

Closed
4 tasks done
sunaurus opened this issue Jun 13, 2023 · 11 comments · Fixed by #3175
Closed
4 tasks done
Labels
bug Something isn't working

Comments

@sunaurus
Copy link
Collaborator

sunaurus commented Jun 13, 2023

  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Is this a question or discussion? Don't use this, use https://lemmy.ml/c/lemmy_support .
  • Is this a UI / front end issue? Use the lemmy-ui repo.

Issue Summary

In 0.17.4, scheduled tasks can cause deadlocks like this:

2023-06-13 21:05:00	2023-06-13T18:05:00.447695Z  INFO lemmy_server::scheduled_tasks: Updating hot ranks for last week...
2023-06-13 21:05:12	thread '<unnamed>' panicked at 'update comment_aggregate hot_ranks: DatabaseError(Unknown, "deadlock detected")', src/scheduled_tasks.rs:82:6

There are two key issues here:

  1. The general issue that scheduled tasks thread can't recover from database errors
  2. Whatever specific set of queries is causing the deadlocks

One major result of this issue is that the "hot" tab stops updating, because the hot_rank calculation just stops completely.

Steps to Reproduce

  1. Start two lemmy_server processes UPDATE: also happens with a single lemmy_server process
  2. Wait for scheduled tasks to execute for a few times (sometimes a deadlock won't happen on the first run)
  3. ...
  4. Eventually, there will be a database deadlock

The only way to mitigate this issue for instance admins on 0.17.4 right now is to regularly restart lemmy_server

@sunaurus sunaurus added the bug Something isn't working label Jun 13, 2023
@amassicci2
Copy link

Any interest in incorporating something to provide general consensus (etcd or zookeeper)? Would provide a solution here and help where it might be needed in the future

@sunaurus
Copy link
Collaborator Author

After further investigation, I am unfortunately starting to think that this issue is not caused by multiple concurrent scheduled task executions (at least not exclusively).

I have been running the code from #3077 on https://lemm.ee since last night, with only one node having scheduled tasks enabled, and I've still seen deadlocks crash the scheduled tasks thread several times on the sole node running scheduled tasks.

I will investigate this further.

@sunaurus sunaurus changed the title Scheduled tasks create database deadlocks when running multiple loadbalanced lemmy_server processes Scheduled tasks thread permanently crashes due to database deadlocks Jun 14, 2023
@sunaurus
Copy link
Collaborator Author

sunaurus commented Jun 14, 2023

I just received confirmation that I am not the only one experiencing this, and it's also happening on an instance with a single lemmy_server process.

So far, I have only seen deadlocks in the comment_aggregates table:

Jun 13 20:50:13 lemmee-app-one lemmy_server[31307]: thread '<unnamed>' panicked at 'update comment_aggregate hot_ranks: DatabaseError(Unknown, "deadlock detected")', src/scheduled_tasks.rs:82:6
Jun 13 21:25:11 lemmee-app-one lemmy_server[34090]: thread '<unnamed>' panicked at 'update comment_aggregate hot_ranks: DatabaseError(Unknown, "deadlock detected")', src/scheduled_tasks.rs:82:6
Jun 13 22:20:12 lemmee-app-one lemmy_server[37191]: thread '<unnamed>' panicked at 'update comment_aggregate hot_ranks: DatabaseError(Unknown, "deadlock detected")', src/scheduled_tasks.rs:82:6
Jun 14 05:40:12 lemmee-app-one lemmy_server[38943]: thread '<unnamed>' panicked at 'update comment_aggregate hot_ranks: DatabaseError(Unknown, "deadlock detected")', src/scheduled_tasks.rs:82:6

There is a long gap between the 3rd and 4th deadlock, because the scheduled tasks thread was stopped for that whole period as I was asleep - generally, the deadlock seems to occur within an hour of restarting lemmy_server on lemm.ee.

@sunaurus
Copy link
Collaborator Author

My current theory is that this deadlock is caused by Comment::create here:

pub async fn create(

This function also updates the comment_aggregates table, and it is being invoked constantly even by federated comments

@sunaurus sunaurus changed the title Scheduled tasks thread permanently crashes due to database deadlocks Scheduled tasks thread permanently crashes due to database deadlocks (causes "hot" to stop updating) Jun 14, 2023
@Nutomic
Copy link
Member

Nutomic commented Jun 14, 2023

These tasks are run from src/scheduled_tasks.rs. That file uses expect in a lot of places, meaning any error will cause the thread to crash with given error message. To fix it the entire file should be changed to use Result and ? operator to handle errors gracefully. Would be good if someone can make a PR for this.

@sunaurus
Copy link
Collaborator Author

I'm already creating exactly that PR, will submit for review soon

@RocketDerp

This comment was marked as abuse.

@RocketDerp

This comment was marked as abuse.

@phiresky
Copy link
Collaborator

phiresky commented Jun 16, 2023

Deadlocks in PG aren't really critical at all - on deadlock the code should retry the transactions until they go through.

But they can also be prevented. Deadlocks only happen when multiple transactions update overlapping rows in a different order. So to prevent them, all things that lock multiple rows (updates) just need to be ordered.

Looks like the comment_aggregates table is updated in three tables:

  • the comment_aggregates_score trigger. this only updates a single row so isn't relevant for deadlocks
  • the query in Comment::create as @sunaurus mentioned
  • the scheduled update task

So to fix this deadlock, something like this should work (untested):

diff --git a/crates/db_schema/src/impls/comment.rs b/crates/db_schema/src/impls/comment.rs
index 46045cd1..49ac2407 100644
--- a/crates/db_schema/src/impls/comment.rs
+++ b/crates/db_schema/src/impls/comment.rs
@@ -111,6 +111,7 @@ from (
   join comment c2 on c2.path <@ c.path and c2.path != c.path
   and c.path <@ '{top_parent}'
   group by c.id
+  order by c.id
 ) as c
 where ca.comment_id = c.id"
           );
diff --git a/src/scheduled_tasks.rs b/src/scheduled_tasks.rs
index 0f75fdba..85518dbb 100644
--- a/src/scheduled_tasks.rs
+++ b/src/scheduled_tasks.rs
@@ -72,7 +72,7 @@ pub fn setup(db_url: String, user_agent: String) -> Result<(), LemmyError> {
 /// Update the hot_rank columns for the aggregates tables
 fn update_hot_ranks(conn: &mut PgConnection, last_week_only: bool) {
   let mut post_update = diesel::update(post_aggregates::table).into_boxed();
-  let mut comment_update = diesel::update(comment_aggregates::table).into_boxed();
+  let mut comment_select = comment_aggregates::table.select(comment_aggregates::comment_id).order(comment_aggregates::comment_id.asc()).into_boxed();
   let mut community_update = diesel::update(community_aggregates::table).into_boxed();
 
   // Only update for the last week of content
@@ -81,7 +81,7 @@ fn update_hot_ranks(conn: &mut PgConnection, last_week_only: bool) {
     let last_week = now - diesel::dsl::IntervalDsl::weeks(1);
 
     post_update = post_update.filter(post_aggregates::published.gt(last_week));
-    comment_update = comment_update.filter(comment_aggregates::published.gt(last_week));
+    comment_select = comment_select.filter(comment_aggregates::published.gt(last_week));
     community_update = community_update.filter(community_aggregates::published.gt(last_week));
   } else {
     info!("Updating hot ranks for all history...");
@@ -103,11 +103,12 @@ fn update_hot_ranks(conn: &mut PgConnection, last_week_only: bool) {
     }
   }
 
-  match comment_update
+  match diesel::update(comment_aggregates::table)
     .set(comment_aggregates::hot_rank.eq(hot_rank(
       comment_aggregates::score,
       comment_aggregates::published,
     )))
+    .filter(comment_aggregates::comment_id.eq_any(comment_select))
     .execute(conn)
   {
     Ok(_) => {}

In addition, it would be better for scalability if this update was done in batches (e.g. 100 comments at a time) because otherwise all comments are still locked simultaneously and every other operation on comment_aggregates has to wait. That's unrelated to the deadlocks though.

@RocketDerp

This comment was marked as abuse.

@sunaurus
Copy link
Collaborator Author

sunaurus commented Jun 25, 2023

The deadlocks crashing the scheduled tasks thread was fixed in #3090, which is included in 0.18
The deadlocks themselves should be significantly reduced (in practice, on lemm.ee, completely resolved) in #3175, which has not been merged yet

@Nutomic Nutomic closed this as completed Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants