Adding a scaled sort, to boost smaller communities. #3907

dessalines · 2023-08-23T21:55:11Z

Previously referred to as best .
Fixes Rework "Hot" sorting to show posts from more varied communities #3622
Fixes The rank of a post in the aggregated feed should be inversely proportional to the size of the community #1026

Still needs lots of testing with prod data, as well as verifying the scheduled_tasks changes are working correctly.

An example of some ranks:

select 
  pa.post_id, 
  now() - pa.published as time_diff, 
  ca.users_active_month, pa.hot_rank, 
  pa.scaled_rank 
from post_aggregates pa 
  inner join community_aggregates ca 
  on pa.community_id = ca.community_id 
order by pa.scaled_rank desc, pa.published desc 
limit 20;

- Previously referred to as *best* . - Fixes #3622

dessalines · 2023-08-23T21:57:48Z

crates/db_schema/src/aggregates/post_aggregates.rs

+    // Diesel can't update based on a join, which is necessary for the scaled_rank
+    // https://github.com/diesel-rs/diesel/issues/1478
+    // Just select the users_active_month manually for now, since its a single post anyway
+    let users_active_month = community_aggregates::table


I can probably move this query down into the update statement, to avoid the round-trip cost.

edit: tried and failed at this.

phiresky · 2023-08-23T22:57:31Z

crates/db_schema/src/aggregates/structs.rs

@@ -100,6 +100,8 @@ pub struct PostAggregates {
  pub community_id: CommunityId,
  pub creator_id: PersonId,
  pub controversy_rank: f64,
+  /// A rank that amplifies smaller communities
+  pub scaled_rank: i32,


I think this value should be a float? Hot_rank for example really quickly goes to zero (after just ~ 2 days mostly) and all information of values between 0 and 1 and 1 and 2 is lost.

For example, right now select scaled_rank(80, '2023-08-20', 10000); and select scaled_rank(1000, '2023-08-20', 10000); both have the exact same rank even though one has > 10 times the votes!

Really hot rank should also be a float imo but that's maybe out of scope here.

That's true, these all should probably be made into floats (not sure why I didn't want to use floats originally), could you open an issue for that? Lets do that as a separate issue.

Also being a "news" type sort, the time decay is supposed to go to zero after ~ 2 days, after which published descending takes over.

https://join-lemmy.org/docs/contributors/07-ranking-algo.html

https://medium.com/hacking-and-gonzo/how-hacker-news-ranking-algorithm-works-1d9b0cf2c08d

Mh, that's interesting. Then maybe we can just keep hot_rank as is but make the new scaled_rank a float? I think it would make sense to make the new scaled_rank function and column float from the start instead of migration later.

There's no code change required for making this a float but keeping hot_rank an int

Maybe I should just change hot_rank to a float then as a part of this, since scaled_rank depends on it.

Sounds good. Note it will reduce the performance of the scheduled task a fair bit since afterwards almost no post will have a hot rank of 0 and be filtered out (ideas @sunaurus? )

That's def a concern... We could either use the published timestamp as a filter, or minimum hot_rank threshold.

I haven't had a chance to look at the implementation yet, so sorry if this is a dumb idea, but can we not just update the rank functions to set the rank 0 after it decays to some threshold, to guarantee that the majority of posts will still have a 0 value for rank and thus get filtered out during updates?

I mean something like if the rank is below 1.0 (or perhaps even higher!) then just set it to 0

Yes, that should work. But it would need careful consideration what the threshold should be both for hot_rank and scaled_rank. According to dess above, the published sort is supposed to take over after a few days. So maybe it would be better to make the function return zero if published is more than 7 days in the past? The hot_rank function gets published as a parameter anyways.

So that change would just be to replace IF (hours_diff > 0) THEN with IF (hours_diff > 0 AND hours_diff < 24 * 7) THEN

dessalines · 2023-08-23T23:24:22Z

migrations/2023-08-23-182533_scaled_rank/up.sql

+    AS $$
+BEGIN
+    -- Add 2 to avoid divide by zero errors
+    -- Use 0.1 to lessen the initial sharp decline at a hot_rank ~ 300


This 0.1 scale factor is the "I made it up on the spot" part of this PR, and I don't know if it makes any sense.

I tried factors of 1, 0.1, 0.01, using a graphing calculator and some regular hot ranks, and compared them with various community sizes. Using smaller numbers lessens the initial sharp decline sensitivity.

phiresky · 2023-08-28T14:33:39Z

I can do the int-> float convert for hot_rank and scaled_rank if you don't have time @dessalines

dessalines · 2023-08-29T14:43:40Z

@phiresky No that's okay, I'll work on that today.

dessalines · 2023-08-29T17:59:20Z

Okay that's updated now. I also removed the 10k scaling factor from the hot_rank function, as its pointless now that they're floats.

Die4Ever · 2023-08-29T19:28:54Z

Will this mean that bot communities like Lemmit will dominate my feed? I like Lemmit but I wouldn't want it to be all of the top posts.

dessalines · 2023-08-29T22:50:13Z

This has nothing to do with bots, but communities with few active users.

You can already block bot accounts universally in your user profile settings, as well as block their users or communities if you wish.

Die4Ever · 2023-08-29T22:57:02Z

This has nothing to do with bots, but communities with few subscribers.

You can already block bot accounts universally in your user profile settings, as well as block their users or communities if you wish.

Yes but I enjoy the Lemmit bot, I just don't want the top 100 posts on my feed to be entirely Lemmit.

I'm glad to know it's based on subscribers and not active users, but I still feel like posts from bot-only communities might be strongly favored by this sorting. Maybe I'll just be toggling the show bots option on and off each day.

dessalines · 2023-08-29T23:10:48Z

Sry I misspoke above, this is absolutely based on active monthly users, not subscribers, as subscribers is really a pointless metric, considering how communities can have a ton of subscribers but little to no activity.

Bot communities have nothing to do with this, this will boost any community with few active users.

This won't do a spread either (IE picking a single post from a lot of communities, as that's too slow to do performance-wise).

Die4Ever · 2023-08-29T23:15:37Z

Sry I misspoke above, this is absolutely based on active users, not subscribers, as subscribers is really a pointless metric, considering how communities can have a ton of subscribers but little to no activity.

Bot communities have nothing to do with this, this will boost any community with few active users.

This won't do a spread either (IE picking a single post from a lot of communities, as that's too slow to do performance-wise).

Right it'll be something to get used to. Lemmit communities are mostly just a single active user (the bot itself) since they don't allow posts from other users.

phiresky · 2023-08-30T10:29:22Z

My suggestion to both improve the performance and to go in the direction of "published sort should take over after a few days" would be to replace IF (hours_diff > 0) THEN with IF (hours_diff > 0 AND hours_diff < 24 * 7) THEN in the hot_rank function. That way the function will predictably go to 0 after 7 days and published sort will take over. It will also mean the scheduled tasks only have to scan 7 days of content.

phiresky · 2023-08-30T10:39:50Z

src/scheduled_tasks.rs

+      r#"WITH batch AS (SELECT pa.id
+               FROM post_aggregates pa
+               WHERE pa.published > $1
+               AND (pa.hot_rank != 0 OR pa.hot_rank_active != 0 OR pa.scaled_rank != 0)


for performance, either the index on idx_post_aggregates_nonzero_hotrank should be replaced and conditioned on this condition, or the OR pa.scaled_rank != 0 be removed

removing the scaled_rank != 0 check should not impact the output I think because scaled_rank with it being floats scaled_rank will only be 0 if hot_rank is also 0

That sounds right, and you're correct that it does seem pointless to change that index.

"idx_post_aggregates_nonzero_hotrank" btree (published DESC) WHERE hot_rank::double precision <> 0::double precision OR hot_rank_active::double precision <> 0::double precision

dessalines · 2023-08-30T14:48:48Z

IF (hours_diff > 0 AND hours_diff < 24 * 7)

Good call, I'll do that now.

Nutomic · 2023-08-31T12:15:44Z

Instead of adding another sort option I would rather use this scaling logic for the existing Active and Hot sorts. That way users will actually notice an improvement immediately, instead having to tell everyone to change sorts manually. If this works well it will be a clear improvement and there should be no need for the current Active/Hot sorts. If it doesnt work, we can finetune during rc process.

dessalines · 2023-08-31T13:15:07Z

I don't feel too comfortable with replacing them, because they are completely different sorts. One will show posts from popular communities, the other will show sorts from unpopular communities.

I'm open to making it the new default tho, if it ends up looking good with production data.

EDIT: I'm converting this to a draft, so I can test with some local production data, to make sure things look okay.

dessalines · 2023-09-02T16:39:46Z

Okay its ready for review. The only part I'm unsure about, is a scale factor added to .../(log(2 + X*users_active_month)) . I currently have X=1, and added a comment about this.

I tested this with production data, and the scaled_rank values seem fine. But its something that we'll really have to see in practice. We can always alter that function later on to suit gentler log effects.

phiresky · 2023-09-02T21:50:15Z

src/scheduled_tasks.rs

+           hot_rank_active = hot_rank(pa.score, pa.newest_comment_time_necro),
+           scaled_rank = scaled_rank(pa.score, pa.published, ca.users_active_month)
+         FROM batch, community_aggregates ca
+         WHERE pa.id = batch.id and pa.community_id = ca.community_id RETURNING pa.published;


Just as a note: with this change it might make sense to do the post ordering / batch selection with ORDER BY (community_id, published) so that the join on community_aggregates is less expensive. But it's probably not a worth the extra pagination complexity if we assume that the whole communities table+indexes will be in memory in any case.

phiresky

LGTM. This will require a new index as well if #3872 is merged.

crates/db_schema/src/lib.rs

Adding a scaled sort, to boost smaller communities.

c0cad89

- Previously referred to as *best* . - Fixes #3622

dessalines commented Aug 23, 2023

View reviewed changes

phiresky reviewed Aug 23, 2023

View reviewed changes

Fixing scheduled task update.

d40ee4b

dessalines commented Aug 23, 2023

View reviewed changes

dessalines marked this pull request as ready for review August 23, 2023 23:26

dessalines requested a review from Nutomic as a code owner August 23, 2023 23:26

dessalines marked this pull request as draft August 24, 2023 21:34

dessalines added 2 commits August 29, 2023 11:30

Merge branch 'main' into scaled_sort

978a7d3

Converting hot_rank integers to floats.

d03b4f8

dessalines requested a review from phiresky August 29, 2023 18:07

dessalines marked this pull request as ready for review August 29, 2023 18:07

phiresky reviewed Aug 30, 2023

View reviewed changes

dessalines added 2 commits August 30, 2023 10:51

Merge remote-tracking branch 'origin/main' into scaled_sort

384f430

Altering hot_rank psql function to default to zero after a week.

c964b1c

dessalines marked this pull request as draft August 31, 2023 13:15

dessalines and others added 2 commits August 31, 2023 09:16

Merge branch 'main' into scaled_sort

776347f

Merge remote-tracking branch 'origin/main' into scaled_sort

83fb70f

Setting scaled_rank to zero, where hot_rank is zero.

42e128d

dessalines marked this pull request as ready for review September 2, 2023 16:37

dessalines requested a review from phiresky September 2, 2023 16:37

phiresky reviewed Sep 2, 2023

View reviewed changes

phiresky approved these changes Sep 2, 2023

View reviewed changes

Nutomic reviewed Sep 4, 2023

View reviewed changes

crates/db_schema/src/lib.rs Show resolved Hide resolved

Nutomic approved these changes Sep 4, 2023

View reviewed changes

dessalines mentioned this pull request Sep 6, 2023

Adding documentation for scaled rank. LemmyNet/lemmy-docs#267

Merged

dessalines added 2 commits September 6, 2023 13:14

Merge branch 'main' into scaled_sort

f163fe5

Adding image_upload table.

4ffa1d3

dessalines enabled auto-merge (squash) September 6, 2023 17:21

dessalines merged commit 9785b20 into main Sep 6, 2023
2 checks passed

This was referenced Sep 28, 2023

Balance Post Scores Based on Instance Monthly Active Users #3642

Closed

Weighted Community Subscription #3518

Closed

mormaer mentioned this pull request Sep 30, 2023

[0.19.0] - Scaled sort option mlemgroup/mlem#673

Closed

SrivatsanSenthilkumar mentioned this pull request Sep 30, 2023

Breaking changes to API in lemmy version 0.19 aeharding/voyager#745

Closed

Nutomic mentioned this pull request Oct 16, 2023

Prevent communities and server from flooding the "new" feed #3954

Closed

4 tasks

This was referenced Dec 11, 2023

Adding scaled sort to UI. Fixes #2156 LemmyNet/lemmy-ui#2169

Merged

Rename "Scaled" sort to "Balanced" Or "Mixed" or something more descriptive LemmyNet/lemmy-ui#2280

Open

jmcharter mentioned this pull request Jan 17, 2024

Feature Request: Add 'scaled' option to sort options sheodox/alexandrite#92

Open

Nutomic mentioned this pull request Feb 7, 2024

Counteract Recency Bias on Lemmy Sorting Algorithm #4432

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a scaled sort, to boost smaller communities. #3907

Adding a scaled sort, to boost smaller communities. #3907

dessalines commented Aug 23, 2023 •

edited

dessalines Aug 23, 2023 •

edited

phiresky Aug 23, 2023

dessalines Aug 23, 2023

phiresky Aug 24, 2023

dessalines Aug 24, 2023 •

edited

phiresky Aug 25, 2023 •

edited

dessalines Aug 29, 2023

sunaurus Aug 30, 2023

sunaurus Aug 30, 2023

phiresky Aug 30, 2023

dessalines Aug 23, 2023

phiresky commented Aug 28, 2023

dessalines commented Aug 29, 2023

dessalines commented Aug 29, 2023

Die4Ever commented Aug 29, 2023 •

edited

dessalines commented Aug 29, 2023 •

edited

Die4Ever commented Aug 29, 2023 •

edited

dessalines commented Aug 29, 2023 •

edited

Die4Ever commented Aug 29, 2023

phiresky commented Aug 30, 2023

phiresky Aug 30, 2023 •

edited

dessalines Aug 30, 2023

dessalines commented Aug 30, 2023

Nutomic commented Aug 31, 2023

dessalines commented Aug 31, 2023 •

edited

dessalines commented Sep 2, 2023 •

edited

phiresky Sep 2, 2023

phiresky left a comment

Adding a scaled sort, to boost smaller communities. #3907

Adding a scaled sort, to boost smaller communities. #3907

Conversation

dessalines commented Aug 23, 2023 • edited

dessalines Aug 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dessalines Aug 24, 2023 • edited

Choose a reason for hiding this comment

phiresky Aug 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phiresky commented Aug 28, 2023

dessalines commented Aug 29, 2023

dessalines commented Aug 29, 2023

Die4Ever commented Aug 29, 2023 • edited

dessalines commented Aug 29, 2023 • edited

Die4Ever commented Aug 29, 2023 • edited

dessalines commented Aug 29, 2023 • edited

Die4Ever commented Aug 29, 2023

phiresky commented Aug 30, 2023

phiresky Aug 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dessalines commented Aug 30, 2023

Nutomic commented Aug 31, 2023

dessalines commented Aug 31, 2023 • edited

dessalines commented Sep 2, 2023 • edited

Choose a reason for hiding this comment

phiresky left a comment

Choose a reason for hiding this comment

dessalines commented Aug 23, 2023 •

edited

dessalines Aug 23, 2023 •

edited

dessalines Aug 24, 2023 •

edited

phiresky Aug 25, 2023 •

edited

Die4Ever commented Aug 29, 2023 •

edited

dessalines commented Aug 29, 2023 •

edited

Die4Ever commented Aug 29, 2023 •

edited

dessalines commented Aug 29, 2023 •

edited

phiresky Aug 30, 2023 •

edited

dessalines commented Aug 31, 2023 •

edited

dessalines commented Sep 2, 2023 •

edited