Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a scaled sort, to boost smaller communities. #3907

Merged
merged 11 commits into from Sep 6, 2023
Merged

Conversation

dessalines
Copy link
Member

@dessalines dessalines commented Aug 23, 2023

Still needs lots of testing with prod data, as well as verifying the scheduled_tasks changes are working correctly.

An example of some ranks:

select 
  pa.post_id, 
  now() - pa.published as time_diff, 
  ca.users_active_month, pa.hot_rank, 
  pa.scaled_rank 
from post_aggregates pa 
  inner join community_aggregates ca 
  on pa.community_id = ca.community_id 
order by pa.scaled_rank desc, pa.published desc 
limit 20;

Screenshot_2023-08-23-19-33-00-798_com termux

- Previously referred to as *best* .
- Fixes #3622
// Diesel can't update based on a join, which is necessary for the scaled_rank
// https://github.com/diesel-rs/diesel/issues/1478
// Just select the users_active_month manually for now, since its a single post anyway
let users_active_month = community_aggregates::table
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably move this query down into the update statement, to avoid the round-trip cost.

edit: tried and failed at this.

@@ -100,6 +100,8 @@ pub struct PostAggregates {
pub community_id: CommunityId,
pub creator_id: PersonId,
pub controversy_rank: f64,
/// A rank that amplifies smaller communities
pub scaled_rank: i32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this value should be a float? Hot_rank for example really quickly goes to zero (after just ~ 2 days mostly) and all information of values between 0 and 1 and 1 and 2 is lost.

For example, right now select scaled_rank(80, '2023-08-20', 10000); and select scaled_rank(1000, '2023-08-20', 10000); both have the exact same rank even though one has > 10 times the votes!

Really hot rank should also be a float imo but that's maybe out of scope here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, these all should probably be made into floats (not sure why I didn't want to use floats originally), could you open an issue for that? Lets do that as a separate issue.

Also being a "news" type sort, the time decay is supposed to go to zero after ~ 2 days, after which published descending takes over.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, that's interesting. Then maybe we can just keep hot_rank as is but make the new scaled_rank a float? I think it would make sense to make the new scaled_rank function and column float from the start instead of migration later.

There's no code change required for making this a float but keeping hot_rank an int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should just change hot_rank to a float then as a part of this, since scaled_rank depends on it.

Copy link
Collaborator

@phiresky phiresky Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Note it will reduce the performance of the scheduled task a fair bit since afterwards almost no post will have a hot rank of 0 and be filtered out (ideas @sunaurus? )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's def a concern... We could either use the published timestamp as a filter, or minimum hot_rank threshold.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to look at the implementation yet, so sorry if this is a dumb idea, but can we not just update the rank functions to set the rank 0 after it decays to some threshold, to guarantee that the majority of posts will still have a 0 value for rank and thus get filtered out during updates?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean something like if the rank is below 1.0 (or perhaps even higher!) then just set it to 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work. But it would need careful consideration what the threshold should be both for hot_rank and scaled_rank. According to dess above, the published sort is supposed to take over after a few days. So maybe it would be better to make the function return zero if published is more than 7 days in the past? The hot_rank function gets published as a parameter anyways.

So that change would just be to replace IF (hours_diff > 0) THEN with IF (hours_diff > 0 AND hours_diff < 24 * 7) THEN

AS $$
BEGIN
-- Add 2 to avoid divide by zero errors
-- Use 0.1 to lessen the initial sharp decline at a hot_rank ~ 300
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 0.1 scale factor is the "I made it up on the spot" part of this PR, and I don't know if it makes any sense.

I tried factors of 1, 0.1, 0.01, using a graphing calculator and some regular hot ranks, and compared them with various community sizes. Using smaller numbers lessens the initial sharp decline sensitivity.

@dessalines dessalines marked this pull request as ready for review August 23, 2023 23:26
@dessalines dessalines marked this pull request as draft August 24, 2023 21:34
@phiresky
Copy link
Collaborator

I can do the int-> float convert for hot_rank and scaled_rank if you don't have time @dessalines

@dessalines
Copy link
Member Author

@phiresky No that's okay, I'll work on that today.

@dessalines
Copy link
Member Author

Okay that's updated now. I also removed the 10k scaling factor from the hot_rank function, as its pointless now that they're floats.

@dessalines dessalines marked this pull request as ready for review August 29, 2023 18:07
@Die4Ever
Copy link

Die4Ever commented Aug 29, 2023

Will this mean that bot communities like Lemmit will dominate my feed? I like Lemmit but I wouldn't want it to be all of the top posts.

@dessalines
Copy link
Member Author

dessalines commented Aug 29, 2023

This has nothing to do with bots, but communities with few active users.

You can already block bot accounts universally in your user profile settings, as well as block their users or communities if you wish.

@Die4Ever
Copy link

Die4Ever commented Aug 29, 2023

This has nothing to do with bots, but communities with few subscribers.

You can already block bot accounts universally in your user profile settings, as well as block their users or communities if you wish.

Yes but I enjoy the Lemmit bot, I just don't want the top 100 posts on my feed to be entirely Lemmit.

I'm glad to know it's based on subscribers and not active users, but I still feel like posts from bot-only communities might be strongly favored by this sorting. Maybe I'll just be toggling the show bots option on and off each day.

@dessalines
Copy link
Member Author

dessalines commented Aug 29, 2023

Sry I misspoke above, this is absolutely based on active monthly users, not subscribers, as subscribers is really a pointless metric, considering how communities can have a ton of subscribers but little to no activity.

Bot communities have nothing to do with this, this will boost any community with few active users.

This won't do a spread either (IE picking a single post from a lot of communities, as that's too slow to do performance-wise).

@Die4Ever
Copy link

Sry I misspoke above, this is absolutely based on active users, not subscribers, as subscribers is really a pointless metric, considering how communities can have a ton of subscribers but little to no activity.

Bot communities have nothing to do with this, this will boost any community with few active users.

This won't do a spread either (IE picking a single post from a lot of communities, as that's too slow to do performance-wise).

Right it'll be something to get used to. Lemmit communities are mostly just a single active user (the bot itself) since they don't allow posts from other users.

@phiresky
Copy link
Collaborator

My suggestion to both improve the performance and to go in the direction of "published sort should take over after a few days" would be to replace IF (hours_diff > 0) THEN with IF (hours_diff > 0 AND hours_diff < 24 * 7) THEN in the hot_rank function. That way the function will predictably go to 0 after 7 days and published sort will take over. It will also mean the scheduled tasks only have to scan 7 days of content.

r#"WITH batch AS (SELECT pa.id
FROM post_aggregates pa
WHERE pa.published > $1
AND (pa.hot_rank != 0 OR pa.hot_rank_active != 0 OR pa.scaled_rank != 0)
Copy link
Collaborator

@phiresky phiresky Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for performance, either the index on idx_post_aggregates_nonzero_hotrank should be replaced and conditioned on this condition, or the OR pa.scaled_rank != 0 be removed

removing the scaled_rank != 0 check should not impact the output I think because scaled_rank with it being floats scaled_rank will only be 0 if hot_rank is also 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right, and you're correct that it does seem pointless to change that index.

"idx_post_aggregates_nonzero_hotrank" btree (published DESC) WHERE hot_rank::double precision <> 0::double precision OR hot_rank_active::double precision <> 0::double precision

@dessalines
Copy link
Member Author

IF (hours_diff > 0 AND hours_diff < 24 * 7)

Good call, I'll do that now.

@Nutomic
Copy link
Member

Nutomic commented Aug 31, 2023

Instead of adding another sort option I would rather use this scaling logic for the existing Active and Hot sorts. That way users will actually notice an improvement immediately, instead having to tell everyone to change sorts manually. If this works well it will be a clear improvement and there should be no need for the current Active/Hot sorts. If it doesnt work, we can finetune during rc process.

@dessalines
Copy link
Member Author

dessalines commented Aug 31, 2023

I don't feel too comfortable with replacing them, because they are completely different sorts. One will show posts from popular communities, the other will show sorts from unpopular communities.

I'm open to making it the new default tho, if it ends up looking good with production data.

EDIT: I'm converting this to a draft, so I can test with some local production data, to make sure things look okay.

@dessalines dessalines marked this pull request as draft August 31, 2023 13:15
@dessalines dessalines marked this pull request as ready for review September 2, 2023 16:37
@dessalines
Copy link
Member Author

dessalines commented Sep 2, 2023

Okay its ready for review. The only part I'm unsure about, is a scale factor added to .../(log(2 + X*users_active_month)) . I currently have X=1, and added a comment about this.

I tested this with production data, and the scaled_rank values seem fine. But its something that we'll really have to see in practice. We can always alter that function later on to suit gentler log effects.

hot_rank_active = hot_rank(pa.score, pa.newest_comment_time_necro),
scaled_rank = scaled_rank(pa.score, pa.published, ca.users_active_month)
FROM batch, community_aggregates ca
WHERE pa.id = batch.id and pa.community_id = ca.community_id RETURNING pa.published;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note: with this change it might make sense to do the post ordering / batch selection with ORDER BY (community_id, published) so that the join on community_aggregates is less expensive. But it's probably not a worth the extra pagination complexity if we assume that the whole communities table+indexes will be in memory in any case.

Copy link
Collaborator

@phiresky phiresky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This will require a new index as well if #3872 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants