Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of received_activity table #4110

Closed
4 tasks done
Nutomic opened this issue Oct 26, 2023 · 2 comments
Closed
4 tasks done

Reduce size of received_activity table #4110

Nutomic opened this issue Oct 26, 2023 · 2 comments

Comments

@Nutomic
Copy link
Member

Nutomic commented Oct 26, 2023

Requirements

  • Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a feature request? Do not put multiple feature requests in one issue.
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Is your proposal related to a problem?

The received_activity table on lemmy.ml currently has 11 GB. The purpose of this table is to check whenever a new activity is received, to see if we received the same activity before. In that case it is rejected. If its a new activity, it is inserted to the table and processed. Rows are removed after a three month interval. This is really basic functionality and shouldnt take so much space.

The table is currently defined like this:

Column   |            Type             | Collation | Nullable |                    Default
-----------+-----------------------------+-----------+----------+-----------------------------------------------
id        | bigint                      |           | not null | nextval('received_activity_id_seq'::regclass)
ap_id     | text                        |           | not null |
published | timestamp without time zone |           | not null | now()

Here are a few ideas we can use to save space:

  • Get rid of id column, its unnecessary
  • Change published from full datetime with millisecond precision, store only date or unix timestamp (integer)
  • ap_id never needs to be read, only checked if a given value exists. This is an ideal use case for bloom filters which are supported in postgres
  • Further reduce the removal interval. If an activity is delivered multiple times, it is probably within minutes or hours of the first send, not months later. Additionally its probably not a big deal if the same activity is received twice, but this hasnt been tested.

cc @dessalines @phiresky

@dessalines
Copy link
Member

The simplest is probably the best here, which would be to reduce the removal interval to maybe > 3 days?

Other than that, @dullbananas is doing some work to remove some of the pointless integer primary keys, but I really doubt that's taking up that much. The number of pointless rows is the issue.

@phiresky
Copy link
Collaborator

phiresky commented Oct 27, 2023

Receiving double events "normally" shouldn't happen at all anymore in 0.19 with the persistent queue (normal server restarts will not resend any activities, just crashes), and if it does it will only affect the most recent ~100 events max.

For non-lemmy AP instances, idk, but it's probably still in the range of minutes / hours.

If there's a bug or more likely someone manually modifies / deletes the federation_queue_state table it will resend all activities however long ago they happened. I see this as somewhat likely to happen but idk if we care.

The linked bloom PG index isn't really relevant for space saving since it's just an index on top of a table (so the data still needs to be in the table).

Nutomic added a commit that referenced this issue Nov 3, 2023
By storing only a partial hash of ap_id instead of the full url,
as well as dropping id column, the size of each row is reduced by
half. Also reduce cleanup interval from 3 months to 1 month.
dullbananas pushed a commit to dullbananas/lemmy that referenced this issue Nov 7, 2023
* Also order reports by oldest first (ref LemmyNet#4123) (LemmyNet#4129)

* Support signed fetch for federation (fixes LemmyNet#868) (LemmyNet#4125)

* Support signed fetch for federation (fixes LemmyNet#868)

* taplo

* add federation queue state to get_federated_instances api (LemmyNet#4104)

* add federation queue state to get_federated_instances api

* feature gate

* move retry sleep function

* move stuff around

* Add UI setting for collapsing bot comments. Fixes LemmyNet#3838 (LemmyNet#4098)

* Add UI setting for collapsing bot comments. Fixes LemmyNet#3838

* Fixing clippy check.

* Only keep sent and received activities for 7 days (fixes LemmyNet#4113, fixes LemmyNet#4110) (LemmyNet#4131)

* Only check auth secure on release mode. (LemmyNet#4127)

* Only check auth secure on release mode.

* Fixing wrong js-client.

* Adding is_debug_mode var.

* Fixing the desktop image on the README. (LemmyNet#4135)

* Delete dupes and add possibly missing unique constraint on person_aggregates.

* Fixing clippy lints.

---------

Co-authored-by: Nutomic <me@nutomic.com>
Co-authored-by: phiresky <phireskyde+git@gmail.com>
dessalines added a commit that referenced this issue Nov 13, 2023
* post_saved

* fmt

* remove unique and not null

* put person_id first in primary key and remove index

* use post_saved.find

* change captcha_answer

* remove removal of not null

* comment_aggregates

* comment_like

* comment_saved

* aggregates

* remove "\"

* deduplicate site_aggregates

* person_post_aggregates

* community_moderator

* community_block

* community_person_ban

* custom_emoji_keyword

* federation allow/block list

* federation_queue_state

* instance_block

* local_site_rate_limit, local_user_language, login_token

* person_ban, person_block, person_follower, post_like, post_read, received_activity

* community_follower, community_language, site_language

* fmt

* image_upload

* remove unused newtypes

* remove more indexes

* use .find

* merge

* fix site_aggregates_site function

* fmt

* Primary keys dess (#17)

* Also order reports by oldest first (ref #4123) (#4129)

* Support signed fetch for federation (fixes #868) (#4125)

* Support signed fetch for federation (fixes #868)

* taplo

* add federation queue state to get_federated_instances api (#4104)

* add federation queue state to get_federated_instances api

* feature gate

* move retry sleep function

* move stuff around

* Add UI setting for collapsing bot comments. Fixes #3838 (#4098)

* Add UI setting for collapsing bot comments. Fixes #3838

* Fixing clippy check.

* Only keep sent and received activities for 7 days (fixes #4113, fixes #4110) (#4131)

* Only check auth secure on release mode. (#4127)

* Only check auth secure on release mode.

* Fixing wrong js-client.

* Adding is_debug_mode var.

* Fixing the desktop image on the README. (#4135)

* Delete dupes and add possibly missing unique constraint on person_aggregates.

* Fixing clippy lints.

---------

Co-authored-by: Nutomic <me@nutomic.com>
Co-authored-by: phiresky <phireskyde+git@gmail.com>

* fmt

* Update community_block.rs

* Update instance_block.rs

* Update person_block.rs

* Update person_block.rs

---------

Co-authored-by: Dessalines <dessalines@users.noreply.github.com>
Co-authored-by: Nutomic <me@nutomic.com>
Co-authored-by: phiresky <phireskyde+git@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants