Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add truncation task #27

Open
unclecheese opened this issue Mar 29, 2020 · 3 comments
Open

RFC: Add truncation task #27

unclecheese opened this issue Mar 29, 2020 · 3 comments

Comments

@unclecheese
Copy link

unclecheese commented Mar 29, 2020

Summary

As we scale this module out, one of the hurdles we're bound to encounter is the unbound expansion of the database it implies. Because the two snapshot tables every edit in the ownership chain, it at least doubles the size of the _versions tables. Unlike _versions, however, the tables are monolithic, so there are no writes to ancestral tables. All in all, "finger in the air" analysis, snapshots have at least an equal impact on database size as versions, and we're effectively doubling the problem.

Example scenario

In the typical set up we base our tests on:

BlockPage:
  Block:
     BlockGallery:
        BlockImage

Versions only paradigm

  • Save/Publish page: New rows in BlockPage_versions, Page_versions, SiteTree_versions (3)
  • Save/publish block: New rows in Block_versions, Element_versions (2)
  • Save/publish BlockGallery: New row in BlockGallery_versions (1)
  • Save/publish BlockImage: New row in BlockImage_versions (1)

Roughly 7 new rows over this hypothetical editing timeline.

Snapshot paradigm

  • Save/Publish page:
    • New rows in BlockPage_versions, Page_versions, SiteTree_versions (3)
    • New row in Snapshot (1)
    • New row in SnapshotItem (1)
  • Save/publish block:
    • New rows in Block_versions, Element_versions (2)
    • New row in Snapshot (1)
    • New rows in SnapshotItem (Block, BlockPage) (2)
  • Save/publish BlockGallery:
    • New row in BlockGallery_versions (1)
    • New row in Snapshot (1)
    • New rows in SnapshotItem (BlockGallery, Block, BlockPage) (3)
  • Save/publish BlockImage:
    • New row in BlockImage_versions (1)
    • New row in Snapshot (1)
    • New rows in SnapshotItem (BlockImage, BlockGallery, Block, BlockPage) (4)

Roughly 21 new rows over the same hypothetical editing timeline.

Impact

Database size, when left unchecked, can amount to a critical issue culminating in a loss of service for a client. While mitigations are marginally cheap to implement (e.g. more disk space), it requires a level of management and proactivity for the service provider that in theory never ends. Further, large databases have significant impacts on developer experience, as they are less portable, and in some cases, impossible to export.

It is unlikely that there are any significant runtime performance impacts to having a large number of snapshots, as all the queries are against indexed columns, but it's not impossible to imagine that it could become a problem given enough volume.

Possible solution

Ideally, we find a safe way to truncate the Snapshots tables in a way that:

  • Does not break core Versioned functionality
  • Does not obfuscate or hinder important information provided by snapshots.

Earlier investigations into truncating _versions tables proved fruitless as the implicit dependency graph across multiple unrelated tables was impossible to predict at the database level. Fortunately, with the monolithic table approach we use in Snapshots, it seems possible that we could truncate history, albeit with a less-performant executable task rather than a simple query.

Suggested implementation

  • New task SnapshotTruncationTask
  • Config setting for truncate_irrelevant_snapshots_after (name TBD)
    • "irrelevant" means the snapshot whose unpublished content at the time is now fully published
    • "after" means the number of days old it must be
  • Task would be fairly slow, so should run nightly

Pseudo-code:

foreach ($oldSnapshot as $snapshot) {
  $items = $snapshot->Items();
  $supersededVersions = 0;
  foreach ($items as $item) {
    if ($item->Version < Versioned::get_version_number_by_stage($class, Versioned::LIVE, $id)) {. 
      $supersededVersions++;
    }
    if ($supersededVersions === $snapshot->Items()->count()) $snapshot->delete();
}
@unclecheese
Copy link
Author

Another aspect of this is how it aligns with the long-term goals of Activity Feed. The designs for activity feed use snapshots in a very targeted way, which is to show the latest unpublished changes. Previously unpublished snapshots are a secondary feature. With a truncation task active, the power-user feature of old activity would only be available on a small set of versions, and that's probably fine. If anyone complains, they can just crank up the config setting.

@chillu
Copy link
Member

chillu commented Mar 30, 2020

Task would be fairly slow, so should run nightly

Can we express this in SQL? Even a prefiltered list of "old snapshot" (by date) might accumulate a few entries where drafts are left unpublished for a long time

FYI, in terms of database storage, it's $0.276 per GB-month for a MySQL RDS in Sydney region. Plus $0.095 per GiB-month for additional backup storage. So ... nothing. The impact of additional required write capacity is probably higher, since it might push you in the next database tier, which can often add ~50% to the cost of the database component (dozens to hundreds of dollars). But unless you're doing something funky around batch publishing, the writes should be sporadic peaks only based on author activity.

@Mossman1215
Copy link

storage isn't expensive but the outage when the db locks up due to lack of disk space is.
generally if disk space is low out of hours it can take 45 mins to get a new RDS server to replace the one with less disk space. if we can reduce the growth rate of the database that would be preferred to out of hours callouts from sites generating lots of history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants