feat(swing-store): budget-limited deletion of snapshot and transcripts

Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or transcript spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. Without this, the kernel's attempt to slowly delete a terminated vat would succeed in slowly draining the kvStore, but would trigger a gigantic SQL transaction at the end, as it deleted every transcript item in the vat's history. The worst-case example I found would be the mainnet chain's v43-walletFactory, which (as of apr-2024) has 8.2M transcript items in 40k spans. A fast machine takes two seconds just to count all the items, and deletion took 22 *minutes*, with a `swingstore.wal` file that peaked at 27 GiB. This would cause an enormous chain stall at some surprising point in time weeks or months after the vat was first terminated. In addition, both the transcript spans and the snapshot records are shadowed into IAVL (via `export-data`) for integrity, and deleting 40k+40k=80k IAVL records in a single block might cause some significant churn too. The kernel should call `transcriptStore.stopUsingTranscript()` and `snapStore.stopUsingLastSnapshot()` as soon as the vat is terminated, to make exports smaller right away (by omitting all transcript/snapshot artifacts for the given vat, even before those DB rows or their export-data records have been deleted). New swing-store documentation was added. refs #8928
Agoric · Jun 14, 2024 · 85c31ab · 85c31ab
1 parent 796a2d3
commit 85c31ab
Show file tree

Hide file tree

Showing 9 changed files with 1,034 additions and 17 deletions.
diff --git a/packages/swing-store/docs/bundlestore.md b/packages/swing-store/docs/bundlestore.md
@@ -0,0 +1,30 @@
+# BundleStore
+
+The `kernelStorage.bundleStore` sub-store manages code bundles. These can be used to hold vat-worker supervisor code (eg `@endo/lockdown` bundle, or the `@agoric/swingset-xsnap-supervisor` package, which incorporates liveslots), or the initial vat code bundles (for both kernel-defined bundles like vat-comms or vat-timer, or for application-defined bundles like vat-zoe or the ZCF code). They can also hold bundles that will be loaded by userspace vat code later, like contract bundles.
+
+Each bundle is defined by a secure BundleID, which contains a version integer and a hash, with a format like `b0-123abc456def` or `b1-789ghi012` (but longer). This contains enough information to securely define the behavior of the code inside the bundle, and to identify the tools needed to load/evaluate it.
+
+The bundleStore provides a simple add/get/remove API to the kernel. The kernel adds its own bundles during initialization, and provides the host application with an API to load additional ones in later. The kernel code that creates new vats will read bundles from the bundleStore when necessary, as vats are created. Userspace can get access to "BundleCap" objects that represent bundles, to keep the large bundle blobs out of RAM as much as possible.
+
+## Data Model
+
+Bundles are actually JavaScript objects: records of at least `{ moduleFormat }`, plus some format-specific fields like `endoZipBase64` and `endoZipBase64Sha512`. They are created by the `@endo/bundle-source` package. Many are consumed by `@endo/import-bundle`, but the simpler bundles can be loaded with some simple string manipulation and a call to `eval()` (which is how supervisor bundles are injected into new vat workers, before `@endo/import-bundle` is available).
+
+The bundleStore database treats each bundle a BundleID and a blob of contents. The SQLite table is just `(bundleID, bundle)`. The bundleStore knows about each `moduleFormat` and how to extract the meaningful data and compress it into a blob, and how to produce the Bundle object during retrieval.
+
+The bundleStore also knows about the BundleID computation rules, and the import process can verify that the contents of each alleged Bundle matches the claimed BundleID, to prevent corruption during export+import. Note that the normal `addBundle()` API does not verify the contents, and relies upon the kernel to perform validation.
+
+The kernel is expected to keep track of which bundles are needed and when (with reference counts), and to not delete a bundle unless it is really unneeded. Currently, this means all bundles are retained forever.
+
+Unlike the `snapStore`, there is no notion of pruning bundles: either the bundle is present (with all its data), or there is no record of the BundleID at all.
+
+## Export Model
+
+Each bundle gets a single export-data entry, whose name is `bundle.${bundleID}`, and whose value is just `${bundleID}`. Each bundle also gets a single export artifact, whose name is `bundle.${bundleID}`, and whose contents are the compressed BLOB from the database (from which a Bundle record can be reconstructed).
+
+## Slow Deletion
+
+Since bundles are not owned by vats, there is nothing to delete when a vat is terminated. So unlike `transcriptStore` and `snapStore`, there is no concept of "slow deletion", and no APIs to support it.
+
+When a bundle is deleted by `bundleStore.deleteBundle()`, its export-data item is deleted immediately, and subsequent exports will omit the corresponding artifact.
+
diff --git a/packages/swing-store/docs/kvstore.md b/packages/swing-store/docs/kvstore.md
@@ -0,0 +1,33 @@
+# KVStore
+
+The `kernelStorage.kvStore` sub-store manages a table of arbitrary key-value (string-to-string) pairs. It provides the usual get/set/has/delete APIs, plus a `getNextKey` call to support lexicographic iteration.
+
+There are three separate sections of the namespace. The normal one is the "consensus" section.  Each value written here will be given an export-data row, and incorporated into the "crankhash" (described below).
+
+The second is "local", and includes any key which is prefixed with `local.`. These keys are *not* given export-data rows, nor are they included in the crankhash.
+
+The thing is "host", and includes any key which is prefixed with `host.`. This is not available to `kernelStorage.kvStore` at all: it is only accessed by methods on `hostStorage.kvStore` (the `kernelStorage` methods will throw an error if given a key like `host.foo`, and the `hostStorage` methods will throw *unless* given a key like `host.foo`). These are also excluded from export-data and the crankhash. Host keys are reserved for the host application, and are generally used to keep track of things like which block has been executed, to manage consistency between a separate host database (eg IAVL) and the swingstore. The host can record "I told the kernel to execute the contents of block 56" into `hostStorage.kvStore`, and then do `hostStorage.commit()`, and then it can record "I processed the rest of block 56" into is own DB, and then commit its own DB. If, upon startup, it observes a discrepancy between the `hostStorage.kvStore` record and its own DB, it knows it got interrupted between these two commit points, which can trigger recovery code.
+
+## CrankHash and ActivityHash
+
+Swingset kernels are frequently run in a consensus mode, where multiple instances of the kernel (on different machines) are expected to execute the same deliveries in lock-step. In this mode, every kernel is expected to do exactly the same computation, and any divergence indicates a failure (or attempt at malice). We want to detect such variations quickly, so the diverging/failing member can "fall out of consensus" promptly.
+
+The swingstore hashes all changes to the "consensus" portion of the kvStore into the "crank hash". This hash covers every change since the beginning of the current crank, and the kernel logs the result at the end of each crank, at which point the crankhash is reset.
+
+Each crank also updates a value called the "activity hash", by hashing the previous activityhash and the latest crankhash together. This records a chain of changes, and is logged at the end of each crank too.
+
+The host application can record the activityhash into its own consensus-tracking database (eg IAVL) at the end of each run, to ensure that any internal divergence of swingset behavior is escalated to a proper consensus failure. Without this, one instance the kernel might "think differently" than the others, but still "act" the same (in terms of IO or externally-visible messages) without triggering a failure, which would be a lurking problem.
+
+Together, these logs improve our ability to diagnose consensus failures. By comparing logs between a "good" machine and a "bad" (diverging) one, we can quickly determine which crank caused the problem, and usually compare slogfile delivery/syscall records to narrow it down to a specific syscall.
+
+kvStore changes are also recorded by the export-data, but these are too voluminous to be logged, and do not capture multiple changes to the same key. And not all host applications use exports, so there might not be anything watching export data.
+
+## Data Model
+
+The kvStore holds a simple string-to-string key/value store. The SQLite schema is simply `(key, value)`, both of which are TEXT columns.
+
+## Export Model
+
+To ensure that every key/value pair is correctly validatable, *all* in-consensus kvStore rows get their own export-data item. The name is just `kv.${key}`, and the value is just the value. `kvStore.delete(key)` will delete the export-data item. There are no artifacts.
+
+These make up the vast majority of the export-data items, both by count and by "churn" (the number of export-data items changed in a single crank). In the future, we would prefer to keep the kvStore in some sort of Merkle-tree data structure, and emit only a handful of export-data rows that contain hashes (perhaps just a single root hash). In this approach, the actual data would be exported in one or more artifacts. However, our SQLite backend does not provide the same kind of automatic Merkleization as IAVL, and only holds a single version of data at a time, making this impractical.
diff --git a/packages/swing-store/docs/snapstore.md b/packages/swing-store/docs/snapstore.md
@@ -0,0 +1,45 @@
+# SnapStore
+
+The `kernelStorage.snapStore` sub-store tracks vat heap snapshots. These blobs capture the state of an XS JavaScript engine, between deliveries, to enable replay-based persistence to run faster. The kernel can start a vat worker from a recent heap snapshot, and then it only needs to replay a handful of transcript items (deliveries), instead of replaying every delivery since the beginning of the incarnation.
+
+The XS / `xsnap` engine defines the heap snapshot format. It consists of a large table of "slots", which are linked together to form JavaScript objects, strings, Maps, functions, etc. The snapshot also includes "chunks" for large data fields (like strings and BigInts), a stack, and some other supporting tables. The snapStore doesn't care about any of the internal details: it just gets a big blob of bytes.
+
+## Data Model
+
+Each snapshot is compressed and stored in the SQLite row as a BLOB. The snapStore has a single table, with a schema of `(vatID, snapPos, inUse, hash, uncompressedSize, compressedSize, compressedSnapshot)`.
+
+The kernel has a scheduler which decide when to take a heap snapshot for each vat. There is a tradeoff between the immediate cost of creating the snapshot, versus the expected future savings of having a shorter transcript to replay. More frequent snapshots save time later, at the cost of time spent now. The kernel currently uses a very simple scheduler, which takes a snapshot every 200 deliveries, plus an extra one a few deliveries into the new incarnation (to avoid replaying expensive contract startup code).
+
+The swingstore is unaware of the kernel's scheduler details. Every once in a while, the kernel tells the snapStore about a new snapshot, and the snapStore updates its data.
+
+As with the transcriptStore, the snapStore retains a hash of older snapshot records, even after it prunes the snapshot data itself. There is at most one `inUse = 1` record for each vatID, and it will always have the highest `snapPos` value. When a particular vatID's active snapshot is replaced, the `inUse` flag is cleared (set to NULL), and the `compressedSnapshot` field is set to NULL.
+
+## Export Model
+
+Each snapshot, both current and historic, gets an export-data entry. The name is `snapshot.${vatID}.${position}`, where `position` is the latest delivery (eg highest delivery number) that was included in the heap state captured by the snapshot. The value is a JSON-serialized record of `{ vatID, snapPos, hash, inUse }`.
+
+If there is a "current" snapshot, there will be one additional export-data record, whose name is `snapshot.${vatID}.current`, and whose value is `snapshot.${vatID}.${position}`. This value is the same as the name of the latest export-data record, and is meant as a convenient pointer to find that latest snapshot.
+
+The export *artifacts* will generally only include the current snapshot for each vat. Only the `debug` mode will include historical snapshots (and only if the swingstore was retaining them in the first place).
+
+## Slow Deletion
+
+As soon as a vat is terminated, the kernel will call `snapStore.stopUsingLastSnapshot()`, after which the vat becomes invisible to exports, and non-loadable by the kernel. The DB is updated to clear the `inUse` flag of the latest snapshot, leaving no rows with `inUse = 1`.
+
+This modifies the latest `snapshot.${vatID}.${snapPos}` export-data record, to change `inUse` to 0.  It also removes the `snapshot.${vatID}.current` export-data record. The modification and deletion are added to the export-data callback queue, so the host-app can learn about them after the next commit. And any subsequent `getExportData()` calls will observe the changed record and omit the `.current` record.
+
+Later, as the kernel performs cleanup work for this vatID, the cleanup call will delete DB rows (one per `budget`). Each row deleted will also remove one export-data record (which feeds the callback queue, as well as affecting the full `getExportData()` results).
+
+Eventually, the snapStore runs out of rows to delete, and `deleteVatSnapshots(budget)` returns `{ done: true }`, so the kernel can finally rest.
+
+### SnapStore Vat Lifetime
+
+The SnapStore doesn't have an explicit API to call when a vat is first created. The kernel just calls `saveSnapshot()` for both the first and all subsequent snapshots. Each `saveSnapshot()` marks the previous snapshot as unused, so there is at most one `inUse = 1` snapshot at any time (until the first delivery of each incarnation, there are zero in-use snapshots, even though the vat is not terminated).
+
+When terminating a vat, the kernel should first call `snapStore.stopUsingLastSnapshot(vatID)`, the same call it would make at the end of an incarnation, to indicate that we're no longer using the last snapshot. This means there are zero in-use snapshots, so exports (except for `mode = debug`) will ignore this VatID entirely.
+
+Then, the kernel must either call `snapStore.deleteVatSnapshots(vatID, undefined)` to delete everything at once, or make a series of calls (spread out over time/blocks) to `snapStore.deleteVatSnapshots(vatID, budget)`. Each will return `{ done, cleanups }`, which can be used to manage the rate-limiting and know when the process is finished.
+
+The `stopUsingLastSnapshot()` is a performance improvement, but is not mandatory. If omitted, exports will continue to include the vat's snapshot artifacts until the first call to `deleteVatSnapshots()`, after which they will go away. Snapshots are deleted in descending order, so the first call will delete the only `inUse = 1` snapshot, after which exports will omit all artifacts for the vatID. `stopUsingLastSnapshot()` is idempotent, and extra calls will leave the DB unchanged.
+
+The kernel must keep calling `deleteVatSnapshots(vatID, budget)` until the `{ done }` return value is `true`. It is safe to call it again after that point; the function will keep returning `true`. But note, this costs one DB txn, so it may be cheaper for the kernel to somehow remember that we've reached the end.