terminate vats slowly #8928

warner · 2024-02-15T21:53:32Z

What is the Problem Being Solved?

We've been brainstorming ways to address the large price-feed vats which are holding onto a lot of virtual objects (#8400 , plus their contribution to #8401). While the fixes to stop their growth are coming along nicely, we will still eventually need a way to delete all the old data. The simplest (and most desirable) approach is to simply terminate those vats, however with the current kernel, that would perform an unacceptably large amount of work, and trigger some O(N^2) performance bugs in the kernel GC handling code, which would probably crash the chain.

#8877 is a workaround to have the vat release objects slowly, to avoid these problems. But a better / longer-term solution is to allow the kernel to safely terminate large vats, yet still perform the cleanup work slowly, over time, so it doesn't interfere with normal operations.

This ticket is about an approach which would fix all of the problems with the price-feed vats, as well as any vat which performed a "malicious large termination" (aka the "I'm too big to fail, hahaha" attack). It would still be vulnerable to vats which collection.clear() or delete a large collection (for which #8417 would be the fix).

The rough idea is to change terminateVat to merely mark the vat as terminated, and not immediately delete all the late vat's data (delete vatstore entries, delete clist entries and propagate decrefs, abandon exported objects). We would defer that work until later "cleanup" steps, performed when the runPolicy says it's a good time (eg if/when there is remaining budget at the end of a block), and only do a small amount of work at a time. This work could be spaced out over months.

The kernel would still reject all the vat's outstanding Promises at the time of termination, so that their downstream effects are felt promptly. This isn't a problem with our current vats (v29-ATOM-USD_price_feed only decides three promises), but could be a too-big-to-fail attack vector in the future. The kernel would also immediately kill the worker process.

Description of the Design

the old design

When terminateVat() (kernel.js) is called, it does the following steps immediately:

A: use kernelKeeper.enumeratePromisesByDecider() to find all the promises decided by this vat, and then call resolveToError() on all of them
B: call kernelKeeper.cleanupAfterTerminatedVat() to delete the vat's state
- B1: first, this does vatKeeper.deleteSnapshotsAndTranscript() to delete historical/replay-enabling state
- B2: then walk the clist for c.o+ exports, and orphanKernelObject() each of them
- B3: then walk the clist for c.o- imports, and deleteCListEntry each, which decrements refcounts
- B4: then walk and delete all remaining vat state, like the vatstore data
- B5: finally delete the vatID from vat.dynamicIDs and/or vat.names
C: call notifyTermination() to let vat-admin fire the done() promise, so userspace (eg parent vat) can react
D: call vatWarehouse.stopWorker() to shut down the xsnap worker

When the vat has a large number of non-exported virtual objects (eg most of the QuotePayments in #8400), step B4 takes a long time. When those objects are exported, B2 takes a long time (eg the QuotePayments that are weakly referenced by vat-board).

When the vat is importing a lot of objects, such as the #8401 cycles, B3 takes a long time, and also adds a lot of krefs to maybeFreeKrefs, and at end-of-crank, the processRefCounts() call will add a lot of actions to gcActions. At the start of the next crank, processGCActionSet will deliver a large GC delivery to the upstream vat (like dispatch.dropImports() with 100k krefs into v9-zoe), which will trigger a very long delivery (with some multiple of 100k syscalls, all of which must get added into the transcript entry (causing bad memory behavior and JSON.stringify serialization time). Then, the next time v9-zoe does a BOYD, it will do some other multiple of 100k syscalls (to check on the status of all those entries, and probably free up additional objects as the cycles unwind), causing the same many-syscalls transcript problem, after which it will do its own very large syscall.dropImports or syscall.retireExports call with 100k vrefs, triggering more work.

the new design

We'll introduce a new kvStore key named vats.terminated, whose value is a simple JSON-serialized array of VatID strings. KernelKeeper will keep an in-RAM cache of it as a Set named terminatedVats, write a sorted version to kvStore every time we change it, and populate the cache at kernel boot (and write an empty list if the key did not already exist, which is our backwards-compatibility/upgrade point).

We'll change kernelKeeper.ownerOfKernelObject to consult both ${kref}.owner and terminatedVats, so an object whose .owner is still present, but that vatID is terminated, will report the object as orphaned. We also change kernelKeeper.vatIsAlive to report false for terminated vats (either by having it consult terminatedVats too, or by preemptively deleting the ${vatID.o.nextID} key that it looks for).

Next, we change terminateVat() to add the vatID to terminatedVats, and skip steps B2, B3, and B4. This will leave the bulk of the vat's data (and the cleanup work) for later, allowing terminateVat() to finish quickly (and without provoking immediate large work).

Now, we introduce a new internal kernel.doSomeCleanup(vatID, budget) method which performs up to budget total steps of:

B1 (orphanKernelObject())
B2 (deleteCListEntry())
B3 (delete other vat state)
(i.e. we only do B2 if there were no c.o+ export entries left to perform B1 on)
if no B1/B2/B3 work is left to do, delete the vatID from terminatedVats

To get this work scheduled, we change the runPolicy to add an allowCleanup() method. If present, this can return { budget }, or can return {} to mean "do all the cleanup now". If it returns something falsy, that tells the kernel that no cleanup work should be done at this time. For backwards compatibility, if the method is missing, the kernel acts as if it returned {} (unlimited cleanup).

Then we add a new pseudo-run-queue message { type: 'cleanup', vatID, budget }, a sibling to GC actions and reap-actions and actual run-queue messages like deliver and notify. We change getNextMessageAndProcessor() function to take runPolicy as an argument, and change its priority list to consult terminatedVats before looking at anything else (but only if the policy allows cleanup right now). So its sequence will look like:

budget = policyInput.allowCleanup()
if budget then return cleanup/vatID/budget
else if acceptance-queue is non-empty, return acceptance-queue message
else check processGCActionSet()
else check nextReapAction()
else check getNextRunQueueMsg()
else idle

This provides a default behavior that nearly matches the old one: vat termination work runs to completion on the step just after the termination happened, but still before any other work gets done.

But, the final step is to change cosmic-swingset to provide a runPolicy.allowCleanup = () => false for all the existing runs, then to perform one final run (with => ({ budget })) iff there is still room left in the block (eg if the previous runs did not exhaust the computron budget). I think budget = 10 or 100 is likely.

So assuming the block that terminates the vat has free space, the final run will orphan eg 100 objects of the dead vat. If we assume that all these objects were also recognized by eg vat-board (and #7212 gets fixed), then this will push dispatch.retireImports() GC actions onto the queue, and these will be delivered as the second crank of this cleanup run. That may cause the recognzing vat to delete some collection entries, which incur syscalls and might create more garbage in those vats, but no more than 100 at a time.

The next non-full block will orphan 100 more. This will repeat, 100 krefs per block, until all the exports have been orphaned. The following block will begin on the c-list imports. Each block will decref 100 krefs, which will immediately push a 100-kref dispatch.dropExports call onto the GC action queue, as well as a similar retireExports call. Then the dropExports will be delivered, causing eg v9-zoe to do syscalls (change export-status to recognizable) on all 100, and accumulating 100 vrefs in its internal liveslots possiblyDeadSet. Next, the retireExports will be delivered, which will trigger syscalls to delete the export status. Each block will process 100 krefs in this way.

Eventually, v9-zoe will get enough deliveries to trigger a BOYD, which will perform more vatstore syscalls to check for remaining reference counts, and when those counts come up empty, liveslots will delete those virtual objects, which will trigger more deletions (unwinding the cycles). The amount of work will be bounded by the number of accumulated deleted objects, which only happen 100 at a time.

Once all the terminated vat's imports are deleted, the next non-full block will start on B3, and will delete 100 vatstore entries. This doesn't cause any activity in other vats, so the only cost is kvStore activity (which also causes IAVL churn, but only 100 entries at a time). This continues until all remaining vat state is deleted.

Then the final non-full block can remove the vatID from terminatedVats, and the kernel has finally completely forgotten about the vat. The next call to getNextMessageAndProcessor() will not find any cleanup work to do.

The final requirement is that we run BOYDs frequently enough to avoid aggregating our carefully-spaced-out cleanup work back into a single large delivery. Our current scheduler has two ways to trigger a BOYD:

${vatID.reapInterval} triggers a BOYD every 1000 deliveries
kernel.snapshotInterval triggers a heap snapshot (and associated BOYD) every 200 deliveries
- plus an extra one after the third delivery of the incarnation, to avoid replaying the big startVat

If we didn't change that, and we assume that eg v9-zoe is otherwise idle, we'd observe 100 pairs of dropExports(100 krefs) and retireExports(100 krefs) before a BOYD was triggered, so v9-zoe would be doing cleanup on 10k krefs at a time. This might be larger than we want (certainly better than doing all 200k at once, but we'd really prefer to keep things limited to the budget, not budget * snapshotInterval).

So the last step is to change the way we schedule BOYDs. As @mhofman has recommended, the scheduler should pay attention to more than just the number of deliveries. We can count computrons, deliveries, and also the number of krefs that have been dropped or retired. When any of these values grows above some threshold, we can trigger a BOYD and heap snapshot.

(the trick is to avoid doing multiple heap snapshots in a single block, because that's just wasted effort)

Security Considerations

We must continue to maintain some invariants.

Dead vats tell no tales

We stop the vat worker, and delete the metadata keys that would allow vat-warehouse to start a new one. So the dead vat will never again get agency, so it never gets a chance to emit new messages.

You can't speak to the dead (vat)

All messages and deliveries to the now-dead vat must go splat. We ensure this by having all message deliveries check for the vatID in terminatedVats first.

other security-ish considerations

We don't want other vat code to be able to tell that termination work is being spread out over time. By rejecting all the promises immediately (and firing done() right away), we provoke the only userspace-visible consequences of termination. The remainder are GC effects (such as WeakMap entries being deleted), which we've been careful to hide from userspace.

This change should close off a significant portion of attacks in which a malicious vat attempts to overload the kernel with cleanup work. It does not prevent high-cardinality syscall.dropImports(): those are better managed with a kernel-side policy to terminate vats which try to drop too much (just like a too-many-computrons-per-delivery policy), combined with an #8417 -style liveslots change to avoid getting terminated.

It also does not prevent the threat of lots of outstanding Promises. We could choose to address that by only reject a few of them each time through (moving step A into the slow path), trading off kernel robustness against letting other vats react promptly to the termination event.

Scaling Considerations

This should enable the kernel to survive the termination of large-scale vats.

It does, of course, cost more time/CPU overall. The data structures it accesses are all constant-time, so the overhead comes from the extra per-block checks (like the kvStore.getNextKey() to see if there is more work to be done). But I think the number of extra calls will be minimal.

Test Plan

Unit tests on the kernel to terminate a vat with lots of state and observe that the work is spread out over many calls. Unit tests on the new runPolicy() method and its effects.

Before deployment (specifically before we trigger a vat termination), we'll need a main-fork simulation of deleting both large price-feed vats. I'm particularly interested/concerned with the shadow-into-IAVL overhead of the DB churn. I think this approach gives us the best chance of reducing that churn to a managable size, but I still want to know exactly how long the real chain will take to perform this cleanup, ane especially how the eventual cosmos-sdk state pruning event plays out.

Upgrade Considerations

The only swing-store (kvStore) change is the new vats.terminated key. This will be missing in vN-1, and when vN starts, the kernel will notice the missing key and initialize it to an empty list. No vats will be in the partially-terminated state upon entry to vN, so we don't need to accomodate any existing state over than the "missing key" one.

The text was updated successfully, but these errors were encountered:

warner · 2024-02-24T01:08:18Z

Ok, so runPolicy() will provide a way for the host app to limit how much termination work gets done, and when. For cosmic-swingset, I think we'll add one run to each block, at the end, which only gets performed if no other run spent any computrons. This run will allow a single round of termination-cleanup (by having allowCleanup() return false on all other calls than the first one), and will limit that round to only doing 5 items (by having it return 5 on the first call). For the price-feed vats that we're looking at, that will either:

delete 5 c-list exports to zoe, causing 5 drop+retires into vat-zoe, which will add 250 syscalls to the next vat-zoe BOYD
delete 5 c-list imports, causing vat-board to get retires, adding maybe 10 syscalls to the next vat-board BOYD
delete 5 kvStore keys, triggering no other work

The garbage in run-21 (08-feb-2024) contains:

282k zoe cycles
169k weak imports in v7-board
1.17M quote payments

Our remediation will just delete (slowly) the two large price-feed vats: v29 (ATOM-USD) and v68 (stATOM-USD). Most of the v7-board weak imports are coming from those two vats, as are most of the zoe cycles, and most of the
quote payments are virtual objects in those vats.

Then, by implementing #8980 and setting the threshold to trigger a BOYD every 20 krefs:

while deleting c-list imports which feed zoe cycles:
- every (empty) block will drop and retire 5 krefs, which eventually break a zoe cycle
- every other (empty) block will reach the threshold to trigger a vat-zoe BOYD
- this BOYD will do about 500 syscalls, and free everything involved in 10 cycles
- so every two blocks (about 12s) will cause three deliveries to vat-zoe (dropImports, retireImports, BOYD)
- the snapshotInterval remains 200 deliveries, so 200/3 = 67 cycles will trigger a snapshot
- so every 133 blocks (about 13min) we'll trigger a Zoe snapshot
- if the BOYD takes 0.5s, the worst-case replay time (where a validator is restarted just before the vat-zoe snapshot) will replay about 66 BOYDs, taking about 33s to start up. The average restart time is half that, about 16s.
- the 282k zoe cycles (as of run-21, 08-feb-2024) will take about 4 days to clear
- (if we delete the imports first, zoe's syscall.retireImports will probably delete the entangled exports, so they won't be around by the time we reach the exports phase)
while deleting c-list imports which are recognizable by v7-board:
- each (empty) block will delete 5 entries
- every 4th (empty) block will reach the threshold to trigger a vat-board BOYD
- that BOYD will do about 20*10=200 syscalls, and will happen about once every 24s
- after 200 empty blocks, we'll have sent 200 retireExports into vat-board, and will trigger a snapshot
- the snapshot will happen about once every 20 minutes
- it will take about 2.3 days to clear all the c-list entries
while deleting vatstore data:
- each quote payment involves about 10 vatstore keys (refcounts, export status, collection keys, etc)
- each (empty) block will delete 5 vatstore entries
- every 2nd (empty) block will finish deleting a quote payment's data
- no other vats will be invoked or triggered, vatstore is entirely vat-local
- it will take about 16 days to finish deleting all the vatstore data

That points to about 23 days to finish the remediation, assuming that the chain is mostly idle (so every block gets to do a little bit of cleanup).

We could speed that up fairly safely by making a fancier scheduler which knows the difference between termination work that triggers GC actions, and work (vatstore deletions) which does not. For example, if we said that each budget=1 allows one c-list entry or two vatstore entries to be processed, then the vatstore-deletion phase would run twice as fast (8 days total) and delete 10 entries per block.

`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 200. This caused BOYD to happen once every 200 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`. * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The current dirt level is recorded in `vNN.reapDirt`. The kernel will automatically upgrade both the kernel-wide and the per-vat state upon the first reboot with the new kernel code. The old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold` of all 'never' values, so they continue to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20. We do not track dirt when the corresponding threshold is 'never', to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980

Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. refs #8928

This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928

Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. refs #8928

This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928

warner · 2024-04-13T04:14:01Z

PR #9227 implements the swingset side of this. We'll still need changes to cosmic-swingset to provide a suitable runPolicy. I'm thinking that we only do cleanup work in empty blocks (so have allowCleanup() return false in all runs, then look at the total used computrons, and iff that was zero, do an additional run which allows budget=5 cleanup and up to 65M computrons). Eventually we might want to change that to allow some cleanup work in non-empty-but-non-full blocks, but I think we should get some experience with the slower form before we enable a faster rate.

Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or transcript spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. Without this, the kernel's attempt to slowly delete a terminated vat would succeed in slowly draining the kvStore, but would trigger a gigantic SQL transaction at the end, as it deleted every transcript item in the vat's history. The worst-case example I found would be the mainnet chain's v43-walletFactory, which (as of apr-2024) has 8.2M transcript items in 40k spans. A fast machine takes two seconds just to count all the items, and deletion took 22 *minutes*, with a `swingstore.wal` file that peaked at 27 GiB. This would cause an enormous chain stall at some surprising point in time weeks or months after the vat was first terminated. In addition, both the transcript spans and the snapshot records are shadowed into IAVL (via `export-data`) for integrity, and deleting 40k+40k=80k IAVL records in a single block might cause some significant churn too. refs #8928

This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928