-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terminate vats slowly #8928
Comments
Ok, so
The garbage in run-21 (08-feb-2024) contains:
Our remediation will just delete (slowly) the two large price-feed vats: v29 (ATOM-USD) and v68 (stATOM-USD). Most of the v7-board weak imports are coming from those two vats, as are most of the zoe cycles, and most of the Then, by implementing #8980 and setting the threshold to trigger a BOYD every 20 krefs:
That points to about 23 days to finish the remediation, assuming that the chain is mostly idle (so every block gets to do a little bit of cleanup). We could speed that up fairly safely by making a fancier scheduler which knows the difference between termination work that triggers GC actions, and work (vatstore deletions) which does not. For example, if we said that each |
`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 200. This caused BOYD to happen once every 200 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`. * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The current dirt level is recorded in `vNN.reapDirt`. The kernel will automatically upgrade both the kernel-wide and the per-vat state upon the first reboot with the new kernel code. The old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold` of all 'never' values, so they continue to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20. We do not track dirt when the corresponding threshold is 'never', to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980
`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 200. This caused BOYD to happen once every 200 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`. * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The current dirt level is recorded in `vNN.reapDirt`. The kernel will automatically upgrade both the kernel-wide and the per-vat state upon the first reboot with the new kernel code. The old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold` of all 'never' values, so they continue to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20. We do not track dirt when the corresponding threshold is 'never', to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980
Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. refs #8928
This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928
This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928
This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928
Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. refs #8928
This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928
PR #9227 implements the swingset side of this. We'll still need changes to cosmic-swingset to provide a suitable |
Both `snapStore.deleteVatSnapshots()` and `transcriptStore.deleteVatTranscripts()` now take a numeric `budget=` argument, which will limit the number of snapshots or transcript spans deleted in each call. Both return a `{ done, cleanups }` record so the caller knows when to stop calling. This enables the slow deletion of large vats (lots of transcript spans or snapshots), a small number of items at a time. Recommended budget is 5, which (given SwingSet's `snapInterval=200` default) will cause the deletion of 1000 rows from the `transcriptItems` table each call, which shouldn't take more than 100ms. Without this, the kernel's attempt to slowly delete a terminated vat would succeed in slowly draining the kvStore, but would trigger a gigantic SQL transaction at the end, as it deleted every transcript item in the vat's history. The worst-case example I found would be the mainnet chain's v43-walletFactory, which (as of apr-2024) has 8.2M transcript items in 40k spans. A fast machine takes two seconds just to count all the items, and deletion took 22 *minutes*, with a `swingstore.wal` file that peaked at 27 GiB. This would cause an enormous chain stall at some surprising point in time weeks or months after the vat was first terminated. In addition, both the transcript spans and the snapshot records are shadowed into IAVL (via `export-data`) for integrity, and deleting 40k+40k=80k IAVL records in a single block might cause some significant churn too. refs #8928
This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
…eanup ...in otherwise empty blocks Fixes #8928
What is the Problem Being Solved?
We've been brainstorming ways to address the large price-feed vats which are holding onto a lot of virtual objects (#8400 , plus their contribution to #8401). While the fixes to stop their growth are coming along nicely, we will still eventually need a way to delete all the old data. The simplest (and most desirable) approach is to simply terminate those vats, however with the current kernel, that would perform an unacceptably large amount of work, and trigger some O(N^2) performance bugs in the kernel GC handling code, which would probably crash the chain.
#8877 is a workaround to have the vat release objects slowly, to avoid these problems. But a better / longer-term solution is to allow the kernel to safely terminate large vats, yet still perform the cleanup work slowly, over time, so it doesn't interfere with normal operations.
This ticket is about an approach which would fix all of the problems with the price-feed vats, as well as any vat which performed a "malicious large termination" (aka the "I'm too big to fail, hahaha" attack). It would still be vulnerable to vats which
collection.clear()
or delete a large collection (for which #8417 would be the fix).The rough idea is to change
terminateVat
to merely mark the vat as terminated, and not immediately delete all the late vat's data (delete vatstore entries, delete clist entries and propagate decrefs, abandon exported objects). We would defer that work until later "cleanup" steps, performed when therunPolicy
says it's a good time (eg if/when there is remaining budget at the end of a block), and only do a small amount of work at a time. This work could be spaced out over months.The kernel would still reject all the vat's outstanding Promises at the time of termination, so that their downstream effects are felt promptly. This isn't a problem with our current vats (
v29-ATOM-USD_price_feed
only decides three promises), but could be a too-big-to-fail attack vector in the future. The kernel would also immediately kill the worker process.Description of the Design
the old design
When
terminateVat()
(kernel.js) is called, it does the following steps immediately:kernelKeeper.enumeratePromisesByDecider()
to find all the promises decided by this vat, and then callresolveToError()
on all of themkernelKeeper.cleanupAfterTerminatedVat()
to delete the vat's statevatKeeper.deleteSnapshotsAndTranscript()
to delete historical/replay-enabling statec.o+
exports, andorphanKernelObject()
each of themc.o-
imports, anddeleteCListEntry
each, which decrements refcountsvat.dynamicIDs
and/orvat.names
notifyTermination()
to let vat-admin fire thedone()
promise, so userspace (eg parent vat) can reactvatWarehouse.stopWorker()
to shut down the xsnap workerWhen the vat has a large number of non-exported virtual objects (eg most of the QuotePayments in #8400), step B4 takes a long time. When those objects are exported, B2 takes a long time (eg the QuotePayments that are weakly referenced by vat-board).
When the vat is importing a lot of objects, such as the #8401 cycles, B3 takes a long time, and also adds a lot of krefs to
maybeFreeKrefs
, and at end-of-crank, theprocessRefCounts()
call will add a lot of actions togcActions
. At the start of the next crank,processGCActionSet
will deliver a large GC delivery to the upstream vat (likedispatch.dropImports()
with 100k krefs into v9-zoe), which will trigger a very long delivery (with some multiple of 100k syscalls, all of which must get added into the transcript entry (causing bad memory behavior and JSON.stringify serialization time). Then, the next time v9-zoe does a BOYD, it will do some other multiple of 100k syscalls (to check on the status of all those entries, and probably free up additional objects as the cycles unwind), causing the same many-syscalls transcript problem, after which it will do its own very largesyscall.dropImports
orsyscall.retireExports
call with 100k vrefs, triggering more work.the new design
We'll introduce a new
kvStore
key namedvats.terminated
, whose value is a simple JSON-serialized array of VatID strings.KernelKeeper
will keep an in-RAM cache of it as a Set namedterminatedVats
, write a sorted version tokvStore
every time we change it, and populate the cache at kernel boot (and write an empty list if the key did not already exist, which is our backwards-compatibility/upgrade point).We'll change
kernelKeeper.ownerOfKernelObject
to consult both${kref}.owner
andterminatedVats
, so an object whose.owner
is still present, but that vatID is terminated, will report the object as orphaned. We also changekernelKeeper.vatIsAlive
to reportfalse
for terminated vats (either by having it consultterminatedVats
too, or by preemptively deleting the${vatID.o.nextID}
key that it looks for).Next, we change
terminateVat()
to add the vatID toterminatedVats
, and skip steps B2, B3, and B4. This will leave the bulk of the vat's data (and the cleanup work) for later, allowingterminateVat()
to finish quickly (and without provoking immediate large work).Now, we introduce a new internal
kernel.doSomeCleanup(vatID, budget)
method which performs up tobudget
total steps of:orphanKernelObject()
)deleteCListEntry()
)c.o+
export entries left to perform B1 on)terminatedVats
To get this work scheduled, we change the
runPolicy
to add anallowCleanup()
method. If present, this can return{ budget }
, or can return{}
to mean "do all the cleanup now". If it returns something falsy, that tells the kernel that no cleanup work should be done at this time. For backwards compatibility, if the method is missing, the kernel acts as if it returned{}
(unlimited cleanup).Then we add a new pseudo-run-queue message
{ type: 'cleanup', vatID, budget }
, a sibling to GC actions and reap-actions and actual run-queue messages likedeliver
andnotify
. We changegetNextMessageAndProcessor()
function to takerunPolicy
as an argument, and change its priority list to consultterminatedVats
before looking at anything else (but only if the policy allows cleanup right now). So its sequence will look like:budget = policyInput.allowCleanup()
This provides a default behavior that nearly matches the old one: vat termination work runs to completion on the step just after the termination happened, but still before any other work gets done.
But, the final step is to change cosmic-swingset to provide a
runPolicy.allowCleanup = () => false
for all the existing runs, then to perform one final run (with=> ({ budget })
) iff there is still room left in the block (eg if the previous runs did not exhaust the computron budget). I thinkbudget = 10
or100
is likely.So assuming the block that terminates the vat has free space, the final run will orphan eg 100 objects of the dead vat. If we assume that all these objects were also recognized by eg vat-board (and #7212 gets fixed), then this will push
dispatch.retireImports()
GC actions onto the queue, and these will be delivered as the second crank of this cleanup run. That may cause the recognzing vat to delete some collection entries, which incur syscalls and might create more garbage in those vats, but no more than 100 at a time.The next non-full block will orphan 100 more. This will repeat, 100 krefs per block, until all the exports have been orphaned. The following block will begin on the c-list imports. Each block will decref 100 krefs, which will immediately push a 100-kref
dispatch.dropExports
call onto the GC action queue, as well as a similarretireExports
call. Then thedropExports
will be delivered, causing eg v9-zoe to do syscalls (change export-status torecognizable
) on all 100, and accumulating 100 vrefs in its internal liveslotspossiblyDeadSet
. Next, theretireExports
will be delivered, which will trigger syscalls to delete the export status. Each block will process 100 krefs in this way.Eventually, v9-zoe will get enough deliveries to trigger a BOYD, which will perform more vatstore syscalls to check for remaining reference counts, and when those counts come up empty, liveslots will delete those virtual objects, which will trigger more deletions (unwinding the cycles). The amount of work will be bounded by the number of accumulated deleted objects, which only happen 100 at a time.
Once all the terminated vat's imports are deleted, the next non-full block will start on B3, and will delete 100 vatstore entries. This doesn't cause any activity in other vats, so the only cost is
kvStore
activity (which also causes IAVL churn, but only 100 entries at a time). This continues until all remaining vat state is deleted.Then the final non-full block can remove the vatID from
terminatedVats
, and the kernel has finally completely forgotten about the vat. The next call togetNextMessageAndProcessor()
will not find any cleanup work to do.The final requirement is that we run BOYDs frequently enough to avoid aggregating our carefully-spaced-out cleanup work back into a single large delivery. Our current scheduler has two ways to trigger a BOYD:
${vatID.reapInterval}
triggers a BOYD every 1000 deliverieskernel.snapshotInterval
triggers a heap snapshot (and associated BOYD) every 200 deliveriesstartVat
If we didn't change that, and we assume that eg v9-zoe is otherwise idle, we'd observe 100 pairs of
dropExports(100 krefs)
andretireExports(100 krefs)
before a BOYD was triggered, so v9-zoe would be doing cleanup on 10k krefs at a time. This might be larger than we want (certainly better than doing all 200k at once, but we'd really prefer to keep things limited to thebudget
, notbudget * snapshotInterval
).So the last step is to change the way we schedule BOYDs. As @mhofman has recommended, the scheduler should pay attention to more than just the number of deliveries. We can count computrons, deliveries, and also the number of krefs that have been dropped or retired. When any of these values grows above some threshold, we can trigger a BOYD and heap snapshot.
(the trick is to avoid doing multiple heap snapshots in a single block, because that's just wasted effort)
Security Considerations
We must continue to maintain some invariants.
Dead vats tell no tales
We stop the vat worker, and delete the metadata keys that would allow vat-warehouse to start a new one. So the dead vat will never again get agency, so it never gets a chance to emit new messages.
You can't speak to the dead (vat)
All messages and deliveries to the now-dead vat must go splat. We ensure this by having all message deliveries check for the vatID in
terminatedVats
first.other security-ish considerations
We don't want other vat code to be able to tell that termination work is being spread out over time. By rejecting all the promises immediately (and firing
done()
right away), we provoke the only userspace-visible consequences of termination. The remainder are GC effects (such as WeakMap entries being deleted), which we've been careful to hide from userspace.This change should close off a significant portion of attacks in which a malicious vat attempts to overload the kernel with cleanup work. It does not prevent high-cardinality
syscall.dropImports()
: those are better managed with a kernel-side policy to terminate vats which try to drop too much (just like a too-many-computrons-per-delivery policy), combined with an #8417 -style liveslots change to avoid getting terminated.It also does not prevent the threat of lots of outstanding Promises. We could choose to address that by only reject a few of them each time through (moving step A into the slow path), trading off kernel robustness against letting other vats react promptly to the termination event.
Scaling Considerations
This should enable the kernel to survive the termination of large-scale vats.
It does, of course, cost more time/CPU overall. The data structures it accesses are all constant-time, so the overhead comes from the extra per-block checks (like the
kvStore.getNextKey()
to see if there is more work to be done). But I think the number of extra calls will be minimal.Test Plan
Unit tests on the kernel to terminate a vat with lots of state and observe that the work is spread out over many calls. Unit tests on the new runPolicy() method and its effects.
Before deployment (specifically before we trigger a vat termination), we'll need a main-fork simulation of deleting both large price-feed vats. I'm particularly interested/concerned with the shadow-into-IAVL overhead of the DB churn. I think this approach gives us the best chance of reducing that churn to a managable size, but I still want to know exactly how long the real chain will take to perform this cleanup, ane especially how the eventual cosmos-sdk state pruning event plays out.
Upgrade Considerations
The only swing-store (kvStore) change is the new
vats.terminated
key. This will be missing in vN-1, and when vN starts, the kernel will notice the missing key and initialize it to an empty list. No vats will be in the partially-terminated state upon entry to vN, so we don't need to accomodate any existing state over than the "missing key" one.The text was updated successfully, but these errors were encountered: