feat(swingset): allow slow termination/deletion of vats

This introduces new `runPolicy()` controls which enable "slow termination" of vats. When configured, terminated vats are immediately dead (all promises are rejected, all new messages go splat, they never run again), however the vat's state is deleted slowly, one piece at a time. This makes it safe to terminate large vats, with a long history, lots of c-list imports/exports, or large vatstore tables, without fear of causing an overload (by e.g. dropping 100k references all in a single crank). See docs/run-policy.md for details and configuration instructions. refs #8928
Agoric · Jun 14, 2024 · dab29e7 · dab29e7
1 parent 0126e7d
commit dab29e7
Show file tree

Hide file tree

Showing 15 changed files with 808 additions and 32 deletions.
diff --git a/packages/SwingSet/docs/run-policy.md b/packages/SwingSet/docs/run-policy.md
@@ -35,7 +35,12 @@ The kernel will invoke the following methods on the policy object (so all must e
 * `policy.crankFailed()`
 * `policy.emptyCrank()`
 
-All methods should return `true` if the kernel should keep running, or `false` if it should stop.
+All those methods should return `true` if the kernel should keep running, or `false` if it should stop.
+
+The following methods are optional (for backwards compatibility with policy objects created for older kernels):
+
+* `policy.allowCleanup()` : may return budget, see "Terminated-Vat Cleanup" below
+* `policy.didCleanup({ cleanups })` (if missing, kernel pretends it returned `true` to keep running)
 
 The `computrons` argument may be `undefined` (e.g. if the crank was delivered to a non-`xs worker`-based vat, such as the comms vat). The policy should probably treat this as equivalent to some "typical" number of computrons.
 
@@ -53,6 +58,27 @@ More arguments may be added in the future, such as:
 
 The run policy should be provided as the first argument to `controller.run()`. If omitted, the kernel defaults to `forever`, a policy that runs until the queue is empty.
 
+## Terminated-Vat Cleanup
+
+Some vats may grow very large (i.e. large c-lists with lots of imported/exported objects, or lots of vatstore entries). If/when these are terminated, the burst of cleanup work might overwhelm the kernel, especially when processing all the dropped imports (which trigger GC messages to other vats).
+
+To protect the system against these bursts, the run policy can be configured to terminate vats slowly. Instead of doing all the cleanup work immediately, the policy allows the kernel to do a little bit of work each time `controller.run()` is called (e.g. once per block, for kernels hosted inside a blockchain).
+
+There are two RunPolicy methods which control this. The first is `runPolicy.allowCleanup()`. This will be invoked many times during `controller.run()`, each time the kernel tries to decide what to do next (once per step). The return value will enable (or not) a fixed amount of cleanup work. The second is `runPolicy.didCleanup({ cleanups })`, which is called later, to inform the policy of how much cleanup work was actually done. The policy can count the cleanups and switch `allowCleanup()` to return `false` when it reaches a threshold. (We need the pre-check `allowCleanup` method because the simple act of looking for cleanup work is itself a cost that we might be able to afford).
+
+If `allowCleanup()` exists, it must either return a falsy value, or an object. This object may have a `budget` property, which must be a number.
+
+A falsy return value (eg `allowCleanup: () => false`) prohibits cleanup work. This can be useful in a "only clean up during idle blocks" approach (see below), but should not be the only policy used, otherwise vat cleanup would never happen.
+
+A numeric `budget` limits how many cleanups are allowed to happen (if any are needed). One "cleanup" will delete one vatstore row, or one c-list entry (note that c-list deletion may trigger GC work), or one heap snapshot record, or one transcript span (and its populated transcript items). Using `{ budget: 5 }` seems to be a reasonable limit on each call, balancing overhead against doing sufficiently small units of work that we can limit the total work performed.
+
+If `budget` is missing or `undefined`, the kernel will perform unlimited cleanup work. This also happens if `allowCleanup()` is missing entirely, which maintains the old behavior for host applications that haven't been updated to make new policy objects. Note that cleanup is higher priority than anything else, followed by GC work, then BringOutYourDead, then message delivery.
+
+`didCleanup({ cleanups })` is called when the kernel actually performed some vat-termination cleanup, and the `cleanups` property is a number with the count of cleanups that took place. Each query to `allowCleanup()` might (or might not) be followed by a call to `didCleanup`, with a `cleanups` value that does not exceed the specified budget.
+
+To limit the work done per block (for blockchain-based applications) the host's RunPolicy objects must keep track of how many cleanups were reported, and change the behavior of `allowCleanup()` when it reaches a per-block threshold. See below for examples.
+
+
 ## Typical Run Policies
 
 A basic policy might simply limit the block to 100 cranks with deliveries and two vat creations:
@@ -78,6 +104,7 @@ function make100CrankPolicy() {
       return true;
     },
   });
+  return policy;
 }
 ```
 
@@ -95,15 +122,15 @@ while(1) {
 
 Note that a new policy object should be provided for each call to `run()`.
 
-A more sophisticated one would count computrons. Suppose that experiments suggest that one million computrons take about 5 seconds to execute. The policy would look like:
+A more sophisticated one would count computrons. Suppose that experiments suggest that sixty-five million computrons take about 5 seconds to execute. The policy would look like:
 
 
 ```js
 function makeComputronCounterPolicy(limit) {
-  let total = 0;
+  let total = 0n;
   const policy = harden({
     vatCreated() {
-      total += 100000; // pretend vat creation takes 100k computrons
+      total += 1_000_000n; // pretend vat creation takes 1M computrons
       return (total < limit);
     },
     crankComplete(details) {
@@ -112,18 +139,119 @@ function makeComputronCounterPolicy(limit) {
       return (total < limit);
     },
     crankFailed() {
-      total += 1000000; // who knows, 1M is as good as anything
+      total += 65_000_000n; // who knows, 65M is as good as anything
       return (total < limit);
     },
     emptyCrank() {
       return true;
     }
   });
+  return policy;
 }
 ```
 
 See `src/runPolicies.js` for examples.
 
+To slowly terminate vats, limiting each block to 5 cleanups, the policy should start with a budget of 5, return the remaining `{ budget }` from `allowCleanup()`, and decrement it as `didCleanup` reports that budget being consumed:
+
+```js
+function makeSlowTerminationPolicy() {
+  let cranks = 0;
+  let vats = 0;
+  let cleanups = 5;
+  const policy = harden({
+    vatCreated() {
+      vats += 1;
+      return (vats < 2);
+    },
+    crankComplete(details) {
+      cranks += 1;
+      return (cranks < 100);
+    },
+    crankFailed() {
+      cranks += 1;
+      return (cranks < 100);
+    },
+    emptyCrank() {
+      return true;
+    },
+    allowCleanup() {
+      if (cleanups > 0) {
+        return { budget: cleanups };
+      } else {
+        return false;
+      }
+    },
+    didCleanup(spent) {
+      cleanups -= spent.cleanups;
+    },
+  });
+  return policy;
+}
+```
+
+A more conservative approach might only allow cleanup in otherwise-empty blocks. To accompish this, use two separate policy objects, and two separate "runs". The first run only performs deliveries, and prohibits all cleanups:
+
+```js
+function makeDeliveryOnlyPolicy() {
+  let empty = true;
+  const didWork = () => { empty = false; return true; };
+  const policy = harden({
+    vatCreated: didWork,
+    crankComplete: didWork,
+    crankFailed: didWork,
+    emptyCrank: didWork,
+    allowCleanup: () => false,
+  });
+  const wasEmpty = () => empty;
+  return [ policy, wasEmpty ];
+}
+```
+
+The second only performs cleanup, with a limited budget, stopping the run after any deliveries occur (such as GC actions):
+
+```js
+function makeCleanupOnlyPolicy() {
+  let cleanups = 5;
+  const stop: () => false;
+  const policy = harden({
+    vatCreated: stop,
+    crankComplete: stop,
+    crankFailed: stop,
+    emptyCrank: stop,
+    allowCleanup() {
+      if (cleanups > 0) {
+        return { budget: cleanups };
+      } else {
+        return false;
+      }
+    },
+    didCleanup(spent) {
+      cleanups -= spent.cleanups;
+    },
+  });
+  return policy;
+}
+```
+
+On each block, the host should only perform the second (cleanup) run if the first policy reports that the block was empty:
+
+```js
+async function doBlock() {
+  const [ firstPolicy, wasEmpty ] = makeDeliveryOnlyPolicy();
+  await controller.run(firstPolicy);
+  if (wasEmpty()) {
+    const secondPolicy = makeCleanupOnlyPolicy();
+    await controller.run(secondPolicy);
+  }
+}
+```
+
+Note that regardless of whatever computron/delivery budget is imposed by the first policy, the second policy will allow one additional delivery to be made (we do not yet have an `allowDelivery()` pre-check method that might inhibit this). The cleanup work, which may or may not happen, will sometimes trigger a GC delivery like `dispatch.dropExports`, but at most one such delivery will be made before the second policy returns `false` and stops `controller.run()`. If cleanup does not trigger such a delivery, or if no cleanup work needs to be done, then one normal run-queue delivery will be performed before the policy has a chance to say "stop". All other cleanup-triggered GC work will be deferred until the first run of the next block.
+
+Also note that `budget` and `cleanups` are plain `Number`s, whereas `comptrons` is a `BigInt`.
+
+
 ## Non-Consensus Wallclock Limits
 
 If the SwingSet kernel is not being operated in consensus mode, then it is safe to use wallclock time as a block limit:

diff --git a/packages/SwingSet/src/kernel/kernel.js b/packages/SwingSet/src/kernel/kernel.js
@@ -266,12 +266,17 @@ export default function buildKernel(
       // (#9157). The fix will add .critical to CrankResults, populated by a
       // getOptions query in deliveryCrankResults() or copied from
       // dynamicOptions in processCreateVat.
-      critical = kernelKeeper.provideVatKeeper(vatID).getOptions().critical;
+      const vatKeeper = kernelKeeper.provideVatKeeper(vatID);
+      critical = vatKeeper.getOptions().critical;
 
       // Reject all promises decided by the vat, making sure to capture the list
       // of kpids before that data is deleted.
       const deadPromises = [...kernelKeeper.enumeratePromisesByDecider(vatID)];
-      kernelKeeper.cleanupAfterTerminatedVat(vatID);
+      // remove vatID from the list of live vats, and mark for deletion
+      kernelKeeper.deleteVatID(vatID);
+      kernelKeeper.addTerminatedVat(vatID);
+      // remove vat from swing-store exports
+      kernelKeeper.removeVat(vatID);
       for (const kpid of deadPromises) {
         resolveToError(kpid, makeError('vat terminated'), vatID);
       }
@@ -378,7 +383,8 @@ export default function buildKernel(
    *    abort?: boolean, // changes should be discarded, not committed
    *    consumeMessage?: boolean, // discard the aborted delivery
    *    didDelivery?: VatID, // we made a delivery to a vat, for run policy and save-snapshot
-   *    computrons?: BigInt, // computron count for run policy
+   *    computrons?: bigint, // computron count for run policy
+   *    cleanups?: number, // cleanup budget spent
    *    meterID?: string, // deduct those computrons from a meter
    *    measureDirt?: { vatID: VatID, dirt: Dirt }, // the dirt counter should increment
    *    terminate?: { vatID: VatID, reject: boolean, info: SwingSetCapData }, // terminate vat, notify vat-admin
@@ -642,16 +648,40 @@ export default function buildKernel(
     if (!vatWarehouse.lookup(vatID)) {
       return NO_DELIVERY_CRANK_RESULTS; // can't collect from the dead
     }
-    const vatKeeper = kernelKeeper.provideVatKeeper(vatID);
     /** @type { KernelDeliveryBringOutYourDead } */
     const kd = harden([type]);
     const vd = vatWarehouse.kernelDeliveryToVatDelivery(vatID, kd);
     const status = await deliverAndLogToVat(vatID, kd, vd);
-    vatKeeper.clearReapDirt(); // BOYD zeros out the when-to-BOYD counters
     // no gcKrefs, BOYD clears them anyways
     return deliveryCrankResults(vatID, status, false); // no meter, BOYD clears dirt
   }
 
+  /**
+   * Perform a small (budget-limited) amount of dead-vat cleanup work.
+   *
+   * @param {RunQueueEventCleanupTerminatedVat} message
+   *     'message' is the run-queue cleanup action, which includes a vatID and budget.
+   *     A budget of 'undefined' allows unlimited work. Otherwise, the budget is a Number,
+   *     and cleanup should not touch more than maybe 5*budget DB rows.
+   * @returns {Promise<CrankResults>}
+   */
+  async function processCleanupTerminatedVat(message) {
+    const { vatID, budget } = message;
+    const { done, cleanups } = kernelKeeper.cleanupAfterTerminatedVat(
+      vatID,
+      budget,
+    );
+    if (done) {
+      kernelKeeper.deleteTerminatedVat(vatID);
+      kernelSlog.write({ type: 'vat-cleanup-complete', vatID });
+    }
+    // We don't perform any deliveries here, so there are no computrons to
+    // report, but we do tell the runPolicy know how much kernel-side DB
+    // work we did, so it can decide how much was too much.
+    const computrons = 0n;
+    return harden({ computrons, cleanups });
+  }
+
   /**
    * The 'startVat' event is queued by `initializeKernel` for all static vats,
    * so that we execute their bundle imports and call their `buildRootObject`
@@ -903,7 +933,6 @@ export default function buildKernel(
     const boydVD = vatWarehouse.kernelDeliveryToVatDelivery(vatID, boydKD);
     const boydStatus = await deliverAndLogToVat(vatID, boydKD, boydVD);
     const boydResults = deliveryCrankResults(vatID, boydStatus, false);
-    vatKeeper.clearReapDirt();
 
     // we don't meter bringOutYourDead since no user code is running, but we
     // still report computrons to the runPolicy
@@ -1159,6 +1188,7 @@ export default function buildKernel(
    * @typedef { import('../types-internal.js').RunQueueEventRetireImports } RunQueueEventRetireImports
    * @typedef { import('../types-internal.js').RunQueueEventNegatedGCAction } RunQueueEventNegatedGCAction
    * @typedef { import('../types-internal.js').RunQueueEventBringOutYourDead } RunQueueEventBringOutYourDead
+   * @typedef { import('../types-internal.js').RunQueueEventCleanupTerminatedVat } RunQueueEventCleanupTerminatedVat
    * @typedef { import('../types-internal.js').RunQueueEvent } RunQueueEvent
    */
 
@@ -1226,6 +1256,8 @@ export default function buildKernel(
     } else if (message.type === 'negated-gc-action') {
       // processGCActionSet pruned some negated actions, but had no GC
       // action to perform. Record the DB changes in their own crank.
+    } else if (message.type === 'cleanup-terminated-vat') {
+      deliverP = processCleanupTerminatedVat(message);
     } else if (gcMessages.includes(message.type)) {
       deliverP = processGCMessage(message);
     } else {
@@ -1285,6 +1317,10 @@ export default function buildKernel(
         // sometimes happens randomly because of vat eviction policy
         // which should not affect the in-consensus policyInput)
         policyInput = ['create-vat', {}];
+      } else if (message.type === 'cleanup-terminated-vat') {
+        const { cleanups } = crankResults;
+        assert(cleanups !== undefined);
+        policyInput = ['cleanup', { cleanups }];
       } else {
         policyInput = ['crank', {}];
       }
@@ -1318,7 +1354,9 @@ export default function buildKernel(
     const { computrons, meterID } = crankResults;
     if (computrons) {
       assert.typeof(computrons, 'bigint');
-      policyInput[1].computrons = BigInt(computrons);
+      if (policyInput[0] !== 'cleanup') {
+        policyInput[1].computrons = BigInt(computrons);
+      }
       if (meterID) {
         const notify = kernelKeeper.deductMeter(meterID, computrons);
         if (notify) {
@@ -1745,20 +1783,30 @@ export default function buildKernel(
    * Pulls the next message from the highest-priority queue and returns it
    * along with a corresponding processor.
    *
+   * @param {RunPolicy} [policy] - a RunPolicy to limit the work being done
    * @returns {{
    *   message: RunQueueEvent | undefined,
    *   processor: (message: RunQueueEvent) => Promise<PolicyInput>,
    * }}
    */
-  function getNextMessageAndProcessor() {
+  function getNextMessageAndProcessor(policy) {
     const acceptanceMessage = kernelKeeper.getNextAcceptanceQueueMsg();
     if (acceptanceMessage) {
       return {
         message: acceptanceMessage,
         processor: processAcceptanceMessage,
       };
     }
+    const allowCleanup = policy?.allowCleanup ? policy.allowCleanup() : {};
+    // false, or an object with optional .budget
+    if (allowCleanup) {
+      assert.typeof(allowCleanup, 'object');
+      if (allowCleanup.budget) {
+        assert.typeof(allowCleanup.budget, 'number');
+      }
+    }
     const message =
+      kernelKeeper.nextCleanupTerminatedVatAction(allowCleanup) ||
       processGCActionSet(kernelKeeper) ||
       kernelKeeper.nextReapAction() ||
       kernelKeeper.getNextRunQueueMsg();
@@ -1834,7 +1882,8 @@ export default function buildKernel(
     await null;
     try {
       kernelKeeper.establishCrankSavepoint('start');
-      const { processor, message } = getNextMessageAndProcessor();
+      const { processor, message } =
+        getNextMessageAndProcessor(foreverPolicy());
       // process a single message
       if (message) {
         await tryProcessMessage(processor, message);
@@ -1870,7 +1919,7 @@ export default function buildKernel(
       kernelKeeper.startCrank();
       try {
         kernelKeeper.establishCrankSavepoint('start');
-        const { processor, message } = getNextMessageAndProcessor();
+        const { processor, message } = getNextMessageAndProcessor(policy);
         if (!message) {
           break;
         }
@@ -1892,6 +1941,11 @@ export default function buildKernel(
           case 'crank-failed':
             policyOutput = policy.crankFailed(policyInput[1]);
             break;
+          case 'cleanup': {
+            const { didCleanup = () => true } = policy;
+            policyOutput = didCleanup(policyInput[1]);
+            break;
+          }
           case 'none':
             policyOutput = policy.emptyCrank();
             break;