Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate old mainnet price-feed vats #9483

Open
5 of 6 tasks
warner opened this issue Jun 11, 2024 · 5 comments
Open
5 of 6 tasks

terminate old mainnet price-feed vats #9483

warner opened this issue Jun 11, 2024 · 5 comments
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request

Comments

@warner
Copy link
Member

warner commented Jun 11, 2024

What is the Problem Being Solved?

#8400 identified misbehavior in the mainnet price-feed vats which caused their storage and c-list use to grow constantly over time. #8401 identified misbehavior in zoe/contract interactions that caused similar symptoms. The cause of these problems has been fixed, and these fixes will be deployed in mainnet upgrade-16, in #9370. This will upgrade Zoe to stop producing most #8401 cycles, and will introduce new price-feed vats that do not suffer from the #8400 growth.

Unfortunately that deployment will (necessarily) leave the old bloated price-feed vats in place, which will retain the troublesome storage and c-list entries (both in the price-feed vats themselves, and in the upgraded vat-zoe because of the cycles). Terminating the price-feed vats would free up that storage, but until the changes of #8928 are deployed, deleting that much data in a single step would crash the kernel, or at least cause a multi-hour -long stall.

So the task, after #8928 is deployed, is to deploy an action to the chain that will terminate the old price-feed vats.

Description of the Design

  • first, wait until the "price-feeds core-eval" is deployed, which introduces new price-feed vats (and, most importantly, switches the Inter-protocol vats to pay attention to them, and stop paying attention to the old ones)
  • then, wait until the code changes from terminate vats slowly #8928 are deployed, which requires a chain-halting kernel upgrade
  • then, deploy something that will terminate all the old price-feed vats:
    • v29-ATOM-USD_price_feed
    • v46-scaledPriceAuthority-ATOM
    • v68-stATOM-USD_price_feed
    • v69-scaledPriceAuthority-stATOM
    • v98-stOSMO-USD_price_feed
    • v99-scaledPriceAuthority-stOSMO
    • v104-stTIA-USD_price_feed
    • v105-scaledPriceAuthority-stTIA
    • v111-stkATOM-USD_price_feed
    • v112-scaledPriceAuthority-stkATOM
    • (although, in practice, we should probably terminate a single vat first, probably v111 or v112 because they're the smallest, and get some hands-on experience with the load this causes, before we start more aggressive cleanup)

I think @Chris-Hibbert told me that Zoe retains the adminNode necessary to call E(adminNode).terminateWithFailure(), but does not expose it to the bootstrap/core-eval environment, which means a single core-eval would not be sufficient to get these vats terminated.

One option is to wait for #8687 to be deployed to mainnet (also in a chain-halting kernel upgrade), and then have the outside-the-kernel upgrade handler call controller.terminateVat() on each of these vatIDs. These two steps could happen in the same chain-halting upgrade.

Another option is to change Zoe somehow to allow governance to trigger terminateWithFailure(), and then perform governance actions to get the vats terminated. This would be better from an authority point of view, but would require more upgrade steps to get there, and might or might not fit our long-term model of how authority should be distributed.

Security Considerations

The ability to terminate a vat must be pretty closely held, as continued operation of a contract is a pretty fundamental part of what chain platform promises. #8687 is not an authority violation because it is not reachable by any vat (it must be invoked by the host application, outside the kernel entirely), but that doesn't mean it's a very comfortable superpower to exercise. It'd be nice to find a good "in-band" means to get these old vats shut down.

Scaling Considerations

We must be careful to not overload the chain. The #8928 slow-termination feature enables this, and introduces a rate limit to control how much cleanup work is allowed during each idle block. We need to carefully test and closely monitor the cleanup rate to make sure the "background work" does not cause problems with validators. We know that doing significant swingset work causes the slower validators to miss the voting window (the number of votes per block drops significantly for a block or two after non-trivial swingset runs). The cleanup parameters are chosen to keep the cleanup work down to about 100ms on my local machine, but we don't know exactly how long that will take on the slower validators. There will always be a slowest validator, and giving them work to do will always have the potential to knock them out of the pool, but we should try to avoid too much churn, especially because cleanup will be ongoing for weeks or months, and there won't be any idle blocks during that time, so they will not have the same chance to catch up as they would normally have.

If multiple vats have been terminated-but-not-deleted, #8928 only does cleanup on one vat at a time. So (in theory) it should be safe to delete all the vats at once, and the rate-limiting code will slowly crawl through all of the first vat's garbage before it eventually moves on to the second, etc. But this needs to be tested carefully.

Test Plan

We definitely need an a3p-like test to exercise whatever API we end up using.

In addition, we need to find some way to simulate and measure the actual rate-limiting. I intend to experiment with mainfork-based clones of the mainnet state, and deploy some hacked up version of this ticket, to watch how cleanup unfolds on real mainnet-sized data.

Upgrade Considerations

this is all about performing a chain upgrade to trigger this deletion

Tasks

  1. Zoe automerge:rebase performance resource-exhaustion
    Chris-Hibbert
  2. SwingSet automerge:rebase swing-store
    warner
  3. SwingSet automerge:rebase
    warner
  4. Zoe enhancement
    Chris-Hibbert
  5. SwingSet enhancement performance
    gibson042 warner
  6. Vaults auction contract-upgrade oracle
    Chris-Hibbert
@warner warner added the enhancement New feature or request label Jun 11, 2024
@Chris-Hibbert
Copy link
Contributor

I think @Chris-Hibbert told me that Zoe retains the adminNode necessary to call E(adminNode).terminateWithFailure(), but does not expose it to the bootstrap/core-eval environment,

Zoe has it, and uses it to terminate in various circumstances, to return the done() promise, and a few other things, but it doesn't provide access to it externally. Zcf has it, so the contract can terminate itself. Zoe doesn't have any way to recognize a governor, so it can't provide access, but the contractGovernor inside the contract has access to ZCF.

We ought to be able to (in the future) give governance the ability to terminate the contract, but existing contracts would require both their governor and the contract to be upgraded to take advantage of this.

@warner
Copy link
Member Author

warner commented Aug 21, 2024

@toliaqat and I were talking about the full set of development/testing tasks necessary before we trigger this deletion, not all of which have issues. The tasks on my mind are:

  • land the slow-deletion code (terminate vats slowly #8928), which includes:
    • kernel-side implementation (done)
    • cosmic-swingset integration (in progress)
    • cosmic-swingset agd tx gov to set params (in process)
    • unit test to show agd tx gov works (not started)
  • some test to show that agd tx gov params actually make it to swingset
  • give Zoe the ability to terminate contract vats (can/should zoe expose vat-termination on the instance admin facet? #9716)
  • a3p test to:
    • create a throwaway vat
    • give it enough traffic to make it large enough to not get deleted in a single block
    • trigger deletion (using core-eval and can/should zoe expose vat-termination on the instance admin facet? #9716)
    • wait for the right number of blocks to pass
    • examine the slogfile for evidence that deletion was slow
    • examine the DB for evidence that it was deleted
  • deploy replacement price-feed vats to mainnet, stop using old ones (upcoming core-eval?)
  • deploy slow-deletion kernel/cosmic-swingset to mainnet (chain-halting upgrade)
  • deploy can/should zoe expose vat-termination on the instance admin facet? #9716 to mainnet (core-eval which upgrades vat-zoe)
  • perform a mainfork test which deletes a small pricefeed vat, confirm predicted deletion rate
  • build many-diverse-validator testnet, build up a pricefeed vat, terminated it, measure deletion load

Once all those are in place, we can perform a core-eval that will terminate the first/smallest of the unused price-feed vats (probably v111-stkATOM). We'll need to watch the chain carefully when that activates, and be prepared to ask for a governance vote to reduce the deletion rate if it looks like it's causing problems. v111 will probably take a week or so to finish deletion. If that goes well, we can do another next core-eval to delete the matching scaledPriceAuthority (v112, which is also pretty small). We might also increase the deletion rate if it seems safe. Eventually we might do a core-eval that terminates multiple vats, to reduce the number of votes needed to delete all ten (2 vats each for the 5 denoms: ATOM, stATOM, stTIA, stOSMO, stkATOM).

many-diverse-validator testnet

We can't afford to build a full mainfork-cloned network with 100 validators, that's just too much compute and disk IO and disk space. But we'd like to build confidence that our deletion load is not going to cause validators to fall behind.

So we might build a testnet with many validators (maybe even 100), but with a mostly empty state. We could get virtual machines of varying CPU speeds, but we'll probably also want to introduce deliberate slowdowns to mimic what's happening on mainnet. Just a sleep() in the cosmic-swingset code before it starts doing kernel processing might suffice, maybe using a magic number in the hostname to decide how much to delay.

Then we experiment with various delays, trying to create a mix of validators that roughly matches what we currently see on mainnet. Currently, a PushPrice that takes 1.5s doesn't have a large effect on validators (we might only miss one or two signatures on the next block, and get back to the full set of 100 votes by the second block). But a block that spends 3 or more seconds doing compute will cause more validators to miss the voting window.

The intention is to convince ourselves that our synthetic variety of validators is suffering the same missing votes (due to large computation) as our real mainnet validators are. Then, we build up a vat to be large enough that deleting its state takes a couple of hours. Then we trigger that termination/deletion, and watch the resulting block times and vote counts carefully.

The goal is to be confident that deleting a vat at a given rate/budget will not cause unpredicted/undue disruption to the chain. I picked a starting budget of 5 exports/imports per block, and that translates fairly precisely into a certain number of DB changes (and thus IAVL changes) in each block. But the actual impact of that computation/DB-churn depends upon the host's IO bandwidth and other things that probably vary from one machine to another. I can't measure that impact exactly without actually triggering the deletion on mainnet, and if that is disruptive, it would cause validators to miss votes, block times to increase, and other transactions to be impacted. So I want to find a way to simulate it as closely as possible beforehand. A testnet with just a few validators won't cut it, nor will a testnet where all of the validators run at the same speed. Hence the hope to have enough validators, and enough different validators, to see the same kind of "slow computation causes missed votes" impact as we see on mainnet, so we can then test to see if the slowness of vat-deletion cranks is enough to cause the same problem.

@warner
Copy link
Member Author

warner commented Oct 4, 2024

adminNodes, zoe adminFacets, and how to reach them

Vats are terminated by using E(adminNode).terminateWithFailure(reason) on the adminNode, which is part of the return value of E(vatAdminService).createVat(). These adminNodes are exported by vat-vat-admin and encapsulate the vatID.

For contract vats, Zoe holds onto these vat-admin adminNode in four durable objects: InstanceStorageManager, instanceAdmin, zoeInstanceAdmin, and adminFacet. Of these, the first two are not exported, and the zoeInstanceAdmin is only shared with the ZCF agent inside the contract vat. The adminNode itself does not leave vat-zoe, so our only way to terminate the vat is to ask Zoe to do it for us. #9716 is about adding a terminate() method to the zoe adminFacet, which transforms our requirement into getting access to this adminFacet.

The zoe adminFacet is returned as part of the result of E(zoe).startInstance(). Typically this arrives on v1-bootstrap in the core-eval or core-proposal code that creates the contract instance. Whether, and how, the adminFacet is stored, depends upon how the core-eval/proposal was written. This has changed over time, so what matters for us is how the proposal that created these old vats was written.

Scanning the DB to trace object references

I've been scanning a recent swingstore snapshot to understand where the adminFacet is being held. Here are some notes on how I did this scan and what I found. As background, here some of the vats we'll be looking at:

  • v1-bootstrap
  • v2-vat-admin
  • v6-agoricNames
  • v7-board
  • v9-zoe
  • v43-walletFactory

We'll be looking at swingset's internal representation of durable object state and durable collections, for which the SwingSet vatstore-usage.md document will be helpful.

  • durable object Kinds are defined by keys like v9.vs.vom.dkind.13.descriptor (for VatID v9 and durable KindID 13), which includes a tag: field that records the name of the Kind/Exo
  • durable object instances have their state recorded in keys like v9.vs.vom.o+d13/4 (for KindID 13 and instance number 4)
  • durable collections are defined by keys like v9.vs.vc.7.|schemata, which has a label: field that records any name assigned to the collection, and sometimes has a keyShape or valueShape that can help explain what the collection is for
    • the actual type of the collection (weak vs strong, set vs map, durable vs merely-virtual) must be deduced by finding the vref of the collection itself, see vatstore-usage.md for details
  • the items inside a durable collection are stored in keys like v9.vs.vc.7.sStringKey (for collection7.get('stringKey')), or v9.vs.vc.7.r0000000013:o-2514 (using a Presence with vref o-2514 as a key)

We will also be looking at vat c-lists, which are stored in keys like v9.c.o-2514 (e.g. ${vatID}.c.${vref}, whose value is a kref, and some refcount info), and v9.c.ko2828 (e.g. ${vatID}.c.${kref}, whose value is a vref).

We'll reduce the full key-value table by filtering it into some subsets, like this:

$ sqlite3 swingstore.sqlite 'SELECT key,value FROM kvStore' |sort >all-kv.txt
$ grep '^v2\.' all-kv.txt >v2
$ grep '^v2\.vs\.' v2 >v2.vs
$ grep '^v2\.c\.' v2 >v2.c

Note that sqlite3 defaults to a pipe (|) character as a separator between the key and the value. This collides with some of our keys (the durable-collection |schemata key in particular), which would interfere with automated analysis tools (which should really import better-sqlite3 and access the DB directly), but which doesn't cause a problem for manual analysis.

We happen to know that v2-vat-admin implements adminNodes as durable objects, whose state.vatID holds the associated vatID.

v24-economicCommittee

We are planning a proposal to replace v24-economicCommittee with a new vat. This will leave v24 ready to be deleted. As the smallest, it is a good candidate to be the first to be deleted, so if the slow-deletion code runs faster than expected, we should still be able to handle the DB load.

  • grep v2.vs for the adminNode that records v24 in its state
% grep v24 v2.vs
v2.vs.vc.5.sv24|{"body":"#null","slots":[]}
v2.vs.vom.o+d13/19|{"vatID":{"body":"#\"v24\"","slots":[]}}

The vc.5 collection holds a Set of managed vats, indexed by VatID. The o+d13/19 durable object is the adminNode.

  • find the kref under which that vref was exported

The \b is to avoid accidentally matching on e.g. o+d13/192.

% grep 'o+d13/19\b' v2.c
v2.c.ko261|R o+d13/19
v2.c.o+d13/19|ko261

So ko261 is the kref for the v24 adminNode.

  • find all importers of that kref
% grep 'ko261\b' all-kv.txt
ko261.owner|v2
ko261.refCount|1,1
v2.c.ko261|R o+d13/19
v2.c.o+d13/19|ko261
v9.c.ko261|R o-112
v9.c.o-112|ko261

This shows v2 exporting the object, and v9 (zoe) importing it as vref o-112

  • find all references to the vref inside v9-zoe
% grep 'o-112\b' v9.vs
v9.vs.vom.o+d26/12|{"instanceState":{"body":"#\"$0.Alleged: InstanceRecord\"","slots":["o+d27/12"]},"adminNode":{"body":"#\"$0.Alleged: adminNode\"","slots":["o-112"]},"root":{"body":"#\"$0.Alleged: undefined\"","slots":["o-113"]},"functions":{"body":"#\"#undefined\"","slots":[]}}
v9.vs.vom.o+d33/12|{"offerFilterStrings":{"body":"#[]","slots":[]},"publicFacet":{"body":"#\"$0.Alleged: Committee publicFacet\"","slots":["o-125"]},"handleOfferObj":{"body":"#\"$0.Alleged: handleOfferObj\"","slots":["o-124"]},"zoeInstanceStorageManager":{"body":"#\"$0.Alleged: InstanceStorageManager instanceStorageManager\"","slots":["o+d26/12:1"]},"seatHandleToZoeSeatAdmin":{"body":"#\"$0.Alleged: weakMapStore\"","slots":["o+d7/24"]},"instanceHandle":{"body":"#\"$0.Alleged: InstanceHandle\"","slots":["o+d29/12"]},"acceptingOffers":{"body":"#true","slots":[]},"zoeSeatAdmins":{"body":"#\"$0.Alleged: setStore\"","slots":["o+d8/51"]},"adminNode":{"body":"#\"$0.Alleged: adminNode\"","slots":["o-112"]}}
v9.vs.vom.o+d34/12|{"instanceStorage":{"body":"#\"$0.Alleged: InstanceStorageManager instanceStorageManager\"","slots":["o+d26/12:1"]},"instanceAdmin":{"body":"#\"$0.Alleged: instanceAdmin instanceAdmin\"","slots":["o+d33/12:1"]},"seatHandleToSeatAdmin":{"body":"#\"$0.Alleged: weakMapStore\"","slots":["o+d7/24"]},"adminNode":{"body":"#\"$0.Alleged: adminNode\"","slots":["o-112"]}}
v9.vs.vom.o+d37/12|{"adminNode":{"body":"#\"$0.Alleged: adminNode\"","slots":["o-112"]},"zcfBundleCap":{"body":"#\"$0.Alleged: device node\"","slots":["d-70"]},"contractBundleCap":{"body":"#\"$0.Alleged: device node\"","slots":["d-76"]}}
v9.vs.vom.rc.o-112|4

This shows our adminNode being held in four durable objects (and having a vom.rc.${vref} refcount of 4):

  • o+d26/12
  • o+d33/12
  • o+d34/12
  • o+d37/12

We can figure out what these durable object Kinds (/ Exos) are by looking at their labels:

% grep descriptor v9.vs
..
v9.vs.vom.dkind.26.descriptor|{"kindID":"26","tag":"InstanceStorageManager","facets":["helpers","instanceStorageManager","withdrawFacet"],"stateShapeCapData":{"body":"#\"#undefined\"","slots":[]}}
v9.vs.vom.dkind.33.descriptor|{"kindID":"33","tag":"instanceAdmin","facets":["helper","instanceAdmin"],"stateShapeCapData":{"body":"#\"#undefined\"","slots":[]}}
v9.vs.vom.dkind.34.descriptor|{"kindID":"34","tag":"zoeInstanceAdmin","unfaceted":true,"stateShapeCapData":{"body":"#{\"adminNode\":{\"#tag\":\"match:remotable\",\"payload\":{\"label\":\"adminNode\"}},\"instanceAdmin\":{\"#tag\":\"match:remotable\",\"payload\":{\"label\":\"InstanceAdmin\"}},\"instanceStorage\":{\"#tag\":\"match:remotable\",\"payload\":{\"label\":\"ZoeInstanceStorageManager\"}},\"seatHandleToSeatAdmin\":{\"#tag\":\"match:kind\",\"payload\":\"remotable\"}}","slots":[]}}
v9.vs.vom.dkind.37.descriptor|{"kindID":"37","tag":"adminFacet","unfaceted":true,"stateShapeCapData":{"body":"#\"#undefined\"","slots":[]}}

So dkind.26 is an InstanceStorageManager, dkind.33 is an instanceAdmin, dkind.34 is zoeInstanceAdmin, and dkind.37 is the adminFacet.

and we can determine which (if any) of these are exported by looking for their vrefs in the v9 c-list:

% grep 'o+d26/12\b' v9.c
% grep 'o+d33/12\b' v9.c
% grep 'o+d34/12\b' v9.c
v9.c.ko266|R o+d34/12
v9.c.o+d34/12|ko266
% grep 'o+d37/12\b' v9.c
v9.c.ko293|R o+d37/12
v9.c.o+d37/12|ko293

which shows that the first two are not exported at all, the o+d34/12 zoeInstanceAdmin is exported as kref ko266, and the o+d37/12 adminFacet is exported as kref ko293.

Then we check to see who else imports those krefs

% grep 'ko266\b' all-kv.txt
ko266.owner|v9
ko266.refCount|1,1
v24.c.ko266|R o-52
v24.c.o-52|ko266
v9.c.ko266|R o+d34/12
v9.c.o+d34/12|ko266

% grep 'ko293\b' all-kv.txt
ko293.owner|v9
ko293.refCount|1,1
v1.c.ko293|R o-217
v1.c.o-217|ko293
v9.c.ko293|R o+d37/12
v9.c.o+d37/12|ko293

The o+d34/12 zoeInstanceAdmin is imported by v24-economicCommittee itself (held by the ZCF agent inside that contract vat), which allows the vat to self-terminate, but doesn't help us terminate it from the outside.

The o+d37/12 adminFacet is imported by v1-bootstrap as o-217, which is more promising. Let's look inside that vat to see where it is held:

% grep 'o-217\b' v1.vs
v1.vs.vc.5.seconomicCommitteeKit|{"body":"#{\"adminFacet\":\"$0.Alleged: adminFacet\",\"creatorFacet\":\"$1.Alleged: Committee creatorFacet\",\"creatorInvitation\":\"#undefined\",\"instance\":\"$2.Alleged: InstanceHandle\",\"label\":\"economicCommittee\",\"publicFacet\":\"$3.Alleged: Committee publicFacet\"}","slots":["o-217","o-218","o-219","o-220"]}
v1.vs.vc.7.r0000000009:o-219|{"body":"#{\"adminFacet\":\"$0.Alleged: adminFacet\",\"creatorFacet\":\"$1.Alleged: Committee creatorFacet\",\"creatorInvitation\":\"#undefined\",\"instance\":\"$2.Alleged: InstanceHandle\",\"label\":\"economicCommittee\",\"publicFacet\":\"$3.Alleged: Committee publicFacet\"}","slots":["o-217","o-218","o-219","o-220"]}
v1.vs.vom.rc.o-217|2

This shows that collection vc.5 is holding it (along with creatorFacet and creatorInvitation) as a value, indexed by the string economicCommitteeKit. That is quite promising. It is also being held as a value of vc.7, indexed by a Presence for vref o-219. We can look at the collection labels to find out what vc.5 and vc.7 are:

% grep schemata v1.vs
..
v1.vs.vc.5.|schemata|{"body":"#{\"keyShape\":{\"#tag\":\"match:scalar\",\"payload\":\"#undefined\"},\"label\":\"Bootstrap Powers\"}","slots":[]}
v1.vs.vc.7.|schemata|{"body":"#{\"keyShape\":{\"#tag\":\"match:scalar\",\"payload\":\"#undefined\"},\"label\":\"ContractKits\"}","slots":[]}

We can also figure out the types of these collections by finding a copy of the right vref, extracting the KindID, and comparing it against the collection types:

% grep '/7\b' v1.vs
v1.vs.vc.1.sContractKits|{"body":"#\"$0.Alleged: mapStore\"","slots":["o+d6/7"]}
v1.vs.vc.5.scontractKits|{"body":"#\"$0.Alleged: mapStore\"","slots":["o+d6/7"]}
v1.vs.vom.rc.o+d6/7|2
# so o+d6/7 is the vref for vc.7 , and it is of type 6
% grep storeKindIDTable v1.vs
v1.vs.storeKindIDTable|{"scalarMapStore":2,"scalarWeakMapStore":3,"scalarSetStore":4,"scalarWeakSetStore":5,"scalarDurableMapStore":6,"scalarDurableWeakMapStore":7,"scalarDurableSetStore":8,"scalarDurableWeakSetStore":9}
# type 6 is scalarDurableMapStore
% grep '\5\b' v1.vs |grep vom
v1.vs.vom.rc.o+d6/5|2

Both vc.5 and vc.7 are scalarDurableMapStores, which means they are strong maps (rather than weak), so their keys could be enumerated. This doesn't matter here, but might be relevant if the only reference to the object we need is living in the key of a MapStore, rather than in its value.

We can find out where vc.5 (i.e. o+d6/5) is being held:

% grep 'o+d6/5\b' v1.vs
v1.vs.vc.1.sBootstrap Powers|{"body":"#\"$0.Alleged: mapStore\"","slots":["o+d6/5"]}
v1.vs.vc.5.spowerStore|{"body":"#\"$0.Alleged: mapStore\"","slots":["o+d6/5"]}
v1.vs.vom.rc.o+d6/5|2

This shows it is available in the vc.1 collection under the string key Bootstrap Powers. vc.1 happens to be "baggage". So the following code, executed in the v1-bootstrap namespace, would return the v24-economicCommittee zoe adminFacet:

baggage.get('Bootstrap Powers').get('economicCommitteeKit').adminFacet

Our core-evals don't quite run in that environment, but I'm pretty sure that Bootstrap Powers are somehow available to them.

v29-ATOM-USD_price_feed

  • adminNode is v2 o+d13/24 = ko362 = v9-zoe o-166
  • held in adminFacet o+d37/13 = ko401 = v1-bootstrap o-279 = v26-ATOM-USD_price_feed-governor o-69
  • v1-bootstrap holds o-279 in vc.8 under key o-276 = ko366
    • vc.8 is another scalarDurableMapStore (type 6) with label GovernedContractKits
    • and vc.8 is stored in baggage (vc.1) under the key GovernedContractKits
    • as well as in vc.5 ("Bootstrap Powers")
    • the o-276 key is an InstanceHandle, exported by zoe, and frequently used to name a contract instance without granting authority over it
  • ko366 has refCount 7: v15:o-62, v26:o-71, v43:o-120, v6:o-121, v7:o-124, v9:o+d29/17, v29:o-54, v1:o-276
  • v6-agoricNames holds it under vc.16 in a key named 'ATOM-USD price feed'
  • v7-board holds it under vc.5 in a key named 'board02963'

So we can probably retrieve the adminFacet with:

const handle = E(board).get('board02963');
const adminFacet = baggage.get("Bootstrap Powers").get("Bootstrap Powers").get(handle).adminFacet;

v26-ATOM-USD_price_feed-governor

  • adminNode: v2 o+d13/21 = ko338 = v9-zoe o-148
  • zoe adminFacet: v9 o+d37/14 = ko410 = v1-bootstrap o-272
  • v1-bootstrap holds o-272 in value of vc.8, keyed by o-276
  • InstanceHandle: v1 o-276 = ko366, imported by 7 vats
  • v1 o-276 = v15 o-62 = v26 o-71 = v29 o-54 = v43 o-120 = v6 o-121 = v7 o-124 = v9 o+d29/17
  • v7-board holds it under 'board02963'
  • v6-agoricNames holds in vc.16 under 'ATOM-USD price feed'

So we should be able to use the same approach as above.

v46-scaledPriceAuthority-ATOM

  • adminNode: v2 o+d13/41 = ko2826 = v9-zoe o-338
  • zoe adminFacet: v9 o+d37/33 = ko2843 = v1 o-2512
  • v1-bootstrap holds o-2512 in vc.7 ("ContractKits") under key o-2514 (InstanceHandle)
  • InstanceHandle: v1 o-2514 = ko2828 = v46 o-54 = v9 o+d29/34

So the InstanceHandle is only imported by v1 (for use as a key) and v46 (used by ZCF). This makes it awkward to retrieve, but fortunately v1-bootstrap's vc.7 ContractKits is a strong MapStore, which means we can iterate through its keys. So we can probably do something like:

let adminFacet;
for (let [key,value] of baggage.get('ContractKits').entries()) {
  if (value.label === 'scaledPriceAuthority-ATOM') {
    adminFacet = value.adminFacet;
  }
}

This isn't as safe/satisfying as using a board ID to reach an InstanceHandle and onward to the adminFacet, and we need to ask about why we think there is only one such entry in ContractKits, but it should work.

v45-auctioneer and v44-auctioneer.governor

The auctioneer vat has a less happy story.

  • adminNode: v2 o+d13/40 = ko644 = v9 o-322
  • zoe adminFacet: v9 o+d37/32 = ko2842 = v44 o-69
  • adminFacet is not imported by v1-bootstrap, only by v44-auctioneer.governor

Note that we replaced this vat in gov76 (04-sep-2024) with v157-auctioneer and an associated v156-auctioneer.governor, leaving the original vats running but now unused. The replacement process might have overwritten some storage (probably not, but we should keep it in mind).

More importantly, the zoe adminFacet for this vat is not currently imported by v1-bootstrap. It is held by the governor vat (v44), but that's the only import.

@Chris-Hibbert found the code that was meant to store the auctioneer's adminFacet:

  auctioneerKit.resolve(
    harden({
      label: 'auctioneer',
      creatorFacet: governedCreatorFacet,
      adminFacet: governorStartResult.adminFacet,
      publicFacet: governedPublicFacet,
      instance: governedInstance,

      governor: governorStartResult.instance,
      governorCreatorFacet: governorStartResult.creatorFacet,
      governorAdminFacet: governorStartResult.adminFacet,
    }),
  );

Note the subtle typo in the adminFacet: line: it is recording governorStartResult.adminFacet, rather than governedStartResult. That means it recorded the governor's adminFacet twice, under two different names, when it was intended to record both the adminFacets of both the auctioneer and its governor separately.

We can confirm this by finding the v44-auctioneer.governor 's adminFacet and looking at the v1-bootstrap record which holds it:

  • adminNode: v2 o+d13/39 = ko603 = v9 o-293
  • zoe adminFacet: v9 o+d37/34 = ko2582 = v1 o-2518
% grep 'o-2518\b' v1
v1.c.ko2852|R o-2518
v1.c.o-2518|ko2852
v1.vs.vc.8.r0000000003:o-2522|{"body":"#{\"adminFacet\":\"$0.Alleged: adminFacet\",\"creatorFacet\":\"$1.Alleged: Auctioneer creatorFacet\",\"governor\":\"$2.Alleged: InstanceHandle\",\"governorAdminFacet\":\"$0\",\"governorCreatorFacet\":\"$3.Alleged: ContractGovernorKit creator\",\"instance\":\"$4.Alleged: InstanceHandle\",\"label\":\"auctioneer\",\"publicFacet\":\"$5.Alleged: publicFacet\"}","slots":["o-2518","o-2523","o-2520","o-2519","o-2522","o-2524"]}
v1.vs.vom.rc.o-2518|1

That shows a record which maps strings to Presences:

{ adminFacet: o-2518,
  creatorFacet: o-2523,
  governor: o-2520,
  governorAdminFacet: o-2518, // note the duplicate $0
  governorCreatorFacet: o-2519,
  instance: o-2522,
  label: 'auctioneer',
  publicFacet: o-2524,
}

So we accidentally dropped the auctioneer's adminFacet. Oops. It turns out we did this for the replacement auctioneer (v157) as well.

Our best thought here is to update the governor to expose a power (to the governing committee) to terminate the contract vat.

@warner
Copy link
Member Author

warner commented Oct 4, 2024

voteCounter vats

We also have about 100 voteCounter vats that I'd like to get rid of. We're keeping these around because they each hold the only copy of the results, but once we figure out how to move that into some (centralized) longer-lived location, it would be nice to delete the unnecessary vats.

Unfortunately, it looks like nothing is holding onto the zoe adminFacet for those vats, and it was garbage-collected. Looking at the oldest such vat, v49-voteCounter:

% grep v49 v2.vs
v2.vs.vom.o+d13/44|{"vatID":{"body":"#\"v49\"","slots":[]}}
% grep o+d13/44 v2.c
v2.c.o+d13/44|ko6208
% grep ko6208 all-kv.txt
ko6208.refCount|1,1
v9.c.ko6208|R o-1161
% grep o-1161 v9.vs
v9.vs.vom.o+d26/37|{"instanceState": ...
v9.vs.vom.o+d33/37|{"offerFilterStrings": ...
v9.vs.vom.o+d34/37|{"instanceStorage": ...
v9.vs.vom.rc.o-1161|3

note that there is no o+d37/* object which holds o-1161, indicating that the adminFacet has been deleted entirely. These vats lack governors, so it seems likely that nothing held onto the adminFacet after creation, so the objects just got garbage collected and deleted upstream.

In fact if we look at the number of o+d26/* "InstanceStorageManager" objects, vs that of o+d37/* "adminFacet" objects:

% grep 'vom\.o+d26/' v9.vs |wc -l
151

% grep 'vom\.o+d37/' v9.vs |wc -l
60

then it is clear that we've dropped a significant number of adminFacet objects, only 60 are remaining out of 151 instances (so we've lost 91). In this DB snapshot, we have 84 voteCounter vats, which accounts for most of them.

The vat does have an InstanceHandle, and it is registered in v7-board as 'board02575', but those handles are powerless.

So to delete the voteCounter vats, we're probably going to have to use controller.terminateVat(). We could also make some change to Zoe's internals to let it be instructed to find these vats and terminate them, but it's not clear what a safe API could possibly look like.

@warner
Copy link
Member Author

warner commented Oct 4, 2024

@michaelfig pointed out that:

consume.startUpgradable in agoric-sdk/vats/src/core/basic-behaviors.js is a promise for a function that stashes all the pieces returned by E(zoe).startInstance in consume.contractKits, a Promise<MapStore<Instance, StartedInstanceKitWithLabel>>

so modern deployment scripts should be better at holding on to everything.

mergify bot added a commit that referenced this issue Oct 9, 2024
closes: #9584
closes: #9928
refs: #9827 
refs: #9748 
refs: #9382
closes: #10031

## Description

We added upgrading the scaledPriceAuthority to the steps in upgrading vaults, auctions, and priceFeeds, and didn't notice that it broke things. The problem turned out to be that the "priceAuthority" object registered with the priceFeedRegistry was an ephemeral object that was not upgraded. This fixes that by re-registering the new priceAuthority.

Then, to simplify the process of cleaning up the uncollected cycles reported in #9483, we switched to replacing the scaledPriceAuthorities rather than upgrading them.

We also realized that we would need different coreEvals in different environments, since the Oracle addresses and particular assets vary for each (test and mainNet) chain environment.

#9748 addressed some of the issues in the original coreEval. #9999 showed what was needed for upgrading priceFeeds, which was completed by #9827.  #10021 added some details on replacing scaledPriceAuthorities.

### Security Considerations

N/A

### Scaling Considerations

Addresses one of our biggest scaling issues.

### Documentation Considerations

N/A

### Testing Considerations

Thorough testing in a3p, and our testnets.  #9886 discusses some testing to ensure Oracles will work with the upgrade.

### Upgrade Considerations

See above
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants