-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terminate old mainnet price-feed vats #9483
Comments
Zoe has it, and uses it to terminate in various circumstances, to return the We ought to be able to (in the future) give governance the ability to terminate the contract, but existing contracts would require both their governor and the contract to be upgraded to take advantage of this. |
@toliaqat and I were talking about the full set of development/testing tasks necessary before we trigger this deletion, not all of which have issues. The tasks on my mind are:
Once all those are in place, we can perform a core-eval that will terminate the first/smallest of the unused price-feed vats (probably v111-stkATOM). We'll need to watch the chain carefully when that activates, and be prepared to ask for a governance vote to reduce the deletion rate if it looks like it's causing problems. v111 will probably take a week or so to finish deletion. If that goes well, we can do another next core-eval to delete the matching scaledPriceAuthority (v112, which is also pretty small). We might also increase the deletion rate if it seems safe. Eventually we might do a core-eval that terminates multiple vats, to reduce the number of votes needed to delete all ten (2 vats each for the 5 denoms: ATOM, stATOM, stTIA, stOSMO, stkATOM). many-diverse-validator testnetWe can't afford to build a full mainfork-cloned network with 100 validators, that's just too much compute and disk IO and disk space. But we'd like to build confidence that our deletion load is not going to cause validators to fall behind. So we might build a testnet with many validators (maybe even 100), but with a mostly empty state. We could get virtual machines of varying CPU speeds, but we'll probably also want to introduce deliberate slowdowns to mimic what's happening on mainnet. Just a Then we experiment with various delays, trying to create a mix of validators that roughly matches what we currently see on mainnet. Currently, a PushPrice that takes 1.5s doesn't have a large effect on validators (we might only miss one or two signatures on the next block, and get back to the full set of 100 votes by the second block). But a block that spends 3 or more seconds doing compute will cause more validators to miss the voting window. The intention is to convince ourselves that our synthetic variety of validators is suffering the same missing votes (due to large computation) as our real mainnet validators are. Then, we build up a vat to be large enough that deleting its state takes a couple of hours. Then we trigger that termination/deletion, and watch the resulting block times and vote counts carefully. The goal is to be confident that deleting a vat at a given rate/budget will not cause unpredicted/undue disruption to the chain. I picked a starting budget of 5 exports/imports per block, and that translates fairly precisely into a certain number of DB changes (and thus IAVL changes) in each block. But the actual impact of that computation/DB-churn depends upon the host's IO bandwidth and other things that probably vary from one machine to another. I can't measure that impact exactly without actually triggering the deletion on mainnet, and if that is disruptive, it would cause validators to miss votes, block times to increase, and other transactions to be impacted. So I want to find a way to simulate it as closely as possible beforehand. A testnet with just a few validators won't cut it, nor will a testnet where all of the validators run at the same speed. Hence the hope to have enough validators, and enough different validators, to see the same kind of "slow computation causes missed votes" impact as we see on mainnet, so we can then test to see if the slowness of vat-deletion cranks is enough to cause the same problem. |
adminNodes, zoe adminFacets, and how to reach themVats are terminated by using For contract vats, Zoe holds onto these vat-admin The zoe Scanning the DB to trace object referencesI've been scanning a recent swingstore snapshot to understand where the
We'll be looking at swingset's internal representation of durable object state and durable collections, for which the SwingSet vatstore-usage.md document will be helpful.
We will also be looking at vat c-lists, which are stored in keys like We'll reduce the full key-value table by filtering it into some subsets, like this:
Note that We happen to know that v2-vat-admin implements v24-economicCommitteeWe are planning a proposal to replace
The vc.5 collection holds a Set of managed vats, indexed by VatID. The
The
So
This shows v2 exporting the object, and v9 (zoe) importing it as vref
This shows our
We can figure out what these durable object Kinds (/ Exos) are by looking at their labels:
So dkind.26 is an and we can determine which (if any) of these are exported by looking for their vrefs in the v9 c-list:
which shows that the first two are not exported at all, the Then we check to see who else imports those krefs
The The
This shows that collection vc.5 is holding it (along with
We can also figure out the types of these collections by finding a copy of the right vref, extracting the KindID, and comparing it against the collection types:
Both vc.5 and vc.7 are We can find out where vc.5 (i.e.
This shows it is available in the vc.1 collection under the string key baggage.get('Bootstrap Powers').get('economicCommitteeKit').adminFacet Our core-evals don't quite run in that environment, but I'm pretty sure that v29-ATOM-USD_price_feed
So we can probably retrieve the const handle = E(board).get('board02963');
const adminFacet = baggage.get("Bootstrap Powers").get("Bootstrap Powers").get(handle).adminFacet; v26-ATOM-USD_price_feed-governor
So we should be able to use the same approach as above. v46-scaledPriceAuthority-ATOM
So the InstanceHandle is only imported by v1 (for use as a key) and v46 (used by ZCF). This makes it awkward to retrieve, but fortunately v1-bootstrap's vc.7 ContractKits is a strong MapStore, which means we can iterate through its keys. So we can probably do something like: let adminFacet;
for (let [key,value] of baggage.get('ContractKits').entries()) {
if (value.label === 'scaledPriceAuthority-ATOM') {
adminFacet = value.adminFacet;
}
} This isn't as safe/satisfying as using a board ID to reach an InstanceHandle and onward to the adminFacet, and we need to ask about why we think there is only one such entry in ContractKits, but it should work. v45-auctioneer and v44-auctioneer.governorThe auctioneer vat has a less happy story.
Note that we replaced this vat in gov76 (04-sep-2024) with v157-auctioneer and an associated v156-auctioneer.governor, leaving the original vats running but now unused. The replacement process might have overwritten some storage (probably not, but we should keep it in mind). More importantly, the zoe @Chris-Hibbert found the code that was meant to store the auctioneer's auctioneerKit.resolve(
harden({
label: 'auctioneer',
creatorFacet: governedCreatorFacet,
adminFacet: governorStartResult.adminFacet,
publicFacet: governedPublicFacet,
instance: governedInstance,
governor: governorStartResult.instance,
governorCreatorFacet: governorStartResult.creatorFacet,
governorAdminFacet: governorStartResult.adminFacet,
}),
); Note the subtle typo in the We can confirm this by finding the v44-auctioneer.governor 's adminFacet and looking at the v1-bootstrap record which holds it:
That shows a record which maps strings to Presences: { adminFacet: o-2518,
creatorFacet: o-2523,
governor: o-2520,
governorAdminFacet: o-2518, // note the duplicate $0
governorCreatorFacet: o-2519,
instance: o-2522,
label: 'auctioneer',
publicFacet: o-2524,
} So we accidentally dropped the auctioneer's Our best thought here is to update the governor to expose a power (to the governing committee) to terminate the contract vat. |
voteCounter vatsWe also have about 100 Unfortunately, it looks like nothing is holding onto the zoe
note that there is no In fact if we look at the number of
then it is clear that we've dropped a significant number of The vat does have an InstanceHandle, and it is registered in v7-board as So to delete the voteCounter vats, we're probably going to have to use |
@michaelfig pointed out that:
so modern deployment scripts should be better at holding on to everything. |
closes: #9584 closes: #9928 refs: #9827 refs: #9748 refs: #9382 closes: #10031 ## Description We added upgrading the scaledPriceAuthority to the steps in upgrading vaults, auctions, and priceFeeds, and didn't notice that it broke things. The problem turned out to be that the "priceAuthority" object registered with the priceFeedRegistry was an ephemeral object that was not upgraded. This fixes that by re-registering the new priceAuthority. Then, to simplify the process of cleaning up the uncollected cycles reported in #9483, we switched to replacing the scaledPriceAuthorities rather than upgrading them. We also realized that we would need different coreEvals in different environments, since the Oracle addresses and particular assets vary for each (test and mainNet) chain environment. #9748 addressed some of the issues in the original coreEval. #9999 showed what was needed for upgrading priceFeeds, which was completed by #9827. #10021 added some details on replacing scaledPriceAuthorities. ### Security Considerations N/A ### Scaling Considerations Addresses one of our biggest scaling issues. ### Documentation Considerations N/A ### Testing Considerations Thorough testing in a3p, and our testnets. #9886 discusses some testing to ensure Oracles will work with the upgrade. ### Upgrade Considerations See above
What is the Problem Being Solved?
#8400 identified misbehavior in the mainnet price-feed vats which caused their storage and c-list use to grow constantly over time. #8401 identified misbehavior in zoe/contract interactions that caused similar symptoms. The cause of these problems has been fixed, and these fixes will be deployed in mainnet upgrade-16, in #9370. This will upgrade Zoe to stop producing most #8401 cycles, and will introduce new price-feed vats that do not suffer from the #8400 growth.
Unfortunately that deployment will (necessarily) leave the old bloated price-feed vats in place, which will retain the troublesome storage and c-list entries (both in the price-feed vats themselves, and in the upgraded vat-zoe because of the cycles). Terminating the price-feed vats would free up that storage, but until the changes of #8928 are deployed, deleting that much data in a single step would crash the kernel, or at least cause a multi-hour -long stall.
So the task, after #8928 is deployed, is to deploy an action to the chain that will terminate the old price-feed vats.
Description of the Design
I think @Chris-Hibbert told me that Zoe retains the
adminNode
necessary to callE(adminNode).terminateWithFailure()
, but does not expose it to the bootstrap/core-eval environment, which means a single core-eval would not be sufficient to get these vats terminated.One option is to wait for #8687 to be deployed to mainnet (also in a chain-halting kernel upgrade), and then have the outside-the-kernel upgrade handler call
controller.terminateVat()
on each of these vatIDs. These two steps could happen in the same chain-halting upgrade.Another option is to change Zoe somehow to allow governance to trigger
terminateWithFailure()
, and then perform governance actions to get the vats terminated. This would be better from an authority point of view, but would require more upgrade steps to get there, and might or might not fit our long-term model of how authority should be distributed.Security Considerations
The ability to terminate a vat must be pretty closely held, as continued operation of a contract is a pretty fundamental part of what chain platform promises. #8687 is not an authority violation because it is not reachable by any vat (it must be invoked by the host application, outside the kernel entirely), but that doesn't mean it's a very comfortable superpower to exercise. It'd be nice to find a good "in-band" means to get these old vats shut down.
Scaling Considerations
We must be careful to not overload the chain. The #8928 slow-termination feature enables this, and introduces a rate limit to control how much cleanup work is allowed during each idle block. We need to carefully test and closely monitor the cleanup rate to make sure the "background work" does not cause problems with validators. We know that doing significant swingset work causes the slower validators to miss the voting window (the number of votes per block drops significantly for a block or two after non-trivial swingset runs). The cleanup parameters are chosen to keep the cleanup work down to about 100ms on my local machine, but we don't know exactly how long that will take on the slower validators. There will always be a slowest validator, and giving them work to do will always have the potential to knock them out of the pool, but we should try to avoid too much churn, especially because cleanup will be ongoing for weeks or months, and there won't be any idle blocks during that time, so they will not have the same chance to catch up as they would normally have.
If multiple vats have been terminated-but-not-deleted, #8928 only does cleanup on one vat at a time. So (in theory) it should be safe to delete all the vats at once, and the rate-limiting code will slowly crawl through all of the first vat's garbage before it eventually moves on to the second, etc. But this needs to be tested carefully.
Test Plan
We definitely need an a3p-like test to exercise whatever API we end up using.
In addition, we need to find some way to simulate and measure the actual rate-limiting. I intend to experiment with mainfork-based clones of the mainnet state, and deploy some hacked up version of this ticket, to watch how cleanup unfolds on real mainnet-sized data.
Upgrade Considerations
this is all about performing a chain upgrade to trigger this deletion
Tasks
The text was updated successfully, but these errors were encountered: