[keyring-controller] Lock controller mutex on write operations #4182

mikesposito · 2024-04-18T14:44:56Z

Explanation

Since all changes to the state of any keyring held by KeyringController should result in an update to the vault, we need to ensure that what we save into the vault is the result of the latest executed operation. In other words, we need to:

Ensure that each operation that changes the state is mutually exclusive. That is, if an operation is made of multiple steps, these steps are guaranteed to be executed before any other operation is started
Ensure that each operation is persisted to the vault before another operation can start

Visually:

sequenceDiagram
  create participant Client
  create participant aPublicMethod()
  Client->>aPublicMethod(): Calls a method that changes state
  create participant ControllerMutex
  aPublicMethod()->>ControllerMutex: Acquire Controller Lock
  loop Controller Lock
    ControllerMutex-->ControllerMutex: Wait controller to be available
  end
  rect rgb(230,245,255)
    note right of aPublicMethod(): Exclusive Controller OP
    participant KeyringController State
    destroy KeyringController State
    aPublicMethod()->>KeyringController State: Update Controller State
    create participant VaultMutex
    aPublicMethod()->>VaultMutex: Acquire Vault Lock
    loop Vault Lock
      VaultMutex-->VaultMutex: Wait vault to be available
    end
    rect rgb(190,225,235)
      note right of aPublicMethod(): Exclusive Vault OP
      participant Vault
      destroy Vault
      aPublicMethod()->>Vault: Persist state changes
    end
    destroy VaultMutex
    VaultMutex-->>aPublicMethod(): Release Vault lock
  end
    destroy ControllerMutex
    ControllerMutex-->>aPublicMethod(): Release Controller lock
    destroy aPublicMethod()
    aPublicMethod()-->>Client: Return value

To do this, this PR adds the KeyringController.#withControllerLock helper function, to be used on any write operation that can potentially mutate the controller state (e.g. any function that calls this.persistAllKeyrings at the end).

References

Related to [keyring-controller] Update state in single call when persisting or unlocking #4154

Changelog

`@metamask/keyring-controller`

CHANGED: KeyringController method calls that change state are now mutually exclusive

Checklist

I've updated the test suite for new or updated code as appropriate
I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate
I've highlighted breaking changes using the "BREAKING" category above as appropriate

mikesposito · 2024-04-18T16:13:52Z

packages/keyring-controller/src/KeyringController.ts

-    } finally {
-      releaseLock();
+    if (!this.#controllerOperationMutex.isLocked()) {
+      throw new Error(KeyringControllerError.ControllerLockRequired);


This should ensure that all calls to #updateVault are done after locking on the controller mutex, enforcing mutually exclusive operations in mutable methods

mcmire

Makes sense, but it might be worth encoding the knowledge that we have about the bug somewhere, perhaps in the docs for withControllerLock or somewhere else.

mcmire · 2024-04-19T19:37:23Z

packages/keyring-controller/src/KeyringController.ts

+  /**
+   * Lock the controller mutex before executing the given function,
+   * and release it after the function is resolved or after an
+   * error is thrown.


Is it worth explaining why this method exists and what problem it is intended to solve?

Definitely! added some details in ab38237

Gudahtt · 2024-04-23T11:50:13Z

I was thinking that this would cause a deadlock if anyone listened for state change events, then called a write operation in response, e.g.

messenger.subscribe('KeyringController:stateChange',(state) => {
 messenger.call('KeyringController:persistAllKeyrings');
});

But no, this should be OK because all write operations are async. The triggered operation would just wait for the first to complete. Is that correct?

mikesposito · 2024-04-23T12:02:20Z

I was thinking that this would cause a deadlock if anyone listened for state change events, then called a write operation in response, e.g.
messenger.subscribe('KeyringController:stateChange',(state) => {
 messenger.call('KeyringController:persistAllKeyrings');
});
But no, this should be OK because all write operations are async. The triggered operation would just wait for the first to complete. Is that correct?

@Gudahtt In your example, if messenger.call is called with await then it would cause a deadlock. But I would say that it is a "feature" since that particular call would probably cause an invalid state to persist, so the controller is acting as a safeguard against bad usage.

This is also part of the reason why methods and actions like persistAllKeyrings are being deprecated in upcoming PRs related to withKeyring: if we remove the necessity of doing something like your example by providing a safe alternative, we can easily avoid both deadlocks and corrupted states

Gudahtt · 2024-04-23T12:04:17Z

Hmm. Event publishing is synchronous, so I do not see how an await could cause a deadlock. The code making the state update would not await async operations, as state updates themselves are synchronous.

mikesposito · 2024-04-23T12:07:37Z

In this example test case the persistAllKeyrings call is executed before the end of submitPassword, even if the event publishing is synchronous and the method call isn't. I think that with the introduction of locks, that test case would cause a deadlock since it would await on a locked mutex

legobeat · 2024-04-23T12:08:20Z

I was thinking that this would cause a deadlock if anyone listened for state change events, then called a write operation in response, e.g.
messenger.subscribe('KeyringController:stateChange',(state) => {
 messenger.call('KeyringController:persistAllKeyrings');
});
But no, this should be OK because all write operations are async. The triggered operation would just wait for the first to complete. Is that correct?
@Gudahtt In your example, if messenger.call is called with await then it would cause a deadlock. But I would say that it is a "feature" since that particular call would probably cause an invalid state to persist, so the controller is acting as a safeguard against bad usage.

This is also part of the reason why methods and actions like persistAllKeyrings are being deprecated in upcoming PRs related to withKeyring: if we remove the necessity of doing something like your example by providing a safe alternative, we can easily avoid both deadlocks and corrupted states

In either case, perhaps warranted to add an explicit test-case with that to capture current behavior?

mikesposito · 2024-04-23T12:09:11Z

Ah I see what you are saying. It would wait for the lock to be released, and just executed afterwards 🤔

Gudahtt · 2024-04-23T12:23:12Z

With the introduction of this new lock, is the vault mutex still useful? It now seems redundant to me.

Not a blocker - we can remove it in a later PR, but if we can remove it we should, because it would simplify the controller a great deal and eliminate the chances of a new deadlock involving that mutex being introduced in the future.

Gudahtt · 2024-04-23T12:51:34Z

Hmm, it still protects internally against two vault accesses in the same controller operation. I would say that it is a protection against bad controller development, while the controller mutex is a protection against bad controller usage

Hmm. It would protect against two concurrent write operations to the vault I guess, as sequential changes would be fine.

I get the impression that the lock was originally added to prevent concurrent vault operations caused by concurrent calls to write operations, rather than internal calls. I see how it could be a useful protection measure internally, but it comes at a non-trivial cost in complexity and deadlock risk, and I don't see a reason to suspect concurrent write operations are especially likely to occur. We have no similar protection against concurrent keyring updates, which would be just as problematic.

mikesposito · 2024-04-23T13:51:47Z

There are a few private methods that are write operations but have no lock, because they are intended to be called from a method that is locked. Thoughts on adding a guard to these to ensure they're always called from within a lock? e.g.
if (!this.#controllerOperationMutex.isLocked()) {
  throw new Error(KeyringControllerError.ControllerLockRequired);
}
(just like the one in #withVaultLock)

The methods I saw that would qualify here are #addQRKeyring, #createNewVaultWithKeyring, #createKeyringWithFirstAccount, #newKeyring, #restoreKeyring, #removeEmptyKeyrings, and #setUnlocked

Good point! I added assertions for all those methods 9780406

mikesposito · 2024-04-23T13:59:52Z

I get the impression that the lock was originally added to prevent concurrent vault operations caused by concurrent calls to write operations, rather than internal calls. I see how it could be a useful protection measure internally, but it comes at a non-trivial cost in complexity and deadlock risk, and I don't see a reason to suspect concurrent write operations are especially likely to occur. We have no similar protection against concurrent keyring updates, which would be just as problematic.

True, there's a high cost in terms of additional complexity. Perhaps we can remove the vault lock before #4192

Gudahtt · 2024-04-23T15:00:26Z

Good point! I added assertions for all those methods 9780406

Thanks!

It looks like there is still one method missing this assertion: #createKeyringWithFirstAccount. Thoughts on adding it there as well?

Gudahtt

One pending non-blocking suggestion (about adding one last lock assertion), but overall this LGTM!

Gudahtt · 2024-04-23T15:01:38Z

packages/keyring-controller/CHANGELOG.md

@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+### Changed
+
+- **BREAKING**: `getAccounts` return type changed from `Promise<string>` to `string` ([#4182](https://github.com/MetaMask/core/pull/4182))


We could consider preserving async on getAccounts, to avoid this breaking change. But I am OK with this changing, it's very easy to accommodate this change.

Good point, it's very easy to accommodate but we'll have to refactor a lot of code already when releasing the complete set of atomic/exclusive operations on clients. Since it's functionally equivalent I'd say that we can come back at this later and remove the async.

I reverted the async removal part in 14ab92e

Hmm, I'm thinking that if we intend to release #4192 on clients along with these changes, then we'll have to release #4199 too - which includes a bunch of breaking changes already, so changing getAccounts too should not change much in terms of update complexity since we'll have to release a major version anyway

Agreed! That is why I was ambivalent about this

mcmire · 2024-04-23T22:22:54Z

packages/keyring-controller/src/KeyringController.ts

@@ -1540,6 +1570,8 @@ export class KeyringController extends BaseController<
   * when initializing the controller
   */
  async #addQRKeyring(): Promise<QRKeyring> {
+    this.#assertControllerMutexIsLocked();


It makes sense that some methods have steps that need to be performed atomically and exclusively — they can only run as a whole once the mutex is unlocked — but it's interesting to me that we would go the other way and restrict some methods from being performed outside of a mutex (they have to be a step in an atomic method). Why is that? Or am I not understanding this correctly?

The main reason for that is to avoid the addition of methods in the future that could unintentionally call these methods internally without locking: if it's true that all KeyringController mutable operations must be atomic, then should be fair to assert that all mutable methods (including # methods) must be behind the mutex.

mcmire · 2024-04-23T22:26:48Z

packages/keyring-controller/src/KeyringController.ts


-    this.messagingSystem.publish(`${name}:lock`);
+      this.messagingSystem.publish(`${name}:lock`);


I notice that for removeAccount, the accountRemoved is emitted after the lock is released, but in this case the lock event is emitted before the lock is released. Do these need to behave the same or does it matter?

Good point! I moved that out of the lock in removeAccount for two reasons mainly:

I initially had a deadlock concern for side effects running while the mutex was still locked, but this has been proven not to be the case

In the subsequent PR for atomic operations removeAccount will be wrapped in this.#persistOrRollback, and since the state and vault update will be delegated to the wrapper, this function would emit the :accountRemoved event before removing the account from the state, so a consumer listening would receive the event while having the controller on its original state

(Related to (2)) setLocked does not persist to the vault, it just clears the keyrings array, so it will not be wrapped in the this.#persistOrRollback function

mikesposito · 2024-04-24T10:46:49Z

It looks like there is still one method missing this assertion: #createKeyringWithFirstAccount. Thoughts on adding it there as well?

Good point! I also added it to #clearKeyrings. With these two we should have all methods interacting with this.#keyrings covered now

This reverts commit 5d8bd35.

## Explanation This PR is an intermediate refactor needed for #4192. Since operations are eventually rolled back, the internal method will not be able to return the last controller state, risking returning a stale one. Moreover, function returns make more sense now. This PR needs these changes to be merged **first**: - [x] #4182  ## References  ### `@metamask/keyring-controller` - **BREAKING**: Change various `KeyringController` methods so they no longer return the controller state - Changed `addNewAccount` return type to `Promise<string>` - Changed `addNewAccountWithoutUpdate` return type to `Promise<string>` - Changed `createNewVaultAndKeychain` return type to `Promise<void>` - Changed `createNewVaultAndRestore` return type to `Promise<void>` - Changed `importAccountWithStrategy` return type to `Promise<string>` - Changed `removeAccount` return type to `Promise<void>` - Changed `setLocked` return type to `Promise<void>` - Changed `submitEncryptionKey` return type to `Promise<void>` - Changed `submitPassword` return type to `Promise<void>` ## Checklist - [ ] I've updated the test suite for new or updated code as appropriate - [ ] I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate - [ ] I've highlighted breaking changes using the "BREAKING" category above as appropriate --------- Co-authored-by: Elliot Winkler <elliot.winkler@gmail.com>

## Explanation Part of `KeyringController` responsibilies is ensuring each operation is [mutually exclusive](#4182) and [atomic](#4192), updating keyring instances and the vault (or rolling them back) in the same mutex lock. However, the ability of clients to have direct access to a keyring instance represents a loophole, as with the current implementation they don’t have to comply with the rules enforced by the controller: we should provide a way for clients to interact with a keyring instance through safeguards provided by KeyringController. The current behavior is this one: 1. Client obtains a keyring instance through `getKeyringForAccount` 2. Client interacts with the instance 3. Client calls `persistAllKeyrings` We should, instead, have something like this: 1. Client calls a `withKeyring` method, passing a _keyring selector_ and a callback 2. KeyringController selects the keyring instance and calls the callback with it, after locking the controller operation mutex 3. Client interacts with the keyring instance safely, inside the callback 4. KeyringController, after the callback execution, internally updates the vault or rolls back changes in case of error, and then releases the mutex lock ## References  * Fixes #4198 * Related to #4192 ## Changelog  ### `@metamask/keyring-controller` - **ADDED**: Added `withKeyring` method - **DEPRECATED**: Deprecated `persistAllKeyrings` method - Use `withKeyring` instead ## Checklist - [ ] I've updated the test suite for new or updated code as appropriate - [ ] I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate - [ ] I've highlighted breaking changes using the "BREAKING" category above as appropriate

This was referenced Apr 18, 2024

[keyring-controller] Lock controller mutex on write operations #4156

Closed

[keyring-controller] Avoid calling mutable public functions internally #4181

Merged

mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from 2e46463 to ef8943d Compare April 18, 2024 15:50

mikesposito commented Apr 18, 2024

View reviewed changes

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 6a04825 to b5b0c02 Compare April 18, 2024 17:20

mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from ef8943d to a2b2119 Compare April 19, 2024 09:50

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from b5b0c02 to 4d0136a Compare April 19, 2024 10:04

mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from a2b2119 to 4f1cab2 Compare April 19, 2024 12:02

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 4d0136a to cacb308 Compare April 19, 2024 12:08

mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from 4f1cab2 to 459273c Compare April 19, 2024 12:11

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from cacb308 to 57ba781 Compare April 19, 2024 12:13

mcmire reviewed Apr 19, 2024

View reviewed changes

mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from 459273c to 4b382ae Compare April 19, 2024 19:58

Base automatically changed from refactor/keyring-controller-deadlock-prevention to main April 19, 2024 20:08

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch 2 times, most recently from 029f2c0 to ab38237 Compare April 22, 2024 08:43

mikesposito marked this pull request as ready for review April 22, 2024 08:43

mikesposito requested a review from a team as a code owner April 22, 2024 08:43

mikesposito requested a review from mcmire April 22, 2024 16:05

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from ca03cc4 to e29dbfe Compare April 23, 2024 13:43

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 9780406 to b6875dd Compare April 23, 2024 14:43

Gudahtt previously approved these changes Apr 23, 2024

View reviewed changes

mcmire reviewed Apr 23, 2024

View reviewed changes

mikesposito dismissed Gudahtt’s stale review via 14ab92e April 24, 2024 10:42

Gudahtt approved these changes Apr 24, 2024

View reviewed changes

mikesposito added 10 commits April 24, 2024 16:12

feat: controller lock

ad6bd78

test: adjust coverage

d80911a

feat: add lock to getAccounts

20f5116

refactor: emit accountRemoved event after releasing the lock

022b68c

docs: add details on #withControllerLock

494f1fc

test: should not cause deadlock with state subscriptions

d5ea410

refactor: getAccounts reads from controller state

6b4362a

refactor: assert mutex is locked in internal functions

a43ed42

fix: add missing mutex-is-locked assertions

e7ccd2a

Revert "refactor: getAccounts reads from controller state"

952aa96

This reverts commit 5d8bd35.

mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 14ab92e to 952aa96 Compare April 24, 2024 14:12

mikesposito merged commit 2a4d933 into main Apr 24, 2024
139 checks passed

mikesposito deleted the feat/keyring-controller-lock-v2 branch April 24, 2024 14:24

This was referenced Apr 25, 2024

fix: hd keyring logic in addNewAccountForKeyring #4203

Open

[keyring-controller] Mitigate run conditions with >^16.x MetaMask/metamask-extension#24276

Open

[keyring-controller] Mitigate run conditions with >^16.x MetaMask/metamask-mobile#9432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[keyring-controller] Lock controller mutex on write operations #4182

[keyring-controller] Lock controller mutex on write operations #4182

mikesposito commented Apr 18, 2024 •

edited

mikesposito Apr 18, 2024

mcmire left a comment

mcmire Apr 19, 2024

mikesposito Apr 22, 2024

Gudahtt commented Apr 23, 2024

mikesposito commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

mikesposito commented Apr 23, 2024 •

edited

legobeat commented Apr 23, 2024 •

edited

mikesposito commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

mikesposito commented Apr 23, 2024

mikesposito commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

Gudahtt left a comment

Gudahtt Apr 23, 2024

mikesposito Apr 24, 2024

mikesposito Apr 24, 2024 •

edited

Gudahtt Apr 24, 2024

mcmire Apr 23, 2024

mikesposito Apr 24, 2024

mcmire Apr 23, 2024

mikesposito Apr 24, 2024 •

edited

mikesposito commented Apr 24, 2024 •

edited


		this.messagingSystem.publish(`${name}:lock`);
		this.messagingSystem.publish(`${name}:lock`);

[keyring-controller] Lock controller mutex on write operations #4182

[keyring-controller] Lock controller mutex on write operations #4182

Conversation

mikesposito commented Apr 18, 2024 • edited

Explanation

References

Changelog

@metamask/keyring-controller

Checklist

Choose a reason for hiding this comment

mcmire left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gudahtt commented Apr 23, 2024

mikesposito commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

mikesposito commented Apr 23, 2024 • edited

legobeat commented Apr 23, 2024 • edited

mikesposito commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

mikesposito commented Apr 23, 2024

mikesposito commented Apr 23, 2024

Gudahtt commented Apr 23, 2024

Gudahtt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikesposito Apr 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikesposito Apr 24, 2024 • edited

Choose a reason for hiding this comment

mikesposito commented Apr 24, 2024 • edited

mikesposito commented Apr 18, 2024 •

edited

`@metamask/keyring-controller`

mikesposito commented Apr 23, 2024 •

edited

legobeat commented Apr 23, 2024 •

edited

mikesposito Apr 24, 2024 •

edited

mikesposito Apr 24, 2024 •

edited

mikesposito commented Apr 24, 2024 •

edited