Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[keyring-controller] Lock controller mutex on write operations #4182

Merged
merged 10 commits into from
Apr 24, 2024

Conversation

mikesposito
Copy link
Member

@mikesposito mikesposito commented Apr 18, 2024

Explanation

Since all changes to the state of any keyring held by KeyringController should result in an update to the vault, we need to ensure that what we save into the vault is the result of the latest executed operation. In other words, we need to:

  1. Ensure that each operation that changes the state is mutually exclusive. That is, if an operation is made of multiple steps, these steps are guaranteed to be executed before any other operation is started
  2. Ensure that each operation is persisted to the vault before another operation can start

Visually:

sequenceDiagram
  create participant Client
  create participant aPublicMethod()
  Client->>aPublicMethod(): Calls a method that changes state
  create participant ControllerMutex
  aPublicMethod()->>ControllerMutex: Acquire Controller Lock
  loop Controller Lock
    ControllerMutex-->ControllerMutex: Wait controller to be available
  end
  rect rgb(230,245,255)
    note right of aPublicMethod(): Exclusive Controller OP
    participant KeyringController State
    destroy KeyringController State
    aPublicMethod()->>KeyringController State: Update Controller State
    create participant VaultMutex
    aPublicMethod()->>VaultMutex: Acquire Vault Lock
    loop Vault Lock
      VaultMutex-->VaultMutex: Wait vault to be available
    end
    rect rgb(190,225,235)
      note right of aPublicMethod(): Exclusive Vault OP
      participant Vault
      destroy Vault
      aPublicMethod()->>Vault: Persist state changes
    end
    destroy VaultMutex
    VaultMutex-->>aPublicMethod(): Release Vault lock
  end
    destroy ControllerMutex
    ControllerMutex-->>aPublicMethod(): Release Controller lock
    destroy aPublicMethod()
    aPublicMethod()-->>Client: Return value

To do this, this PR adds the KeyringController.#withControllerLock helper function, to be used on any write operation that can potentially mutate the controller state (e.g. any function that calls this.persistAllKeyrings at the end).

References

Changelog

@metamask/keyring-controller

  • CHANGED: KeyringController method calls that change state are now mutually exclusive

Checklist

  • I've updated the test suite for new or updated code as appropriate
  • I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate
  • I've highlighted breaking changes using the "BREAKING" category above as appropriate

} finally {
releaseLock();
if (!this.#controllerOperationMutex.isLocked()) {
throw new Error(KeyringControllerError.ControllerLockRequired);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ensure that all calls to #updateVault are done after locking on the controller mutex, enforcing mutually exclusive operations in mutable methods

@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 6a04825 to b5b0c02 Compare April 18, 2024 17:20
@mikesposito mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from ef8943d to a2b2119 Compare April 19, 2024 09:50
@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from b5b0c02 to 4d0136a Compare April 19, 2024 10:04
@mikesposito mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from a2b2119 to 4f1cab2 Compare April 19, 2024 12:02
@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 4d0136a to cacb308 Compare April 19, 2024 12:08
@mikesposito mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from 4f1cab2 to 459273c Compare April 19, 2024 12:11
@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from cacb308 to 57ba781 Compare April 19, 2024 12:13
Copy link
Contributor

@mcmire mcmire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but it might be worth encoding the knowledge that we have about the bug somewhere, perhaps in the docs for withControllerLock or somewhere else.

/**
* Lock the controller mutex before executing the given function,
* and release it after the function is resolved or after an
* error is thrown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth explaining why this method exists and what problem it is intended to solve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! added some details in ab38237

@mikesposito mikesposito force-pushed the refactor/keyring-controller-deadlock-prevention branch from 459273c to 4b382ae Compare April 19, 2024 19:58
Base automatically changed from refactor/keyring-controller-deadlock-prevention to main April 19, 2024 20:08
@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch 2 times, most recently from 029f2c0 to ab38237 Compare April 22, 2024 08:43
@mikesposito mikesposito marked this pull request as ready for review April 22, 2024 08:43
@mikesposito mikesposito requested a review from a team as a code owner April 22, 2024 08:43
@mikesposito mikesposito requested a review from mcmire April 22, 2024 16:05
@Gudahtt
Copy link
Member

Gudahtt commented Apr 23, 2024

I was thinking that this would cause a deadlock if anyone listened for state change events, then called a write operation in response, e.g.

messenger.subscribe('KeyringController:stateChange',(state) => {
 messenger.call('KeyringController:persistAllKeyrings');
});

But no, this should be OK because all write operations are async. The triggered operation would just wait for the first to complete. Is that correct?

@mikesposito
Copy link
Member Author

I was thinking that this would cause a deadlock if anyone listened for state change events, then called a write operation in response, e.g.

messenger.subscribe('KeyringController:stateChange',(state) => {
 messenger.call('KeyringController:persistAllKeyrings');
});

But no, this should be OK because all write operations are async. The triggered operation would just wait for the first to complete. Is that correct?

@Gudahtt In your example, if messenger.call is called with await then it would cause a deadlock. But I would say that it is a "feature" since that particular call would probably cause an invalid state to persist, so the controller is acting as a safeguard against bad usage.

This is also part of the reason why methods and actions like persistAllKeyrings are being deprecated in upcoming PRs related to withKeyring: if we remove the necessity of doing something like your example by providing a safe alternative, we can easily avoid both deadlocks and corrupted states

@Gudahtt
Copy link
Member

Gudahtt commented Apr 23, 2024

Hmm. Event publishing is synchronous, so I do not see how an await could cause a deadlock. The code making the state update would not await async operations, as state updates themselves are synchronous.

@mikesposito
Copy link
Member Author

mikesposito commented Apr 23, 2024

In this example test case the persistAllKeyrings call is executed before the end of submitPassword, even if the event publishing is synchronous and the method call isn't. I think that with the introduction of locks, that test case would cause a deadlock since it would await on a locked mutex

@legobeat
Copy link
Contributor

legobeat commented Apr 23, 2024

I was thinking that this would cause a deadlock if anyone listened for state change events, then called a write operation in response, e.g.

messenger.subscribe('KeyringController:stateChange',(state) => {
 messenger.call('KeyringController:persistAllKeyrings');
});

But no, this should be OK because all write operations are async. The triggered operation would just wait for the first to complete. Is that correct?

@Gudahtt In your example, if messenger.call is called with await then it would cause a deadlock. But I would say that it is a "feature" since that particular call would probably cause an invalid state to persist, so the controller is acting as a safeguard against bad usage.

This is also part of the reason why methods and actions like persistAllKeyrings are being deprecated in upcoming PRs related to withKeyring: if we remove the necessity of doing something like your example by providing a safe alternative, we can easily avoid both deadlocks and corrupted states

In either case, perhaps warranted to add an explicit test-case with that to capture current behavior?

@mikesposito
Copy link
Member Author

Ah I see what you are saying. It would wait for the lock to be released, and just executed afterwards 🤔

@Gudahtt
Copy link
Member

Gudahtt commented Apr 23, 2024

With the introduction of this new lock, is the vault mutex still useful? It now seems redundant to me.

Not a blocker - we can remove it in a later PR, but if we can remove it we should, because it would simplify the controller a great deal and eliminate the chances of a new deadlock involving that mutex being introduced in the future.

@Gudahtt
Copy link
Member

Gudahtt commented Apr 23, 2024

Hmm, it still protects internally against two vault accesses in the same controller operation. I would say that it is a protection against bad controller development, while the controller mutex is a protection against bad controller usage

Hmm. It would protect against two concurrent write operations to the vault I guess, as sequential changes would be fine.

I get the impression that the lock was originally added to prevent concurrent vault operations caused by concurrent calls to write operations, rather than internal calls. I see how it could be a useful protection measure internally, but it comes at a non-trivial cost in complexity and deadlock risk, and I don't see a reason to suspect concurrent write operations are especially likely to occur. We have no similar protection against concurrent keyring updates, which would be just as problematic.

@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from ca03cc4 to e29dbfe Compare April 23, 2024 13:43
@mikesposito
Copy link
Member Author

There are a few private methods that are write operations but have no lock, because they are intended to be called from a method that is locked. Thoughts on adding a guard to these to ensure they're always called from within a lock? e.g.

if (!this.#controllerOperationMutex.isLocked()) {
  throw new Error(KeyringControllerError.ControllerLockRequired);
}

(just like the one in #withVaultLock)

The methods I saw that would qualify here are #addQRKeyring, #createNewVaultWithKeyring, #createKeyringWithFirstAccount, #newKeyring, #restoreKeyring, #removeEmptyKeyrings, and #setUnlocked

Good point! I added assertions for all those methods 9780406

@mikesposito
Copy link
Member Author

I get the impression that the lock was originally added to prevent concurrent vault operations caused by concurrent calls to write operations, rather than internal calls. I see how it could be a useful protection measure internally, but it comes at a non-trivial cost in complexity and deadlock risk, and I don't see a reason to suspect concurrent write operations are especially likely to occur. We have no similar protection against concurrent keyring updates, which would be just as problematic.

True, there's a high cost in terms of additional complexity. Perhaps we can remove the vault lock before #4192

@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 9780406 to b6875dd Compare April 23, 2024 14:43
@Gudahtt
Copy link
Member

Gudahtt commented Apr 23, 2024

Good point! I added assertions for all those methods 9780406

Thanks!

It looks like there is still one method missing this assertion: #createKeyringWithFirstAccount. Thoughts on adding it there as well?

Gudahtt
Gudahtt previously approved these changes Apr 23, 2024
Copy link
Member

@Gudahtt Gudahtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pending non-blocking suggestion (about adding one last lock assertion), but overall this LGTM!

@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Changed

- **BREAKING**: `getAccounts` return type changed from `Promise<string>` to `string` ([#4182](https://github.com/MetaMask/core/pull/4182))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider preserving async on getAccounts, to avoid this breaking change. But I am OK with this changing, it's very easy to accommodate this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it's very easy to accommodate but we'll have to refactor a lot of code already when releasing the complete set of atomic/exclusive operations on clients. Since it's functionally equivalent I'd say that we can come back at this later and remove the async.

I reverted the async removal part in 14ab92e

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm thinking that if we intend to release #4192 on clients along with these changes, then we'll have to release #4199 too - which includes a bunch of breaking changes already, so changing getAccounts too should not change much in terms of update complexity since we'll have to release a major version anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! That is why I was ambivalent about this

@@ -1540,6 +1570,8 @@ export class KeyringController extends BaseController<
* when initializing the controller
*/
async #addQRKeyring(): Promise<QRKeyring> {
this.#assertControllerMutexIsLocked();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that some methods have steps that need to be performed atomically and exclusively — they can only run as a whole once the mutex is unlocked — but it's interesting to me that we would go the other way and restrict some methods from being performed outside of a mutex (they have to be a step in an atomic method). Why is that? Or am I not understanding this correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for that is to avoid the addition of methods in the future that could unintentionally call these methods internally without locking: if it's true that all KeyringController mutable operations must be atomic, then should be fair to assert that all mutable methods (including # methods) must be behind the mutex.


this.messagingSystem.publish(`${name}:lock`);
this.messagingSystem.publish(`${name}:lock`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that for removeAccount, the accountRemoved is emitted after the lock is released, but in this case the lock event is emitted before the lock is released. Do these need to behave the same or does it matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I moved that out of the lock in removeAccount for two reasons mainly:

  1. I initially had a deadlock concern for side effects running while the mutex was still locked, but this has been proven not to be the case
  2. In the subsequent PR for atomic operations removeAccount will be wrapped in this.#persistOrRollback, and since the state and vault update will be delegated to the wrapper, this function would emit the :accountRemoved event before removing the account from the state, so a consumer listening would receive the event while having the controller on its original state
  3. (Related to (2)) setLocked does not persist to the vault, it just clears the keyrings array, so it will not be wrapped in the this.#persistOrRollback function

@mikesposito
Copy link
Member Author

mikesposito commented Apr 24, 2024

It looks like there is still one method missing this assertion: #createKeyringWithFirstAccount. Thoughts on adding it there as well?

Good point! I also added it to #clearKeyrings. With these two we should have all methods interacting with this.#keyrings covered now

@mikesposito mikesposito force-pushed the feat/keyring-controller-lock-v2 branch from 14ab92e to 952aa96 Compare April 24, 2024 14:12
@mikesposito mikesposito merged commit 2a4d933 into main Apr 24, 2024
139 checks passed
@mikesposito mikesposito deleted the feat/keyring-controller-lock-v2 branch April 24, 2024 14:24
mikesposito added a commit that referenced this pull request Apr 24, 2024
## Explanation

This PR is an intermediate refactor needed for #4192.

Since operations are eventually rolled back, the internal method will
not be able to return the last controller state, risking returning a
stale one. Moreover, function returns make more sense now.

This PR needs these changes to be merged **first**:
- [x] #4182

<!--
Thanks for your contribution! Take a moment to answer these questions so
that reviewers have the information they need to properly understand
your changes:

* What is the current state of things and why does it need to change?
* What is the solution your changes offer and how does it work?
* Are there any changes whose purpose might not obvious to those
unfamiliar with the domain?
* If your primary goal was to update one package but you found you had
to update another one along the way, why did you do so?
* If you had to upgrade a dependency, why did you do so?
-->

## References

<!--
Are there any issues that this pull request is tied to? Are there other
links that reviewers should consult to understand these changes better?

For example:

* Fixes #12345
* Related to #67890
--
* Related to #4192

## Changelog

<!--
If you're making any consumer-facing changes, list those changes here as
if you were updating a changelog, using the template below as a guide.

(CATEGORY is one of BREAKING, ADDED, CHANGED, DEPRECATED, REMOVED, or
FIXED. For security-related issues, follow the Security Advisory
process.)

Please take care to name the exact pieces of the API you've added or
changed (e.g. types, interfaces, functions, or methods).

If there are any breaking changes, make sure to offer a solution for
consumers to follow once they upgrade to the changes.

Finally, if you're only making changes to development scripts or tests,
you may replace the template below with "None".
-->

### `@metamask/keyring-controller`

- **BREAKING**: Change various `KeyringController` methods so they no
longer return the controller state
  - Changed `addNewAccount` return type to `Promise<string>` 
- Changed `addNewAccountWithoutUpdate` return type to `Promise<string>`
  - Changed `createNewVaultAndKeychain` return type to `Promise<void>`
  - Changed `createNewVaultAndRestore` return type to `Promise<void>`
  - Changed `importAccountWithStrategy` return type to `Promise<string>`
  - Changed `removeAccount` return type to `Promise<void>`
  - Changed `setLocked` return type to `Promise<void>`
  - Changed `submitEncryptionKey` return type to `Promise<void>`
  - Changed `submitPassword` return type to `Promise<void>`

## Checklist

- [ ] I've updated the test suite for new or updated code as appropriate
- [ ] I've updated documentation (JSDoc, Markdown, etc.) for new or
updated code as appropriate
- [ ] I've highlighted breaking changes using the "BREAKING" category
above as appropriate

---------

Co-authored-by: Elliot Winkler <elliot.winkler@gmail.com>
mikesposito added a commit that referenced this pull request Apr 30, 2024
## Explanation

Part of `KeyringController` responsibilies is ensuring each operation is
[mutually exclusive](#4182) and
[atomic](#4192), updating keyring
instances and the vault (or rolling them back) in the same mutex lock.

However, the ability of clients to have direct access to a keyring
instance represents a loophole, as with the current implementation they
don’t have to comply with the rules enforced by the controller: we
should provide a way for clients to interact with a keyring instance
through safeguards provided by KeyringController.

The current behavior is this one:
1. Client obtains a keyring instance through `getKeyringForAccount`
2. Client interacts with the instance
3. Client calls `persistAllKeyrings` 

We should, instead, have something like this:
1. Client calls a `withKeyring` method, passing a _keyring selector_ and
a callback
2. KeyringController selects the keyring instance and calls the callback
with it, after locking the controller operation mutex
3. Client interacts with the keyring instance safely, inside the
callback
4. KeyringController, after the callback execution, internally updates
the vault or rolls back changes in case of error, and then releases the
mutex lock

## References

<!--
Are there any issues that this pull request is tied to? Are there other
links that reviewers should consult to understand these changes better?

For example:

* Fixes #12345
* Related to #67890
-->
* Fixes #4198 
* Related to #4192 

## Changelog

<!--
If you're making any consumer-facing changes, list those changes here as
if you were updating a changelog, using the template below as a guide.

(CATEGORY is one of BREAKING, ADDED, CHANGED, DEPRECATED, REMOVED, or
FIXED. For security-related issues, follow the Security Advisory
process.)

Please take care to name the exact pieces of the API you've added or
changed (e.g. types, interfaces, functions, or methods).

If there are any breaking changes, make sure to offer a solution for
consumers to follow once they upgrade to the changes.

Finally, if you're only making changes to development scripts or tests,
you may replace the template below with "None".
-->

### `@metamask/keyring-controller`

- **ADDED**: Added `withKeyring` method
- **DEPRECATED**: Deprecated `persistAllKeyrings` method
  - Use `withKeyring` instead

## Checklist

- [ ] I've updated the test suite for new or updated code as appropriate
- [ ] I've updated documentation (JSDoc, Markdown, etc.) for new or
updated code as appropriate
- [ ] I've highlighted breaking changes using the "BREAKING" category
above as appropriate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants