Rework `ElectrumClient` #512

t-bast · 2023-07-12T10:52:08Z

Our ElectrumClient was poorly documented and had unexpected behavior in various non-trivial edge cases (e.g. disconnections while RPC calls are pending). Since we rely on untrusted electrum servers, we need clearer requirements on what to do in case of failures, and a mechanism to automatically disconnect from malicious-looking or buggy servers.

The architecture of the coroutines spawned by the ElectrumClient was also confusing, with for example the cancel() method called sometimes on child jobs, sometimes on the main scope, which was really hard to reason about since supervision wasn't explicit.

I highly recommend reviewing commits independently, as they are all narrowly scoped to fix a specific issue. We should also spend some time testing this in Phoenix to see how it behaves in practice.

I removed the CoroutineExceptionHandler installed on TLS sockets, but that is unfortunately premature. While this new version of the ElectrumClient doesn't crash when the socket is remotely closed in unit tests (while the previous version did crash), the behavior is different on Android and the application will still crash: we need ktorio/ktor#3690 to be integrated into ktor before we can safely get rid of it. But since it may also have unintended side-effects, I'm not sure yet whether we should keep it or not in this PR.

This is only used when connecting, it doesn't need to be an argument of the `ElectrumClient`.

TCP sockets with electrum servers require some non-trivial code to reconstruct individual messages from the bytes received on the socket. We process bytes in chunks, then reconstruct utf8 strings, and finally split those strings at newline characters into individual messages. We document that code, rename fields to make it easier to understand, and add unit tests. We remove it from the TCP socket abstraction, since this is specific to electrum connections. We also change the behavior in case the socket is closed while we have buffered a partial message: it doesn't make sense to emit it, as listeners won't be able to decode it.

This is a pure 1:1 mapping from `connectionStatus`, there is no reason for that duplication. Clients can trivially migrate or `map` using the `toConnectionState` helper function.

We clean up the coroutine hierarchy: the `ElectrumClient` has a single internal coroutine that is launched once the connection has been established with the electrum server. This coroutine launches three child coroutines and uses supervision to gracefully stop if the connection is closed or an error is received. Connection establishment happens in the context of the job that calls `connect()` and doesn't throw exceptions. We revert the introduction of a `CoroutineExceptionHandler` inside the JVM socket code, as it shouldn't be needed anymore.

Otherwise we may be stuck in the `Connecting` state.

We just keep `getHeader` / `getHeaders` which can be handy.

On most RPC calls, we can gracefully handle server errors. Since we cannot trust the electrum server anyway, and they may lie to us by omission, this doesn't downgrade the security model. The previous behavior was to throw an exception, which was never properly handled and would just result in a crash of the wallet application.

We add an explicit timeout to RPC calls and a retry. If that retry also fails, two strategies are available: - handle the error and gracefully degrade (non-critical RPC calls) - disconnect and retry with new electrum server (critical RPCs such as subscriptions) The timeout can be updated by the application, for example when a slow network is detected or when Tor is activated.

sstone

As much as possible there should be a more specific description of the problems we are trying to solve otherwise it's hard to evaluate how much better these changes are (which is not saying that the electrum client could not use some refactoring).
It may also be easier to decompose this into a small set of PR that build upon each other: for example ios tests fail and it's difficult to see why or when they got broken.

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt

src/commonMain/kotlin/fr/acinq/lightning/utils/strings.kt

sstone · 2023-07-20T20:49:57Z

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt

    }

-    suspend inline fun <reified T : ElectrumResponse> rpcCall(request: ElectrumRequest): T {
+    /** Send a request using timeouts, retrying once and giving up after that retry. */
+    private suspend inline fun <reified T : ElectrumResponse> rpcCall(request: ElectrumRequest): Either<ServerError, T> {


Adding a default timeout is a very good idea but I'm not that sure about automated retries.

We don't really have a choice, I think we have to retry to mitigate the case where we failed simply because we get disconnected and reconnected (as described in the comment of the rpcCallWithTimeout function), otherwise we'll have a lot of easily avoidable failures (the behavior we have today is that the caller will hang forever waiting for a response).

t-bast · 2023-07-25T13:17:17Z

As much as possible there should be a more specific description of the problems we are trying to solve otherwise it's hard to evaluate how much better these changes are

But that's exactly what I did in each commit message? I spent a lot of time creating a clean commit history, where each commit fixes specific issues. Can you tell me which commit is lacking details?

It may also be easier to decompose this into a small set of PR that build upon each other: for example ios tests fail and it's difficult to see why or when they got broken.

We could, but the split in many commits also achieves that. It's easy to locally test specific commits to see at what point tests started failing.

Note that the only tests that were failing are new tests that were added in this PR. I tried porting them to master, and they also fail there.

When connecting to an invalid address in tests on iOS, the error isn't reported by the native socket which just hangs until the test times out. We lower the timeout on the `connect` call, which previously had the same value as the test timeout. We thus fail the connection attempt after 1 second and are able to reconnect to a different server. The very low RPC timeout was flaky on iOS: because the clock used depends on the actual dispatcher, it sometimes triggered only after receiving a valid response from the server, which made the test fail. Setting it to 0ms ensures that the is immediately cancelled with a timeout exception regardless of the dispatcher used.

We actually still need it until ktorio/ktor#3690 is integrated into a ktor release.

t-bast · 2023-07-28T13:00:33Z

This is now ready for review and e2e testing!

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt

src/commonMain/kotlin/fr/acinq/lightning/utils/strings.kt

src/jvmMain/kotlin/fr/acinq/lightning/io/JvmTcpSocket.kt

This was forgotten after e2e tests.

Since `withTimeout` runs concurrently, it may throw the timeout exception after we've established the TCP connection, so we need to release it if we can (if we have a pointer to it).

sstone

I tested this on Android without problems.

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt

pm47

It was high time for cleaning up the coroutine hierarchy, and relying on simple suspend functions at connection establishment is much simpler to reason about.

My main concern is the potential side effects of the new behavior, especially timeout errors when making calls to ElectrumClient. But the calls do not throw, they either:

return a nullable value: type checks should protect us here
for getScriptHashHistory/getScriptHashUnspents: return an empty list, which seems the appropriate thing to do.

So, I think we are fine.

There are a few things that have an impact on phoenix integration @dpad85:

move socketBuilder to connect()
remove old connectionState flow
introduction of timeouts (15s by default)
getConfirmations now takes currentMinDepth: that one seems the most annoying to me, as it makes the api (public and also internal) less friendly

src/commonMain/kotlin/fr/acinq/lightning/utils/strings.kt

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClientExtensions.kt

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt

We can actually get it from the connection status, which simplifies the way callers use those functions.

t-bast added 9 commits July 11, 2023 15:16

Remove unused ElectrumClientCommand

813b520

Move socketBuilder to connect()

94f2b90

This is only used when connecting, it doesn't need to be an argument of the `ElectrumClient`.

Remove duplicate connectionState flow

e66b313

This is a pure 1:1 mapping from `connectionStatus`, there is no reason for that duplication. Clients can trivially migrate or `map` using the `toConnectionState` helper function.

Add timeout to connection establishment

e914f3e

Otherwise we may be stuck in the `Connecting` state.

Remove unnecessary RPC calls exposed

0377d0d

We just keep `getHeader` / `getHeaders` which can be handy.

t-bast requested review from pm47, sstone and dpad85 July 12, 2023 10:52

sstone reviewed Jul 20, 2023

View reviewed changes

t-bast mentioned this pull request Jul 24, 2023

Get transaction's confirmation information (including block timestamp) #513

Open

t-bast force-pushed the electrum-client-rework branch 2 times, most recently from 4676981 to 885d82f Compare July 27, 2023 08:03

t-bast added 2 commits July 28, 2023 13:57

Add CoroutineExceptionHandler on tls sockets

2b70f09

We actually still need it until ktorio/ktor#3690 is integrated into a ktor release.

t-bast force-pushed the electrum-client-rework branch from 99d635d to 2b70f09 Compare July 28, 2023 11:58

t-bast marked this pull request as ready for review July 28, 2023 13:00

sstone reviewed Aug 7, 2023

View reviewed changes

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt Outdated Show resolved Hide resolved

src/commonMain/kotlin/fr/acinq/lightning/utils/strings.kt Show resolved Hide resolved

src/jvmMain/kotlin/fr/acinq/lightning/io/JvmTcpSocket.kt Show resolved Hide resolved

t-bast added 2 commits August 8, 2023 15:55

Remove remaining test delay

d691a0b

This was forgotten after e2e tests.

Close TCP socket when race with timeout

4502d86

Since `withTimeout` runs concurrently, it may throw the timeout exception after we've established the TCP connection, so we need to release it if we can (if we have a pointer to it).

sstone previously approved these changes Aug 10, 2023

View reviewed changes

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt Outdated Show resolved Hide resolved

src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt Show resolved Hide resolved

Add more comments on mailbox and Supervisation

b2f72d3

t-bast dismissed sstone’s stale review via b2f72d3 August 11, 2023 08:32

s/block/suspend

ad86714

sstone previously approved these changes Aug 14, 2023

View reviewed changes

t-bast mentioned this pull request Aug 18, 2023

App crashes when Tor is quickly turned on and off ACINQ/phoenix#396

Closed

pm47 reviewed Sep 5, 2023

View reviewed changes

Remove currentBlockHeight from extensions methods

75ab436

We can actually get it from the connection status, which simplifies the way callers use those functions.

t-bast dismissed sstone’s stale review via 75ab436 September 6, 2023 13:35

pm47 mentioned this pull request Sep 11, 2023

Fix race condition in connection to Peer #526

Merged

pm47 approved these changes Sep 11, 2023

View reviewed changes

t-bast merged commit fdba64a into master Sep 11, 2023
2 checks passed

t-bast deleted the electrum-client-rework branch September 11, 2023 13:11

dpad85 mentioned this pull request Sep 13, 2023

Update for ElectrumClient rework ACINQ/phoenix#418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework `ElectrumClient` #512

Rework `ElectrumClient` #512

t-bast commented Jul 12, 2023

sstone left a comment

sstone Jul 20, 2023

t-bast Jul 25, 2023 •

edited

t-bast commented Jul 25, 2023 •

edited

t-bast commented Jul 28, 2023

sstone left a comment

pm47 left a comment

Rework ElectrumClient #512

Rework ElectrumClient #512

Conversation

t-bast commented Jul 12, 2023

sstone left a comment

Choose a reason for hiding this comment

sstone Jul 20, 2023

Choose a reason for hiding this comment

t-bast Jul 25, 2023 • edited

Choose a reason for hiding this comment

t-bast commented Jul 25, 2023 • edited

t-bast commented Jul 28, 2023

sstone left a comment

Choose a reason for hiding this comment

pm47 left a comment

Choose a reason for hiding this comment

Rework `ElectrumClient` #512

Rework `ElectrumClient` #512

t-bast Jul 25, 2023 •

edited

t-bast commented Jul 25, 2023 •

edited