eth1 endpoints validation #3869

tbenr · 2021-04-18T20:02:19Z

PR Description

Implements Eth1 Provider validation, completing #3832

TODO

implement tests in Web3jEth1MonitorableProviderTest

Documentation

I thought about documentation and added the documentation label to this PR if updates are required.

Changelog

I thought about adding a changelog entry, and added one if I deemed necessary.

ajsutton

I think this is looking good. Mostly just need to think about thread safety in terms of the structure. We'll also need to test it out a bit and make sure that we don't wind up logging too many errors when nodes are down or in the wrong state - eth1 can very easily fill up the logs with errors but it's not actually that big a problem if the eth1 endpoint is unavailable.

pow/src/main/java/tech/pegasys/teku/pow/AbstractMonitorableEth1Provider.java

ajsutton · 2021-04-18T22:26:24Z

pow/src/main/java/tech/pegasys/teku/pow/FallbackAwareEth1Provider.java

                .exceptionallyCompose(
                    err -> {
+                      LOG.warn("Retrying with next eth1 endpoint", err);


This WARN and the error below will wind up being excessively noisy when a provider is unavailable. We probably need to keep them at debug level until we address #3855 which can take a more wholistic approach to it.

this should not be the case since I'm skipping non-working endpoints until the next validation check.

pow/src/main/java/tech/pegasys/teku/pow/Web3jEth1Provider.java

services/powchain/src/main/java/tech/pegasys/teku/services/powchain/Eth1ProviderMonitor.java

tbenr · 2021-04-19T06:40:57Z

@ajsutton another thing to improve is the startup phase. I was thinking to put a silent on in FallbackAwareEth1Provider::run retry until the Monitor completed a validation cycle.

I'll fix broken tests too :)

ajsutton · 2021-04-19T22:10:00Z

@ajsutton another thing to improve is the startup phase. I was thinking to put a silent on in FallbackAwareEth1Provider::run retry until the Monitor completed a validation cycle.

I'll fix broken tests too :)

We got the build running for you automatically now. :). btw if you're running acceptance tests locally - they use the docker image, so run ./gradelw distDocker to "deploy" any changes to production code. You can then just run the test itself in IntelliJ which makes life easy.

ajsutton · 2021-04-19T22:16:55Z

@ajsutton another thing to improve is the startup phase. I was thinking to put a silent on in FallbackAwareEth1Provider::run retry until the Monitor completed a validation cycle.

Forgot to put the response to this bit in. I think that makes a lot of sense - startup is certainly a bit of a special case. It would be nice to be able to handle it cleanly where that first validation cycle returns a SafeFuture so we can just wait for it to complete before sending requests to that node rather than having to retry, but its not a big deal.

ajsutton · 2021-04-20T00:32:50Z

btw, my guess for the failing tests is that chain ID we're expecting for the Eth1 node doesn't match the actual chain ID so we now ignore the endpoint entirely whereas previously we just logged a warning.

20:30:36.046 ERROR - PLEASE CHECK YOUR ETH1 NODE (endpoint [besunode5:8545 [1]])| Wrong Eth1 chain id (expected=5, actual=2018)

The simplest solution is probably just to set the chain ID in the Besu genesis file: https://github.com/ConsenSys/teku/blob/74ab474b984de7ffe338e87a6e8ab0d7874a7429/acceptance-tests/src/testFixtures/resources/besu/depositContractGenesis.json#L3

tbenr · 2021-04-20T04:47:55Z

@ajsutton another thing to improve is the startup phase. I was thinking to put a silent on in FallbackAwareEth1Provider::run retry until the Monitor completed a validation cycle.

I'll fix broken tests too :)

We got the build running for you automatically now. :). btw if you're running acceptance tests locally - they use the docker image, so run ./gradelw distDocker to "deploy" any changes to production code. You can then just run the test itself in IntelliJ which makes life easy.

Was already doing that but that time i forgot it😁

tbenr · 2021-04-20T04:54:54Z

@ajsutton another thing to improve is the startup phase. I was thinking to put a silent on in FallbackAwareEth1Provider::run retry until the Monitor completed a validation cycle.

Forgot to put the response to this bit in. I think that makes a lot of sense - startup is certainly a bit of a special case. It would be nice to be able to handle it cleanly where that first validation cycle returns a SafeFuture so we can just wait for it to complete before sending requests to that node rather than having to retry, but its not a big deal.

I already did it (not exactly as you said, which is far better) But i want to improve it by notifig at first success or at the end. So we can start as soon as a valid endpoint is found (letting timeouts go their way)

tbenr · 2021-04-20T20:44:43Z

@ajsutton I implemented everything i wanted to. I'm overall satisfied!

ajsutton

This is really looking good. I've left a bunch of comments but they're mostly small. I'm going to spend some time doing some manual testing locally as well and pay attention to what logs come out etc but this is great work.

ajsutton · 2021-04-20T23:05:15Z

pow/src/main/java/tech/pegasys/teku/pow/AbstractMonitorableEth1Provider.java

+    success,
+    failed


nit: by convention enum names are generally all upper case.

ajsutton · 2021-04-20T23:07:07Z

pow/src/main/java/tech/pegasys/teku/pow/AbstractMonitorableEth1Provider.java

+        }
+    }
+    // should never occur
+    return true;


nit: I'd make this a default case that throws an exception so we get a very loud error if a new enum variant is added for some reason.

Suggested change

}

}

// should never occur

return true;

default:

throw new IllegalStateException("Unknown result type: " + lastValidationResult);

}

default:

throw new IllegalStateException("Unknown result type: " + lastCallResult);

}

I'm not entirely sure I have that suggested change right but it should show the idea. :)

ajsutton · 2021-04-20T23:11:09Z

pow/src/main/java/tech/pegasys/teku/pow/Eth1Provider.java

+    String hostname;
+    try {
+      String tmp = Splitter.on("://").splitToList(endpoint).get(1);
+      hostname = Splitter.on("/").splitToList(tmp).get(0);
+    } catch (Exception e) {
+      hostname = "unknown";
+    }


We probably shouldn't reinvent URL parsing here. I'd suggest something like:

Suggested change

String hostname;

try {

String tmp = Splitter.on("://").splitToList(endpoint).get(1);

hostname = Splitter.on("/").splitToList(tmp).get(0);

} catch (Exception e) {

hostname = "unknown";

}

String hostname;

try {

final URI uri = new URI(endpoint);

if (uri.getPort() != - 1) {

hostname = uri.getHost() + ":" + uri.getPort();

} else {

hostname = uri.getHost();

}

} catch (URISyntaxException e) {

hostname = "unknown";

}

yeah. don't know why I did that :)

ajsutton · 2021-04-20T23:13:43Z

pow/src/main/java/tech/pegasys/teku/pow/Eth1ProviderSelector.java

+    this.initialValidationCompleted = new SafeFuture<>();
+  }
+
+  public class ValidEth1ProviderIterator {


nit: Typically we put internal classes at the bottom of the file.

ajsutton · 2021-04-20T23:18:53Z

pow/src/main/java/tech/pegasys/teku/pow/Web3jEth1Provider.java

+                if (chainId.intValueExact() != Constants.DEPOSIT_CHAIN_ID) {
+                  STATUS_LOG.eth1DepositChainIdMismatch(
+                      Constants.DEPOSIT_CHAIN_ID, chainId.intValueExact(), this.id);
+                  throw new RuntimeException("Wrong Chainid");


Should this throw an exception or just return false?
And should it update the last validation result?

well, ~~the idea of throwing there was to delegate everything to the handleComposed~~

ajsutton · 2021-04-20T23:41:33Z

services/powchain/src/main/java/tech/pegasys/teku/services/powchain/Eth1ProviderMonitor.java

+    // let's prepare a parallel validation stream
+    Stream<SafeFuture<Boolean>> validationStream =
+        eth1ProviderSelector
+            .getProviders()
+            .parallelStream()
+            .filter(MonitorableProvider::needsToBeValidated)
+            .map(MonitorableProvider::validate);
+
+    if (eth1ProviderSelector.isInitialValidationCompleted()) {
+      // if we already notified a completion, just execute all validations.
+      validationStream.forEach(isValidFuture -> isValidFuture.always(() -> {}));
+    } else {
+      // otherwise let's notify a validation completion as soon as we have a valid endpoint or in
+      // any case at the end of all validations.
+      SafeFuture.allOf(
+              validationStream
+                  .map(
+                      isValidFuture ->
+                          isValidFuture.thenApply(
+                              (isValid) -> {
+                                if (isValid) {
+                                  eth1ProviderSelector.notifyValidationCompletion();
+                                }
+                                return null;
+                              }))
+                  .toArray(SafeFuture[]::new))
+          .always(eth1ProviderSelector::notifyValidationCompletion);
+    }


Given that it's safe to complete a future multiple times, it's ok if we call eth1ProviderSelector::notifyValidationCompletion multiple times as well. So I think I'd remove this if and just always use the else case.

We can also simplify a little by using SafeFuture.thenPeek.

And given that all the requests are made async anyway, I would just use a normal .stream() rather than making it parallel - involving more threads won't help and may actually be slower due to the overhead of moving across threads.

SafeFuture.allOf( eth1ProviderSelector.getProviders().stream() .filter(MonitorableProvider::needsToBeValidated) .map(MonitorableProvider::validate) .map( isValidFuture -> isValidFuture.thenPeek( isValid -> { if (isValid) { eth1ProviderSelector.notifyValidationCompletion(); } })) .toArray(SafeFuture[]::new)) .alwaysRun(eth1ProviderSelector::notifyValidationCompletion) .finish(error -> LOG.error("Unexpected error while validating eth1 endpoints", error));

The .alwaysRun(...).finish() at the end just ensures that if there's an exception thrown we do wind up logging it with some context about what was happening. We could potentially use .reportExceptions() but it doesn't provide as much useful context and tends to be a lot noisier.

So I think I'd remove this if and just always use the else case.

since we go in the else branch only once, I was on the side of saving an array of SafeFutures and some additional useless lambdas execution, sacrificing code neatness

And given that all the requests are made async anyway, I would just use a normal .stream()

yeah very good point.

The .alwaysRun(...).finish() at the end just ensures that if there's an exception thrown we do wind up logging it with some context about what was happening. We could potentially use .reportExceptions() but it doesn't provide as much useful context and tends to be a lot noisier.

👍 👍

yeah it's a tiny bit wasteful to notify validation complete multiple times but given we run this so rarely and the cost of those calls is so low I don't think you could notice the difference either way. The benefit of clearer code is definitely worth it in this case (as it usually is - generally the simplest code is also the fastest code).

ajsutton · 2021-04-20T23:46:33Z

pow/src/main/java/tech/pegasys/teku/pow/Web3jEth1Provider.java

+                validating.set(false);
+                return futureReturn;
+              });


For safety, we should set validating to false in an .alwaysRun block. That way we guarantee validating gets set back to false even if an unexpected exception gets thrown in this handling code.

Suggested change

validating.set(false);

return futureReturn;

});

return futureReturn;

})

.alwaysRun(() -> validating.set(false));

.alwaysRun here is like a finally block in a try/catch.

ajsutton · 2021-04-20T23:47:49Z

pow/src/main/java/tech/pegasys/teku/pow/Web3jEth1Provider.java

+  @Override
+  public SafeFuture<Boolean> validate() {
+    if (validating.compareAndSet(false, true)) {
+      LOG.info("Validating endpoint {} ...", this.id);


This is a routine thing to have happen so probably just log at debug level.

ajsutton · 2021-04-20T23:47:59Z

pow/src/main/java/tech/pegasys/teku/pow/Web3jEth1Provider.java

+                  updateLastValidation(Result.failed);
+                  futureReturn.complete(Boolean.FALSE);
+                } else {
+                  LOG.info("Endpoint {} is VALID", this.id);


This is also routine so can just be debug level.

ajsutton · 2021-04-21T00:08:44Z

pow/src/main/java/tech/pegasys/teku/pow/Web3jEth1Provider.java

@@ -162,4 +190,46 @@ public Web3jEth1Provider(final Web3j web3j, final AsyncRunner asyncRunner) {
              return (List<EthLog.LogResult<?>>) (List) logs;
            });
  }
+
+  @Override
+  public SafeFuture<Boolean> validate() {


Spent a bit of time working out how to pull all the advice below together and this is what I come up with to replace this whole method:

@Override public SafeFuture<Boolean> validate() { if (validating.compareAndSet(false, true)) { LOG.debug("Validating endpoint {} ...", this.id); return validateChainId() .thenCompose( result -> { if (result == Result.failed) { return SafeFuture.completedFuture(result); } else { return validateSyncing(); } }) .thenApply( result -> { updateLastValidation(result); return result == Result.success; }) .exceptionally( error -> { LOG.warn( "Endpoint {} is INVALID | {}", this.id, Throwables.getRootCause(error).getMessage()); updateLastValidation(Result.failed); return false; }) .alwaysRun(() -> validating.set(false)); } else { LOG.debug("Already validating"); return SafeFuture.completedFuture(isValid()); } } private SafeFuture<Result> validateChainId() { return getChainId() .thenApply( chainId -> { if (chainId.intValueExact() != Constants.DEPOSIT_CHAIN_ID) { STATUS_LOG.eth1DepositChainIdMismatch( Constants.DEPOSIT_CHAIN_ID, chainId.intValueExact(), this.id); return Result.failed; } return Result.success; }); } private SafeFuture<Result> validateSyncing() { return ethSyncing() .thenApply( syncing -> { if (syncing) { LOG.warn("Endpoint {} is INVALID | Still syncing", this.id); updateLastValidation(Result.failed); return Result.failed; } else { LOG.debug("Endpoint {} is VALID", this.id); updateLastValidation(Result.success); return Result.success; } }); }

good point on splitting things up..

tbenr · 2021-04-21T07:18:20Z

Thanks @ajsutton for the valuable comments!
Going to study and merge them during the day

updated chainid for acceptance tests

ajsutton

LGTM. This is really excellent work. Thanks so much for contributing this.

ajsutton reviewed Apr 18, 2021

View reviewed changes

stefa2k mentioned this pull request Apr 19, 2021

Failover for eth1 node stereum-dev/ethereum2-docker-compose#31

Closed

6 tasks

tbenr force-pushed the validate_eth1_endpoints branch from a4bcd22 to 9ec358a Compare April 19, 2021 18:39

tbenr force-pushed the validate_eth1_endpoints branch from 00f56d8 to e3cf210 Compare April 20, 2021 16:49

ajsutton reviewed Apr 21, 2021

View reviewed changes

tbenr added 7 commits April 21, 2021 18:44

eth1 endpoints validation

89dd786

various improvements

b29576c

make "no available endpoints" log to debug level

f5163eb

improvements on startup management

6e18c66

updated chainid for acceptance tests

more improvements and Web3jEth1MonitorableProvider test implementation

3cdc812

minor things

e0656c5

integrate adrian's suggestions

54f6540

tbenr force-pushed the validate_eth1_endpoints branch from 988fae5 to 54f6540 Compare April 21, 2021 16:58

Merge branch 'master' into validate_eth1_endpoints

4629bfb

ajsutton approved these changes Apr 21, 2021

View reviewed changes

ajsutton merged commit a2fdfff into Consensys:master Apr 21, 2021

tbenr deleted the validate_eth1_endpoints branch April 22, 2021 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eth1 endpoints validation #3869

eth1 endpoints validation #3869

tbenr commented Apr 18, 2021 •

edited

Loading

ajsutton left a comment

ajsutton Apr 18, 2021

tbenr Apr 20, 2021

tbenr commented Apr 19, 2021 •

edited

Loading

ajsutton commented Apr 19, 2021

ajsutton commented Apr 19, 2021

ajsutton commented Apr 20, 2021

tbenr commented Apr 20, 2021

tbenr commented Apr 20, 2021 •

edited

Loading

tbenr commented Apr 20, 2021

ajsutton left a comment

ajsutton Apr 20, 2021

ajsutton Apr 20, 2021

ajsutton Apr 20, 2021

tbenr Apr 21, 2021

ajsutton Apr 20, 2021

ajsutton Apr 20, 2021

tbenr Apr 21, 2021 •

edited

Loading

ajsutton Apr 20, 2021

tbenr Apr 21, 2021

ajsutton Apr 21, 2021

ajsutton Apr 20, 2021

ajsutton Apr 20, 2021

ajsutton Apr 20, 2021

ajsutton Apr 21, 2021 •

edited

Loading

tbenr Apr 21, 2021

tbenr commented Apr 21, 2021

ajsutton left a comment

-        }
-    }
-    // should never occur
-    return true;
+          default:
+            throw new IllegalStateException("Unknown result type: " + lastValidationResult);
+        }
+        default:
+          throw new IllegalStateException("Unknown result type: " + lastCallResult);
+    }

eth1 endpoints validation #3869

eth1 endpoints validation #3869

Conversation

tbenr commented Apr 18, 2021 • edited Loading

PR Description

TODO

Documentation

Changelog

ajsutton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbenr commented Apr 19, 2021 • edited Loading

ajsutton commented Apr 19, 2021

ajsutton commented Apr 19, 2021

ajsutton commented Apr 20, 2021

tbenr commented Apr 20, 2021

tbenr commented Apr 20, 2021 • edited Loading

tbenr commented Apr 20, 2021

ajsutton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbenr Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajsutton Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbenr commented Apr 21, 2021

ajsutton left a comment

Choose a reason for hiding this comment

tbenr commented Apr 18, 2021 •

edited

Loading

tbenr commented Apr 19, 2021 •

edited

Loading

tbenr commented Apr 20, 2021 •

edited

Loading

tbenr Apr 21, 2021 •

edited

Loading

ajsutton Apr 21, 2021 •

edited

Loading