Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix set router epoch on runtime invalidate. #1046

Merged
merged 2 commits into from
Dec 12, 2017

Conversation

zalokhan
Copy link
Member

@zalokhan zalokhan commented Dec 6, 2017

Overview

Description: CorfuRuntime: fetchLayout fetches the latest layout from a random layout server and sets the client router epochs to the newer, higher epoch number in a loop.
However, if the first router in the loop throws a network exception the loop is aborted and the other router epochs remain unset.

Why should this be merged: To avoid wrong epoch exceptions and set the router epochs correctly.

Related issue(s) (if applicable): #1044

Checklist (Definition of Done):

  • There are no TODOs left in the code
  • Coding conventions (e.g. for logging, unit tests) have been followed
  • Change is covered by automated tests
  • Public API has Javadoc

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling d1fc0f5 on zalokhan:runtimeBugFix into ** on CorfuDB:master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling d1fc0f5 on zalokhan:runtimeBugFix into ** on CorfuDB:master**.

Copy link
Contributor

@rogermichoud rogermichoud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a small test?

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit d1fc0f5.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1046 Graphs

@zalokhan zalokhan force-pushed the runtimeBugFix branch 2 times, most recently from b507ff2 to dfa6fed Compare December 6, 2017 21:21
@zalokhan
Copy link
Member Author

zalokhan commented Dec 6, 2017

@rogermichoud Added a test.
Verification: The test fails on master.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 0ec14b0 on zalokhan:runtimeBugFix into ** on CorfuDB:master**.

Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

final AtomicReference<String> failedNode = new AtomicReference<>();

CorfuRuntime.overrideGetRouterFunction = (corfuRuntime, endpoint) -> {
if (failedNode.get() != null && endpoint.equals(failedNode.get())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just move this logic to AbstractViewTest::getRouterFunction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't that complicate the usage? Too many control knobs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what is happening here. Is there a reason to throw these exceptions (it seems like they aren't picked up in your test?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I see the logic. It's an unfortunate side effect of the test router not being able to simulate a closed connection. Add that to the comments (that this is simulating a failed node), and I think we'll be okay.

Copy link
Member Author

@zalokhan zalokhan Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly the test routers never throw Network Exceptions and I needed to test them specifically.
These exceptions are caught by the CorfuRuntime while updating the router epochs.

Secondly, this is very specific to this test and I wonder if anyone else would ever want to throw this exception at this point in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced a "simulated network exception" at the test router level. Can you not leverage this? It is way less fancy than that, so not sure it serves your purpose. That is what this method does, throwing a network exception from the router.

The method is simulateDisconnectedEndpoint(), in TestClientRouter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good I'll add a dependency on #982 and consume your test addition.

@@ -350,6 +350,16 @@ public String getEndpoint(int port) {
return "test:" + port;
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move the logic as suggested in the comment above, this function is no longer necessary.

Copy link
Contributor

@rogermichoud rogermichoud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing suggestion.

final AtomicReference<String> failedNode = new AtomicReference<>();

CorfuRuntime.overrideGetRouterFunction = (corfuRuntime, endpoint) -> {
if (failedNode.get() != null && endpoint.equals(failedNode.get())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced a "simulated network exception" at the test router level. Can you not leverage this? It is way less fancy than that, so not sure it serves your purpose. That is what this method does, throwing a network exception from the router.

The method is simulateDisconnectedEndpoint(), in TestClientRouter.

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 0ec14b0.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1046 Graphs

@zalokhan
Copy link
Member Author

@rogermichoud @no2chem
Sorry for realizing this so late.
I am not able to reproduce the error using the new simulateEndpointDisconnect feature in the TestClientRouter.
This is because the NettyClientRouter unlike the TestClientRouter calls the start() method in the constructor. This at times can throw a NetworkException if the connection cannot be made.
Simulating this will require a different approach.

So for now, I'll stick to my existing test and make a few changes which @no2chem suggested in the review.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 70.634% when pulling c51489f on zalokhan:runtimeBugFix into 0c92e3c on CorfuDB:master.

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit c51489f.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1046 Graphs

@no2chem
Copy link
Member

no2chem commented Dec 12, 2017

@zalokhan did you update this? I think it needs a comment describing what you're doing in CorfuRuntimeTest (simulating failed connections).

@zalokhan
Copy link
Member Author

@no2chem Added the comment.

no2chem
no2chem previously approved these changes Dec 12, 2017
Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov
Copy link

codecov bot commented Dec 12, 2017

Codecov Report

Merging #1046 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1046      +/-   ##
==========================================
- Coverage   66.65%   66.61%   -0.04%     
==========================================
  Files         201      201              
  Lines        9593     9594       +1     
  Branches      969      970       +1     
==========================================
- Hits         6394     6391       -3     
- Misses       2825     2826       +1     
- Partials      374      377       +3
Impacted Files Coverage Δ
...rc/main/java/org/corfudb/runtime/CorfuRuntime.java 72.51% <100%> (+0.6%) ⬆️
...udb/runtime/object/CorfuCompileWrapperBuilder.java 85.71% <0%> (-4.77%) ⬇️
...in/java/org/corfudb/runtime/view/AbstractView.java 55.76% <0%> (-3.85%) ⬇️
.../org/corfudb/protocols/wireprotocol/IMetadata.java 84.05% <0%> (-2.9%) ⬇️
...a/org/corfudb/infrastructure/ManagementServer.java 75.48% <0%> (-2.34%) ⬇️
...va/org/corfudb/protocols/wireprotocol/LogData.java 84.16% <0%> (-0.84%) ⬇️
...src/main/java/org/corfudb/runtime/view/Layout.java 60.13% <0%> (-0.7%) ⬇️
...udb/runtime/view/stream/BackpointerStreamView.java 82.32% <0%> (-0.56%) ⬇️
...org/corfudb/runtime/clients/NettyClientRouter.java 73.88% <0%> (-0.38%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87cc9e9...2e45aae. Read the comment docs.

@no2chem no2chem dismissed stale reviews from rogermichoud and themself December 12, 2017 22:07

See @zalokhan 's comment

Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@no2chem no2chem merged commit ffff81f into CorfuDB:master Dec 12, 2017
@zalokhan zalokhan deleted the runtimeBugFix branch December 12, 2017 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants