-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix set router epoch on runtime invalidate. #1046
Conversation
5d2f6ff
to
d1fc0f5
Compare
Changes Unknown when pulling d1fc0f5 on zalokhan:runtimeBugFix into ** on CorfuDB:master**. |
Changes Unknown when pulling d1fc0f5 on zalokhan:runtimeBugFix into ** on CorfuDB:master**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a small test?
Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit d1fc0f5. *** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at: |
b507ff2
to
dfa6fed
Compare
@rogermichoud Added a test. |
dfa6fed
to
0ec14b0
Compare
Changes Unknown when pulling 0ec14b0 on zalokhan:runtimeBugFix into ** on CorfuDB:master**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
final AtomicReference<String> failedNode = new AtomicReference<>(); | ||
|
||
CorfuRuntime.overrideGetRouterFunction = (corfuRuntime, endpoint) -> { | ||
if (failedNode.get() != null && endpoint.equals(failedNode.get())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just move this logic to AbstractViewTest::getRouterFunction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't that complicate the usage? Too many control knobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what is happening here. Is there a reason to throw these exceptions (it seems like they aren't picked up in your test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, I see the logic. It's an unfortunate side effect of the test router not being able to simulate a closed connection. Add that to the comments (that this is simulating a failed node), and I think we'll be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly the test routers never throw Network Exceptions and I needed to test them specifically.
These exceptions are caught by the CorfuRuntime while updating the router epochs.
Secondly, this is very specific to this test and I wonder if anyone else would ever want to throw this exception at this point in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced a "simulated network exception" at the test router level. Can you not leverage this? It is way less fancy than that, so not sure it serves your purpose. That is what this method does, throwing a network exception from the router.
The method is simulateDisconnectedEndpoint(), in TestClientRouter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good I'll add a dependency on #982 and consume your test addition.
@@ -350,6 +350,16 @@ public String getEndpoint(int port) { | |||
return "test:" + port; | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move the logic as suggested in the comment above, this function is no longer necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing suggestion.
final AtomicReference<String> failedNode = new AtomicReference<>(); | ||
|
||
CorfuRuntime.overrideGetRouterFunction = (corfuRuntime, endpoint) -> { | ||
if (failedNode.get() != null && endpoint.equals(failedNode.get())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced a "simulated network exception" at the test router level. Can you not leverage this? It is way less fancy than that, so not sure it serves your purpose. That is what this method does, throwing a network exception from the router.
The method is simulateDisconnectedEndpoint(), in TestClientRouter.
Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 0ec14b0. *** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at: |
@rogermichoud @no2chem So for now, I'll stick to my existing test and make a few changes which @no2chem suggested in the review. |
0ec14b0
to
c51489f
Compare
Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit c51489f. *** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at: |
@zalokhan did you update this? I think it needs a comment describing what you're doing in CorfuRuntimeTest (simulating failed connections). |
c51489f
to
706c3c3
Compare
706c3c3
to
e08579a
Compare
@no2chem Added the comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov Report
@@ Coverage Diff @@
## master #1046 +/- ##
==========================================
- Coverage 66.65% 66.61% -0.04%
==========================================
Files 201 201
Lines 9593 9594 +1
Branches 969 970 +1
==========================================
- Hits 6394 6391 -3
- Misses 2825 2826 +1
- Partials 374 377 +3
Continue to review full report at Codecov.
|
See @zalokhan 's comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Overview
Description: CorfuRuntime: fetchLayout fetches the latest layout from a random layout server and sets the client router epochs to the newer, higher epoch number in a loop.
However, if the first router in the loop throws a network exception the loop is aborted and the other router epochs remain unset.
Why should this be merged: To avoid wrong epoch exceptions and set the router epochs correctly.
Related issue(s) (if applicable): #1044
Checklist (Definition of Done):