-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent NetworkException from causing a double-write #982
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, but the handler should decide what to do, not hardcoded timeouts.
So I would remove all the timeout fields. When a NetworkException is thrown, execute the handler. If the handler throws an exception, re-throw it, otherwise continue retrying forever.
} catch (InterruptedException e) { | ||
log.warn("Interrupted Exception in layout helper.", e); | ||
} | ||
log.warn("System seems unavailable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct format is getCurrentLayout: x
@no2chem What I can do instead is pass the timeout parameter with the handler. Like this, if somebody wants to register a handler that fires directly, he can pass a -1 for the timeout. |
@rogermichoud I'm not sure why threadlocals are necessary. By default, there should be no handler (i.e., the handler should always return). |
@no2chem |
Yes, the default behavior should be to (retry, not hang) forever. If there is an application which has a specific behavior requirement (e.g., mp, they may specify it using the handler). |
64cf8cf
to
aa8da61
Compare
aa8da61
to
e0d0dd2
Compare
48daa76
to
72201ff
Compare
02ba039
to
b2a7b9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - much better - mainly see the comment about UnknownResultException. We definitely don't want the PR to introduce data consistency issues due to trim (but it's ok not to handle it now).
@@ -97,9 +113,23 @@ public Layout getCurrentLayout() { | |||
log.warn("Got a wrong epoch exception, updating epoch to {} and " | |||
+ "invalidate view", we.getCorrectEpoch()); | |||
runtime.invalidateLayout(); | |||
} else if (re instanceof NetworkException) { | |||
log.warn("layoutHelper: System seems unavailable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print the exception too:
log.warn("layoutHelper: System seems unavailable", re);
class TimeoutHandler { | ||
CorfuRuntime rt; | ||
long maxTimeout; | ||
ThreadLocal<Long> localTimeStart = new ThreadLocal<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this on a per-thread basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is an example, so I guess there is some freedom on how to implement it. I also think different threads could be interacting with different routers. If one thread is only reading, its queries will only go to the chain tail, and it will reset the timer each time. This could mask a problem with writes that could never go in.
} catch (TrimmedException te) { | ||
// We cannot know if the write went through or not | ||
// This will rewrite | ||
throw new OverwriteException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should throw a new exception type, "UnknownResultException"
An overwrite exception will cause a data inconsistency issue, which will not be nice....
For a transaction, we should abort. For a object, we should recover from a checkpoint(?) - but we still won't know if the operation committed or not. We can deal with handling the exception in a separate PR, but we absolutely don't want to introduce data consistency issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to throw your UnrecoverableCorfuException. That will just create a dependency on your PR. Which is fine. The UnreceoverableCorfruExcpetion should go in pretty soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.. It's UnrecoverableCorfuError now :). Can you create a dependency on ZenHub so we can track it?
b2a7b9f
to
149e859
Compare
@@ -93,14 +110,38 @@ public Layout getCurrentLayout() { | |||
log.warn("Got a wrong epoch exception, updating epoch to {} and " | |||
+ "invalidate view", we.getCorrectEpoch()); | |||
runtime.invalidateLayout(); | |||
} else if (re instanceof NetworkException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else { | ||
throw re; | ||
} | ||
if (rethrowAllExceptions) { | ||
throw new RuntimeException(re); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (ex.getCause() instanceof SystemUnavailableException) { | ||
throw (SystemUnavailableException) ex.getCause(); | ||
} | ||
|
||
runtime.invalidateLayout(); | ||
Utils.sleepUninterruptibly(runtime.retryRate * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @return The return value of the function. | ||
*/ | ||
public <T, A extends RuntimeException, B extends RuntimeException, C extends RuntimeException, | ||
D extends RuntimeException> T layoutHelper(LayoutFunction<Layout, T, A, B, C, D> | ||
function) | ||
function, | ||
boolean rethrowAllExceptions) | ||
throws A, B, C, D { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the declaration of thrown exception 'A' which is a runtime exception.
Remove the declaration of thrown exception 'B' which is a runtime exception.
Remove the declaration of thrown exception 'C' which is a runtime exception.
Remove the declaration of thrown exception 'D' which is a runtime exception.
* test/src/test/java/org/corfudb/runtime/CorfuRuntimeTest.java | ||
* | ||
*/ | ||
public Runnable beforeRpcHandler = () -> {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* | ||
*/ | ||
public Runnable beforeRpcHandler = () -> {}; | ||
public Runnable systemDownHandler = () -> {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} catch (RuntimeException re) { | ||
// These two exceptions are pass through. Both of them are already too late for trying | ||
// to validate the state of the write, we know that it didn't went through. | ||
if (re instanceof SystemUnavailableException || re instanceof OverwriteException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (re instanceof SystemUnavailableException || re instanceof OverwriteException) { | ||
throw re; | ||
} | ||
re.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this printStackTrace
},CLIENT_DELAY_POST_SHUTDOWN, TimeUnit.MILLISECONDS); | ||
offline.shutdown(); | ||
|
||
Thread.sleep(CORFU_SERVER_DOWN_TIME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SonarQube analysis reported 16 issues Watch the comments in this conversation to review them. 1 extra issueNote: The following issues were found on lines that were not modified in the pull request. Because these issues can't be reported as line comments, they are summarized here:
|
Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 149e859. *** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once you remove the printStackTrace
@no2chem I had to do some tweaks to tests and also be sure that the custom handler would get out of fetchLayout as well if we cannot reach any layout. Should not impact the default path. Also I wanted to be sure we handle both cases correctly for custom handlers: if the changes are ok for you, we can merge. |
149e859
to
030e579
Compare
@no2chem |
Hm - Also, weren't you going to have SystemUnavailableException extend UnrecoverableCorfuError? |
Well, I wasn't but I will :-) |
ceed3a4
to
8e2758b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 2 small nits left...
@@ -45,6 +45,8 @@ public Layout getCurrentLayout() { | |||
while (true) { | |||
try { | |||
return runtime.layout.get(); | |||
} catch (SystemUnavailableException sue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, the error will not be caught by the exception catch block below
/** | ||
* Created by rmichoud on 10/31/17. | ||
*/ | ||
public class SystemUnavailableException extends UnrecoverableCorfuError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SystemUnavailableException -> SystemUnavailableError. Also you can move it to the Unrecoverable package.
8e2758b
to
107d0fa
Compare
NetworkExcpetion within the layoutHelper function will call the handler. The default handler will do nothing and just let layoutHelper retry the request. For writes(Transaction and single update), NetworkExceptions are passed to the view layer (StreamsView and BackPointerView). After a NetworkException, the layer will force a read to see if we persisted the record or not. If it was persisted, the request return the correct value. If not, the retry logic is implemented in the view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, thanks!
NetworkExcpetion within the layoutHelper function will call the handler. The default handler will do nothing and just let layoutHelper retry the request. For writes(Transaction and single update), NetworkExceptions are passed to the view layer (StreamsView and BackPointerView). After a NetworkException, the layer will force a read to see if we persisted the record or not. If it was persisted, the request return the correct value. If not, the retry logic is implemented in the view.
The Runtime has a settable timeout for how long it will wait for
an unavailable system. If we are not able to progress (due to NetworkExcpetion)
during the timeout, the handler provided by the client (or default handler,
that shutdown the Runtime) will be triggered.
Fixes #974