Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent NetworkException from causing a double-write #982

Merged
merged 1 commit into from
Dec 11, 2017

Conversation

rogermichoud
Copy link
Contributor

@rogermichoud rogermichoud commented Nov 3, 2017

The Runtime has a settable timeout for how long it will wait for
an unavailable system. If we are not able to progress (due to NetworkExcpetion)
during the timeout, the handler provided by the client (or default handler,
that shutdown the Runtime) will be triggered.

Fixes #974

Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, but the handler should decide what to do, not hardcoded timeouts.

So I would remove all the timeout fields. When a NetworkException is thrown, execute the handler. If the handler throws an exception, re-throw it, otherwise continue retrying forever.

} catch (InterruptedException e) {
log.warn("Interrupted Exception in layout helper.", e);
}
log.warn("System seems unavailable");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct format is getCurrentLayout: x

@rogermichoud
Copy link
Contributor Author

@no2chem
The problem is that there are many threads that are executing AbstractView layoutHelper. In order to implement a timeout crafted in the handler, I would need to use ThreadLocals and add variables to my default implementation of the handler in CorfuRuntime. Doesn't look that clean to me.

What I can do instead is pass the timeout parameter with the handler. Like this, if somebody wants to register a handler that fires directly, he can pass a -1 for the timeout.

@no2chem
Copy link
Member

no2chem commented Nov 13, 2017

@rogermichoud I'm not sure why threadlocals are necessary.
Timeouts should still be decided globally on a per-request basis. The handler should get the NetworkException, and decide what to do; this doesn't need to be on a per-thread basis, in fact, it shouldn't be --- it should be on a per-endpoint basis.

By default, there should be no handler (i.e., the handler should always return).

@rogermichoud
Copy link
Contributor Author

@no2chem
You want the default behavior to hang forever? I thought the consensus was having the default behavior to shutdown the runtime after a timeout (SystemUnavailable state).

@no2chem
Copy link
Member

no2chem commented Nov 13, 2017

Yes, the default behavior should be to (retry, not hang) forever. If there is an application which has a specific behavior requirement (e.g., mp, they may specify it using the handler).

@Maithem Maithem self-requested a review November 20, 2017 21:39
@rogermichoud rogermichoud force-pushed the systemUnavailable branch 2 times, most recently from 48daa76 to 72201ff Compare November 28, 2017 19:43
@rogermichoud rogermichoud changed the title [WIP] A handler is provided in case of system unavailable. A handler is provided in case of system unavailable. Nov 28, 2017
@rogermichoud rogermichoud force-pushed the systemUnavailable branch 2 times, most recently from 02ba039 to b2a7b9f Compare November 29, 2017 01:47
Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - much better - mainly see the comment about UnknownResultException. We definitely don't want the PR to introduce data consistency issues due to trim (but it's ok not to handle it now).

@@ -97,9 +113,23 @@ public Layout getCurrentLayout() {
log.warn("Got a wrong epoch exception, updating epoch to {} and "
+ "invalidate view", we.getCorrectEpoch());
runtime.invalidateLayout();
} else if (re instanceof NetworkException) {
log.warn("layoutHelper: System seems unavailable");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print the exception too:

log.warn("layoutHelper: System seems unavailable", re);

class TimeoutHandler {
CorfuRuntime rt;
long maxTimeout;
ThreadLocal<Long> localTimeStart = new ThreadLocal<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this on a per-thread basis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this is an example, so I guess there is some freedom on how to implement it. I also think different threads could be interacting with different routers. If one thread is only reading, its queries will only go to the chain tail, and it will reset the timer each time. This could mask a problem with writes that could never go in.

} catch (TrimmedException te) {
// We cannot know if the write went through or not
// This will rewrite
throw new OverwriteException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should throw a new exception type, "UnknownResultException"
An overwrite exception will cause a data inconsistency issue, which will not be nice....

For a transaction, we should abort. For a object, we should recover from a checkpoint(?) - but we still won't know if the operation committed or not. We can deal with handling the exception in a separate PR, but we absolutely don't want to introduce data consistency issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to throw your UnrecoverableCorfuException. That will just create a dependency on your PR. Which is fine. The UnreceoverableCorfruExcpetion should go in pretty soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.. It's UnrecoverableCorfuError now :). Can you create a dependency on ZenHub so we can track it?

@no2chem no2chem changed the title A handler is provided in case of system unavailable. Prevent NetworkException from causing a double-write Nov 29, 2017
@no2chem no2chem added this to the 0.2.0 milestone Dec 4, 2017
@no2chem no2chem added the bug label Dec 4, 2017
@@ -93,14 +110,38 @@ public Layout getCurrentLayout() {
log.warn("Got a wrong epoch exception, updating epoch to {} and "
+ "invalidate view", we.getCorrectEpoch());
runtime.invalidateLayout();
} else if (re instanceof NetworkException) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR Replace the usage of the "instanceof" operator by a catch block. rule

} else {
throw re;
}
if (rethrowAllExceptions) {
throw new RuntimeException(re);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR Define and throw a dedicated exception instead of using a generic one. rule

if (ex.getCause() instanceof SystemUnavailableException) {
throw (SystemUnavailableException) ex.getCause();
}

runtime.invalidateLayout();
Utils.sleepUninterruptibly(runtime.retryRate * 1000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR Cast one of the operands of this multiplication operation to a "long". rule

* @return The return value of the function.
*/
public <T, A extends RuntimeException, B extends RuntimeException, C extends RuntimeException,
D extends RuntimeException> T layoutHelper(LayoutFunction<Layout, T, A, B, C, D>
function)
function,
boolean rethrowAllExceptions)
throws A, B, C, D {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR Remove the declaration of thrown exception 'A' which is a runtime exception. rule
MINOR Remove the declaration of thrown exception 'B' which is a runtime exception. rule
MINOR Remove the declaration of thrown exception 'C' which is a runtime exception. rule
MINOR Remove the declaration of thrown exception 'D' which is a runtime exception. rule

* test/src/test/java/org/corfudb/runtime/CorfuRuntimeTest.java
*
*/
public Runnable beforeRpcHandler = () -> {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR Either remove or fill this block of code. rule
MINOR Make beforeRpcHandler a static final constant or non-public and provide accessors if needed. rule

*
*/
public Runnable beforeRpcHandler = () -> {};
public Runnable systemDownHandler = () -> {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR Either remove or fill this block of code. rule
MINOR Make systemDownHandler a static final constant or non-public and provide accessors if needed. rule

} catch (RuntimeException re) {
// These two exceptions are pass through. Both of them are already too late for trying
// to validate the state of the write, we know that it didn't went through.
if (re instanceof SystemUnavailableException || re instanceof OverwriteException) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR Replace the usage of the "instanceof" operator by a catch block. rule
MAJOR Replace the usage of the "instanceof" operator by a catch block. rule

if (re instanceof SystemUnavailableException || re instanceof OverwriteException) {
throw re;
}
re.printStackTrace();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR Use a logger to log this exception. rule

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this printStackTrace

},CLIENT_DELAY_POST_SHUTDOWN, TimeUnit.MILLISECONDS);
offline.shutdown();

Thread.sleep(CORFU_SERVER_DOWN_TIME);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR Remove this use of "Thread.sleep()". rule

@corfu-repo
Copy link
Collaborator

SonarQube analysis reported 16 issues

  • MAJOR 8 major
  • MINOR 8 minor

Watch the comments in this conversation to review them.

1 extra issue

Note: The following issues were found on lines that were not modified in the pull request. Because these issues can't be reported as line comments, they are summarized here:

  1. MAJOR CorfuRuntime.java#L586: Either re-interrupt this method or rethrow the "InterruptedException". rule

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 70.702% when pulling 149e859 on systemUnavailable into 0174570 on master.

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 149e859.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #982 Graphs

Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once you remove the printStackTrace

@rogermichoud
Copy link
Contributor Author

@no2chem I had to do some tweaks to tests and also be sure that the custom handler would get out of fetchLayout as well if we cannot reach any layout. Should not impact the default path.

Also I wanted to be sure we handle both cases correctly for custom handlers:
1. We cannot write because one (or more) server of the logunit chain is not reachable but we can get a layout (some or all layout servers are up).
2. We cannot write and we cannot have a layout. In that case, we are stuck in fetchLayout.

if the changes are ok for you, we can merge.

@rogermichoud
Copy link
Contributor Author

@no2chem
Also removed the printStackTrace

@no2chem
Copy link
Member

no2chem commented Dec 11, 2017

Hm - Also, weren't you going to have SystemUnavailableException extend UnrecoverableCorfuError?

@rogermichoud
Copy link
Contributor Author

Well, I wasn't but I will :-)

@rogermichoud rogermichoud force-pushed the systemUnavailable branch 2 times, most recently from ceed3a4 to 8e2758b Compare December 11, 2017 20:16
Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 2 small nits left...

@@ -45,6 +45,8 @@ public Layout getCurrentLayout() {
while (true) {
try {
return runtime.layout.get();
} catch (SystemUnavailableException sue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, the error will not be caught by the exception catch block below

/**
* Created by rmichoud on 10/31/17.
*/
public class SystemUnavailableException extends UnrecoverableCorfuError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemUnavailableException -> SystemUnavailableError. Also you can move it to the Unrecoverable package.

NetworkExcpetion within the layoutHelper function will call the handler. The default
handler will do nothing and just let layoutHelper retry the request.

For writes(Transaction and single update), NetworkExceptions are passed to the
view layer (StreamsView and BackPointerView). After a NetworkException, the layer
will force a read to see if we persisted the record or not. If it was persisted,
the request return the correct value. If not, the retry logic is implemented in the view.
Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks!

@no2chem no2chem merged commit 0c92e3c into master Dec 11, 2017
no2chem pushed a commit that referenced this pull request Dec 14, 2017
NetworkExcpetion within the layoutHelper function will call the handler. The default
handler will do nothing and just let layoutHelper retry the request.

For writes(Transaction and single update), NetworkExceptions are passed to the
view layer (StreamsView and BackPointerView). After a NetworkException, the layer
will force a read to see if we persisted the record or not. If it was persisted,
the request return the correct value. If not, the retry logic is implemented in the view.
@Maithem Maithem deleted the systemUnavailable branch December 15, 2017 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants