Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset logunit server #1052

Merged
merged 1 commit into from
Dec 8, 2017
Merged

Reset logunit server #1052

merged 1 commit into from
Dec 8, 2017

Conversation

zalokhan
Copy link
Member

@zalokhan zalokhan commented Dec 7, 2017

Overview

Description: Resets the log unit server by clearing all data and persisted state.

Why should this be merged: This is a requirement to heal failed nodes.
The healing nodes need to be added back to the chain and this can only be done once the state
on the log unit server on the healing node is reset.

Checklist (Definition of Done):

  • There are no TODOs left in the code
  • Coding conventions (e.g. for logging, unit tests) have been followed
  • Change is covered by automated tests
  • Public API has Javadoc

@no2chem
Copy link
Member

no2chem commented Dec 7, 2017

Hm. Can you clarify why healing can only be done if the server is reset? Can't the healing node take an intersection of the current log state and it's state?

@no2chem no2chem added this to the 0.2.0 milestone Dec 7, 2017
Copy link
Member

@no2chem no2chem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks fine, but I wonder if (1) we really need to reset to heal a node, and (2) if we should find a way to "protect" this API.

Maybe make it only accessible from a special administrative epoch?


@Override
public void reset() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling b3befa0 on zalokhan:resetLogunit into ** on CorfuDB:master**.

@zalokhan
Copy link
Member Author

zalokhan commented Dec 7, 2017

@no2chem There can be a case where the head of the chain (and also primary sequencer) was ahead than the others and crashed (address 100).
The backup sequencer is bootstrapped with a token from the maximum address seen by the remaining log units (address 50). Now clients can write a different set of data to the new chain for addresses (50 - 100)

The crashed head now tries to recover.
The intersection of the log addresses from the healing node and the current state will be 0-100 but have inconsistent data.

Let me know if I understood and answered your question correctly.

@no2chem
Copy link
Member

no2chem commented Dec 7, 2017 via email

@zalokhan
Copy link
Member Author

zalokhan commented Dec 7, 2017

But how do you determine till where do we drop the log entries?
You would have to do a deep reading of all the entries to get to know where the log stream branches off. I guess this would require access tot he deserializers and would not be feasible.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling c40ecc9 on zalokhan:resetLogunit into ** on CorfuDB:master**.

@no2chem
Copy link
Member

no2chem commented Dec 7, 2017 via email

@no2chem no2chem merged commit 8e63b38 into CorfuDB:master Dec 8, 2017
@zalokhan zalokhan deleted the resetLogunit branch December 8, 2017 00:18
@Maithem
Copy link
Contributor

Maithem commented Dec 8, 2017

@no2chem don't merge right away, there are other reviewers looking at this PR.

Copy link
Contributor

@Maithem Maithem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resetLogUnit should block all other calls, right now, writes can interleave with a reset which result in a dirty state. One way of achieving this is by setting the logging unit's state to not ready and then doing the reset.

@zalokhan
Copy link
Member Author

zalokhan commented Dec 8, 2017

@Maithem To achieve this, should the reset call go through the batchwriter? This can ensure synchronization.

no2chem added a commit that referenced this pull request Dec 8, 2017
@no2chem
Copy link
Member

no2chem commented Dec 8, 2017

@Maithem sorry about the merge. But concurrency shouldn't be an issue here. The logunit epoch should be sealed while this is happening

@zalokhan zalokhan restored the resetLogunit branch December 8, 2017 00:33
@Maithem
Copy link
Contributor

Maithem commented Dec 8, 2017

@no2chem No worries.

Let's consider the system as a whole, I think in general we need to minimize work that needs to happen between a seal and layout changes. In the case of chain replication, I think this is fine since the whole chain has to be ready for writes to go through, but in the case of a quorum, why block the whole system, if the system can still can accept requests?

@Maithem
Copy link
Contributor

Maithem commented Dec 8, 2017

@zalokhan Readers are not blocked by the writer thread, so I don't think that would work.

@no2chem
Copy link
Member

no2chem commented Dec 8, 2017 via email

@Maithem
Copy link
Contributor

Maithem commented Dec 8, 2017

Either way you're reconfiguring to reset hopefully, so a seal should occur...

I'm not saying we shouldn't do a seal, i'm saying the work after the seal should be minimized.

Ok, we can think more about the quorum case later.

I think there is another issue, consider the case where the batch writer is writing while a reset occurs, this is a race condition that would leave the logging in a bad state. I think the before the reset happens, we either need to wait for all writes to succeed, or just cancel all pending writes. Essentially, the LU pipelines need to be flushed before a reset is issued.

Moreover, I think this is a dangerous operation and the API should be protected some how.

zalokhan added a commit that referenced this pull request Dec 11, 2017
@zalokhan zalokhan deleted the resetLogunit branch January 11, 2018 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants