Reset logunit server #1052

zalokhan · 2017-12-07T22:51:11Z

Overview

Description: Resets the log unit server by clearing all data and persisted state.

Why should this be merged: This is a requirement to heal failed nodes.
The healing nodes need to be added back to the chain and this can only be done once the state
on the log unit server on the healing node is reset.

Checklist (Definition of Done):

There are no TODOs left in the code
Coding conventions (e.g. for logging, unit tests) have been followed
Change is covered by automated tests
Public API has Javadoc

no2chem · 2017-12-07T23:00:49Z

Hm. Can you clarify why healing can only be done if the server is reset? Can't the healing node take an intersection of the current log state and it's state?

no2chem

Overall, looks fine, but I wonder if (1) we really need to reset to heal a node, and (2) if we should find a way to "protect" this API.

Maybe make it only accessible from a special administrative epoch?

no2chem · 2017-12-07T23:10:42Z

infrastructure/src/main/java/org/corfudb/infrastructure/log/InMemoryStreamLog.java

+
+    @Override
+    public void reset() {
+


extra space?

coveralls · 2017-12-07T23:16:24Z

Changes Unknown when pulling b3befa0 on zalokhan:resetLogunit into ** on CorfuDB:master**.

zalokhan · 2017-12-07T23:17:11Z

@no2chem There can be a case where the head of the chain (and also primary sequencer) was ahead than the others and crashed (address 100).
The backup sequencer is bootstrapped with a token from the maximum address seen by the remaining log units (address 50). Now clients can write a different set of data to the new chain for addresses (50 - 100)

The crashed head now tries to recover.
The intersection of the log addresses from the healing node and the current state will be 0-100 but have inconsistent data.

Let me know if I understood and answered your question correctly.

no2chem · 2017-12-07T23:29:40Z

Ok makes much more sense now. But wouldn't a partial reset be better than a complete reset? I guess that's just an optimization (drop only 50 to tail), but it seems quite important for performance. We can do that in a separate PR. On Thu, Dec 07, 2017 at 3:18pm, Zeeshan Lokhandwala < notifications@github.com [notifications@github.com] > wrote: On Thu, Dec 07, 2017 at 3:18pm, Zeeshan Lokhandwala < notifications@github.com [notifications@github.com] > wrote: @no2chem [https://github.com/no2chem] There can be a case where the head of the chain (and also primary sequencer) was ahead than the others and crashed (address 100). The backup sequencer is bootstrapped with a token from the maximum address seen by the remaining log units (address 50). Now clients can write a different set of data to the new chain for addresses (50 - 100) The crashed head now tries to recover. The intersection of the log addresses from the healing node and the current state will be 0-100 but have inconsistent data. Let me know if I understood and answered your question correctly. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [#1052 (comment)] , or mute the thread [https://github.com/notifications/unsubscribe-auth/AGMi2GDox7t52jqadTiXu2147MSRknbkks5s-HH4gaJpZM4Q6UV0] .

zalokhan · 2017-12-07T23:31:55Z

But how do you determine till where do we drop the log entries?
You would have to do a deep reading of all the entries to get to know where the log stream branches off. I guess this would require access tot he deserializers and would not be feasible.

coveralls · 2017-12-07T23:43:23Z

Changes Unknown when pulling c40ecc9 on zalokhan:resetLogunit into ** on CorfuDB:master**.

no2chem · 2017-12-07T23:49:27Z

I see. The problem is the lack of a lease. If the previous sequencer had a lease of say, 10k entries, then this problem would go away On Thu, Dec 07, 2017 at 3:35pm, Zeeshan Lokhandwala < notifications@github.com [notifications@github.com] > wrote: But how do you determine till where do we drop the log entries? You would have to do a deep reading of all the entries to get to know where the log stream branches off. I guess this would require access tot he deserializers and would not be feasible. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [#1052 (comment)] , or mute the thread [https://github.com/notifications/unsubscribe-auth/AGMi2FJUHym3QvT3EcDsm7-FWvLCUu7Mks5s-HVsgaJpZM4Q6UV0] .

Maithem · 2017-12-08T00:20:42Z

@no2chem don't merge right away, there are other reviewers looking at this PR.

Maithem

I think resetLogUnit should block all other calls, right now, writes can interleave with a reset which result in a dirty state. One way of achieving this is by setting the logging unit's state to not ready and then doing the reset.

zalokhan · 2017-12-08T00:28:21Z

@Maithem To achieve this, should the reset call go through the batchwriter? This can ensure synchronization.

This reverts commit 8e63b38.

no2chem · 2017-12-08T00:31:21Z

@Maithem sorry about the merge. But concurrency shouldn't be an issue here. The logunit epoch should be sealed while this is happening

Maithem · 2017-12-08T00:43:21Z

@no2chem No worries.

Let's consider the system as a whole, I think in general we need to minimize work that needs to happen between a seal and layout changes. In the case of chain replication, I think this is fine since the whole chain has to be ready for writes to go through, but in the case of a quorum, why block the whole system, if the system can still can accept requests?

Maithem · 2017-12-08T00:44:15Z

@zalokhan Readers are not blocked by the writer thread, so I don't think that would work.

no2chem · 2017-12-08T00:46:15Z

In the case of quorum we probably need leases, I suspect. Either way you're reconfiguring to reset hopefully, so a seal should occur... On Thu, Dec 07, 2017 at 4:43pm, Maithem < notifications@github.com [notifications@github.com] > wrote: No worries. Let's consider the system as a whole, I think in general we need to minimize work that needs to happen between a seal and layout changes. In the case of chain replication, I think this is fine since the whole chain has to be ready for writes to go through, but in the case of a quorum, why block the whole system, if the system can still can accept requests? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [#1052 (comment)] , or mute the thread [https://github.com/notifications/unsubscribe-auth/AGMi2LU1jgr0qSfoUS40rWsRPnm0vk16ks5s-IYqgaJpZM4Q6UV0] .

Maithem · 2017-12-08T01:03:29Z

Either way you're reconfiguring to reset hopefully, so a seal should occur...

I'm not saying we shouldn't do a seal, i'm saying the work after the seal should be minimized.

Ok, we can think more about the quorum case later.

I think there is another issue, consider the case where the batch writer is writing while a reset occurs, this is a race condition that would leave the logging in a bad state. I think the before the reset happens, we either need to wait for all writes to succeed, or just cancel all pending writes. Essentially, the LU pipelines need to be flushed before a reset is issued.

Moreover, I think this is a dangerous operation and the API should be protected some how.

zalokhan added the feature label Dec 7, 2017

zalokhan self-assigned this Dec 7, 2017

zalokhan requested review from dahliamalkhi, no2chem, Maithem and medhavidhawan December 7, 2017 22:51

no2chem added this to the 0.2.0 milestone Dec 7, 2017

no2chem approved these changes Dec 7, 2017

View reviewed changes

reset logunit server

c40ecc9

zalokhan force-pushed the resetLogunit branch from b3befa0 to c40ecc9 Compare December 7, 2017 23:19

no2chem merged commit 8e63b38 into CorfuDB:master Dec 8, 2017

zalokhan deleted the resetLogunit branch December 8, 2017 00:18

Maithem reviewed Dec 8, 2017

View reviewed changes

no2chem added a commit that referenced this pull request Dec 8, 2017

Revert "reset logunit server (#1052)"

d5286de

This reverts commit 8e63b38.

zalokhan restored the resetLogunit branch December 8, 2017 00:33

zalokhan added a commit that referenced this pull request Dec 11, 2017

reset logunit server (#1052)

65d8341

zalokhan deleted the resetLogunit branch January 11, 2018 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset logunit server #1052

Reset logunit server #1052

zalokhan commented Dec 7, 2017 •

edited

Loading

no2chem commented Dec 7, 2017

no2chem left a comment

no2chem Dec 7, 2017

zalokhan Dec 7, 2017

coveralls commented Dec 7, 2017

zalokhan commented Dec 7, 2017

no2chem commented Dec 7, 2017 via email

zalokhan commented Dec 7, 2017

coveralls commented Dec 7, 2017

no2chem commented Dec 7, 2017 via email

Maithem commented Dec 8, 2017

Maithem left a comment

zalokhan commented Dec 8, 2017

no2chem commented Dec 8, 2017

Maithem commented Dec 8, 2017 •

edited

Loading

Maithem commented Dec 8, 2017

no2chem commented Dec 8, 2017 via email

Maithem commented Dec 8, 2017 •

edited

Loading

Reset logunit server #1052

Reset logunit server #1052

Conversation

zalokhan commented Dec 7, 2017 • edited Loading

Overview

Checklist (Definition of Done):

no2chem commented Dec 7, 2017

no2chem left a comment

Choose a reason for hiding this comment

no2chem Dec 7, 2017

Choose a reason for hiding this comment

zalokhan Dec 7, 2017

Choose a reason for hiding this comment

coveralls commented Dec 7, 2017

zalokhan commented Dec 7, 2017

no2chem commented Dec 7, 2017 via email

zalokhan commented Dec 7, 2017

coveralls commented Dec 7, 2017

no2chem commented Dec 7, 2017 via email

Maithem commented Dec 8, 2017

Maithem left a comment

Choose a reason for hiding this comment

zalokhan commented Dec 8, 2017

no2chem commented Dec 8, 2017

Maithem commented Dec 8, 2017 • edited Loading

Maithem commented Dec 8, 2017

no2chem commented Dec 8, 2017 via email

Maithem commented Dec 8, 2017 • edited Loading

zalokhan commented Dec 7, 2017 •

edited

Loading

Maithem commented Dec 8, 2017 •

edited

Loading

Maithem commented Dec 8, 2017 •

edited

Loading