HOTFIX: safely clear all active state in onPartitionsLost #7691

ableegoldman · 2019-11-14T03:08:59Z

After a number of last minute bugs were found stemming from the incremental closing of lost tasks in StreamsRebalanceListener#onPartitionsLost, a safer approach to this edge case seems warranted. We initially wanted to be as "future-proof" as possible, and avoid baking further protocol assumptions into the code that may be broken as the protocol evolves. This meant that rather than simply closing all active tasks and clearing all associated state in #onPartitionsLost(lostPartitions) we would loop through the lostPartitions/lost tasks and remove them one by one from the various data structures/assignments, then verify that everything was empty in the end. This verification in particular has caused us significant trouble, as it turns out to be nontrivial to determine what should in fact be empty, and if so whether it is also being correctly updated.

Therefore, before worrying about it being "future-proof" it seems we should make sure it is "present-day-proof" and implement this callback in the safest possible way, by blindly clearing and closing all active task state. We log all the relevant state (at debug level) before clearing it, so we can at least tell from the logs whether/which emptiness checks were being violated.

Note that this is targeted at 2.4 (not trunk) and that I also picked over the minor fix from #7686

ableegoldman · 2019-11-14T03:10:11Z

@guozhangwang @mjsax @abbccdda @apurvam @vvcephei @bbejeck

ableegoldman · 2019-11-14T04:13:57Z

Kicked off some sets of system tests -->
all streams tests:
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3511/ -- REBUILDING
broker bounce (x5 repeats):
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3512/ -- REBUILDING
version probing (x30 repeats):
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3513/ -- PASSED
cooperative upgrade (x3 repeats):
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3514/ -- PASSED

edit: version probing and cooperative upgrade passed, broker bounce and "all streams tests" need to be rerun after fixing a stupid UnsupportedOperationException

ableegoldman · 2019-11-14T19:42:07Z

New system test runs:
all streams tests:
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3519 -- PASSED
broker bounce (x5 repeats):
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3520 -- PASSED

edit: all system tests green

guozhangwang

One more question, otherwise LGTM.

guozhangwang · 2019-11-14T23:41:48Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

-                log.debug("Closing the zombie suspended stream task {}.", id);
-                firstException.compareAndSet(null, closeSuspended(true, suspended.get(id)));
+        for (final TaskId id : allAssignedTaskIds()) {
+            if (running.containsKey(id)) {


Question: I remember running is a superset of suspended, so if we put this condition first before line 304, it means line 304 would never trigger right?

Ah, good question. I found another way to solve that issue besides making running a superset of suspended -- now, all maps are completely disjoint, and any given task should be contained in exactly one

bbejeck

Changes here LGTM, I just have a few questions overall.

bbejeck · 2019-11-15T17:17:05Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

    boolean hasRestoringTasks() {
+        if (restoring.isEmpty()) {


It seems a bit awkward to have a hasXXX call have side effects. Although I do understand the motivation here, I guess this is something to revisit during a subsequent refactoring/follow-on PR.

Yeah, that's fair (Guozhang said the same thing on the trunk PR 😄) -- I'll think of a better way to do this so that it's clear what's going on without having to comment and update the PR(s) by tonight

Alright, I rewrote this clearing functionality to be an explicit method call and not a hidden side effect

bbejeck · 2019-11-15T17:26:38Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

        removeTaskFromRunning(task);
-        closedTaskChangelogs.addAll(task.changelogPartitions());


Why did we remove this line here from closeRunning but we have the same call in suspendRunningTasks. More of a question for my own education, I wouldn't hold up merging for this.

We no longer need to keep track of which changelog partitions were lost after a onPartitionsLost since we just clear everything. #closeRunning is actually only called on zombie tasks, so we can just remove it entirely here (non-zombie running tasks are closed by first suspending and then closing as suspended)

bbejeck · 2019-11-15T17:49:40Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java


-        // With the current rebalance protocol, there should not be any running tasks left as they were all lost


Is part of the "strict assumptions" referred that needed to get removed for now?

Yes -- we now just clear everything. But we log the entire state of all relevant data structures so that, for our own debugging sake, we should be able to tell from the logs whether the state was actually being updated properly and the "blindly clear everything" safety mechanism was unnecessary. And if not, we can figure out what isn't being properly updated and fix that in trunk without having broken 2.4 :)

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedTasks.java

ableegoldman · 2019-11-15T21:43:13Z

Please check out the latest commit -- I added another hopefully unnecessary safety mechanism, where we remove a task from ALL state maps (eg created, running, etc) when closing it or moving it to another. The idea is to basically ensure that a task will not end up in two state maps at once -- this has never seemed to happen or been the cause of any test failures so far, but it's been indicated (and it's true) that it is difficult to feel confident just from reading the code that these sets don't ever overlap. This may be overkill, but the idea is to make 2.4 as stable as possible as well as improve our confidence in the correctness and assumptions of the code
cc/ @bbejeck @guozhangwang @abbccdda @mjsax @vvcephei

ableegoldman · 2019-11-15T22:19:35Z

Kicked off more system tests off of the latest changes:

all streams tests -- PASSED
broker bounce (repeat x5) -- FAILED
broker bounce (repeat x15)
cooperative upgrade (repeat x15) -- PASSED

edit: adding completed system test results

bbejeck

Thanks for the quick turnaround @ableegoldman, this LGTM I only have a couple of minor questions.

bbejeck · 2019-11-15T22:51:29Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

@@ -372,6 +378,24 @@ void updateRestored(final Collection<TopicPartition> restored) {
        }
    }

+    @Override
+     void removeTaskFromAllOldMaps(final StreamTask task, final Map<TaskId, StreamTask> newState) {


This took a minute to figure out based on the name newState. What about currentStateMap, but I'm not sure that's any better.

Yeah, I agonized over the naming of this method and the "newMap" parameter ... I like currentStateMap though (also renaming the method and adding javadocs for what it does)

bbejeck · 2019-11-15T23:01:16Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

+        final Set<TopicPartition> taskPartitions = new HashSet<>(task.partitions());
+        taskPartitions.addAll(task.changelogPartitions());
+
+        if (newState != restoring) {


Not sure if this matters, but I don't see where removeTaskFromAllOldMaps passing in restoring as the new state gets called.

It isn't, but I felt it was best for the method to follow the same behavior for all possible input, and do what it says it will (remove from everything except the passed in map)

Makes sense, I was thinking that was your reasoning, but I wanted to confirm.

bbejeck · 2019-11-15T23:04:34Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedTasks.java

+        final Set<TopicPartition> taskPartitions = new HashSet<>(task.partitions());
+        taskPartitions.addAll(task.changelogPartitions());
+
+        if (newState != running) {


Same thing here, I don't see where running is passed for the newState

ditto above, let me know if you don't agree with the reasoning

bbejeck · 2019-11-15T23:26:01Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStandbyTasks.java

@@ -79,7 +79,7 @@ int commit() {
            } catch (final RuntimeException e) {
                log.error("Closing the standby task {} failed due to the following error:", task.id(), e);
            } finally {
-                removeTaskFromRunning(task);
+                removeTaskFromAllOldMaps(task, null);


minor suggestion - instead of null what about Collections.emptyMap()? This suggestion is subjective so feel free to ignore. Applies here and below.

Ah, yeah that's a good suggestion -- will do

ableegoldman

Thanks for the review!

ableegoldman · 2019-11-15T23:44:55Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStandbyTasks.java

@@ -79,7 +79,7 @@ int commit() {
            } catch (final RuntimeException e) {
                log.error("Closing the standby task {} failed due to the following error:", task.id(), e);
            } finally {
-                removeTaskFromRunning(task);
+                removeTaskFromAllOldMaps(task, null);


Ah, yeah that's a good suggestion -- will do

ableegoldman · 2019-11-15T23:45:18Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

+        final Set<TopicPartition> taskPartitions = new HashSet<>(task.partitions());
+        taskPartitions.addAll(task.changelogPartitions());
+
+        if (newState != restoring) {


It isn't, but I felt it was best for the method to follow the same behavior for all possible input, and do what it says it will (remove from everything except the passed in map)

ableegoldman · 2019-11-15T23:49:08Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedTasks.java

+        final Set<TopicPartition> taskPartitions = new HashSet<>(task.partitions());
+        taskPartitions.addAll(task.changelogPartitions());
+
+        if (newState != running) {


ditto above, let me know if you don't agree with the reasoning

ableegoldman · 2019-11-15T23:54:54Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/AssignedStreamsTasks.java

@@ -372,6 +378,24 @@ void updateRestored(final Collection<TopicPartition> restored) {
        }
    }

+    @Override
+     void removeTaskFromAllOldMaps(final StreamTask task, final Map<TaskId, StreamTask> newState) {


Yeah, I agonized over the naming of this method and the "newMap" parameter ... I like currentStateMap though (also renaming the method and adding javadocs for what it does)

guozhangwang

Reviewed the PR again, honestly I'm a bit concerned about the over-kill we are executing here since it may "hide" some other bugs that could be exposed, on the other hand I think I buy the argument to make 2.4 release as stable as possible and since it is only in 2.4 now it maybe okay.

Let's try to do the cleanup asap in trunk and also cherry-pick it into 2.4 and replace this overkill mechanism asap.

bbejeck · 2019-11-16T16:13:51Z

@ableegoldman failures seem relevant org.apache.kafka.streams.processor.internals.StoreChangelogReaderTest.shouldRestoreMessagesFromCheckpoint

ableegoldman · 2019-11-17T04:20:03Z

The failures in the second* round of system tests were due to a bug introduced in the 73513f6 commit. Kicking off a third round of system tests with this issue fixed:

all streams tests -- PASSED
broker bounce (repeat x5) -- PASSED
broker bounce (repeat x15) -- PASSED
cooperative upgrade (repeat x15) -- PASSED

*the first round of system tests (run before the guilty 73513f6) all passed

edit: all tests are passing again

bbejeck · 2019-11-19T21:08:36Z

~~In the previous build both Java 11/2.12 and Java 11/2.13 passed, but Java 8 failed.~~

~~Doing a local ./gradlew test the build passed~~

~~Merging this now~~

Irrelevant as alll 3 PR builds passed.

bbejeck · 2019-11-19T21:34:56Z

Merged #7691 into 2.4

bbejeck · 2019-11-19T21:35:07Z

Thanks for the fix @ableegoldman!

After a number of last minute bugs were found stemming from the incremental closing of lost tasks in StreamsRebalanceListener#onPartitionsLost, a safer approach to this edge case seems warranted. We initially wanted to be as "future-proof" as possible, and avoid baking further protocol assumptions into the code that may be broken as the protocol evolves. This meant that rather than simply closing all active tasks and clearing all associated state in #onPartitionsLost(lostPartitions) we would loop through the lostPartitions/lost tasks and remove them one by one from the various data structures/assignments, then verify that everything was empty in the end. This verification in particular has caused us significant trouble, as it turns out to be nontrivial to determine what should in fact be empty, and if so whether it is also being correctly updated. Therefore, before worrying about it being "future-proof" it seems we should make sure it is "present-day-proof" and implement this callback in the safest possible way, by blindly clearing and closing all active task state. We log all the relevant state (at debug level) before clearing it, so we can at least tell from the logs whether/which emptiness checks were being violated. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bill Bejeck <bbejeck@gmail.com>, Andrew Choi <andchoi@linkedin.com>

bbejeck · 2019-11-19T22:22:37Z

cherry-picked to trunk

safely clear and log everything

06b852f

Properly clear restoredPartitions (KAFKA-9178)

792db74

mjsax added the streams label Nov 14, 2019

fix unit tests

bfc53bd

guozhangwang reviewed Nov 14, 2019

View reviewed changes

ableegoldman mentioned this pull request Nov 15, 2019

KAFKA-9178: restoredPartitions is not cleared until the last restoring task completes #7686

Closed

don't need to pass lost partitions

7dcb533

bbejeck reviewed Nov 15, 2019

View reviewed changes

ableegoldman added 2 commits November 15, 2019 15:04

clear restoring in own method, not side effect

c51d60f

be safe, remove from everywhere

73513f6

fix unit tests due to EasyMock ridiculousness >:(

2f05d5f

bbejeck reviewed Nov 15, 2019

View reviewed changes

github review

514275f

ableegoldman commented Nov 16, 2019

View reviewed changes

bbejeck approved these changes Nov 16, 2019

View reviewed changes

guozhangwang approved these changes Nov 16, 2019

View reviewed changes

log id not full task in StoreChangelogReader

526ab05

fix stupid bug introduced in previous commit

65c49aa

fix log message

282a023

andrewchoi5 approved these changes Nov 19, 2019

View reviewed changes

bbejeck changed the base branch from 2.4 to trunk November 19, 2019 17:47

bbejeck changed the base branch from trunk to 2.4 November 19, 2019 17:48

bbejeck merged commit cbc9f57 into apache:2.4 Nov 19, 2019

		removeTaskFromRunning(task);
		closedTaskChangelogs.addAll(task.changelogPartitions());


		// With the current rebalance protocol, there should not be any running tasks left as they were all lost

HOTFIX: safely clear all active state in onPartitionsLost #7691

HOTFIX: safely clear all active state in onPartitionsLost #7691

Conversation

ableegoldman commented Nov 14, 2019 • edited

ableegoldman commented Nov 14, 2019

ableegoldman commented Nov 14, 2019 • edited

ableegoldman commented Nov 14, 2019 • edited

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbejeck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ableegoldman commented Nov 15, 2019

ableegoldman commented Nov 15, 2019 • edited

bbejeck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ableegoldman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

bbejeck commented Nov 16, 2019

ableegoldman commented Nov 17, 2019 • edited

bbejeck commented Nov 19, 2019 • edited

bbejeck commented Nov 19, 2019

bbejeck commented Nov 19, 2019

bbejeck commented Nov 19, 2019

ableegoldman commented Nov 14, 2019 •

edited

ableegoldman commented Nov 14, 2019 •

edited

ableegoldman commented Nov 14, 2019 •

edited

ableegoldman commented Nov 15, 2019 •

edited

ableegoldman commented Nov 17, 2019 •

edited

bbejeck commented Nov 19, 2019 •

edited