New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent 2 instances of ProjectionCoreCoordinator & its subcomponents from running in parallel #1466

Merged
merged 1 commit into from Oct 18, 2017

Conversation

2 participants
@shaan1337
Member

shaan1337 commented Oct 16, 2017

It is possible for 2 instances of Projection Core Coordinator subcomponents (EventReaderCoreService/ProjectionCoreService/ProjectionCoreServiceCommandReader) to be running in parallel if a node goes from Master to Unknown state and quickly back to Master state.
If the subcomponents have not been stopped in time when the node moves to Unknown state, there will be 2 instances of the subcomponent running when the node moves back to master.

This can cause problems where projections gets stuck after an election

Example:
If a second instance of the command reader starts running (when receiving ProjectionCoreServiceMessage.StartCore message with another Epoch ID) before the first one is stopped, the command reader is stopped when the StopCore message for the first instance arrives.

[PID:18392:012 2017.10.11 16:20:24.644 INFO  ClusterVNodeControll] === NO QUORUM EMERGED WITHIN TIMEOUT... RETIRING...
[PID:18392:012 2017.10.11 16:20:24.644 INFO  ClusterVNodeControll] ========== [127.0.0.1:2112] IS UNKNOWN...
[PID:18392:012 2017.10.11 16:20:24.644 DEBUG ElectionsService    ] ELECTIONS: STARTING ELECTIONS.
[PID:18392:012 2017.10.11 16:20:24.644 DEBUG ElectionsService    ] ELECTIONS: (V=4) SHIFT TO LEADER ELECTION.
[PID:18392:012 2017.10.11 16:20:24.644 DEBUG ElectionsService    ] ELECTIONS: (V=4) VIEWCHANGE FROM [127.0.0.1:2112, {26650f98-b103-4fec-ac03-34673619413a}].
[PID:18392:018 2017.10.11 16:20:24.644 DEBUG ProjectionCoreCoordi] PROJECTIONS: Stopping Projections Core Coordinator. (Node State : Unknown)
---
Stopping projection manager
[PID:18392:018 2017.10.11 16:20:24.645 DEBUG ProjectionManager   ] PROJECTIONS: Stopping Projections Manager. (Node State : Unknown)
---
[PID:18392:015 2017.10.11 16:20:24.648 DEBUG PersistentSubscripti] Subscriptions received state change to Unknown stopping listening.
[PID:18392:012 2017.10.11 16:20:24.665 DEBUG ElectionsService    ] ELECTIONS: (V=4) VIEWCHANGE FROM [127.0.0.1:2122, {df52b02e-e51c-4941-9e2d-0db5cd2bed76}].
[PID:18392:012 2017.10.11 16:20:24.665 DEBUG ElectionsService    ] ELECTIONS: (V=4) MAJORITY OF VIEWCHANGE.
[PID:18392:012 2017.10.11 16:20:24.670 DEBUG ElectionsService    ] ELECTIONS: (V=4) PREPARE FROM [127.0.0.1:2122, {df52b02e-e51c-4941-9e2d-0db5cd2bed76}].
[PID:18392:012 2017.10.11 16:20:24.670 DEBUG ElectionsService    ] ELECTIONS: (V=4) SHIFT TO REG_NONLEADER.
[PID:18392:012 2017.10.11 16:20:24.693 DEBUG ElectionsService    ] ELECTIONS: (V=4) PROPOSAL FROM [127.0.0.1:2122,{df52b02e-e51c-4941-9e2d-0db5cd2bed76}] M=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}). ME=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}).
[PID:18392:012 2017.10.11 16:20:24.693 DEBUG ElectionsService    ] ELECTIONS: (V=4) ACCEPT FROM [127.0.0.1:2122,{df52b02e-e51c-4941-9e2d-0db5cd2bed76}] M=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}]).
[PID:18392:012 2017.10.11 16:20:24.693 DEBUG ElectionsService    ] ELECTIONS: (V=4) ACCEPT FROM [127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}] M=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}]).
[PID:18392:012 2017.10.11 16:20:24.693 INFO  ElectionsService    ] ELECTIONS: (V=4) DONE. ELECTED MASTER = [127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}). ME=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}).
[PID:18392:012 2017.10.11 16:20:24.693 INFO  ClusterVNodeControll] ========== [127.0.0.1:2112] PRE-MASTER STATE, WAITING FOR CHASER TO CATCH UP...
[PID:18392:022 2017.10.11 16:20:24.694 DEBUG PersistentSubscripti] Subscriptions received state change to PreMaster stopping listening.
[PID:18392:012 2017.10.11 16:20:24.699 INFO  ClusterVNodeControll] ========== [127.0.0.1:2112] IS MASTER... SPARTA!
[PID:18392:009 2017.10.11 16:20:24.699 DEBUG EpochManager        ] === Writing E7@109516:{53449de7-e378-45f8-902a-b73a392597be} (previous epoch at 63525).
[PID:18392:006 2017.10.11 16:20:24.699 DEBUG PersistentSubscripti] Subscriptions Became Master so now handling subscriptions
[PID:18392:009 2017.10.11 16:20:24.705 DEBUG EpochManager        ] === Update Last Epoch E7@109516:{53449de7-e378-45f8-902a-b73a392597be} (previous epoch at 63525).
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionCoreCoordi] PROJECTIONS: Starting Projections Core Coordinator. (Node State : Master)
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionManager   ] PROJECTIONS: Starting Projections Manager. (Node State : Master)
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionManagerRes] PROJECTIONS: There was an active cancellation scope, cancelling now
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionManagerRes] PROJECTIONS: Starting Projection Manager Response Reader (reads from $projections-$master)
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG MultiStreamMessageWr] PROJECTIONS: Resetting Worker Writer
[PID:18392:021 2017.10.11 16:20:24.714 INFO  ProjectionCoreServic] _projections is not empty after all the projections have been killed
[PID:18392:021 2017.10.11 16:20:24.714 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: StopCore - 68ms. Handler: ProjectionCoreService.
[PID:18392:021 2017.10.11 16:20:24.714 DEBUG ProjectionCoreServic] PROJECTIONS: Stopping Projection Core Reader (34d3170a1d1b4d4bb2481bc0999924d0)
[PID:18392:021 2017.10.11 16:20:24.714 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #2]: StopCore - 68ms. Q: 1/4.
[PID:18392:020 2017.10.11 16:20:24.715 INFO  ProjectionCoreServic] _projections is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.715 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: StopCore - 68ms. Handler: ProjectionCoreService.
[PID:18392:020 2017.10.11 16:20:24.715 DEBUG ProjectionCoreServic] PROJECTIONS: Stopping Projection Core Reader (d614795683174951b061bfee08f63ae5)
[PID:18392:020 2017.10.11 16:20:24.715 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #1]: StopCore - 68ms. Q: 1/7.
[PID:18392:020 2017.10.11 16:20:24.715 INFO  ProjectionCoreServic] _subscriptions is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.716 INFO  ProjectionCoreServic] _eventReaders is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.716 INFO  ProjectionCoreServic] _subscriptionEventReaders is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.718 DEBUG ResponseWriter      ] PROJECTIONS: Scheduling the writing of $stopped to $projections-$master. Current status of Writer: Busy: True
[PID:18392:021 2017.10.11 16:20:24.718 INFO  ProjectionCoreServic] _eventReaders is not empty after all the projections have been killed
---
New instances of Projection Core Reader started:
[PID:18392:021 2017.10.11 16:20:24.718 DEBUG ProjectionCoreServic] PROJECTIONS: Starting Projection Core Reader (reads from $projections-$34d3170a1d1b4d4bb2481bc0999924d0)
[PID:18392:021 2017.10.11 16:20:24.718 DEBUG ResponseWriter      ] PROJECTIONS: Resetting Master Writer
[PID:18392:020 2017.10.11 16:20:24.718 DEBUG ProjectionCoreServic] PROJECTIONS: Starting Projection Core Reader (reads from $projections-$d614795683174951b061bfee08f63ae5)
[PID:18392:020 2017.10.11 16:20:24.719 DEBUG ResponseWriter      ] PROJECTIONS: Resetting Master Writer
---
[PID:18392:019 2017.10.11 16:20:24.719 INFO  ProjectionCoreServic] _projections is not empty after all the projections have been killed
[PID:18392:019 2017.10.11 16:20:24.719 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: StopCore - 73ms. Handler: ProjectionCoreService.
---
StopCore message received after other instances started :
[PID:18392:019 2017.10.11 16:20:24.720 DEBUG ProjectionCoreServic] PROJECTIONS: Stopping Projection Core Reader (025c5717850c452c971a73761d394851)
[PID:18392:019 2017.10.11 16:20:24.720 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #0]: StopCore - 73ms. Q: 1/4.
---
[PID:18392:019 2017.10.11 16:20:24.720 INFO  ProjectionCoreServic] _eventReaders is not empty after all the projections have been killed
[PID:18392:019 2017.10.11 16:20:24.720 DEBUG ProjectionCoreServic] PROJECTIONS: Starting Projection Core Reader (reads from $projections-$025c5717850c452c971a73761d394851)
[PID:18392:019 2017.10.11 16:20:24.721 DEBUG ResponseWriter      ] PROJECTIONS: Resetting Master Writer
[PID:18392:017 2017.10.11 16:20:24.726 INFO  TcpService          ] Internal TCP connection accepted: [Normal, 127.0.0.1:34427, L127.0.0.1:2110, {dea512e1-81fd-425a-a7b4-c8216607e01a}].
[PID:18392:012 2017.10.11 16:20:24.792 TRACE GossipServiceBase   ] CLUSTER HAS CHANGED (gossip received from [127.0.0.1:2122])

Projections get stuck at this point

Resolution:
Keep track of subcomponent starts and stops.
During a node transition to Master, only start the ProjectionCoreCoordinator when all subcomponents (from the previous running instance) have been stopped.

Prevent 2 instances of ProjectionCoreCoordinator & its subcomponents …
…from running in parallel

Fix bug: ProjectionCoreCoordinator, Wrong order of ReaderCoreServiceMessage.StopReader and ProjectionCoreServiceMessage.StopCore in Stop()

@hayley-jean hayley-jean merged commit ac7f09c into release-v4.0.4 Oct 18, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
wercker/build-mono4 Wercker pipeline passed
Details

@hayley-jean hayley-jean deleted the fix-parallel-projection-core-run branch Oct 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment