Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug when two instances of command reader are running and receiving ProjectionCoreServiceMessage.StopCore #1464

Closed
wants to merge 1 commit into from

Conversation

shaan1337
Copy link
Member

@shaan1337 shaan1337 commented Oct 12, 2017

Symptoms: Projection gets stuck after election

If a second instance of the command reader starts running (when receiving ProjectionCoreServiceMessage.StartCore message with another Epoch ID) before the first one is stopped, the command reader is stopped when the StopCore message for the first instance arrives instead of continuing to run.

Example:

[PID:18392:012 2017.10.11 16:20:24.644 INFO  ClusterVNodeControll] === NO QUORUM EMERGED WITHIN TIMEOUT... RETIRING...
[PID:18392:012 2017.10.11 16:20:24.644 INFO  ClusterVNodeControll] ========== [127.0.0.1:2112] IS UNKNOWN...
[PID:18392:012 2017.10.11 16:20:24.644 DEBUG ElectionsService    ] ELECTIONS: STARTING ELECTIONS.
[PID:18392:012 2017.10.11 16:20:24.644 DEBUG ElectionsService    ] ELECTIONS: (V=4) SHIFT TO LEADER ELECTION.
[PID:18392:012 2017.10.11 16:20:24.644 DEBUG ElectionsService    ] ELECTIONS: (V=4) VIEWCHANGE FROM [127.0.0.1:2112, {26650f98-b103-4fec-ac03-34673619413a}].
[PID:18392:018 2017.10.11 16:20:24.644 DEBUG ProjectionCoreCoordi] PROJECTIONS: Stopping Projections Core Coordinator. (Node State : Unknown)
---
Stopping projection manager
[PID:18392:018 2017.10.11 16:20:24.645 DEBUG ProjectionManager   ] PROJECTIONS: Stopping Projections Manager. (Node State : Unknown)
---
[PID:18392:015 2017.10.11 16:20:24.648 DEBUG PersistentSubscripti] Subscriptions received state change to Unknown stopping listening.
[PID:18392:012 2017.10.11 16:20:24.665 DEBUG ElectionsService    ] ELECTIONS: (V=4) VIEWCHANGE FROM [127.0.0.1:2122, {df52b02e-e51c-4941-9e2d-0db5cd2bed76}].
[PID:18392:012 2017.10.11 16:20:24.665 DEBUG ElectionsService    ] ELECTIONS: (V=4) MAJORITY OF VIEWCHANGE.
[PID:18392:012 2017.10.11 16:20:24.670 DEBUG ElectionsService    ] ELECTIONS: (V=4) PREPARE FROM [127.0.0.1:2122, {df52b02e-e51c-4941-9e2d-0db5cd2bed76}].
[PID:18392:012 2017.10.11 16:20:24.670 DEBUG ElectionsService    ] ELECTIONS: (V=4) SHIFT TO REG_NONLEADER.
[PID:18392:012 2017.10.11 16:20:24.693 DEBUG ElectionsService    ] ELECTIONS: (V=4) PROPOSAL FROM [127.0.0.1:2122,{df52b02e-e51c-4941-9e2d-0db5cd2bed76}] M=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}). ME=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}).
[PID:18392:012 2017.10.11 16:20:24.693 DEBUG ElectionsService    ] ELECTIONS: (V=4) ACCEPT FROM [127.0.0.1:2122,{df52b02e-e51c-4941-9e2d-0db5cd2bed76}] M=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}]).
[PID:18392:012 2017.10.11 16:20:24.693 DEBUG ElectionsService    ] ELECTIONS: (V=4) ACCEPT FROM [127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}] M=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}]).
[PID:18392:012 2017.10.11 16:20:24.693 INFO  ElectionsService    ] ELECTIONS: (V=4) DONE. ELECTED MASTER = [127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}). ME=[127.0.0.1:2112,{26650f98-b103-4fec-ac03-34673619413a}](L=108911,W=109516,C=109516,E6@63525:{958f8051-057a-4b49-b126-f3f237b6abbb}).
[PID:18392:012 2017.10.11 16:20:24.693 INFO  ClusterVNodeControll] ========== [127.0.0.1:2112] PRE-MASTER STATE, WAITING FOR CHASER TO CATCH UP...
[PID:18392:022 2017.10.11 16:20:24.694 DEBUG PersistentSubscripti] Subscriptions received state change to PreMaster stopping listening.
[PID:18392:012 2017.10.11 16:20:24.699 INFO  ClusterVNodeControll] ========== [127.0.0.1:2112] IS MASTER... SPARTA!
[PID:18392:009 2017.10.11 16:20:24.699 DEBUG EpochManager        ] === Writing E7@109516:{53449de7-e378-45f8-902a-b73a392597be} (previous epoch at 63525).
[PID:18392:006 2017.10.11 16:20:24.699 DEBUG PersistentSubscripti] Subscriptions Became Master so now handling subscriptions
[PID:18392:009 2017.10.11 16:20:24.705 DEBUG EpochManager        ] === Update Last Epoch E7@109516:{53449de7-e378-45f8-902a-b73a392597be} (previous epoch at 63525).
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionCoreCoordi] PROJECTIONS: Starting Projections Core Coordinator. (Node State : Master)
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionManager   ] PROJECTIONS: Starting Projections Manager. (Node State : Master)
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionManagerRes] PROJECTIONS: There was an active cancellation scope, cancelling now
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG ProjectionManagerRes] PROJECTIONS: Starting Projection Manager Response Reader (reads from $projections-$master)
[PID:18392:018 2017.10.11 16:20:24.705 DEBUG MultiStreamMessageWr] PROJECTIONS: Resetting Worker Writer
[PID:18392:021 2017.10.11 16:20:24.714 INFO  ProjectionCoreServic] _projections is not empty after all the projections have been killed
[PID:18392:021 2017.10.11 16:20:24.714 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: StopCore - 68ms. Handler: ProjectionCoreService.
[PID:18392:021 2017.10.11 16:20:24.714 DEBUG ProjectionCoreServic] PROJECTIONS: Stopping Projection Core Reader (34d3170a1d1b4d4bb2481bc0999924d0)
[PID:18392:021 2017.10.11 16:20:24.714 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #2]: StopCore - 68ms. Q: 1/4.
[PID:18392:020 2017.10.11 16:20:24.715 INFO  ProjectionCoreServic] _projections is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.715 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: StopCore - 68ms. Handler: ProjectionCoreService.
[PID:18392:020 2017.10.11 16:20:24.715 DEBUG ProjectionCoreServic] PROJECTIONS: Stopping Projection Core Reader (d614795683174951b061bfee08f63ae5)
[PID:18392:020 2017.10.11 16:20:24.715 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #1]: StopCore - 68ms. Q: 1/7.
[PID:18392:020 2017.10.11 16:20:24.715 INFO  ProjectionCoreServic] _subscriptions is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.716 INFO  ProjectionCoreServic] _eventReaders is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.716 INFO  ProjectionCoreServic] _subscriptionEventReaders is not empty after all the projections have been killed
[PID:18392:020 2017.10.11 16:20:24.718 DEBUG ResponseWriter      ] PROJECTIONS: Scheduling the writing of $stopped to $projections-$master. Current status of Writer: Busy: True
[PID:18392:021 2017.10.11 16:20:24.718 INFO  ProjectionCoreServic] _eventReaders is not empty after all the projections have been killed
---
New instances of Projection Core Reader started:
[PID:18392:021 2017.10.11 16:20:24.718 DEBUG ProjectionCoreServic] PROJECTIONS: Starting Projection Core Reader (reads from $projections-$34d3170a1d1b4d4bb2481bc0999924d0)
[PID:18392:021 2017.10.11 16:20:24.718 DEBUG ResponseWriter      ] PROJECTIONS: Resetting Master Writer
[PID:18392:020 2017.10.11 16:20:24.718 DEBUG ProjectionCoreServic] PROJECTIONS: Starting Projection Core Reader (reads from $projections-$d614795683174951b061bfee08f63ae5)
[PID:18392:020 2017.10.11 16:20:24.719 DEBUG ResponseWriter      ] PROJECTIONS: Resetting Master Writer
---
[PID:18392:019 2017.10.11 16:20:24.719 INFO  ProjectionCoreServic] _projections is not empty after all the projections have been killed
[PID:18392:019 2017.10.11 16:20:24.719 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: StopCore - 73ms. Handler: ProjectionCoreService.
---
StopCore message received after other instances started :
[PID:18392:019 2017.10.11 16:20:24.720 DEBUG ProjectionCoreServic] PROJECTIONS: Stopping Projection Core Reader (025c5717850c452c971a73761d394851)
[PID:18392:019 2017.10.11 16:20:24.720 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #0]: StopCore - 73ms. Q: 1/4.
---
[PID:18392:019 2017.10.11 16:20:24.720 INFO  ProjectionCoreServic] _eventReaders is not empty after all the projections have been killed
[PID:18392:019 2017.10.11 16:20:24.720 DEBUG ProjectionCoreServic] PROJECTIONS: Starting Projection Core Reader (reads from $projections-$025c5717850c452c971a73761d394851)
[PID:18392:019 2017.10.11 16:20:24.721 DEBUG ResponseWriter      ] PROJECTIONS: Resetting Master Writer
[PID:18392:017 2017.10.11 16:20:24.726 INFO  TcpService          ] Internal TCP connection accepted: [Normal, 127.0.0.1:34427, L127.0.0.1:2110, {dea512e1-81fd-425a-a7b4-c8216607e01a}].
[PID:18392:012 2017.10.11 16:20:24.792 TRACE GossipServiceBase   ] CLUSTER HAS CHANGED (gossip received from [127.0.0.1:2122])

Projections get stuck at this point

Resolution: Keep track of number of active starts with a variable: _starts

…eiving StartCore msg with another Epoch ID) before the 1st one is stopped, the command reader is stopped when the StopCore message for the first instance arrives instead of continuing to run
@shaan1337
Copy link
Member Author

Closing this, a better solution implemented in PR #1466

@shaan1337 shaan1337 closed this Oct 16, 2017
@shaan1337 shaan1337 deleted the fix-stopcore-bug branch December 14, 2017 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant