Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent subscriptions are quietly stopping #1392

Closed
mpielikis opened this issue Aug 18, 2017 · 43 comments · Fixed by #2214
Closed

Persistent subscriptions are quietly stopping #1392

mpielikis opened this issue Aug 18, 2017 · 43 comments · Fixed by #2214
Assignees
Labels
action/investigate For future investigation detectability/hard Negative impact will be hard to detect impact/high High prospect of negative impact kind/bug Issues which are a software defect subsystem/persistent subs Issues relating to persistent subscriptions

Comments

@mpielikis
Copy link
Contributor

From 3.8 versions to recent 4.0.1.4 version we see accidental (~ once/twice a week) stopping/freezing of persistent subscriptions on three node cluster. We can't detect from the logs or from index.html#/subscriptions form if there is a problem. We see that the stream is not consumed and the restart of a client does not help. The only way to wake up these subscriptions is to restart the EventStore master. After the master reset/failover all the subscriptions become active again instantly.

Last accident happened this night after upgrade from 3.9.2 to 4.0.1.4. The subscription stopped around 2017.08.17 23:21:00 UTC.

Master log at that time

[PID:07240:054 2017.08.17 23:20:56.201 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:049 2017.08.17 23:20:56.482 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:20:56.732 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:049 2017.08.17 23:20:56.998 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 249ms. Q: 0/1.
[PID:07240:052 2017.08.17 23:20:57.248 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:049 2017.08.17 23:20:57.529 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:056 2017.08.17 23:20:57.810 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:052 2017.08.17 23:20:58.076 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 234ms. Q: 0/0.
[PID:07240:043 2017.08.17 23:20:58.342 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:056 2017.08.17 23:20:58.623 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:20:58.857 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 203ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:20:59.107 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:20:59.389 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:20:59.623 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:056 2017.08.17 23:20:59.858 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 203ms. Q: 0/1.
[PID:07240:042 2017.08.17 23:21:00.065 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-MeterEventBatchProcessor::meter-event-batch-processor-checkpoint: 512134
[PID:07240:018 2017.08.17 23:21:00.065 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-MeterEventBatchProcessor::meter-event-batch-processor-checkpoint
[PID:07240:034 2017.08.17 23:21:00.128 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/2.
[PID:07240:046 2017.08.17 23:21:00.393 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:21:00.644 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 203ms. Q: 0/0.
[PID:07240:034 2017.08.17 23:21:00.894 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:01.175 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/2.
[PID:07240:034 2017.08.17 23:21:01.440 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:01.690 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:21:02.222 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 281ms. Q: 0/1.
[PID:07240:006 2017.08.17 23:21:02.487 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:026 2017.08.17 23:21:02.769 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-5::channelProcessor-group-checkpoint: 315683
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-10::channelProcessor-group-checkpoint: 315049
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-11::channelProcessor-group-checkpoint: 305759
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-13::channelProcessor-group-checkpoint: 336682
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-15::channelProcessor-group-checkpoint: 335854
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-18::channelProcessor-group-checkpoint: 304294
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-30::channelProcessor-group-checkpoint: 308246
[PID:07240:043 2017.08.17 23:21:03.066 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 249ms. Q: 0/3.
[PID:07240:018 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-11::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-15::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-30::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-5::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-10::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-18::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.128 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-13::channelProcessor-group-checkpoint
[PID:07240:054 2017.08.17 23:21:03.316 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:046 2017.08.17 23:21:03.612 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/0.
[PID:07240:054 2017.08.17 23:21:03.909 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 250ms. Q: 0/0.
[PID:07240:006 2017.08.17 23:21:03.956 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadEvent - 296ms. Q: 0/0.
[PID:07240:046 2017.08.17 23:21:04.159 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:010 2017.08.17 23:21:04.566 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 375ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:21:04.909 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:018 2017.08.17 23:21:05.034 TRACE TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID a8f62bd6-2af3-427b-8706-f0a457709f82, EventStreamId: System, CorrelationId: ffd7873b-9841-4891-9107-4cf75a9947b1, FirstEventNumber: 1228749, LastEventNumber: 1228749.
[PID:07240:054 2017.08.17 23:21:05.253 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 296ms. Q: 0/2.
[PID:07240:018 2017.08.17 23:21:05.300 TRACE TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID d780d587-1f1a-4716-baca-5d91807ff205, EventStreamId: System, CorrelationId: 37da3885-4bc0-4f34-880d-2220e235bbda, FirstEventNumber: 1228749, LastEventNumber: 1228749.
[PID:07240:025 2017.08.17 23:21:05.534 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:21:05.816 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:025 2017.08.17 23:21:06.113 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:21:06.441 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 281ms. Q: 0/1.
[PID:07240:025 2017.08.17 23:21:06.706 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:046 2017.08.17 23:21:06.956 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:010 2017.08.17 23:21:07.285 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/0.
[PID:07240:043 2017.08.17 23:21:07.597 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 281ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:21:07.894 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:034 2017.08.17 23:21:08.316 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:025 2017.08.17 23:21:08.597 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:21:08.894 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/0.
[PID:07240:034 2017.08.17 23:21:09.191 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:21:09.472 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:21:09.754 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:046 2017.08.17 23:21:10.019 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:10.394 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 296ms. Q: 0/0.
[PID:07240:043 2017.08.17 23:21:10.707 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 281ms. Q: 0/0.

Cluster stats

{
  "proc": {
    "startTime": "2017-08-17T14:37:47.4349457Z",
    "id": 18328,
    "mem": 1070555136,
    "cpu": 233.079865,
    "cpuScaled": 29.1349831,
    "threadsCount": 49,
    "contentionsRate": 0.0,
    "thrownExceptionsRate": 0.0,
    "diskIo": {
      "readBytes": 12001665746853,
      "writtenBytes": 28417119922,
      "readOps": 1460891441,
      "writeOps": 2845510
    },
    "tcp": {
      "connections": 19,
      "receivingSpeed": 21938.458267457536,
      "sendingSpeed": 2506481.8386726012,
      "inSend": 0,
      "measureTime": "00:00:01.0591902",
      "pendingReceived": 0,
      "pendingSend": 0,
      "receivedBytesSinceLastRun": 23237,
      "receivedBytesTotal": 5612295687,
      "sentBytesSinceLastRun": 2654841,
      "sentBytesTotal": 81248599794
    },
    "gc": {
      "allocationSpeed": 1.55199E+08,
      "gen0ItemsCount": 40163,
      "gen0Size": 6291456,
      "gen1ItemsCount": 9285,
      "gen1Size": 1296648,
      "gen2ItemsCount": 608,
      "gen2Size": 257035464,
      "largeHeapSize": 324430424,
      "timeInGc": 2.8402946,
      "totalBytesInHeaps": 582762536
    }
  },
  "sys": {
    "cpu": 66.01876,
    "freeMem": 2854817792,
    "drive": {
      "c": {
        "availableBytes": 83237036032,
        "totalBytes": 536500760576,
        "usage": "84%",
        "usedBytes": 453263724544
      }
    }
  },
  "es": {
    "checksum": 1084824927495,
    "checksumNonFlushed": 1084824927495,
    "queue": {
      "mainQueue": {
        "queueName": "MainQueue",
        "groupName": "",
        "avgItemsPerSecond": 359,
        "avgProcessingTime": 0.049526302083333335,
        "currentIdleTime": "0:00:00:00.0012368",
        "currentItemProcessingTime": null,
        "idleTimePercent": 98.762716877636819,
        "length": 0,
        "lengthCurrentTryPeak": 9,
        "lengthLifetimePeak": 4677,
        "totalItemsProcessed": 10138387,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "master Replication Service": {
        "queueName": "Master Replication Service",
        "groupName": "",
        "avgItemsPerSecond": 48,
        "avgProcessingTime": 0.09857692307692309,
        "currentIdleTime": "0:00:00:00.0531661",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.523521537686847,
        "length": 2,
        "lengthCurrentTryPeak": 2,
        "lengthLifetimePeak": 2,
        "totalItemsProcessed": 372799,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "SendReplicationData"
      },
      "monitoringQueue": {
        "queueName": "MonitoringQueue",
        "groupName": "",
        "avgItemsPerSecond": 2,
        "avgProcessingTime": 16.0854,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.0414227",
        "idleTimePercent": 95.478182602181789,
        "length": 1,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 4,
        "totalItemsProcessed": 7702,
        "inProgressMessage": "GetFreshStats",
        "lastProcessedMessage": "WriteEventsCompleted"
      },
      "projection Core #0": {
        "queueName": "Projection Core #0",
        "groupName": "Projection Core",
        "avgItemsPerSecond": 0,
        "avgProcessingTime": 0.0,
        "currentIdleTime": "0:01:09:48.3115636",
        "currentItemProcessingTime": null,
        "idleTimePercent": 100.0,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 0,
        "totalItemsProcessed": 3,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "StartReader"
      },
      "projections Master": {
        "queueName": "Projections Master",
        "groupName": "",
        "avgItemsPerSecond": 0,
        "avgProcessingTime": 0.0,
        "currentIdleTime": "0:01:09:48.2137472",
        "currentItemProcessingTime": null,
        "idleTimePercent": 100.0,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 10,
        "totalItemsProcessed": 25,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "RegularTimeout"
      },
      "storage Chaser": {
        "queueName": "Storage Chaser",
        "groupName": "",
        "avgItemsPerSecond": 96,
        "avgProcessingTime": 0.34431941747572814,
        "currentIdleTime": "0:00:00:00.0022446",
        "currentItemProcessingTime": null,
        "idleTimePercent": 96.696184178477466,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 0,
        "totalItemsProcessed": 6461417,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ChaserCheckpointFlush"
      },
      "storageReaderQueue EventStore/EventStore#1": {
        "queueName": "StorageReaderQueue EventStore/EventStore#1",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 54,
        "avgProcessingTime": 13.799558620689655,
        "currentIdleTime": "0:00:00:00.0046469",
        "currentItemProcessingTime": null,
        "idleTimePercent": 24.998479444289135,
        "length": 0,
        "lengthCurrentTryPeak": 10,
        "lengthLifetimePeak": 108,
        "totalItemsProcessed": 284703,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageReaderQueue EventStore/EventStore#2": {
        "queueName": "StorageReaderQueue EventStore/EventStore#2",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 59,
        "avgProcessingTime": 8.7366285714285716,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.0065817",
        "idleTimePercent": 48.427923514397868,
        "length": 0,
        "lengthCurrentTryPeak": 13,
        "lengthLifetimePeak": 48,
        "totalItemsProcessed": 256760,
        "inProgressMessage": "ReadEvent",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageReaderQueue EventStore/EventStore#3": {
        "queueName": "StorageReaderQueue EventStore/EventStore#3",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 34,
        "avgProcessingTime": 18.205637837837838,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.0664873",
        "idleTimePercent": 36.874348967180211,
        "length": 8,
        "lengthCurrentTryPeak": 15,
        "lengthLifetimePeak": 461,
        "totalItemsProcessed": 490572,
        "inProgressMessage": "ReadStreamEventsBackward",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageReaderQueue EventStore/EventStore#4": {
        "queueName": "StorageReaderQueue EventStore/EventStore#4",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 38,
        "avgProcessingTime": 15.587243902439024,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.3773847",
        "idleTimePercent": 40.115020721891995,
        "length": 7,
        "lengthCurrentTryPeak": 12,
        "lengthLifetimePeak": 44,
        "totalItemsProcessed": 232627,
        "inProgressMessage": "ReadAllEventsForward",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageWriterQueue": {
        "queueName": "StorageWriterQueue",
        "groupName": "",
        "avgItemsPerSecond": 15,
        "avgProcessingTime": 5.0324588235294119,
        "currentIdleTime": "0:00:00:00.0537118",
        "currentItemProcessingTime": null,
        "idleTimePercent": 91.984726707915627,
        "length": 0,
        "lengthCurrentTryPeak": 2,
        "lengthLifetimePeak": 1615,
        "totalItemsProcessed": 1041029,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "WritePrepares"
      },
      "subscriptions": {
        "queueName": "Subscriptions",
        "groupName": "",
        "avgItemsPerSecond": 16,
        "avgProcessingTime": 0.023238888888888889,
        "currentIdleTime": "0:00:00:00.0516671",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.971482671482349,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 2197,
        "totalItemsProcessed": 2327515,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "EventCommitted"
      },
      "timer": {
        "queueName": "Timer",
        "groupName": "",
        "avgItemsPerSecond": 31,
        "avgProcessingTime": 0.049658823529411769,
        "currentIdleTime": "0:00:00:00.0056987",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.876878624796646,
        "length": 32,
        "lengthCurrentTryPeak": 32,
        "lengthLifetimePeak": 38,
        "totalItemsProcessed": 979770,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ExecuteScheduledTasks"
      },
      "worker EventStore/EventStore#1": {
        "queueName": "Worker EventStore/EventStore#1",
        "groupName": "Workers",
        "avgItemsPerSecond": 6,
        "avgProcessingTime": 0.17930000000000001,
        "currentIdleTime": "0:00:00:00.0148259",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.884573177531649,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 721,
        "totalItemsProcessed": 1173046,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "IncomingHttpRequestMessage"
      },
      "worker EventStore/EventStore#2": {
        "queueName": "Worker EventStore/EventStore#2",
        "groupName": "Workers",
        "avgItemsPerSecond": 29,
        "avgProcessingTime": 0.21633225806451614,
        "currentIdleTime": "0:00:00:00.0090712",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.376988350007323,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 37,
        "totalItemsProcessed": 468189,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      },
      "worker EventStore/EventStore#3": {
        "queueName": "Worker EventStore/EventStore#3",
        "groupName": "Workers",
        "avgItemsPerSecond": 59,
        "avgProcessingTime": 0.197975,
        "currentIdleTime": "0:00:00:00.0531228",
        "currentItemProcessingTime": null,
        "idleTimePercent": 98.822102228852884,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 452,
        "totalItemsProcessed": 1881443,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      },
      "worker EventStore/EventStore#4": {
        "queueName": "Worker EventStore/EventStore#4",
        "groupName": "Workers",
        "avgItemsPerSecond": 138,
        "avgProcessingTime": 0.60060405405405415,
        "currentIdleTime": "0:00:00:00.0043397",
        "currentItemProcessingTime": null,
        "idleTimePercent": 91.693977436424916,
        "length": 0,
        "lengthCurrentTryPeak": 5,
        "lengthLifetimePeak": 342,
        "totalItemsProcessed": 746591,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      },
      "worker EventStore/EventStore#5": {
        "queueName": "Worker EventStore/EventStore#5",
        "groupName": "Workers",
        "avgItemsPerSecond": 9,
        "avgProcessingTime": 0.12348,
        "currentIdleTime": "0:00:00:00.0092776",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.887002737779909,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 63,
        "totalItemsProcessed": 319862,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      }
    },
    "writer": {
      "lastFlushSize": 899,
      "lastFlushDelayMs": 2.8726079061321412,
      "meanFlushSize": 990,
      "meanFlushDelayMs": 8.047793519848268,
      "maxFlushSize": 3251632,
      "maxFlushDelayMs": 4087.7333426456207,
      "queuedFlushMessages": 0
    },
    "readIndex": {
      "cachedRecord": 3132382,
      "notCachedRecord": 44531547,
      "cachedStreamInfo": 150486727,
      "notCachedStreamInfo": 2843415,
      "cachedTransInfo": 0,
      "notCachedTransInfo": 0
    }
  }
}
@hayley-jean hayley-jean added kind/bug Issues which are a software defect investigate labels Aug 18, 2017
@hayley-jean
Copy link
Member

Thank you for submitting this, @mpielikis

Would you be able to provide the logs for the master node at the time? If you don't want to share them here, you can email them to hayley at geteventstore.com

@Faq
Copy link

Faq commented Aug 27, 2017

Any followup here?

@shaan1337
Copy link
Member

Hello @mpielikis

Can you please let us know if you're still facing the same issue in the latest version of EventStore? (if you've upgraded)

Do you remember if all the subscriptions that were stuck were subscribing from the same or different streams?

Thanks
Shaan

@mpielikis
Copy link
Contributor Author

Hello @shaan1337

Can you please let us know if you're still facing the same issue in the latest version of EventStore? (if you've upgraded)

yes it still happens. Additional debug level logging of persistent subscription could bring more details IMO. The current version is 4.0.3.0

Do you remember if all the subscriptions that were stuck were subscribing from the same or different streams?

Different subscriptions for different streams. Could not find any correlations. Every time the freezed subscriptions were different one or many. We use DispatchToSingle type of subscription the most. Three clients connects to many DispatchToSingle subscriptions (37) and to couple of RoundRobin.

@gregoryyoung
Copy link
Contributor

Just from looking at the log messages is it possible a timeout is occurring on writing of checkpoint?

@mpielikis
Copy link
Contributor Author

I was analyzing a log and didn't find any suspicious errors or timeouts just usual messages. If there is a timeout shouldn't there be a retry?

@gregoryyoung
Copy link
Contributor

There is a retry to a certain depth

@mpielikis
Copy link
Contributor Author

Also noticed that when subscription "freezes" the number of connected clients to DispatchToSingle becomes 0 or more than 1 when it normally should be only one. It seems as the client socket connections become inactive or stalled and the only way to reset them is to restart the server.

@mpielikis
Copy link
Contributor Author

mpielikis commented Dec 28, 2017

the last "freeze". As you see there is 7 identical connections

curl -i "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/info" -u ***:***> out.txt
HTTP/1.1 200 OK
Content-Length: 3597
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Thu, 28 Dec 2017 17:47:48 GMT

{
  "lastPushClientsTime": "2017-12-28T19:47:48.1759168+02:00",
  "lastPushMessageTime": "2017-12-28T19:47:33.1473113+02:00",
  "lastReadBatchTime": "2017-12-28T19:31:07.6428082+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-23",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.12.77.137:2114/streams/%24persistentsubscription-ChannelsPartition-23::channelProcessor-group-parked",
  "getMessagesUri": "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/1",
  "totalItemsProcessed": 4400,
  "countSinceLastMeasurement": 5,
  "lastProcessedEventNumber": 428327,
  "lastKnownEventNumber": 428340,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 5,
  "connections": [
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 151,
      "countSinceLastMeasurement": 5,
      "extraStatistics": [],
      "availableSlots": 5,
      "inFlightMessages": 5
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

@mpielikis
Copy link
Contributor Author

these are the changes I've made to 4.0.3.0 to debug this issue
mpielikis@8e6470d

@shaan1337
Copy link
Member

Thanks @mpielikis for all the details, it definitely helps!

@gregoryyoung
Copy link
Contributor

There are retries.

@gregoryyoung
Copy link
Contributor

One thing that looks odd in the status:

" "totalInFlightMessages": 5,"

But none of the connections have any in flight messages?

@mpielikis
Copy link
Contributor Author

no, there is one.

{
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 151,
      "countSinceLastMeasurement": 5,
      "extraStatistics": [],
      "availableSlots": 5,
      "inFlightMessages": 5
    },

@gregoryyoung
Copy link
Contributor

are they being ack'ed?

@mpielikis
Copy link
Contributor Author

You mean TCP ACK? I don't know. What I know is that restarting that client keeps these connections. I could bring more details on next "freeze". Just please tell me what should I check for or what actions should I take to help investigate this.

also interesting part is this

 "lastPushClientsTime": "2017-12-28T19:47:48.1759168+02:00",
  "lastPushMessageTime": "2017-12-28T19:47:33.1473113+02:00",
  "lastReadBatchTime": "2017-12-28T19:31:07.6428082+02:00",
  "lastPushResult": "Sent",

messages are pushed, but event batch reading is stopped.

@gregoryyoung
Copy link
Contributor

no there is an explicit ack/nak message from the client on processing of an event (it happens automatically by default but you can also manually do it).

event batch reading only happens if the subscription is not caught up.

@mpielikis
Copy link
Contributor Author

Last result is [Sent]. It ACKs?

mpielikis@8e6470d#diff-09aba1e9e59172134f684c6055cdd727L181

@mpielikis
Copy link
Contributor Author

If you mean on the client side then yes there is an explicit s.Acknowledge(e); on persistent subscription.

@gregoryyoung
Copy link
Contributor

the reason I ask is that there are outstanding messages on the subscription. Is it possible that an exception etc is being thrown in the client and the messages are not being ack'ed/nak'ed because of it? I am guessing that its not sending because the buffer size is 5 and its full

{
  "from": "10.12.77.139:49998",
  "username": "admin",
  "averageItemsPerSecond": 0.0,
  "totalItemsProcessed": 151,
  "countSinceLastMeasurement": 5,
  "extraStatistics": [],
  "availableSlots": 5,
  "inFlightMessages": 5
},

If a message is not acked/naked the buffer will fill and no more messages will be sent.

On a side note you really don't want a buffer size of 5 unless these messages are taking +- 1 second to process. A more likely value would be two to three orders of magnitude larger (500-5000).

@mpielikis
Copy link
Contributor Author

the handler on a client looks like this

Action<EventStorePersistentSubscriptionBase, ResolvedEvent> eventAppeared = (s, e) =>
            {
                try
                {
                    handler(s, e);

                    try
                    {
                        s.Acknowledge(e);
                    }
                    catch (Exception exc)
                    {
                        log.Error("Missing ACK event because of " + exc.ToString());

                        Environment.Exit(100);
                    }
                }
                catch (Exception exc)
                {
                    log.Error(exc);

                    try
                    {
                        s.Fail(e, PersistentSubscriptionNakEventAction.Park, exc.Message);
                    }
                    catch (Exception excf)
                    {
                        log.Error("Missing Fail event because of " + excf.ToString());

                        Environment.Exit(100);
                    }
                }
            };

Even if exception happened on a client it shouldn't block the subscription. Other interesting fact is that not one but many subscriptions freezes at the same time with different clients from different machines.

handlers are really lightweight and just pushing messages on a inmemory queue.

@shaan1337
Copy link
Member

Hello @mpielikis

Can you please send us 4-5 snapshots (at intervals of around 10-15 seconds) of the query you ran when there is a freeze:
curl -i "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/info" -u ***:***> out.txt

It'll help us know which of the values are changing.

Thanks

@shaan1337
Copy link
Member

Hi @mpielikis

I've gone through a review of the persistent subscription code.

Based on the data you provided, it looks like it can be narrowed down to the following 2 possibilities:

  1. Due to this issue , multiple subscriptions can exist on the same connection. If in your client, you call ConnectToPersistentSubscription from the same IEventStoreConnection object multiple times, multiple working subscriptions can exist simultaneously using the same TCP socket. They can each receive events independently although they use the same TCP socket. (I've tested this) So if you're using DispatchToSingle strategy, only one among those multiple connections will receive the events. Thus, it could be possible that the events are being pushed to one of those connections and on the client side ACKed successfully but not handled.

  2. NotifyLiveSubscriptionMessage is not being called when a live message comes in from here for certain subscriptions (maybe they are no longer present in _subscriptionTopics for some reason. That would explain why multiple subscriptions freeze at the same time)

For no. 1, can you please verify if it's possible that your client code is calling ConnectToPersistentSubscription multiple times? Maybe it happens during some special circumstances (on reconnection for example). That "special circumstance" would need to happen across multiple machines at the same time (because many of your subscriptions freeze at the same time), so it could be related to the network.

If no. 1 isn't fruitful, for no. 2, can you please add a LastNotifyLiveMessageTime to your fork in NotifyLiveSubscriptionMessage function? The stats snapshots in my previous comment would also be helpful in this case.

Thanks

@shaan1337
Copy link
Member

Hi @mpielikis

Do you have any feedback with regards to the above please?

Thanks

@mpielikis
Copy link
Contributor Author

Hi @shaan1337

yes. Just was waiting for this freeze to appear, now I got it.

I made four snapshots of 32 subscriptions. First snapshot when there was a freeze, second when clients were closed, third when clients were up again, fourth after ES restart. Netstat -na is also added from all machines. IP addresses are masked.

for example all five states of one subscription:

On Freeze

Content-Length: 5313
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 10:09:57 GMT

{
  "lastPushClientsTime": "2018-01-22T12:09:57.3597618+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 2959,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:62433",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:61476",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62437",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62441",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62453",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:18699",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25578",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25566",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25583",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:29078",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:14764",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:16008",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:30779",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

On Clients Closed (12 non existant connections by netstat)

HTTP/1.1 200 OK
Content-Length: 5037
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 10:12:16 GMT

{
  "lastPushClientsTime": "2018-01-22T12:12:17.3530781+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 2959,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:62433",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:61476",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62437",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62441",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62453",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:18699",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25578",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25566",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25583",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:29078",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:14764",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:16008",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

On Clients Up (12 non-existant and 1 existant 10.98.40.137:40570 connection by netstat)

HTTP/1.1 200 OK
Content-Length: 5313
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 10:28:25 GMT

{
  "lastPushClientsTime": "2018-01-22T12:28:25.0661943+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 2959,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:62433",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:61476",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62437",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62441",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62453",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:18699",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25578",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25566",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25583",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:29078",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:14764",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:16008",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:40570",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

On ES restart (one healthy connection 10.98.40.137:42525)

HTTP/1.1 200 OK
Content-Length: 1985
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 11:29:52 GMT

{
  "lastPushClientsTime": "2018-01-22T13:29:52.9858258+02:00",
  "lastPushMessageTime": "2018-01-22T13:07:05.0997138+02:00",
  "lastReadBatchTime": "2018-01-22T13:07:06.8028796+02:00",
  "lastNotifyLiveMessageTime": "0001-01-01T00:00:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 7,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 0,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:42525",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 7,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

I can send full snapshot zip file.

@gregoryyoung
Copy link
Contributor

What do the ES logs say about connections?

@mpielikis
Copy link
Contributor Author

mpielikis commented Jan 23, 2018

three grepped connections log entries from 5 day log:

10.98.40.137:61476

[PID:03108:024 2018.01.20 05:51:04.649 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}].
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Received bytes: 978, Sent bytes: 0
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Send calls: 0, callbacks: 0
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Receive calls: 2, callbacks: 2
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Close reason: [Success] Socket closed
[PID:03108:024 2018.01.20 05:51:04.649 INFO  TcpConnectionManager] Connection 'external-normal' [10.98.40.137:61476, {b518403e-d2c0-4122-8a47-f18e9db9287d}] closed: Success.
[PID:03108:057 2018.01.20 05:51:04.977 DEBUG PersistentSubscripti] Lost connection from 10.98.40.137:61476

10.98.40.137:62433

[PID:03108:013 2018.01.20 06:31:37.256 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}].
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Received bytes: 978, Sent bytes: 0
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Send calls: 0, callbacks: 0
[PID:03108:013 2018.01.20 06:31:37.272 INFO  TcpConnectionManager] Connection 'external-normal' [10.98.40.137:62433, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}] closed: Success.
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Receive calls: 2, callbacks: 2
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Close reason: [Success] Socket closed
[PID:03108:022 2018.01.20 06:31:38.131 DEBUG PersistentSubscripti] Lost connection from 10.98.40.137:62433

10.98.40.137:62441

[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Received bytes: 978, Sent bytes: 0
[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Send calls: 0, callbacks: 0
[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Receive calls: 2, callbacks: 2
[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Close reason: [Success] Socket closed
[PID:03108:032 2018.01.20 06:31:38.131 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}].
[PID:03108:032 2018.01.20 06:31:38.163 INFO  TcpConnectionManager] Connection 'external-normal' [10.98.40.137:62441, {5781d10b-f697-433e-98e1-f8b7764471bb}] closed: Success.
[PID:03108:022 2018.01.20 06:31:38.272 DEBUG PersistentSubscripti] Lost connection from 10.98.40.137:62441

@mpielikis
Copy link
Contributor Author

new freeze here.

four states from one of the stucked persistent subscription

On Freeze

2 connections: 1 lost and 1 active

HTTP/1.1 200 OK
Content-Length: 2288
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 07:30:41 GMT

{
  "lastPushClientsTime": "2018-01-24T09:30:41.376861+02:00",
  "lastPushMessageTime": "2018-01-24T09:30:21.2554694+02:00",
  "lastReadBatchTime": "2018-01-23T14:20:55.2506459+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T09:29:48.3890323+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 22996,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473829,
  "lastKnownEventNumber": 473834,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 5,
  "connections": [
    {
      "from": "10.98.40.138:37893",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.138:50342",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 22360,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 2,
      "inFlightMessages": 5
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "37893" 
  TCP    10.98.40.138:37893      10.98.40.137:1112       ESTABLISHED
nestat -na | find "50342"
  no result

netstat on 10.98.40.137

nestat -na | find "37893"
  TCP    10.98.40.137:1112       10.98.40.138:37893      ESTABLISHED
nestat -na | find "50342
  no result

On Client Close

1 lost connection

HTTP/1.1 200 OK
Content-Length: 2012
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 07:33:57 GMT

{
  "lastPushClientsTime": "2018-01-24T09:33:56.862183+02:00",
  "lastPushMessageTime": "2018-01-24T09:33:51.8164518+02:00",
  "lastReadBatchTime": "2018-01-23T14:20:55.2506459+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T09:33:05.2623793+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 23030,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473829,
  "lastKnownEventNumber": 473837,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 5,
  "connections": [
    {
      "from": "10.98.40.138:50342",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 22394,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 2,
      "inFlightMessages": 5
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "50342"
  no result

On Client Up

2 connections: 1 lost 1 active

HTTP/1.1 200 OK
Content-Length: 2288
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 07:58:01 GMT

{
  "lastPushClientsTime": "2018-01-24T09:58:01.609859+02:00",
  "lastPushMessageTime": "2018-01-24T09:57:51.5254373+02:00",
  "lastReadBatchTime": "2018-01-23T14:20:55.2506459+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T09:57:49.1240758+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 23299,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473849,
  "lastKnownEventNumber": 473862,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 6,
  "connections": [
    {
      "from": "10.98.40.138:29134",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.138:50342",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 22663,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 1,
      "inFlightMessages": 6
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "29134"
  TCP    10.98.40.138:29134      10.98.40.137:1112       ESTABLISHED
nestat -na | find "50342
  noresult

After EventStore restart

1 live connection

HTTP/1.1 200 OK
Content-Length: 2006
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 08:33:02 GMT

{
  "lastPushClientsTime": "2018-01-24T10:33:02.9302029+02:00",
  "lastPushMessageTime": "2018-01-24T10:30:54.0145478+02:00",
  "lastReadBatchTime": "2018-01-24T10:30:54.0614057+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T10:30:54.0145478+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 47,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473895,
  "lastKnownEventNumber": 473895,
  "readBufferCount": 0,
  "liveBufferCount": 31,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.138:31607",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 47,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 9,
      "inFlightMessages": 0
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "31607"
  TCP    10.98.40.138:31607      10.98.40.137:1112       ESTABLISHED

@shaan1337
Copy link
Member

Hello @mpielikis ,

Sorry for the delay in getting back to you, thanks for all the details.
I'm continuing the analysis based on the information you provided.

Can you please send me your log files related to the above runs to shaan [at] geteventstore.com ?

Thanks

@shaan1337
Copy link
Member

Hello @mpielikis

For the new freeze, it seems that the last notify live message time is stuck at: 2018-01-20T07:07:01.084668+02:00, so our second hypothesis seems to hold true (NotifyLiveSubscriptionMessage is not being called when a live message comes in for certain subscriptions)

  "lastPushClientsTime": "2018-01-22T12:09:57.3597618+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",

This is a few minutes after some of the connections were lost: 10.98.40.137:61476, 10.98.40.137:62433, 10.98.40.137:62441 on the 20th (as you sent earlier)

Can you please also send me the logs for the 20th January?

Thanks

@haf
Copy link

haf commented May 30, 2018

I think we're experiencing exactly the same problem. It has caused massive problems for us on version 3.9.4 of ES.

@kbrownhbs
Copy link

Has this issue been addressed? I've been playing with persistent subscriptions and wonder if it is ready for prime time.

@DavidBM
Copy link

DavidBM commented Aug 30, 2018

Hello, any update in the issue? We want to use eventstore but this can be a blocker.

@jen20 jen20 removed the investigate label Aug 30, 2018
@mpielikis
Copy link
Contributor Author

We still struggle with this issue in production once in a week. It is the major problem we have. We hope PR #1640 will address it, there are some similar symptoms.

@haf
Copy link

haf commented Sep 1, 2018

This is likely due to the connection ending up in an invalid state; the ES connection doesn't successfully reconnect always after network splits. If you want to triage and debug this huge issue, I'd focus on introducing spotty network, with alternating partitions, and a single master ES server that you bring up and down, while tracking the events fired and tcp connection state of the ES connection implementation.

It's not the server's connection; it's the client's connection that's buggy. If you set up a continuous restart of the server, you'll worsen the problem because the client will get into invalid states and throw internal exceptions, such as ObjectDisposedException, or just silently stop sending data to the application.

And the issue for us is primarily catch-up subscriptions, which we have about 40 of in a single process, not persistent subscriptions.

Also note-worthy is that the ES server's logs are useless to debug this. The ES client's logs, unless incremented are also useless. We're using the latest, untagged (!), docker version of the server now, and the .net core client.

We've also had race conditions in the client connection's authenticate call, despite using TCP; and if it's ever unauthenticated during runtime, we crash our service and let k8s restart it.

We've tried to notify ES of these problems, but they say they can't reproduce, despite having been given full access to our environment and code-base, and us having purchased commercial support. If we give them access to Stackdriver logging with ALL logs, neatly ordered by timestamp in GCP (synchronised by Google's NTP server), they still insist on us mounting a persistent volume and shipping text-file-based logs to them over e-mail.

The latest bid from them is to ask us to manually handle the TCP connection state of the ES connection by "avoiding connecting to subscriptions while it's disconnected"; however, since the TCP state is on another thread, it's impossible to know for sure from the caller's thread what state the TCP socket is in. That TCP runs on another thread and therefore any assumption of the socket's state from another thread being a race condition is met with repeated assertions to "just do as I say and handle it manually and it will be fine".

As for the unauthenticated issues that occur every other launch, they simply say they cannot reproduce, despite full access to logs and machines.

When asked about how to monitor the EventStore server with Grafana/Prometheus/Influx or another modern stack, there's no real response either (no Prom-endpoint, no metric shipper, no real answer), so there are no metrics for ES that lets us debug the problem. This is particularly funny because I remember @gregoryyoung talking 2009 at Öredev, bragging about how, when his self-proclaimed SLA had been violated, he told the surprised customer: and this was possible because he had metrics. And yet, he tweets about completely different things other than fixing his database's client lib.

@shaan1337
Copy link
Member

@mpielikis By any luck are you using ResolveLinkTos = true on any of your persistent subscriptions?

@mpielikis
Copy link
Contributor Author

No we don't. The freeze is likely happening at some moment with combination of Slow stream reading + High Persistent Subscription usage. We saw sometimes high peaks in StorageReaderQueue with several thousand items in queue, and then the freeze happened. To lower the impact of this issue we divided the tasks between two EventStore instances with separate db, so two disks were used for reading instead of one.

@shaan1337
Copy link
Member

shaan1337 commented Nov 17, 2018

Thanks for the details @mpielikis. I earlier saw that you implemented a restart command in your fork which restarts the persistent subscription service and I (maybe wrongly) made the assumption that when you called the command, your subscriptions would "unfreeze". Can you please confirm if that's the case?

I've also done a code review again a few days ago and can deduce the following:

  1. The persistent subscription must still be present in _subscriptionByTopics since you are able to retrieve its stats through <stream>/<group>/info which calls Handle(MonitoringMessage.GetPersistentSubscriptionStats message)
  2. The persistent subscription must still be present in _subscriptionsById since from your logs provided, NotifyClockTick -> TryPushingMessagesToClients is being called on the subscriptions from WakeSubscriptions (lastPushClientsTime was being updated properly)
  3. From these logs for ChannelsPartition-5/channelProcessor-group
  "lastPushClientsTime": "2018-01-22T12:09:57.3597618+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",

I assumed that the persistent subscription was live and that due to the freeze, NotifyLiveSubscriptionMessage was not being called since the lastNotifyLiveMessageTime timestamp was stuck at 2018-01-20T07:07. (can you please confirm that live events should have been received here?)

  1. I assume that you are still seeing events being added to your streams from the UI - so it means that StorageMessage.EventCommitted messages are properly being generated when events are written.

  2. The persistent subscription queue must be active since you can retrieve stats through <stream>/<group>/info. If the PS queue was stuck at a particular message, you should receive 408 - Request timed out messages but I don't think that's the situation in your case.

Based on the above, I think that one of the premises #1 to #5 must be false because it seems to lead to an impossible situation (or I'm missing something). Can you please confirm if #3 was true?

We could also set some additional (manually activated) logging between PersistentSubscriptionService -> Handle(StorageMessage.EventCommitted message) and NotifyLiveSubscriptionMessage(ResolvedEvent resolvedEvent) to see if live events are going through at this point.

Thanks

@shaan1337 shaan1337 added subsystem/persistent subs Issues relating to persistent subscriptions impact/high High prospect of negative impact detectability/hard Negative impact will be hard to detect labels Mar 13, 2019
@djowatts
Copy link

We've been having very similar problems ourselves to this. We have found that editing the subscription through the UI and not changing anything but saving it awakens the subscription again. Any movement on this since November?

@condron
Copy link
Contributor

condron commented Jun 6, 2019

This related fix may address someof the problems here #1936

@alexeyzimarev
Copy link
Member

We got the same issue in production today. We restarted the master node and it resolved the issue.

@hayley-jean
Copy link
Member

Thank you for letting us know @alexeyzimarev
What version of Event Store did you see this on?

@alexeyzimarev
Copy link
Member

@hayley-jean it's 5.0.2 on CentOS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/investigate For future investigation detectability/hard Negative impact will be hard to detect impact/high High prospect of negative impact kind/bug Issues which are a software defect subsystem/persistent subs Issues relating to persistent subscriptions
Projects
None yet