Persistent subscriptions are quietly stopping #1392

mpielikis · 2017-08-18T07:42:32Z

From 3.8 versions to recent 4.0.1.4 version we see accidental (~ once/twice a week) stopping/freezing of persistent subscriptions on three node cluster. We can't detect from the logs or from index.html#/subscriptions form if there is a problem. We see that the stream is not consumed and the restart of a client does not help. The only way to wake up these subscriptions is to restart the EventStore master. After the master reset/failover all the subscriptions become active again instantly.

Last accident happened this night after upgrade from 3.9.2 to 4.0.1.4. The subscription stopped around 2017.08.17 23:21:00 UTC.

Master log at that time

[PID:07240:054 2017.08.17 23:20:56.201 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:049 2017.08.17 23:20:56.482 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:20:56.732 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:049 2017.08.17 23:20:56.998 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 249ms. Q: 0/1.
[PID:07240:052 2017.08.17 23:20:57.248 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:049 2017.08.17 23:20:57.529 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:056 2017.08.17 23:20:57.810 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:052 2017.08.17 23:20:58.076 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 234ms. Q: 0/0.
[PID:07240:043 2017.08.17 23:20:58.342 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:056 2017.08.17 23:20:58.623 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:20:58.857 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 203ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:20:59.107 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:20:59.389 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:20:59.623 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:056 2017.08.17 23:20:59.858 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 203ms. Q: 0/1.
[PID:07240:042 2017.08.17 23:21:00.065 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-MeterEventBatchProcessor::meter-event-batch-processor-checkpoint: 512134
[PID:07240:018 2017.08.17 23:21:00.065 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-MeterEventBatchProcessor::meter-event-batch-processor-checkpoint
[PID:07240:034 2017.08.17 23:21:00.128 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/2.
[PID:07240:046 2017.08.17 23:21:00.393 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:21:00.644 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 203ms. Q: 0/0.
[PID:07240:034 2017.08.17 23:21:00.894 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:01.175 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/2.
[PID:07240:034 2017.08.17 23:21:01.440 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:01.690 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:21:02.222 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 281ms. Q: 0/1.
[PID:07240:006 2017.08.17 23:21:02.487 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:026 2017.08.17 23:21:02.769 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-5::channelProcessor-group-checkpoint: 315683
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-10::channelProcessor-group-checkpoint: 315049
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-11::channelProcessor-group-checkpoint: 305759
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-13::channelProcessor-group-checkpoint: 336682
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-15::channelProcessor-group-checkpoint: 335854
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-18::channelProcessor-group-checkpoint: 304294
[PID:07240:031 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Publishing checkpoint for $persistentsubscription-ChannelsPartition-30::channelProcessor-group-checkpoint: 308246
[PID:07240:043 2017.08.17 23:21:03.066 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 249ms. Q: 0/3.
[PID:07240:018 2017.08.17 23:21:03.066 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-11::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-15::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-30::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-5::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-10::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.081 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-18::channelProcessor-group-checkpoint
[PID:07240:018 2017.08.17 23:21:03.128 DEBUG PersistentSubscripti] Checkpoint write successful for $persistentsubscription-ChannelsPartition-13::channelProcessor-group-checkpoint
[PID:07240:054 2017.08.17 23:21:03.316 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:046 2017.08.17 23:21:03.612 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/0.
[PID:07240:054 2017.08.17 23:21:03.909 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 250ms. Q: 0/0.
[PID:07240:006 2017.08.17 23:21:03.956 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadEvent - 296ms. Q: 0/0.
[PID:07240:046 2017.08.17 23:21:04.159 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:010 2017.08.17 23:21:04.566 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 375ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:21:04.909 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:018 2017.08.17 23:21:05.034 TRACE TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID a8f62bd6-2af3-427b-8706-f0a457709f82, EventStreamId: System, CorrelationId: ffd7873b-9841-4891-9107-4cf75a9947b1, FirstEventNumber: 1228749, LastEventNumber: 1228749.
[PID:07240:054 2017.08.17 23:21:05.253 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 296ms. Q: 0/2.
[PID:07240:018 2017.08.17 23:21:05.300 TRACE TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID d780d587-1f1a-4716-baca-5d91807ff205, EventStreamId: System, CorrelationId: 37da3885-4bc0-4f34-880d-2220e235bbda, FirstEventNumber: 1228749, LastEventNumber: 1228749.
[PID:07240:025 2017.08.17 23:21:05.534 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:21:05.816 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:025 2017.08.17 23:21:06.113 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 250ms. Q: 0/1.
[PID:07240:054 2017.08.17 23:21:06.441 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 281ms. Q: 0/1.
[PID:07240:025 2017.08.17 23:21:06.706 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:046 2017.08.17 23:21:06.956 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:010 2017.08.17 23:21:07.285 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/0.
[PID:07240:043 2017.08.17 23:21:07.597 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 281ms. Q: 0/1.
[PID:07240:034 2017.08.17 23:21:07.894 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/0.
[PID:07240:034 2017.08.17 23:21:08.316 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:025 2017.08.17 23:21:08.597 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 218ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:21:08.894 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 265ms. Q: 0/0.
[PID:07240:034 2017.08.17 23:21:09.191 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#4]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:21:09.472 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:043 2017.08.17 23:21:09.754 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 265ms. Q: 0/1.
[PID:07240:046 2017.08.17 23:21:10.019 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#3]: ReadAllEventsForward - 234ms. Q: 0/1.
[PID:07240:031 2017.08.17 23:21:10.394 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#1]: ReadAllEventsForward - 296ms. Q: 0/0.
[PID:07240:043 2017.08.17 23:21:10.707 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue EventStore/EventStore#2]: ReadAllEventsForward - 281ms. Q: 0/0.

Cluster stats

{
  "proc": {
    "startTime": "2017-08-17T14:37:47.4349457Z",
    "id": 18328,
    "mem": 1070555136,
    "cpu": 233.079865,
    "cpuScaled": 29.1349831,
    "threadsCount": 49,
    "contentionsRate": 0.0,
    "thrownExceptionsRate": 0.0,
    "diskIo": {
      "readBytes": 12001665746853,
      "writtenBytes": 28417119922,
      "readOps": 1460891441,
      "writeOps": 2845510
    },
    "tcp": {
      "connections": 19,
      "receivingSpeed": 21938.458267457536,
      "sendingSpeed": 2506481.8386726012,
      "inSend": 0,
      "measureTime": "00:00:01.0591902",
      "pendingReceived": 0,
      "pendingSend": 0,
      "receivedBytesSinceLastRun": 23237,
      "receivedBytesTotal": 5612295687,
      "sentBytesSinceLastRun": 2654841,
      "sentBytesTotal": 81248599794
    },
    "gc": {
      "allocationSpeed": 1.55199E+08,
      "gen0ItemsCount": 40163,
      "gen0Size": 6291456,
      "gen1ItemsCount": 9285,
      "gen1Size": 1296648,
      "gen2ItemsCount": 608,
      "gen2Size": 257035464,
      "largeHeapSize": 324430424,
      "timeInGc": 2.8402946,
      "totalBytesInHeaps": 582762536
    }
  },
  "sys": {
    "cpu": 66.01876,
    "freeMem": 2854817792,
    "drive": {
      "c": {
        "availableBytes": 83237036032,
        "totalBytes": 536500760576,
        "usage": "84%",
        "usedBytes": 453263724544
      }
    }
  },
  "es": {
    "checksum": 1084824927495,
    "checksumNonFlushed": 1084824927495,
    "queue": {
      "mainQueue": {
        "queueName": "MainQueue",
        "groupName": "",
        "avgItemsPerSecond": 359,
        "avgProcessingTime": 0.049526302083333335,
        "currentIdleTime": "0:00:00:00.0012368",
        "currentItemProcessingTime": null,
        "idleTimePercent": 98.762716877636819,
        "length": 0,
        "lengthCurrentTryPeak": 9,
        "lengthLifetimePeak": 4677,
        "totalItemsProcessed": 10138387,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "master Replication Service": {
        "queueName": "Master Replication Service",
        "groupName": "",
        "avgItemsPerSecond": 48,
        "avgProcessingTime": 0.09857692307692309,
        "currentIdleTime": "0:00:00:00.0531661",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.523521537686847,
        "length": 2,
        "lengthCurrentTryPeak": 2,
        "lengthLifetimePeak": 2,
        "totalItemsProcessed": 372799,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "SendReplicationData"
      },
      "monitoringQueue": {
        "queueName": "MonitoringQueue",
        "groupName": "",
        "avgItemsPerSecond": 2,
        "avgProcessingTime": 16.0854,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.0414227",
        "idleTimePercent": 95.478182602181789,
        "length": 1,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 4,
        "totalItemsProcessed": 7702,
        "inProgressMessage": "GetFreshStats",
        "lastProcessedMessage": "WriteEventsCompleted"
      },
      "projection Core #0": {
        "queueName": "Projection Core #0",
        "groupName": "Projection Core",
        "avgItemsPerSecond": 0,
        "avgProcessingTime": 0.0,
        "currentIdleTime": "0:01:09:48.3115636",
        "currentItemProcessingTime": null,
        "idleTimePercent": 100.0,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 0,
        "totalItemsProcessed": 3,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "StartReader"
      },
      "projections Master": {
        "queueName": "Projections Master",
        "groupName": "",
        "avgItemsPerSecond": 0,
        "avgProcessingTime": 0.0,
        "currentIdleTime": "0:01:09:48.2137472",
        "currentItemProcessingTime": null,
        "idleTimePercent": 100.0,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 10,
        "totalItemsProcessed": 25,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "RegularTimeout"
      },
      "storage Chaser": {
        "queueName": "Storage Chaser",
        "groupName": "",
        "avgItemsPerSecond": 96,
        "avgProcessingTime": 0.34431941747572814,
        "currentIdleTime": "0:00:00:00.0022446",
        "currentItemProcessingTime": null,
        "idleTimePercent": 96.696184178477466,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 0,
        "totalItemsProcessed": 6461417,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ChaserCheckpointFlush"
      },
      "storageReaderQueue EventStore/EventStore#1": {
        "queueName": "StorageReaderQueue EventStore/EventStore#1",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 54,
        "avgProcessingTime": 13.799558620689655,
        "currentIdleTime": "0:00:00:00.0046469",
        "currentItemProcessingTime": null,
        "idleTimePercent": 24.998479444289135,
        "length": 0,
        "lengthCurrentTryPeak": 10,
        "lengthLifetimePeak": 108,
        "totalItemsProcessed": 284703,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageReaderQueue EventStore/EventStore#2": {
        "queueName": "StorageReaderQueue EventStore/EventStore#2",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 59,
        "avgProcessingTime": 8.7366285714285716,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.0065817",
        "idleTimePercent": 48.427923514397868,
        "length": 0,
        "lengthCurrentTryPeak": 13,
        "lengthLifetimePeak": 48,
        "totalItemsProcessed": 256760,
        "inProgressMessage": "ReadEvent",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageReaderQueue EventStore/EventStore#3": {
        "queueName": "StorageReaderQueue EventStore/EventStore#3",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 34,
        "avgProcessingTime": 18.205637837837838,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.0664873",
        "idleTimePercent": 36.874348967180211,
        "length": 8,
        "lengthCurrentTryPeak": 15,
        "lengthLifetimePeak": 461,
        "totalItemsProcessed": 490572,
        "inProgressMessage": "ReadStreamEventsBackward",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageReaderQueue EventStore/EventStore#4": {
        "queueName": "StorageReaderQueue EventStore/EventStore#4",
        "groupName": "StorageReaderQueue",
        "avgItemsPerSecond": 38,
        "avgProcessingTime": 15.587243902439024,
        "currentIdleTime": null,
        "currentItemProcessingTime": "0:00:00:00.3773847",
        "idleTimePercent": 40.115020721891995,
        "length": 7,
        "lengthCurrentTryPeak": 12,
        "lengthLifetimePeak": 44,
        "totalItemsProcessed": 232627,
        "inProgressMessage": "ReadAllEventsForward",
        "lastProcessedMessage": "ReadStreamEventsForward"
      },
      "storageWriterQueue": {
        "queueName": "StorageWriterQueue",
        "groupName": "",
        "avgItemsPerSecond": 15,
        "avgProcessingTime": 5.0324588235294119,
        "currentIdleTime": "0:00:00:00.0537118",
        "currentItemProcessingTime": null,
        "idleTimePercent": 91.984726707915627,
        "length": 0,
        "lengthCurrentTryPeak": 2,
        "lengthLifetimePeak": 1615,
        "totalItemsProcessed": 1041029,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "WritePrepares"
      },
      "subscriptions": {
        "queueName": "Subscriptions",
        "groupName": "",
        "avgItemsPerSecond": 16,
        "avgProcessingTime": 0.023238888888888889,
        "currentIdleTime": "0:00:00:00.0516671",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.971482671482349,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 2197,
        "totalItemsProcessed": 2327515,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "EventCommitted"
      },
      "timer": {
        "queueName": "Timer",
        "groupName": "",
        "avgItemsPerSecond": 31,
        "avgProcessingTime": 0.049658823529411769,
        "currentIdleTime": "0:00:00:00.0056987",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.876878624796646,
        "length": 32,
        "lengthCurrentTryPeak": 32,
        "lengthLifetimePeak": 38,
        "totalItemsProcessed": 979770,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "ExecuteScheduledTasks"
      },
      "worker EventStore/EventStore#1": {
        "queueName": "Worker EventStore/EventStore#1",
        "groupName": "Workers",
        "avgItemsPerSecond": 6,
        "avgProcessingTime": 0.17930000000000001,
        "currentIdleTime": "0:00:00:00.0148259",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.884573177531649,
        "length": 0,
        "lengthCurrentTryPeak": 0,
        "lengthLifetimePeak": 721,
        "totalItemsProcessed": 1173046,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "IncomingHttpRequestMessage"
      },
      "worker EventStore/EventStore#2": {
        "queueName": "Worker EventStore/EventStore#2",
        "groupName": "Workers",
        "avgItemsPerSecond": 29,
        "avgProcessingTime": 0.21633225806451614,
        "currentIdleTime": "0:00:00:00.0090712",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.376988350007323,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 37,
        "totalItemsProcessed": 468189,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      },
      "worker EventStore/EventStore#3": {
        "queueName": "Worker EventStore/EventStore#3",
        "groupName": "Workers",
        "avgItemsPerSecond": 59,
        "avgProcessingTime": 0.197975,
        "currentIdleTime": "0:00:00:00.0531228",
        "currentItemProcessingTime": null,
        "idleTimePercent": 98.822102228852884,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 452,
        "totalItemsProcessed": 1881443,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      },
      "worker EventStore/EventStore#4": {
        "queueName": "Worker EventStore/EventStore#4",
        "groupName": "Workers",
        "avgItemsPerSecond": 138,
        "avgProcessingTime": 0.60060405405405415,
        "currentIdleTime": "0:00:00:00.0043397",
        "currentItemProcessingTime": null,
        "idleTimePercent": 91.693977436424916,
        "length": 0,
        "lengthCurrentTryPeak": 5,
        "lengthLifetimePeak": 342,
        "totalItemsProcessed": 746591,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      },
      "worker EventStore/EventStore#5": {
        "queueName": "Worker EventStore/EventStore#5",
        "groupName": "Workers",
        "avgItemsPerSecond": 9,
        "avgProcessingTime": 0.12348,
        "currentIdleTime": "0:00:00:00.0092776",
        "currentItemProcessingTime": null,
        "idleTimePercent": 99.887002737779909,
        "length": 0,
        "lengthCurrentTryPeak": 1,
        "lengthLifetimePeak": 63,
        "totalItemsProcessed": 319862,
        "inProgressMessage": "<none>",
        "lastProcessedMessage": "TcpSend"
      }
    },
    "writer": {
      "lastFlushSize": 899,
      "lastFlushDelayMs": 2.8726079061321412,
      "meanFlushSize": 990,
      "meanFlushDelayMs": 8.047793519848268,
      "maxFlushSize": 3251632,
      "maxFlushDelayMs": 4087.7333426456207,
      "queuedFlushMessages": 0
    },
    "readIndex": {
      "cachedRecord": 3132382,
      "notCachedRecord": 44531547,
      "cachedStreamInfo": 150486727,
      "notCachedStreamInfo": 2843415,
      "cachedTransInfo": 0,
      "notCachedTransInfo": 0
    }
  }
}

The text was updated successfully, but these errors were encountered:

hayley-jean · 2017-08-18T07:46:46Z

Thank you for submitting this, @mpielikis

Would you be able to provide the logs for the master node at the time? If you don't want to share them here, you can email them to hayley at geteventstore.com

Faq · 2017-08-27T07:35:58Z

Any followup here?

shaan1337 · 2017-12-28T11:26:12Z

Hello @mpielikis

Can you please let us know if you're still facing the same issue in the latest version of EventStore? (if you've upgraded)

Do you remember if all the subscriptions that were stuck were subscribing from the same or different streams?

Thanks
Shaan

mpielikis · 2017-12-28T14:48:13Z

Hello @shaan1337

Can you please let us know if you're still facing the same issue in the latest version of EventStore? (if you've upgraded)

yes it still happens. Additional debug level logging of persistent subscription could bring more details IMO. The current version is 4.0.3.0

Do you remember if all the subscriptions that were stuck were subscribing from the same or different streams?

Different subscriptions for different streams. Could not find any correlations. Every time the freezed subscriptions were different one or many. We use DispatchToSingle type of subscription the most. Three clients connects to many DispatchToSingle subscriptions (37) and to couple of RoundRobin.

gregoryyoung · 2017-12-28T14:51:48Z

Just from looking at the log messages is it possible a timeout is occurring on writing of checkpoint?

mpielikis · 2017-12-28T15:12:17Z

I was analyzing a log and didn't find any suspicious errors or timeouts just usual messages. If there is a timeout shouldn't there be a retry?

gregoryyoung · 2017-12-28T15:20:46Z

There is a retry to a certain depth

mpielikis · 2017-12-28T17:41:21Z

Also noticed that when subscription "freezes" the number of connected clients to DispatchToSingle becomes 0 or more than 1 when it normally should be only one. It seems as the client socket connections become inactive or stalled and the only way to reset them is to restart the server.

mpielikis · 2017-12-28T17:52:32Z

the last "freeze". As you see there is 7 identical connections

curl -i "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/info" -u ***:***> out.txt

HTTP/1.1 200 OK
Content-Length: 3597
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Thu, 28 Dec 2017 17:47:48 GMT

{
  "lastPushClientsTime": "2017-12-28T19:47:48.1759168+02:00",
  "lastPushMessageTime": "2017-12-28T19:47:33.1473113+02:00",
  "lastReadBatchTime": "2017-12-28T19:31:07.6428082+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-23",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.12.77.137:2114/streams/%24persistentsubscription-ChannelsPartition-23::channelProcessor-group-parked",
  "getMessagesUri": "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/1",
  "totalItemsProcessed": 4400,
  "countSinceLastMeasurement": 5,
  "lastProcessedEventNumber": 428327,
  "lastKnownEventNumber": 428340,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 5,
  "connections": [
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 151,
      "countSinceLastMeasurement": 5,
      "extraStatistics": [],
      "availableSlots": 5,
      "inFlightMessages": 5
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

mpielikis · 2017-12-28T18:19:15Z

these are the changes I've made to 4.0.3.0 to debug this issue
mpielikis@8e6470d

shaan1337 · 2017-12-29T04:23:55Z

Thanks @mpielikis for all the details, it definitely helps!

gregoryyoung · 2017-12-29T11:12:38Z

There are retries.

gregoryyoung · 2017-12-29T11:14:12Z

One thing that looks odd in the status:

" "totalInFlightMessages": 5,"

But none of the connections have any in flight messages?

mpielikis · 2017-12-29T11:33:09Z

no, there is one.

{
      "from": "10.12.77.139:49998",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 151,
      "countSinceLastMeasurement": 5,
      "extraStatistics": [],
      "availableSlots": 5,
      "inFlightMessages": 5
    },

gregoryyoung · 2017-12-29T11:50:44Z

are they being ack'ed?

mpielikis · 2017-12-29T12:14:36Z

You mean TCP ACK? I don't know. What I know is that restarting that client keeps these connections. I could bring more details on next "freeze". Just please tell me what should I check for or what actions should I take to help investigate this.

also interesting part is this

 "lastPushClientsTime": "2017-12-28T19:47:48.1759168+02:00",
  "lastPushMessageTime": "2017-12-28T19:47:33.1473113+02:00",
  "lastReadBatchTime": "2017-12-28T19:31:07.6428082+02:00",
  "lastPushResult": "Sent",

messages are pushed, but event batch reading is stopped.

gregoryyoung · 2017-12-29T12:29:00Z

no there is an explicit ack/nak message from the client on processing of an event (it happens automatically by default but you can also manually do it).

event batch reading only happens if the subscription is not caught up.

mpielikis · 2017-12-29T12:35:04Z

Last result is [Sent]. It ACKs?

mpielikis@8e6470d#diff-09aba1e9e59172134f684c6055cdd727L181

mpielikis · 2017-12-29T12:53:06Z

If you mean on the client side then yes there is an explicit s.Acknowledge(e); on persistent subscription.

gregoryyoung · 2017-12-29T13:32:59Z

the reason I ask is that there are outstanding messages on the subscription. Is it possible that an exception etc is being thrown in the client and the messages are not being ack'ed/nak'ed because of it? I am guessing that its not sending because the buffer size is 5 and its full

{
  "from": "10.12.77.139:49998",
  "username": "admin",
  "averageItemsPerSecond": 0.0,
  "totalItemsProcessed": 151,
  "countSinceLastMeasurement": 5,
  "extraStatistics": [],
  "availableSlots": 5,
  "inFlightMessages": 5
},

If a message is not acked/naked the buffer will fill and no more messages will be sent.

On a side note you really don't want a buffer size of 5 unless these messages are taking +- 1 second to process. A more likely value would be two to three orders of magnitude larger (500-5000).

mpielikis · 2017-12-29T13:55:29Z

the handler on a client looks like this

Action<EventStorePersistentSubscriptionBase, ResolvedEvent> eventAppeared = (s, e) =>
            {
                try
                {
                    handler(s, e);

                    try
                    {
                        s.Acknowledge(e);
                    }
                    catch (Exception exc)
                    {
                        log.Error("Missing ACK event because of " + exc.ToString());

                        Environment.Exit(100);
                    }
                }
                catch (Exception exc)
                {
                    log.Error(exc);

                    try
                    {
                        s.Fail(e, PersistentSubscriptionNakEventAction.Park, exc.Message);
                    }
                    catch (Exception excf)
                    {
                        log.Error("Missing Fail event because of " + excf.ToString());

                        Environment.Exit(100);
                    }
                }
            };

Even if exception happened on a client it shouldn't block the subscription. Other interesting fact is that not one but many subscriptions freezes at the same time with different clients from different machines.

handlers are really lightweight and just pushing messages on a inmemory queue.

shaan1337 · 2018-01-02T05:42:03Z

Hello @mpielikis

Can you please send us 4-5 snapshots (at intervals of around 10-15 seconds) of the query you ran when there is a freeze:
curl -i "http://10.12.77.137:2114/subscriptions/ChannelsPartition-23/channelProcessor-group/info" -u ***:***> out.txt

It'll help us know which of the values are changing.

Thanks

shaan1337 · 2018-01-05T05:36:47Z

Hi @mpielikis

I've gone through a review of the persistent subscription code.

Based on the data you provided, it looks like it can be narrowed down to the following 2 possibilities:

Due to this issue , multiple subscriptions can exist on the same connection. If in your client, you call ConnectToPersistentSubscription from the same IEventStoreConnection object multiple times, multiple working subscriptions can exist simultaneously using the same TCP socket. They can each receive events independently although they use the same TCP socket. (I've tested this) So if you're using DispatchToSingle strategy, only one among those multiple connections will receive the events. Thus, it could be possible that the events are being pushed to one of those connections and on the client side ACKed successfully but not handled.
NotifyLiveSubscriptionMessage is not being called when a live message comes in from here for certain subscriptions (maybe they are no longer present in _subscriptionTopics for some reason. That would explain why multiple subscriptions freeze at the same time)

For no. 1, can you please verify if it's possible that your client code is calling ConnectToPersistentSubscription multiple times? Maybe it happens during some special circumstances (on reconnection for example). That "special circumstance" would need to happen across multiple machines at the same time (because many of your subscriptions freeze at the same time), so it could be related to the network.

If no. 1 isn't fruitful, for no. 2, can you please add a LastNotifyLiveMessageTime to your fork in NotifyLiveSubscriptionMessage function? The stats snapshots in my previous comment would also be helpful in this case.

Thanks

shaan1337 · 2018-01-22T05:51:20Z

Hi @mpielikis

Do you have any feedback with regards to the above please?

Thanks

mpielikis · 2018-01-22T13:02:17Z

Hi @shaan1337

yes. Just was waiting for this freeze to appear, now I got it.

I made four snapshots of 32 subscriptions. First snapshot when there was a freeze, second when clients were closed, third when clients were up again, fourth after ES restart. Netstat -na is also added from all machines. IP addresses are masked.

for example all five states of one subscription:

On Freeze

Content-Length: 5313
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 10:09:57 GMT

{
  "lastPushClientsTime": "2018-01-22T12:09:57.3597618+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 2959,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:62433",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:61476",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62437",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62441",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62453",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:18699",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25578",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25566",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25583",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:29078",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:14764",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:16008",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:30779",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

On Clients Closed (12 non existant connections by netstat)

HTTP/1.1 200 OK
Content-Length: 5037
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 10:12:16 GMT

{
  "lastPushClientsTime": "2018-01-22T12:12:17.3530781+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 2959,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:62433",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:61476",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62437",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62441",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62453",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:18699",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25578",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25566",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25583",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:29078",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:14764",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:16008",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

On Clients Up (12 non-existant and 1 existant 10.98.40.137:40570 connection by netstat)

HTTP/1.1 200 OK
Content-Length: 5313
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 10:28:25 GMT

{
  "lastPushClientsTime": "2018-01-22T12:28:25.0661943+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 2959,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:62433",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:61476",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62437",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62441",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:62453",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:18699",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25578",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25566",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:25583",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:29078",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:14764",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:16008",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.137:40570",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

On ES restart (one healthy connection 10.98.40.137:42525)

HTTP/1.1 200 OK
Content-Length: 1985
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Mon, 22 Jan 2018 11:29:52 GMT

{
  "lastPushClientsTime": "2018-01-22T13:29:52.9858258+02:00",
  "lastPushMessageTime": "2018-01-22T13:07:05.0997138+02:00",
  "lastReadBatchTime": "2018-01-22T13:07:06.8028796+02:00",
  "lastNotifyLiveMessageTime": "0001-01-01T00:00:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-5",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-5::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-5/channelProcessor-group/1",
  "totalItemsProcessed": 7,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 443201,
  "lastKnownEventNumber": 443208,
  "readBufferCount": 0,
  "liveBufferCount": 0,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.137:42525",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 7,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    }
  ]
}

I can send full snapshot zip file.

gregoryyoung · 2018-01-22T14:03:51Z

What do the ES logs say about connections?

mpielikis · 2018-01-23T13:12:34Z

three grepped connections log entries from 5 day log:

10.98.40.137:61476

[PID:03108:024 2018.01.20 05:51:04.649 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}].
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Received bytes: 978, Sent bytes: 0
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Send calls: 0, callbacks: 0
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Receive calls: 2, callbacks: 2
[PID:03108:033 2018.01.20 05:51:04.649 INFO  TcpConnection       ] ES TcpConnection closed [05:51:04.649: N10.98.40.137:61476, L10.98.40.137:1112, {b518403e-d2c0-4122-8a47-f18e9db9287d}]:Close reason: [Success] Socket closed
[PID:03108:024 2018.01.20 05:51:04.649 INFO  TcpConnectionManager] Connection 'external-normal' [10.98.40.137:61476, {b518403e-d2c0-4122-8a47-f18e9db9287d}] closed: Success.
[PID:03108:057 2018.01.20 05:51:04.977 DEBUG PersistentSubscripti] Lost connection from 10.98.40.137:61476

10.98.40.137:62433

[PID:03108:013 2018.01.20 06:31:37.256 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}].
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Received bytes: 978, Sent bytes: 0
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Send calls: 0, callbacks: 0
[PID:03108:013 2018.01.20 06:31:37.272 INFO  TcpConnectionManager] Connection 'external-normal' [10.98.40.137:62433, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}] closed: Success.
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Receive calls: 2, callbacks: 2
[PID:03108:053 2018.01.20 06:31:37.272 INFO  TcpConnection       ] ES TcpConnection closed [06:31:37.272: N10.98.40.137:62433, L10.98.40.137:1112, {7f4f45e9-c448-4cac-8ae7-465e8dd7729f}]:Close reason: [Success] Socket closed
[PID:03108:022 2018.01.20 06:31:38.131 DEBUG PersistentSubscripti] Lost connection from 10.98.40.137:62433

10.98.40.137:62441

[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Received bytes: 978, Sent bytes: 0
[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Send calls: 0, callbacks: 0
[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Receive calls: 2, callbacks: 2
[PID:03108:053 2018.01.20 06:31:38.131 INFO  TcpConnection       ] ES TcpConnection closed [06:31:38.131: N10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}]:Close reason: [Success] Socket closed
[PID:03108:032 2018.01.20 06:31:38.131 INFO  TcpService          ] External TCP connection accepted: [Normal, 10.98.40.137:62441, L10.98.40.137:1112, {5781d10b-f697-433e-98e1-f8b7764471bb}].
[PID:03108:032 2018.01.20 06:31:38.163 INFO  TcpConnectionManager] Connection 'external-normal' [10.98.40.137:62441, {5781d10b-f697-433e-98e1-f8b7764471bb}] closed: Success.
[PID:03108:022 2018.01.20 06:31:38.272 DEBUG PersistentSubscripti] Lost connection from 10.98.40.137:62441

mpielikis · 2018-01-25T12:45:05Z

new freeze here.

four states from one of the stucked persistent subscription

On Freeze

2 connections: 1 lost and 1 active

HTTP/1.1 200 OK
Content-Length: 2288
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 07:30:41 GMT

{
  "lastPushClientsTime": "2018-01-24T09:30:41.376861+02:00",
  "lastPushMessageTime": "2018-01-24T09:30:21.2554694+02:00",
  "lastReadBatchTime": "2018-01-23T14:20:55.2506459+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T09:29:48.3890323+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 22996,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473829,
  "lastKnownEventNumber": 473834,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 5,
  "connections": [
    {
      "from": "10.98.40.138:37893",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.138:50342",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 22360,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 2,
      "inFlightMessages": 5
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "37893" 
  TCP    10.98.40.138:37893      10.98.40.137:1112       ESTABLISHED
nestat -na | find "50342"
  no result

netstat on 10.98.40.137

nestat -na | find "37893"
  TCP    10.98.40.137:1112       10.98.40.138:37893      ESTABLISHED
nestat -na | find "50342
  no result

On Client Close

1 lost connection

HTTP/1.1 200 OK
Content-Length: 2012
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 07:33:57 GMT

{
  "lastPushClientsTime": "2018-01-24T09:33:56.862183+02:00",
  "lastPushMessageTime": "2018-01-24T09:33:51.8164518+02:00",
  "lastReadBatchTime": "2018-01-23T14:20:55.2506459+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T09:33:05.2623793+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 23030,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473829,
  "lastKnownEventNumber": 473837,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 5,
  "connections": [
    {
      "from": "10.98.40.138:50342",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 22394,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 2,
      "inFlightMessages": 5
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "50342"
  no result

On Client Up

2 connections: 1 lost 1 active

HTTP/1.1 200 OK
Content-Length: 2288
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 07:58:01 GMT

{
  "lastPushClientsTime": "2018-01-24T09:58:01.609859+02:00",
  "lastPushMessageTime": "2018-01-24T09:57:51.5254373+02:00",
  "lastReadBatchTime": "2018-01-23T14:20:55.2506459+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T09:57:49.1240758+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 23299,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473849,
  "lastKnownEventNumber": 473862,
  "readBufferCount": 0,
  "liveBufferCount": 500,
  "retryBufferCount": 0,
  "totalInFlightMessages": 6,
  "connections": [
    {
      "from": "10.98.40.138:29134",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 0,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 10,
      "inFlightMessages": 0
    },
    {
      "from": "10.98.40.138:50342",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 22663,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 1,
      "inFlightMessages": 6
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "29134"
  TCP    10.98.40.138:29134      10.98.40.137:1112       ESTABLISHED
nestat -na | find "50342
  noresult

After EventStore restart

1 live connection

HTTP/1.1 200 OK
Content-Length: 2006
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Access-Control-Allow-Methods: GET, GET, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-Forwarded-Prefix, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTos
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ES-Position, ES-CurrentVersion
Date: Wed, 24 Jan 2018 08:33:02 GMT

{
  "lastPushClientsTime": "2018-01-24T10:33:02.9302029+02:00",
  "lastPushMessageTime": "2018-01-24T10:30:54.0145478+02:00",
  "lastReadBatchTime": "2018-01-24T10:30:54.0614057+02:00",
  "lastNotifyLiveMessageTime": "2018-01-24T10:30:54.0145478+02:00",
  "lastPushResult": "Sent",
  "links": [
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/info",
      "rel": "detail"
    },
    {
      "href": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/replayParked",
      "rel": "replayParked"
    }
  ],
  "config": {
    "resolveLinktos": false,
    "startFrom": -1,
    "messageTimeoutMilliseconds": 30000,
    "extraStatistics": false,
    "maxRetryCount": 10,
    "liveBufferSize": 500,
    "bufferSize": 500,
    "readBatchSize": 20,
    "preferRoundRobin": false,
    "checkPointAfterMilliseconds": 2000,
    "minCheckPointCount": 10,
    "maxCheckPointCount": 1000,
    "maxSubscriberCount": 0,
    "namedConsumerStrategy": "DispatchToSingle"
  },
  "eventStreamId": "ChannelsPartition-15",
  "groupName": "channelProcessor-group",
  "status": "Live",
  "averageItemsPerSecond": 0.0,
  "parkedMessageUri": "http://10.98.40.137:2114/streams/%24persistentsubscription-ChannelsPartition-15::channelProcessor-group-parked",
  "getMessagesUri": "http://10.98.40.137:2114/subscriptions/ChannelsPartition-15/channelProcessor-group/1",
  "totalItemsProcessed": 47,
  "countSinceLastMeasurement": 0,
  "lastProcessedEventNumber": 473895,
  "lastKnownEventNumber": 473895,
  "readBufferCount": 0,
  "liveBufferCount": 31,
  "retryBufferCount": 0,
  "totalInFlightMessages": 0,
  "connections": [
    {
      "from": "10.98.40.138:31607",
      "username": "admin",
      "averageItemsPerSecond": 0.0,
      "totalItemsProcessed": 47,
      "countSinceLastMeasurement": 0,
      "extraStatistics": [],
      "availableSlots": 9,
      "inFlightMessages": 0
    }
  ]
}

netstat on 10.98.40.138

nestat -na | find "31607"
  TCP    10.98.40.138:31607      10.98.40.137:1112       ESTABLISHED

shaan1337 · 2018-01-30T08:34:20Z

Hello @mpielikis ,

Sorry for the delay in getting back to you, thanks for all the details.
I'm continuing the analysis based on the information you provided.

Can you please send me your log files related to the above runs to shaan [at] geteventstore.com ?

Thanks

shaan1337 · 2018-01-31T06:24:38Z

Hello @mpielikis

For the new freeze, it seems that the last notify live message time is stuck at: 2018-01-20T07:07:01.084668+02:00, so our second hypothesis seems to hold true (NotifyLiveSubscriptionMessage is not being called when a live message comes in for certain subscriptions)

  "lastPushClientsTime": "2018-01-22T12:09:57.3597618+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",

This is a few minutes after some of the connections were lost: 10.98.40.137:61476, 10.98.40.137:62433, 10.98.40.137:62441 on the 20th (as you sent earlier)

Can you please also send me the logs for the 20th January?

Thanks

haf · 2018-05-30T14:23:42Z

I think we're experiencing exactly the same problem. It has caused massive problems for us on version 3.9.4 of ES.

kbrownhbs · 2018-07-22T22:37:01Z

Has this issue been addressed? I've been playing with persistent subscriptions and wonder if it is ready for prime time.

DavidBM · 2018-08-30T14:09:55Z

Hello, any update in the issue? We want to use eventstore but this can be a blocker.

mpielikis · 2018-08-31T12:14:08Z

We still struggle with this issue in production once in a week. It is the major problem we have. We hope PR #1640 will address it, there are some similar symptoms.

haf · 2018-09-01T07:24:56Z

This is likely due to the connection ending up in an invalid state; the ES connection doesn't successfully reconnect always after network splits. If you want to triage and debug this huge issue, I'd focus on introducing spotty network, with alternating partitions, and a single master ES server that you bring up and down, while tracking the events fired and tcp connection state of the ES connection implementation.

It's not the server's connection; it's the client's connection that's buggy. If you set up a continuous restart of the server, you'll worsen the problem because the client will get into invalid states and throw internal exceptions, such as ObjectDisposedException, or just silently stop sending data to the application.

And the issue for us is primarily catch-up subscriptions, which we have about 40 of in a single process, not persistent subscriptions.

Also note-worthy is that the ES server's logs are useless to debug this. The ES client's logs, unless incremented are also useless. We're using the latest, untagged (!), docker version of the server now, and the .net core client.

We've also had race conditions in the client connection's authenticate call, despite using TCP; and if it's ever unauthenticated during runtime, we crash our service and let k8s restart it.

We've tried to notify ES of these problems, but they say they can't reproduce, despite having been given full access to our environment and code-base, and us having purchased commercial support. If we give them access to Stackdriver logging with ALL logs, neatly ordered by timestamp in GCP (synchronised by Google's NTP server), they still insist on us mounting a persistent volume and shipping text-file-based logs to them over e-mail.

The latest bid from them is to ask us to manually handle the TCP connection state of the ES connection by "avoiding connecting to subscriptions while it's disconnected"; however, since the TCP state is on another thread, it's impossible to know for sure from the caller's thread what state the TCP socket is in. That TCP runs on another thread and therefore any assumption of the socket's state from another thread being a race condition is met with repeated assertions to "just do as I say and handle it manually and it will be fine".

As for the unauthenticated issues that occur every other launch, they simply say they cannot reproduce, despite full access to logs and machines.

When asked about how to monitor the EventStore server with Grafana/Prometheus/Influx or another modern stack, there's no real response either (no Prom-endpoint, no metric shipper, no real answer), so there are no metrics for ES that lets us debug the problem. This is particularly funny because I remember @gregoryyoung talking 2009 at Öredev, bragging about how, when his self-proclaimed SLA had been violated, he told the surprised customer: and this was possible because he had metrics. And yet, he tweets about completely different things other than fixing his database's client lib.

shaan1337 · 2018-11-16T06:23:19Z

@mpielikis By any luck are you using ResolveLinkTos = true on any of your persistent subscriptions?

mpielikis · 2018-11-16T10:26:15Z

No we don't. The freeze is likely happening at some moment with combination of Slow stream reading + High Persistent Subscription usage. We saw sometimes high peaks in StorageReaderQueue with several thousand items in queue, and then the freeze happened. To lower the impact of this issue we divided the tasks between two EventStore instances with separate db, so two disks were used for reading instead of one.

shaan1337 · 2018-11-17T01:59:52Z

Thanks for the details @mpielikis. I earlier saw that you implemented a restart command in your fork which restarts the persistent subscription service and I (maybe wrongly) made the assumption that when you called the command, your subscriptions would "unfreeze". Can you please confirm if that's the case?

I've also done a code review again a few days ago and can deduce the following:

The persistent subscription must still be present in _subscriptionByTopics since you are able to retrieve its stats through <stream>/<group>/info which calls Handle(MonitoringMessage.GetPersistentSubscriptionStats message)
The persistent subscription must still be present in _subscriptionsById since from your logs provided, NotifyClockTick -> TryPushingMessagesToClients is being called on the subscriptions from WakeSubscriptions (lastPushClientsTime was being updated properly)
From these logs for ChannelsPartition-5/channelProcessor-group

  "lastPushClientsTime": "2018-01-22T12:09:57.3597618+02:00",
  "lastPushMessageTime": "2018-01-20T07:07:01.084668+02:00",
  "lastReadBatchTime": "2018-01-20T07:07:01.084668+02:00",
  "lastNotifyLiveMessageTime": "2018-01-20T07:07:01.084668+02:00",

I assumed that the persistent subscription was live and that due to the freeze, NotifyLiveSubscriptionMessage was not being called since the lastNotifyLiveMessageTime timestamp was stuck at 2018-01-20T07:07. (can you please confirm that live events should have been received here?)

I assume that you are still seeing events being added to your streams from the UI - so it means that StorageMessage.EventCommitted messages are properly being generated when events are written.
The persistent subscription queue must be active since you can retrieve stats through <stream>/<group>/info. If the PS queue was stuck at a particular message, you should receive 408 - Request timed out messages but I don't think that's the situation in your case.

Based on the above, I think that one of the premises #1 to #5 must be false because it seems to lead to an impossible situation (or I'm missing something). Can you please confirm if #3 was true?

We could also set some additional (manually activated) logging between PersistentSubscriptionService -> Handle(StorageMessage.EventCommitted message) and NotifyLiveSubscriptionMessage(ResolvedEvent resolvedEvent) to see if live events are going through at this point.

Thanks

djowatts · 2019-05-14T11:55:03Z

We've been having very similar problems ourselves to this. We have found that editing the subscription through the UI and not changing anything but saving it awakens the subscription again. Any movement on this since November?

condron · 2019-06-06T17:16:29Z

This related fix may address someof the problems here #1936

alexeyzimarev · 2019-11-16T18:34:00Z

We got the same issue in production today. We restarted the master node and it resolved the issue.

hayley-jean · 2019-11-18T06:28:40Z

Thank you for letting us know @alexeyzimarev
What version of Event Store did you see this on?

alexeyzimarev · 2019-11-18T07:50:57Z

@hayley-jean it's 5.0.2 on CentOS

hayley-jean added kind/bug Issues which are a software defect investigate labels Aug 18, 2017

jen20 removed the investigate label Aug 30, 2018

shaan1337 added subsystem/persistent subs Issues relating to persistent subscriptions impact/high High prospect of negative impact detectability/hard Negative impact will be hard to detect labels Mar 13, 2019

hayley-jean mentioned this issue Aug 2, 2019

Persistent subscriptions in flight msgs #1971

Merged

jageall added this to the Event Store v6 RC milestone Oct 2, 2019

jageall added the action/investigate For future investigation label Oct 3, 2019

jageall assigned hayley-jean Oct 3, 2019

mat-mcloughlin modified the milestones: Event Store v6 Preview, Event Store v6 Preview 2 Dec 2, 2019

mat-mcloughlin modified the milestones: Event Store v6 Preview 2, EventStore v6 RC Jan 3, 2020

shaan1337 mentioned this issue Jan 10, 2020

Fix race condition in TCP connections which allowed messages to be dispatched after connection closed #2214

Merged

jen20 closed this as completed in #2214 Jan 17, 2020

Persistent subscriptions are quietly stopping #1392

Persistent subscriptions are quietly stopping #1392

Comments

mpielikis commented Aug 18, 2017

hayley-jean commented Aug 18, 2017

Faq commented Aug 27, 2017

shaan1337 commented Dec 28, 2017

mpielikis commented Dec 28, 2017

gregoryyoung commented Dec 28, 2017

mpielikis commented Dec 28, 2017

gregoryyoung commented Dec 28, 2017

mpielikis commented Dec 28, 2017

mpielikis commented Dec 28, 2017 • edited Loading

mpielikis commented Dec 28, 2017

shaan1337 commented Dec 29, 2017

gregoryyoung commented Dec 29, 2017

gregoryyoung commented Dec 29, 2017

mpielikis commented Dec 29, 2017

gregoryyoung commented Dec 29, 2017

mpielikis commented Dec 29, 2017

gregoryyoung commented Dec 29, 2017

mpielikis commented Dec 29, 2017

mpielikis commented Dec 29, 2017

gregoryyoung commented Dec 29, 2017

mpielikis commented Dec 29, 2017

shaan1337 commented Jan 2, 2018

shaan1337 commented Jan 5, 2018

shaan1337 commented Jan 22, 2018

mpielikis commented Jan 22, 2018

gregoryyoung commented Jan 22, 2018

mpielikis commented Jan 23, 2018 • edited Loading

mpielikis commented Jan 25, 2018

shaan1337 commented Jan 30, 2018

shaan1337 commented Jan 31, 2018

haf commented May 30, 2018

kbrownhbs commented Jul 22, 2018

DavidBM commented Aug 30, 2018

mpielikis commented Aug 31, 2018

haf commented Sep 1, 2018 • edited Loading

shaan1337 commented Nov 16, 2018

mpielikis commented Nov 16, 2018

shaan1337 commented Nov 17, 2018 • edited Loading

djowatts commented May 14, 2019

condron commented Jun 6, 2019

alexeyzimarev commented Nov 16, 2019

hayley-jean commented Nov 18, 2019

alexeyzimarev commented Nov 18, 2019

mpielikis commented Dec 28, 2017 •

edited

Loading

mpielikis commented Jan 23, 2018 •

edited

Loading

haf commented Sep 1, 2018 •

edited

Loading

shaan1337 commented Nov 17, 2018 •

edited

Loading