Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscribers do not get notification if a scc-broker node restarts while processing messages #435

Closed
dkruchinin opened this issue Jul 23, 2018 · 5 comments

Comments

@dkruchinin
Copy link

dkruchinin commented Jul 23, 2018

I was testing different failure scenerious in socketcluster and I found out that subscribers do not get notified properly when an scc-broker node is restarted while it's still streaming messages. Instead terminating clinets websocket connection related to the subscriptions handled by the restarted scc-broker insance, a worker just re-initializes the connection to the newly launhed scc-broker leaving the clients unaware of the problem. They just end up getting only part of the messages sent by the publisher. Even if all the messages are persited by the workers, clients still end up not knowing that something went wrong and can't ask the workers to send them all the messages they've missed.

I created a simple test environment that spawns up an haproxy as a load-balancer, two default socketcluster workers, two scc-brokers, an scc-state instance and three socketcluster clients, one publisher and two subscribers, you can find the detailed description of how to reproduce the issue here - https://github.com/dkruchinin/socketcluster-sandbox

Is it a bug or a feature? If it's the latter, is there a way to propagate the error back to the clients through a middleware somehow?

@dkruchinin dkruchinin changed the title Subscribers do not get notification if a scc-broker node restarts while streaming messages Subscribers do not get notification if a scc-broker node restarts while processing messages Jul 23, 2018
@jondubois
Copy link
Member

jondubois commented Jul 23, 2018

@dkruchinin This is a feature ;p SocketCluster doesn't do delivery guarantees but you can implement your own mechanism on top.

Telling a client if a message failed to be delivered is tricky because there are multiple scenarios to account for. For example, if a channel has 1000 subscribers and only 999 of them receive the message successfully; should we tell the publisher client that the publish operation was a success or a failure?

The publish operation currently only tells you if the message reached the front-facing server; beyond that it doesn't track the delivery to individual subscribers.

If you want to track if specific subscribers have received specific messages, then you can create special receipt/acknowledgement channels which subscribers can use to inform publishers whenever they receive certain messages.

It would be nice to write a client-side plugin though which could implement this guaranteed pub/sub receipt/ack feature; it shouldn't be too difficult.

@dkruchinin
Copy link
Author

dkruchinin commented Jul 24, 2018

@jondubois

Thank you for fast reply.

For example, if a channel has 1000 subscribers and only 999 of them receive the message successfully; should we tell the publisher client that the publish operation was a success or a failure?

I don't think the publisher has to know anything about how its messages get distributed by SocketCluster and how many subscribers receive it. Delivery guarantees are good enough as long as I can assume that the messages at least hit the server-side handler that can be modified to persist them. In your example I would care more about building a mechanism that would notify the publisher once the message it sent is persisted by the server.

If you want to track if specific subscribers have received specific messages, then you can create special receipt/acknowledgement channels which subscribers can use to inform publishers whenever they receive certain messages.

Ideally, I don't want subscribers to deal with re-sending messasges to publishers, I would prefer it being resolved on the server side. It is perfectly fine if the subscriber sends a bunch of messages and then goes offline. If the server managed to receive all those messages, it can take care of propagating them to subscribers and provide some delivery guarantees like letting the subscribers know that the stream of messages was interrupted by a scc-broker failure and they have to re-connect and pull those messages that they've missed.

Basically, I'm trying to understand if SocketCluster is a good option for what I want to achive, namely:

  • the server acknowledges subscribers once their messages are written to a persistent sotre
  • in case of a hardware failure affecting one of the core components of the cluster responsible for publishing messages to subscribers (like scc-broker or woker), all clinet connections directly or indirectly managed by the failed component are interrupted, so that they can reinitialize the connection and figure out what they missed.

@jondubois
Copy link
Member

jondubois commented Jul 24, 2018

@dkruchinin Now, when publishing, you can only verify that the message reached the server that you are directly connected to, beyond that there is no guarantee that the message has successfully propagated through the rest of the cluster.

There is some work being done right now which will offer delivery guarantee at the back-end/cluster level (e.g. it will retry failed deliveries which did not reach other nodes on the back end) - That feature could potentially be a couple of months away from completion though.

Ideally, when this feature is completed, you should be able to configure your SCC nodes to enable or disable delivery guarantees.

@dkruchinin
Copy link
Author

@jondubois Thank you! Is there already a branch with a prototype I can keep an eye on?

@jondubois
Copy link
Member

jondubois commented Jul 24, 2018

@dkruchinin; there is a branch by @BenV which supports adding custom middleware functions to a regular SC instance's broker process; see BenV/sc-broker@43adbe4; this will allow us to do things like delay the completion of a publish operation until it has fully propagated throughout the rest of the cluster (it will allow us to retry publishing multiple times behind the scenes in case an scc-broker instance fails behind the scenes).

The changes by @BenV are the first step. Then we'll need to also make changes to https://github.com/SocketCluster/scc-broker-client (this is the client which connects each scc-worker instance to back end scc-brokers) and also to scc-broker https://github.com/SocketCluster/scc-broker - I guess the expected behaviour would be to retry sending a message if it doesn't reach the other instance.

I think it should be configurable (can be enabled or disabled) because not all systems require delivery guarantees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants