Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetstream KV Cluster loosing data after nodes restart/ #5441

Open
grrrvahrrr opened this issue May 17, 2024 · 4 comments
Open

Jetstream KV Cluster loosing data after nodes restart/ #5441

grrrvahrrr opened this issue May 17, 2024 · 4 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@grrrvahrrr
Copy link

Observed behavior

We start up a 5-node cluster with the following configuration and start to continuously put data into the KV:

port: 4222
http_port: 8222

cluster {
    name: js_kv
    listen: 0.0.0.0:6222
    connect_retries: -1
    pool_size: 9
    authorization {
        user: user
        password: password
        timeout: 0.5
    }
    routes = [
        nats-route://user:password@js_kv_node01:6222,
        nats-route://user:password@js_kv_node02:6222,
        nats-route://user:password@js_kv_node03:6222,
        nats-route://user:password@js_kv_node04:6222,
        nats-route://user:password@js_kv_node05:6222
    ]
    compression: {
      mode: s2_auto
      rtt_thresholds: [10ms, 50ms, 100ms]
    }
}

jetstream {
  store_dir: /data/jetstream
  max_file_store: 10737418240
}

To test sustainability we turn off two random nodes and wait for 5-10 minutes before turning them back on again.
As a result after restart we see a difference in last sequence in the nodes that have been turned off. And in time it starts growing.
image

We also tried launching a durable consumer on the test cluster. As a result if the consumer leader switches from a healthy node to a node with sequence loss it stops delivering data due to unexpected sequence difference.

Following testing proves that in an unstable server environment kv can not guarantee the safety of data and getting data from it can give an unreliable result depending on which node is the cluster leader at the given moment.

Expected behavior

We expect that after turning nodes back on again, we would get the same last sequence on all 5 nodes and consumers would continue delivering data or at least alert that they have stopped doing so.

Server and client version

tested on both v2.10.14 and nightly-20240513
nats.go v1.34.1

Host environment

No response

Steps to reproduce

No response

@grrrvahrrr grrrvahrrr added the defect Suspected defect such as a bug or regression label May 17, 2024
@derekcollison
Copy link
Member

Can you share underlying stream info for the KV?

Could you test with latest main/nightly as well?

@timurguseynov
Copy link

timurguseynov commented May 26, 2024

@derekcollison Seems like there is also might be a problem with docker swarm nodes drain command. Seems like when it's being called without calling linux sync command right before the drain, there is a 90% chance of node entering into a non-recoverable mismatch state on the second run even with the SyncAlways setting turned on. Do you think that the behavior of shutdown and sync could be changed in a way that would not cause issues with docker swarm?

@ripienaar
Copy link
Contributor

ripienaar commented May 27, 2024

Do you set --stop-grace-period on swarm? NATS Lame Duck mode time is 2 minutes, you should align that and swarms grace period

At start of the grace nats will get a term and at and a kill if its not stopped already after the grace, you want to set swarm to longer than nats shutdown time

@timurguseynov
Copy link

@ripienaar thank you so much for your answer, I hope we will try it out soon)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

4 participants