-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New replicas start accepting traffic before data/tables are created #561
Comments
Similarly, it looks like when replicas are removed, the tables are deleted before the pods are removed from the cluster service: Logs from the operator:
content of out.txt:
|
Similar ClickHouse/ClickHouse#10963 |
@mcgrawia, we know about that. Some improvements have been done in 0.12.0 release and we plan to implement more careful handling of readiness checks in the next release. |
Should be fixed in https://github.com/Altinity/clickhouse-operator/releases/tag/0.13.0 |
Actually, it is not completely fixed. The problem is that operator uses service in order to connect to the pod. External users use service as well. So as soon as pod is live and ready, it is accessible both for operator and external users. My test results are below. Test tries to repeat the original scenario, but also adds a distributed table. Then it makes following queries in a loop with 1 second sleep between tries:
Here is the part of the log:
So for 5 seconds tables were missing, but as soon as tables are created distributed table returns the full data. Replicated table requires 5 more seconds to catch up. We planned to use more aggressive readiness checks, but it would not solve the problem since not-ready pods are not accessible by a service as well. Ideas are welcome. One possible option is to connect to the pod directly without use of service. It can be done integrating clickhouse-go or by using some dummy container with clickhouse-client inside an operator itself. CC @mcgrawia |
@mcgrawia , here is a log in 0.15.0 operator version:
It is still possible the replication will be lagging on big tables, but the chance is significantly lowered. |
Hi @alex-zaitsev thanks for the update! I am not up to speed yet on the 0.15.0 replica addition algorithm, but my understanding of the 0.14.0 is as follows:
Would it be possible to add a step between 4 and 5 that waits for the |
Hi @mcgrawia , Alternatives:
|
Thanks @alex-zaitsev, that makes sense. I'll look into the Distributed table and see how we can use that as well. |
When increasing the replicas of an existing cluster, the new replicas are opened to traffic before the replicated tables and data are found on the node.
Steps to reproduce using docker-for-mac Kubernetes:
Results:
This is an example with a single table and small data set but this problem causes significant outages for us when adding a replica.
The text was updated successfully, but these errors were encountered: