New replicas start accepting traffic before data/tables are created #561

mcgrawia · 2020-10-06T22:04:31Z

When increasing the replicas of an existing cluster, the new replicas are opened to traffic before the replicated tables and data are found on the node.

Steps to reproduce using docker-for-mac Kubernetes:

Apply a single replica CH installation with Zookeeper settings (setup zookeeper first):

apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: test-scaling
spec:
  configuration:
    users:
      default/networks/ip: ::/0
    clusters:
      - name: test-scaling
        layout:
          shardsCount: 1
          replicasCount: 1
    zookeeper:
      nodes:
        - host: scaling-clickhouse-zookeeper
          port: 2181

Start some load on the cluster service:

while [ 1 -eq 1 ]
      do echo 'select count() from test_scaling' | curl 'http://localhost:8123/' --data-binary @- >> out.txt ; sleep 0.5
      done

Write some data to the single replica

CREATE TABLE test_scaling (num UInt32)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/test_scaling', '{replica}') ORDER BY tuple();

insert into test_scaling
select * from system.numbers
limit 100000000;

Apply yaml update to increase replicas to 2:

apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: test-scaling
spec:
  configuration:
    users:
      default/networks/ip: ::/0
    clusters:
      - name: test-scaling
        layout:
          shardsCount: 1
          replicasCount: 2
    zookeeper:
      nodes:
        - host: scaling-clickhouse-zookeeper
          port: 2181

View output:

cat out.txt

Results:

# section where initial table did not exist yet
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
...
# section where table was created and populated with one replica
0
0
0
0
0
0
0
9436905
29359260
51378705
75495240
100000000
100000000
100000000
100000000
...
100000000
100000000
100000000
100000000
100000000
100000000
# second replica came online and started serving traffic before all the tables and data were copied. 
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
100000000
100000000
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
100000000
0
9825130
26601850
100000000
100000000
100000000
100000000
100000000
100000000
100000000
100000000

This is an example with a single table and small data set but this problem causes significant outages for us when adding a replica.

mcgrawia · 2020-10-07T01:58:43Z

Similarly, it looks like when replicas are removed, the tables are deleted before the pods are removed from the cluster service:

Logs from the operator:

I1007 01:55:28.417079       1 announcer.go:99] Delete host test-scaling/0-1 - started
I1007 01:55:28.814533       1 schemer.go:74] Run query on: chi-test-scaling-test-scaling-0-1.default.svc.cluster.local of [chi-test-scaling-test-scaling-0-1.default.svc.cluster.local]
I1007 01:55:28.825744       1 schemer.go:277] Drop tables: [test_scaling] as [DROP TABLE IF EXISTS "default"."test_scaling"]
I1007 01:55:29.805738       1 announcer.go:99] Deleted tables on host 0-1 replica 1 to shard 0 in cluster test-scaling
I1007 01:55:29.822884       1 deleter.go:40] Controller delete host started test-scaling/0-1

content of out.txt:

# reverted back to 1-replica yaml here
1000388226
1000388226
1000388226
1000388226
1000388226
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
1000388226
1000388226
1000388226
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
1000388226
1000388226
1000388226

filimonov · 2020-10-07T04:53:55Z

Similar ClickHouse/ClickHouse#10963

alex-zaitsev · 2020-10-07T12:17:26Z

@mcgrawia, we know about that. Some improvements have been done in 0.12.0 release and we plan to implement more careful handling of readiness checks in the next release.

alex-zaitsev · 2020-12-24T08:49:43Z

Should be fixed in https://github.com/Altinity/clickhouse-operator/releases/tag/0.13.0

alex-zaitsev · 2020-12-25T12:49:09Z

Actually, it is not completely fixed. The problem is that operator uses service in order to connect to the pod. External users use service as well. So as soon as pod is live and ready, it is accessible both for operator and external users.
But there is a workaround: use distributed table on top of replicated. Operator adds a new replica to the cluster (remote_servers) after schema is created, and therefore using distributed table can mitigate the problem almost entirely.

My test results are below. Test tries to repeat the original scenario, but also adds a distributed table. Then it makes following queries in a loop with 1 second sleep between tries:

            cnt_local = clickhouse.query_with_error(chi, "select count() from test_local", "chi-test-025-rescaling-default-0-1.test.svc.cluster.local")
            cnt_distr = clickhouse.query_with_error(chi, "select count() from test_distr", "chi-test-025-rescaling-default-0-1.test.svc.cluster.local")

Here is the part of the log:

      When Add one more replica, but do not wait for completion
        And ∕Users∕alz∕go∕src∕github.com∕altinity∕clickhouse-operator∕tests∕configs∕test-025-rescaling-2.yaml is applied
        OK
        Then Waiting for: 2 statefulsets, 2 pods and 3 services to be available
          And Not ready yet. [ statefulset: 1 pod: 1 service: 2 ]. Wait for 5 seconds
          OK
          And Not ready yet. [ statefulset: 2 pod: 2 service: 2 ]. Wait for 10 seconds
          OK
          And Not ready yet. [ statefulset: 2 pod: 2 service: 2 ]. Wait for 15 seconds
          OK
        OK
        Then chi test-025-rescaling .status.status should be InProgress
        OK
      OK
      Then Query second pod using service as soon as it is available
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: 0, distr: 100000000
local: 388225, distr: 100000000
local: 24504760, distr: 100000000
local: 24504760, distr: 100000000
local: 100000000, distr: 100000000
      OK

So for 5 seconds tables were missing, but as soon as tables are created distributed table returns the full data. Replicated table requires 5 more seconds to catch up.

We planned to use more aggressive readiness checks, but it would not solve the problem since not-ready pods are not accessible by a service as well.

Ideas are welcome. One possible option is to connect to the pod directly without use of service. It can be done integrating clickhouse-go or by using some dummy container with clickhouse-client inside an operator itself.

CC @mcgrawia

alex-zaitsev · 2021-07-19T14:41:56Z

@mcgrawia , here is a log in 0.15.0 operator version:

      When Add one more replica, but do not wait for completion
        And ∕Users∕alz∕go∕src∕github.com∕altinity∕clickhouse-operator∕tests∕configs∕test-025-rescaling-2.yaml is applied
        OK
        Then 2 pod(s)  should be created
          And Not ready. Wait for 5 seconds
          OK
        OK
        Then chi test-025-rescaling .status.status should be InProgress
        OK
      OK
      Then Query second pod using service as soon as pod is in ready state
        And pod chi-test-025-rescaling-default-0-1-0 .metadata.labels."clickhouse\.altinity\.com∕ready" should be yes
          And Not ready. Wait for 1 seconds
          OK
          And Not ready. Wait for 2 seconds
          OK
          And Not ready. Wait for 3 seconds
          OK
          And Not ready. Wait for 4 seconds
          OK
          And Not ready. Wait for 5 seconds
          OK
          And Not ready. Wait for 6 seconds
          OK
        OK
local via loadbalancer: 100000000, distributed via loadbalancer: 100000000
local: 100000000, distr: 100000000
Tables not ready: 0s, data not ready: 0s
        Then Query to the distributed table via load balancer should never fail
        OK
        And Query to the local table via load balancer should never fail
        OK
      OK

It is still possible the replication will be lagging on big tables, but the chance is significantly lowered.

mcgrawia · 2021-07-19T15:34:23Z

Hi @alex-zaitsev thanks for the update! I am not up to speed yet on the 0.15.0 replica addition algorithm, but my understanding of the 0.14.0 is as follows:

Wait for pod to be Ready
- When it’s ready, K8s adds it to the host service
Query other hosts for tables
Copy tables, via host service address
Add host to cluster config replicas
Add Ready: yes label to pod (exposes it in cluster service)

Would it be possible to add a step between 4 and 5 that waits for the /replicas_status to return a 200 before adding the Ready: yes label? This should be doable because at step 4, the host is already exposed via the host service, so it should be possible to issue an http request to http://chi-test-0-1:8123/replicas_status. Then once that returns 200, adding the Ready: yes label will expose it to customers via the cluster service?

alex-zaitsev · 2021-07-20T06:01:15Z

Hi @mcgrawia ,
We do not want adding it by default, because it may never be ready (e.g. under highly loaded systems).

Alternatives:

Add a special option that would wait for replication to complete.
Use distributed table to query a replicated table. In this case you may rely on https://clickhouse.tech/docs/en/operations/settings/settings/#settings-max_replica_delay_for_distributed_queries

mcgrawia · 2021-07-20T14:52:27Z

Thanks @alex-zaitsev, that makes sense. I'll look into the Distributed table and see how we can use that as well.

mcgrawia changed the title ~~New replica starts accepting traffic before data/tables are created~~ New replicas starts accepting traffic before data/tables are created Oct 6, 2020

mcgrawia changed the title ~~New replicas starts accepting traffic before data/tables are created~~ New replicas start accepting traffic before data/tables are created Oct 6, 2020

sunsingerus added the work in progress This feautre is not completed yet label Jan 27, 2021

alex-zaitsev closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New replicas start accepting traffic before data/tables are created #561

New replicas start accepting traffic before data/tables are created #561

mcgrawia commented Oct 6, 2020 •

edited

mcgrawia commented Oct 7, 2020

filimonov commented Oct 7, 2020

alex-zaitsev commented Oct 7, 2020

alex-zaitsev commented Dec 24, 2020

alex-zaitsev commented Dec 25, 2020

alex-zaitsev commented Jul 19, 2021

mcgrawia commented Jul 19, 2021

alex-zaitsev commented Jul 20, 2021

mcgrawia commented Jul 20, 2021

New replicas start accepting traffic before data/tables are created #561

New replicas start accepting traffic before data/tables are created #561

Comments

mcgrawia commented Oct 6, 2020 • edited

mcgrawia commented Oct 7, 2020

filimonov commented Oct 7, 2020

alex-zaitsev commented Oct 7, 2020

alex-zaitsev commented Dec 24, 2020

alex-zaitsev commented Dec 25, 2020

alex-zaitsev commented Jul 19, 2021

mcgrawia commented Jul 19, 2021

alex-zaitsev commented Jul 20, 2021

mcgrawia commented Jul 20, 2021

mcgrawia commented Oct 6, 2020 •

edited