Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New replicas start accepting traffic before data/tables are created #561

Closed
mcgrawia opened this issue Oct 6, 2020 · 9 comments
Closed
Labels
work in progress This feautre is not completed yet

Comments

@mcgrawia
Copy link
Contributor

mcgrawia commented Oct 6, 2020

When increasing the replicas of an existing cluster, the new replicas are opened to traffic before the replicated tables and data are found on the node.

Steps to reproduce using docker-for-mac Kubernetes:

  1. Apply a single replica CH installation with Zookeeper settings (setup zookeeper first):
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: test-scaling
spec:
  configuration:
    users:
      default/networks/ip: ::/0
    clusters:
      - name: test-scaling
        layout:
          shardsCount: 1
          replicasCount: 1
    zookeeper:
      nodes:
        - host: scaling-clickhouse-zookeeper
          port: 2181
  1. Start some load on the cluster service:
while [ 1 -eq 1 ]
      do echo 'select count() from test_scaling' | curl 'http://localhost:8123/' --data-binary @- >> out.txt ; sleep 0.5
      done
  1. Write some data to the single replica
CREATE TABLE test_scaling (num UInt32)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/test_scaling', '{replica}') ORDER BY tuple();

insert into test_scaling
select * from system.numbers
limit 100000000;
  1. Apply yaml update to increase replicas to 2:
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: test-scaling
spec:
  configuration:
    users:
      default/networks/ip: ::/0
    clusters:
      - name: test-scaling
        layout:
          shardsCount: 1
          replicasCount: 2
    zookeeper:
      nodes:
        - host: scaling-clickhouse-zookeeper
          port: 2181
  1. View output:
cat out.txt

Results:

# section where initial table did not exist yet
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
...
# section where table was created and populated with one replica
0
0
0
0
0
0
0
9436905
29359260
51378705
75495240
100000000
100000000
100000000
100000000
...
100000000
100000000
100000000
100000000
100000000
100000000
# second replica came online and started serving traffic before all the tables and data were copied. 
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
100000000
100000000
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
100000000
0
9825130
26601850
100000000
100000000
100000000
100000000
100000000
100000000
100000000
100000000

This is an example with a single table and small data set but this problem causes significant outages for us when adding a replica.

@mcgrawia mcgrawia changed the title New replica starts accepting traffic before data/tables are created New replicas starts accepting traffic before data/tables are created Oct 6, 2020
@mcgrawia mcgrawia changed the title New replicas starts accepting traffic before data/tables are created New replicas start accepting traffic before data/tables are created Oct 6, 2020
@mcgrawia
Copy link
Contributor Author

mcgrawia commented Oct 7, 2020

Similarly, it looks like when replicas are removed, the tables are deleted before the pods are removed from the cluster service:

Logs from the operator:

I1007 01:55:28.417079       1 announcer.go:99] Delete host test-scaling/0-1 - started
I1007 01:55:28.814533       1 schemer.go:74] Run query on: chi-test-scaling-test-scaling-0-1.default.svc.cluster.local of [chi-test-scaling-test-scaling-0-1.default.svc.cluster.local]
I1007 01:55:28.825744       1 schemer.go:277] Drop tables: [test_scaling] as [DROP TABLE IF EXISTS "default"."test_scaling"]
I1007 01:55:29.805738       1 announcer.go:99] Deleted tables on host 0-1 replica 1 to shard 0 in cluster test-scaling
I1007 01:55:29.822884       1 deleter.go:40] Controller delete host started test-scaling/0-1

content of out.txt:

# reverted back to 1-replica yaml here
1000388226
1000388226
1000388226
1000388226
1000388226
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
1000388226
1000388226
1000388226
Code: 60, e.displayText() = DB::Exception: Table default.test_scaling doesn't exist. (version 20.9.2.20 (official build))
1000388226
1000388226
1000388226

@filimonov
Copy link
Member

Similar ClickHouse/ClickHouse#10963

@alex-zaitsev
Copy link
Member

@mcgrawia, we know about that. Some improvements have been done in 0.12.0 release and we plan to implement more careful handling of readiness checks in the next release.

@alex-zaitsev
Copy link
Member

Should be fixed in https://github.com/Altinity/clickhouse-operator/releases/tag/0.13.0

@alex-zaitsev
Copy link
Member

Actually, it is not completely fixed. The problem is that operator uses service in order to connect to the pod. External users use service as well. So as soon as pod is live and ready, it is accessible both for operator and external users.
But there is a workaround: use distributed table on top of replicated. Operator adds a new replica to the cluster (remote_servers) after schema is created, and therefore using distributed table can mitigate the problem almost entirely.

My test results are below. Test tries to repeat the original scenario, but also adds a distributed table. Then it makes following queries in a loop with 1 second sleep between tries:

            cnt_local = clickhouse.query_with_error(chi, "select count() from test_local", "chi-test-025-rescaling-default-0-1.test.svc.cluster.local")
            cnt_distr = clickhouse.query_with_error(chi, "select count() from test_distr", "chi-test-025-rescaling-default-0-1.test.svc.cluster.local")

Here is the part of the log:

      When Add one more replica, but do not wait for completion
        And ∕Users∕alz∕go∕src∕github.com∕altinity∕clickhouse-operator∕tests∕configs∕test-025-rescaling-2.yaml is applied
        OK
        Then Waiting for: 2 statefulsets, 2 pods and 3 services to be available
          And Not ready yet. [ statefulset: 1 pod: 1 service: 2 ]. Wait for 5 seconds
          OK
          And Not ready yet. [ statefulset: 2 pod: 2 service: 2 ]. Wait for 10 seconds
          OK
          And Not ready yet. [ statefulset: 2 pod: 2 service: 2 ]. Wait for 15 seconds
          OK
        OK
        Then chi test-025-rescaling .status.status should be InProgress
        OK
      OK
      Then Query second pod using service as soon as it is available
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_local doesn't exist.. 
command terminated with exit code 60, distr: Received exception from server (version 20.8.7):
Code: 60. DB::Exception: Received from chi-test-025-rescaling-default-0-1.test.svc.cluster.local:9000. DB::Exception: Table default.test_distr doesn't exist.. 
command terminated with exit code 60
local: 0, distr: 100000000
local: 388225, distr: 100000000
local: 24504760, distr: 100000000
local: 24504760, distr: 100000000
local: 100000000, distr: 100000000
      OK

So for 5 seconds tables were missing, but as soon as tables are created distributed table returns the full data. Replicated table requires 5 more seconds to catch up.

We planned to use more aggressive readiness checks, but it would not solve the problem since not-ready pods are not accessible by a service as well.

Ideas are welcome. One possible option is to connect to the pod directly without use of service. It can be done integrating clickhouse-go or by using some dummy container with clickhouse-client inside an operator itself.

CC @mcgrawia

@sunsingerus sunsingerus added the work in progress This feautre is not completed yet label Jan 27, 2021
@alex-zaitsev
Copy link
Member

@mcgrawia , here is a log in 0.15.0 operator version:

      When Add one more replica, but do not wait for completion
        And ∕Users∕alz∕go∕src∕github.com∕altinity∕clickhouse-operator∕tests∕configs∕test-025-rescaling-2.yaml is applied
        OK
        Then 2 pod(s)  should be created
          And Not ready. Wait for 5 seconds
          OK
        OK
        Then chi test-025-rescaling .status.status should be InProgress
        OK
      OK
      Then Query second pod using service as soon as pod is in ready state
        And pod chi-test-025-rescaling-default-0-1-0 .metadata.labels."clickhouse\.altinity\.com∕ready" should be yes
          And Not ready. Wait for 1 seconds
          OK
          And Not ready. Wait for 2 seconds
          OK
          And Not ready. Wait for 3 seconds
          OK
          And Not ready. Wait for 4 seconds
          OK
          And Not ready. Wait for 5 seconds
          OK
          And Not ready. Wait for 6 seconds
          OK
        OK
local via loadbalancer: 100000000, distributed via loadbalancer: 100000000
local: 100000000, distr: 100000000
Tables not ready: 0s, data not ready: 0s
        Then Query to the distributed table via load balancer should never fail
        OK
        And Query to the local table via load balancer should never fail
        OK
      OK

It is still possible the replication will be lagging on big tables, but the chance is significantly lowered.

@mcgrawia
Copy link
Contributor Author

Hi @alex-zaitsev thanks for the update! I am not up to speed yet on the 0.15.0 replica addition algorithm, but my understanding of the 0.14.0 is as follows:

  1. Wait for pod to be Ready
    • When it’s ready, K8s adds it to the host service
  2. Query other hosts for tables
  3. Copy tables, via host service address
  4. Add host to cluster config replicas
  5. Add Ready: yes label to pod (exposes it in cluster service)

Would it be possible to add a step between 4 and 5 that waits for the /replicas_status to return a 200 before adding the Ready: yes label? This should be doable because at step 4, the host is already exposed via the host service, so it should be possible to issue an http request to http://chi-test-0-1:8123/replicas_status. Then once that returns 200, adding the Ready: yes label will expose it to customers via the cluster service?

@alex-zaitsev
Copy link
Member

Hi @mcgrawia ,
We do not want adding it by default, because it may never be ready (e.g. under highly loaded systems).

Alternatives:

  1. Add a special option that would wait for replication to complete.
  2. Use distributed table to query a replicated table. In this case you may rely on https://clickhouse.tech/docs/en/operations/settings/settings/#settings-max_replica_delay_for_distributed_queries

@mcgrawia
Copy link
Contributor Author

Thanks @alex-zaitsev, that makes sense. I'll look into the Distributed table and see how we can use that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work in progress This feautre is not completed yet
Projects
None yet
Development

No branches or pull requests

4 participants