Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse Operator exception causes clickhouse cluster exception #890

Closed
czhfe opened this issue Feb 23, 2022 · 29 comments
Closed

Clickhouse Operator exception causes clickhouse cluster exception #890

czhfe opened this issue Feb 23, 2022 · 29 comments

Comments

@czhfe
Copy link

czhfe commented Feb 23, 2022

Environment
Clickhouse Operator version: 0.17.0

Question
Clickhouse Operator restart (oom, update, etc.) will cause clickhouse exceptions. The error is as follows:

2022.02.23 11:14:56.675790 [ 10484 ] {52fafb46-2a44-4028-9a3b-0e8d82237a02} <Error> TCPHandler: Code: 170, e.displayText() = DB::Exception: Requested cluster 'skyline' not found, Stack trace:

image

Can Clickhouse Operator support running multiple copies?

@Slach
Copy link
Collaborator

Slach commented Feb 23, 2022

Could you share your kind: ClickHouseInstallation manifest?
Which layout do you use?

Can Clickhouse Operator support running multiple copies?

Multiple copies of what do you need?
Do you mean multiple clickhouse-server instances?

All clusters defined in ConfigMap and mounted in file /etc/clickhouse-server/config.d/chop-generated-remote_servers.xml
I not sure OOM will root cause this exception.

@czhfe
Copy link
Author

czhfe commented Feb 23, 2022

Could you share your kind: ClickHouseInstallation manifest? Which layout do you use?

Can Clickhouse Operator support running multiple copies?

Multiple copies of what do you need? Do you mean multiple clickhouse-server instances?

All clusters defined in ConfigMap and mounted in file /etc/clickhouse-server/config.d/chop-generated-remote_servers.xml I not sure OOM will root cause this exception.

1、ClickHouseInstallation manifest

apiVersion: "clickhouse.altinity.com/v1"
kind: ClickHouseInstallation
metadata:
  name: clickhouse
spec:
  defaults:
    templates: 
      dataVolumeClaimTemplate: clickhouse-data
      podTemplate: clickhouse
      serviceTemplate: clickhouse-default
  configuration:
    zookeeper:
      nodes:
        - host: zookeeper-0.zookeeper-headless
          port: 2181
        - host: zookeeper-1.zookeeper-headless
          port: 2181
        - host: zookeeper-2.zookeeper-headless
          port: 2181
    clusters:
      - name: czhfe
        layout:
          shardsCount: 2
          replicasCount: 2
    profiles:
      default/distributed_aggregation_memory_efficient: "1"
      default/max_bytes_before_external_sort: "6184752906"
      default/max_bytes_before_external_group_by: "3865470566"
    settings:
      disable_internal_dns_cache: "1"
      max_server_memory_usage: "7730941133"
      prometheus/asynchronous_metrics: "true"
      prometheus/endpoint: /metrics
      prometheus/events: "true"
      prometheus/metrics: "true"
      prometheus/port: "8001"
      prometheus/status_info: "true"
    users:
      clickhouse_admin/networks/ip: "::/0"
      clickhouse_admin/password: "xxxxxxxxxx"
      clickhouse_admin/profile: default
      clickhouse_admin/access_management: 1
  templates:
    podTemplates:
      - name: clickhouse
        podDistribution:
          - type: ShardAntiAffinity
            scope: Shard
        spec:
          containers:
            - name: clickhouse-pod
              image: clickhouse/clickhouse-server:21.8.14.5
              ports:
              - name: metrics
                containerPort: 8001
              resources:
                requests:
                  memory: "8Gi"
                  cpu: "1"
                limits:
                  memory: "8Gi"
                  cpu: "1"
              env:
                - name: TZ
                  value: "Asia/Shanghai"
              lifecycle:
                preStop:
                  exec:
                    command: [ "/bin/sh","-c","clickhouse stop" ]
              livenessProbe:
                initialDelaySeconds: 10
                failureThreshold: 3
                periodSeconds: 10
                successThreshold: 1
                timeoutSeconds: 10
                httpGet:
                  port: http
                  scheme: HTTP
                  path: /ping
              readinessProbe:
                initialDelaySeconds: 5
                failureThreshold: 3
                periodSeconds: 10
                successThreshold: 1
                timeoutSeconds: 10
                httpGet:
                  port: http
                  scheme: HTTP
                  path: /ping
              startupProbe:
                initialDelaySeconds: 20
                failureThreshold: 30
                periodSeconds: 10
                successThreshold: 1
                timeoutSeconds: 10
                httpGet:
                  port: http
                  scheme: HTTP
                  path: /ping
          terminationGracePeriodSeconds: 60
    volumeClaimTemplates:
      - name: clickhouse-data
        reclaimPolicy: Retain
        spec:
          storageClassName: csi-disk-ssd
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 50Gi
    serviceTemplates:
      - name: clickhouse-default
        generateName: clickhouse-server
        spec:
          ports:
            - name: http
              port: 8123
            - name: tcp
              port: 9000
          type: ClusterIP

2、This refers to whether the clickhouse operator can run multiple copies, not to the clickhouse-server

3、This problem may occur whenever the clickhouse-operator has restarted

@Slach
Copy link
Collaborator

Slach commented Feb 23, 2022

1 I don't see skyline cluster in your spec, only czhfe cluster,

  clusters:
      - name: czhfe
        layout:
          shardsCount: 2
          replicasCount: 2

could you share following query result?

SELECT hostName() h, * FROM cluster('all-sharded',system.query_log) WHERE query_id='52fafb46-2a44-4028-9a3b-0e8d82237a02' FORMAT Vertical

This refers to whether the clickhouse operator can run multiple copies, not to the clickhouse-server

It's possible, but only for the same version of clickhouse-operator and each clickhouse-operator instance shall watch separate namespaces (if you install clickhouse-operator in kube-system it will watch all namespaces, if you install clickhouse-operator into the other namespace, it will watch only one namespace where clickhouse-operator installed

If you have multiple clickhouse-operator instance, it could have race-condition.

could you share following command?

kubectl get deploy --all-namespaces -l app=clickhouse-operator

This problem may occur whenever the clickhouse-operator has restarted
I not sure

@czhfe
Copy link
Author

czhfe commented Feb 23, 2022

1 I don't see skyline cluster in your spec, only czhfe cluster,
For the top up, the above manifest is an example, cluster is written czhfe

could you share following query result?

……
exception_code:                      170
exception:                           Code: 170, e.displayText() = DB::Exception: Requested cluster 'skyline' not found (version 21.6.5.37 (official build))
stack_trace:                         0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8b6cbba in /usr/bin/clickhouse
1. DB::Context::getCluster(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const @ 0xf528752 in /usr/bin/clickhouse
2. DB::StorageDistributed::getCluster() const @ 0xfdf01b7 in /usr/bin/clickhouse
3. DB::StorageDistributed::write(std::__1::shared_ptr<DB::IAST> const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context>) @ 0xfdf2c6c in /usr/bin/clickhouse
4. DB::PushingToViewsBlockOutputStream::PushingToViewsBlockOutputStream(std::__1::shared_ptr<DB::IStorage> const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context>, std::__1::shared_ptr<DB::IAST> const&, bool) @ 0xf8b85da in /usr/bin/clickhouse
5. DB::InterpreterInsertQuery::execute() @ 0xf8b2e2b in /usr/bin/clickhouse
6. DB::PushingToViewsBlockOutputStream::PushingToViewsBlockOutputStream(std::__1::shared_ptr<DB::IStorage> const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context>, std::__1::shared_ptr<DB::IAST> const&, bool) @ 0xf8b7dbb in /usr/bin/clickhouse
7. DB::InterpreterInsertQuery::execute() @ 0xf8b2e2b in /usr/bin/clickhouse
8. ? @ 0xfc0a4a1 in /usr/bin/clickhouse
9. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, bool) @ 0xfc08b23 in /usr/bin/clickhouse
10. DB::TCPHandler::runImpl() @ 0x1042e932 in /usr/bin/clickhouse
11. DB::TCPHandler::run() @ 0x10441839 in /usr/bin/clickhouse
12. Poco::Net::TCPServerConnection::start() @ 0x12a3fd4f in /usr/bin/clickhouse
13. Poco::Net::TCPServerDispatcher::run() @ 0x12a417da in /usr/bin/clickhouse
14. Poco::PooledThread::run() @ 0x12b7ab39 in /usr/bin/clickhouse
15. Poco::ThreadImpl::runnableEntry(void*) @ 0x12b76b2a in /usr/bin/clickhouse
16. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
17. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
……

It's possible, but only for the same version of clickhouse-operator and each clickhouse-operator instance shall watch separate namespaces (if you install clickhouse-operator in kube-system it will watch all namespaces, if you install clickhouse-operator into the other namespace, it will watch only one namespace where clickhouse-operator installed

I don't know if this clickhouse-operator can run more than one replica (the deployment replicas in kubernetes are greater than 1), and I'm not sure if running more than one replica has any impact.

@Slach
Copy link
Collaborator

Slach commented Feb 23, 2022

Shared results are incomplete, you skip original query (only exception related field), so I don't know what exactly happens

@czhfe
Copy link
Author

czhfe commented Feb 24, 2022

Shared results are incomplete, you skip original query (only exception related field), so I don't know what exactly happens

kubectl get deploy --all-namespaces -l app=clickhouse-operator

NAMESPACE     NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   clickhouse-operator-altinity-clickhouse-operator   1/1     1            1           225d

SELECT hostName() h, * FROM cluster('all-sharded',system.query_log) WHERE query_id='52fafb46-2a44-4028-9a3b-0e8d82237a02' FORMAT Vertical

Row 1:
──────
h:                                   chi-clickhouse-skyline-0-0-0
type:                                ExceptionBeforeStart
event_date:                          2022-02-23
event_time:                          2022-02-23 11:14:56
event_time_microseconds:             2022-02-23 11:14:56.674389
query_start_time:                    2022-02-23 11:14:56
query_start_time_microseconds:       2022-02-23 11:14:56.674389
query_duration_ms:                   0
read_rows:                           0
read_bytes:                          0
written_rows:                        0
written_bytes:                       0
result_rows:                         0
result_bytes:                        0
memory_usage:                        0
current_database:                    default
query:                               INSERT INTO `482239998746169344`.`log_detail`(trace_id,product_code,app_code,service_name,service_version,service_instance,tracker_version,tracker_lang,env_code,tenant_code,user_code,parent_span_id,span_id,span_type,span_layer,start_time,end_time,duration,component,operation_name,peer,is_error,tags,tags.api_domain_without_protocol,tags.api_fingerprint,tags.db_type,tags.db_instance,tags.db_statement,tags.url,tags.http_method,tags.status_code,tags.cache_type,tags.mq_type,tags.mq_topic,tags.event,tags.error_kind,tags.message,tags.rpc_method,tags.state_code,logs,fingerprint,__time__,__event_time__,__topic__,__content__,tags.sql_fingerprint,tags.sql_fingerprint_hash,tags.db_params) VALUES
normalized_query_hash:               5927681148878365672
query_kind:
databases:                           []
tables:                              []
columns:                             []
projections:                         []
exception_code:                      170
exception:                           Code: 170, e.displayText() = DB::Exception: Requested cluster 'skyline' not found (version 21.6.5.37 (official build))
stack_trace:                         0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8b6cbba in /usr/bin/clickhouse
1. DB::Context::getCluster(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const @ 0xf528752 in /usr/bin/clickhouse
2. DB::StorageDistributed::getCluster() const @ 0xfdf01b7 in /usr/bin/clickhouse
3. DB::StorageDistributed::write(std::__1::shared_ptr<DB::IAST> const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context>) @ 0xfdf2c6c in /usr/bin/clickhouse
4. DB::PushingToViewsBlockOutputStream::PushingToViewsBlockOutputStream(std::__1::shared_ptr<DB::IStorage> const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context>, std::__1::shared_ptr<DB::IAST> const&, bool) @ 0xf8b85da in /usr/bin/clickhouse
5. DB::InterpreterInsertQuery::execute() @ 0xf8b2e2b in /usr/bin/clickhouse
6. DB::PushingToViewsBlockOutputStream::PushingToViewsBlockOutputStream(std::__1::shared_ptr<DB::IStorage> const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context>, std::__1::shared_ptr<DB::IAST> const&, bool) @ 0xf8b7dbb in /usr/bin/clickhouse
7. DB::InterpreterInsertQuery::execute() @ 0xf8b2e2b in /usr/bin/clickhouse
8. ? @ 0xfc0a4a1 in /usr/bin/clickhouse
9. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, bool) @ 0xfc08b23 in /usr/bin/clickhouse
10. DB::TCPHandler::runImpl() @ 0x1042e932 in /usr/bin/clickhouse
11. DB::TCPHandler::run() @ 0x10441839 in /usr/bin/clickhouse
12. Poco::Net::TCPServerConnection::start() @ 0x12a3fd4f in /usr/bin/clickhouse
13. Poco::Net::TCPServerDispatcher::run() @ 0x12a417da in /usr/bin/clickhouse
14. Poco::PooledThread::run() @ 0x12b7ab39 in /usr/bin/clickhouse
15. Poco::ThreadImpl::runnableEntry(void*) @ 0x12b76b2a in /usr/bin/clickhouse
16. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
17. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

is_initial_query:                    1
user:                                fast
query_id:                            52fafb46-2a44-4028-9a3b-0e8d82237a02
address:                             ::ffff:10.16.0.113
port:                                57996
initial_user:                        fast
initial_query_id:                    52fafb46-2a44-4028-9a3b-0e8d82237a02
initial_address:                     ::ffff:10.16.0.113
initial_port:                        57996
interface:                           1
os_user:                             data-bus-skyline-482239998746169344-7f4f54d4d6-8dxwb
client_hostname:                     data-bus-skyline-482239998746169344-7f4f54d4d6-8dxwb
client_name:                         Golang SQLDriver
client_revision:                     54213
client_version_major:                1
client_version_minor:                1
client_version_patch:                54213
http_method:                         0
http_user_agent:
http_referer:
forwarded_for:
quota_key:
revision:                            54451
log_comment:
thread_ids:                          []
ProfileEvents.Names:                 []
ProfileEvents.Values:                []
Settings.Names:                      []
Settings.Values:                     []
used_aggregate_functions:            []
used_aggregate_function_combinators: []
used_database_engines:               []
used_data_type_families:             []
used_dictionaries:                   []
used_formats:                        []
used_functions:                      []
used_storages:                       []
used_table_functions:                []

@Slach
Copy link
Collaborator

Slach commented Feb 24, 2022

ok. now you have only one instance clickhouse-operator which watch all namespaces

could you share follwing queries results?

kubectl get chi --all-namespaces

kubectl exec  chi-clickhouse-skyline-0-0-0 -n <namespace_where_chi_installed> -- clickhouse-client -mn -q "SHOW CREATE TABLE `482239998746169344`.`log_detail` FORMAT Vertical; SELECT * FROM system.clusters FORMAT Vertical"

@czhfe
Copy link
Author

czhfe commented Feb 24, 2022

ok. now you have only one instance which watch all namespacesclickhouse-operator

could you share follwing queries results?

kubectl get chi --all-namespaces

kubectl exec  chi-clickhouse-skyline-0-0-0 -n <namespace_where_chi_installed> -- clickhouse-client -mn -q "SHOW CREATE TABLE `482239998746169344`.`log_detail` FORMAT Vertical; SELECT * FROM system.clusters FORMAT Vertical"

Sorry, there are some table structures that are not so easy to share. clickhouse-operator can run multiple copies? Is there any impact

@Slach
Copy link
Collaborator

Slach commented Feb 24, 2022

Sorry, there are some table structures that are not so easy to share. clickhouse-operator can run multiple copies? Is there any impact

I already asked, you not answer
#890 (comment)

Multiple copies of what?

@czhfe
Copy link
Author

czhfe commented Feb 24, 2022

Sorry, there are some table structures that are not so easy to share. clickhouse-operator can run multiple copies? Is there any impact

I already asked, you not answer #890 (comment)

Multiple copies of what?

clickhouse-operator, I'm referring to whether the replicas of clickhouse-operator deployment in kubernetes can be set to greater than 1 (i.e. multiple replicas)

@Slach
Copy link
Collaborator

Slach commented Feb 24, 2022

yes, clickhouse-operator could be run in different namespaces
no, clickhouse-operator shall have replicas: 1 in his deployment
did you change replicas in clickhouse-operator deployment?

you shared following command

kubectl get deploy --all-namespaces -l app=clickhouse-operator
NAMESPACE     NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   clickhouse-operator-altinity-clickhouse-operator   1/1     1            1           225d

it mean you have only ONE copy of clickhouse-operator

so, i ask again
you shared manifest with no skyline cluster

please share kind: ClickHouseInstallation manifest for skyline
and share SELECT engine_full FROM system.tables WHERE database='482239998746169344' AND table='log_detail'

@czhfe
Copy link
Author

czhfe commented Feb 25, 2022

Yes, what I want to confirm now is whether clickhouse-opeartor deployments under the same namespace can be multiple copies

@czhfe
Copy link
Author

czhfe commented Feb 25, 2022

An abnormal restart of the clickhouse-operator here will cause the clickhouse-operator to coordinate after restarting, isn't that the same problem mentioned in this issue (#855)

@chanadian
Copy link

chanadian commented Feb 28, 2022

I recently ran into the same symptoms after upgrading CH version from 0.13.5 to 0.18.2, but with 1 operator in a namespace. The cluster was removed even though CRD was upgraded. It seems like the root cause was if the install spec is broken (see below), the cluster may be removed. Here's the clickhouse operator logs:

"ClickHouseInstallation.clickhouse.altinity.com \"app\" is invalid: spec.templating.policy: Unsupported value: \"\": supported values: \"auto\", \"manual\""

We had to revert and manually add "" to templating.policy in the CHI install template clickhouseinstallationtemplates.clickhouse.altinity.com in order to get our cluster back. We have held off from upgrading for this reason

@Slach
Copy link
Collaborator

Slach commented Mar 1, 2022

@chanadian
Could you provide sequence step of your actions?
how exactly you upgrade CRDs and operator?
How much times between upgrade CRDs and upgrade clickhouse-operator?

@chanadian
Copy link

I followed the upgrade instructions listed here: https://github.com/Altinity/clickhouse-operator/blob/master/docs/operator_upgrade.md

Sequence:

  1. kubectl delete deploy -n kube-system clickhouse-operator
  2. Waited for operator to be terminated
  3. kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/0.18.2/deploy/operator/parts/crd.yaml
  4. Deployed new version of ch operator
  5. New operator came up, restarted pod 0.
  6. Operator logs complained about invalid spec and cluster was removed from system.clusters

How much times between upgrade CRDs and upgrade clickhouse-operator?

Step 3 and 4 were done almost simultaneously back to back. Should there have been a wait?

Note:

  • I did not change anything in our CHI as that seemed compatible
  • Would I need to delete the clickhouseinstallationtemplate crd too?

@czhfe
Copy link
Author

czhfe commented Mar 4, 2022

I recently ran into the same symptoms after upgrading CH version from 0.13.5 to 0.18.2, but with 1 operator in a namespace. The cluster was removed even though CRD was upgraded. It seems like the root cause was if the install spec is broken (see below), the cluster may be removed. Here's the clickhouse operator logs:

"ClickHouseInstallation.clickhouse.altinity.com \"app\" is invalid: spec.templating.policy: Unsupported value: \"\": supported values: \"auto\", \"manual\""

We had to revert and manually add "" to templating.policy in the CHI install template clickhouseinstallationtemplates.clickhouse.altinity.com in order to get our cluster back. We have held off from upgrading for this reason

You should first update /spec/templating/policy to the new standard, update it to auto or manual (it defaults to "" before this), then update CRD, and finally update clickhouse operator to avoid this problem (this should be considered a bug in clickhouse operator), I had this problem before #842

kubectl patch clickhouseinstallations {chi} \
-n {namespace} \
--type='json' \
-p='[{"op": "replace", "path": "/spec/templating/policy", "value": "manual"}]'

@alex-zaitsev
Copy link
Member

@chanadian , yes, template CRD needs to be updated as well! Thanks for the catch!

@cw9
Copy link

cw9 commented Sep 1, 2022

hi, we are running into a similar issue with operator v0.18.4: during a scale up, when we run out of resources for new CH nodes, the operator stuck in a bad state and lost some clusters in the setting, and caused queries to return errors with Requested cluster 'XXX' not found, any advice to avoid this?

@cw9
Copy link

cw9 commented Sep 1, 2022

e.g. the operator is managing 3 CH clusters in a namespace: cluster1, cluster2 and cluster3, when it stuck in scaling up cluster1, the cluster2 config is removed in cluster3, so cluster3 can no longer communicate with cluster2 despite both clusters are running fine with existing capacity.

@Slach
Copy link
Collaborator

Slach commented Sep 2, 2022

@cw9 do you mean
you have 3 separate CHI in different namespaces and 1 clickhouse-operator in kube-system?

@cw9
Copy link

cw9 commented Sep 2, 2022

Nope, I meant 3 ClickHouse clusters in one k8s namespace operated by one operator

@Slach
Copy link
Collaborator

Slach commented Sep 2, 2022

@cw9
spec.configuration.clusters in kind: ClickHouseInstallation could contains multiple clusters

so, could you clarify, and share
kubectl get chi --all-namespaces

@cw9
Copy link

cw9 commented Sep 4, 2022

kubectl get chi --all-namespaces
NAMESPACE   NAME         CLUSTERS   HOSTS   STATUS      AGE
xxx      my-use-case-v1   3          ??      Completed   9d

in that namespace I have the following clusters:

spec:
  configuration:
    clusters:
      - name: cluster1
         // other config
      - name: cluster2
         // other config
      - name: cluster3
         // other config

@Slach
Copy link
Collaborator

Slach commented Sep 5, 2022

ok. could you share

kubectl get deployment -l app=clickhouse-operator --all-namespaces

@Slach
Copy link
Collaborator

Slach commented Sep 5, 2022

@cw9 separate clusters shall deployed on separate pods
do you mean you have the same /etc/clickhouse-server/config.d/chop-generated-remote_servers.xml on all cluster during scale-up?

could you share

kubectl logs pod -n <your-clickhouse-operator-namespace> -l app=clickhouse-operator -c clickhouse-operator --since=24h

@cw9
Copy link

cw9 commented Sep 9, 2022

kubectl get deployment -l app=clickhouse-operator --all-namespaces
NAMESPACE   NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
xxx      clickhouse-operator   1/1     1            1           79d

@cw9
Copy link

cw9 commented Sep 9, 2022

@Slach yes separate clusters are all deployed on separate pods, each node should have the same chop-generated-remote_servers.xml on all cluster during scale-up.

the nodes can see all the clusters, we usually use cluster3 as query federation nodes to combine data from cluster1 and cluster2, but during this incident when cluster1 is failing to scale up, cluster2 was missing from the remote_servers.xml despite they are up and running normally, and this caused the cluster3 to only be able to query cluster1

@alex-zaitsev
Copy link
Member

Should be fixed in earlier released, but additional rules were added at https://github.com/Altinity/clickhouse-operator/releases/tag/release-0.22.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants