Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ClickHouse monitor implementation #3498

Merged
merged 4 commits into from Mar 25, 2022

Conversation

yanjunz97
Copy link
Contributor

This PR implements the flow visibility ClickHouse monitor with a long-running pod instead of the Cronjob which is used before. This implementation brings the following advantages.

  • It reduces the overhead of creating and destroying new pod every time the monitor executes
  • It avoids reading the K8s logs to check the last state of execution

Signed-off-by: Yanjun Zhou zhouya@vmware.com

@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2022

Codecov Report

Merging #3498 (d730119) into main (6c4e5a3) will decrease coverage by 11.08%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #3498       +/-   ##
===========================================
- Coverage   65.59%   54.50%   -11.09%     
===========================================
  Files         268      383      +115     
  Lines       26780    42043    +15263     
===========================================
+ Hits        17567    22917     +5350     
- Misses       7314    16776     +9462     
- Partials     1899     2350      +451     
Flag Coverage Δ
integration-tests 35.83% <ø> (?)
kind-e2e-tests 54.01% <ø> (-1.83%) ⬇️
unit-tests 43.06% <ø> (+0.48%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...g/agent/apiserver/handlers/featuregates/handler.go 0.00% <0.00%> (-82.36%) ⬇️
...kg/apiserver/registry/system/supportbundle/rest.go 20.45% <0.00%> (-54.55%) ⬇️
pkg/support/dump.go 8.19% <0.00%> (-49.19%) ⬇️
...egator/apiserver/handlers/recordmetrics/handler.go 0.00% <0.00%> (-44.45%) ⬇️
pkg/support/dump_others.go 0.00% <0.00%> (-44.00%) ⬇️
...g/agent/apiserver/handlers/addressgroup/handler.go 0.00% <0.00%> (-40.00%) ⬇️
...agent/apiserver/handlers/appliedtogroup/handler.go 0.00% <0.00%> (-40.00%) ⬇️
...gregator/apiserver/handlers/flowrecords/handler.go 0.00% <0.00%> (-40.00%) ⬇️
pkg/apiserver/handlers/loglevel/handler.go 0.00% <0.00%> (-38.47%) ⬇️
pkg/ovs/ovsctl/ofctl.go 19.10% <0.00%> (-17.98%) ⬇️
... and 171 more

@@ -4829,6 +4784,44 @@ spec:
---
apiVersion: apps/v1
kind: Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to use a new container within the same Deployment, and not a new Deployment?

There is value in keeping the number of Pods down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As We consider to deploy the ClickHouse cluster in future which may have multiple ClickHouse pods, but we only need one monitor for the whole ClickHouse cluster. Thus we split the monitor deployment from the ClickHouse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact is that at the moment, there is a single replica so there is no reason not to go with the simple solution.

Looking at the ClickHouse operator in more details it is very easy to add a container:

. And because the ClickHouse servers are run as a StatefulSet, and not as a Deployment, if we ever have more than 1 replica, it will be easy to run the monitor for the first replica only (only one container will do the work, the others will not do anything).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks Antonin, I updated the code to move the monitor to the ClickHouse pod.

build/yamls/flow-visibility.yml Show resolved Hide resolved
// Retry connection to ClickHouse every 5 seconds if it fails.
connRetryInterval = 5 * time.Second
// Retry connection to ClickHouse every 10 seconds if it fails.
connRetryInterval = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why we changed the retry interval this time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed 4 to 5 retries at the beginning of the monitor as the ClickHouse server was not ready. Thought it would make sense to slightly lower the retry frequency.
I've thought about adding some sleeping time before the monitor starts, but it seems that does not make sense to sleep first if the monitor is recovered from a the crash.

@yanjunz97 yanjunz97 force-pushed the single-pod-clickhouse-monitor branch from 4924745 to 100cd4d Compare March 23, 2022 01:34
key: password
name: clickhouse-secret
- name: DB_URL
value: tcp://clickhouse-clickhouse.flow-visibility.svc:9000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to use the svc DNS anymore. They are in the same network namespace now, so you can just use tcp://localhost:9000

another question related to @dreamtalen's comment above: I don't see a way to change the port to something other than 9000 for the ClickHouse server in this manifest. How would a user be able to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the DNS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we do not provide a clear way to change the port. I updated the code to allow that. But currently the user might need multiple steps to complete the change.

I'm not sure whether there are use cases that the ports only exposed to the ClickHouse pod required to be changed, but if the user would like to change the port exposed in the pod, they need to update

  • The corresponding port in the hostTemplates in clickhouse.yml
  • The DB_URL env used by the monitor in clickhouse.yml
  • The port used to connect to the database when initializing the ClickHouse server in create_table.sh

If the user would like to change the port exposed out of the pod, which is the clickhouse-clickhouse service, they need to update

  • The corresponding port in the serviceTemplates in clickhouse.yml
  • The port used by the Grafana defined in datasouce_provider.yml
  • The databaseURL used by the Flow Aggregator defined in flow-aggregator.conf, which is introduced in PR Add ClickHouse Client #3196

I think it is not convenient for users to do all these changes just to use another port other than 9000. Would like to hear from @heanlan about suggestions on what we can do to make the change easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation. It looks fine to me for now. @antoninbas previously mentioned we'll support Helm later. With Helm, port values at different places can be configured easily by sharing a common values.yaml.

plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
Signed-off-by: Yanjun Zhou <zhouya@vmware.com>
Signed-off-by: Yanjun Zhou <zhouya@vmware.com>
@yanjunz97 yanjunz97 force-pushed the single-pod-clickhouse-monitor branch from 43e3e53 to 9f6fcb8 Compare March 24, 2022 01:36
@tnqn tnqn removed this from the Antrea v1.6 release milestone Mar 24, 2022
Signed-off-by: Yanjun Zhou <zhouya@vmware.com>
@yanjunz97
Copy link
Contributor Author

After some more discussions with @heanlan , we do not come up with the use case that user may want to change the port exposed only inside the ClickHouse server container. And I think to enable this change may introduce some confusions. Thus I only keep the way to change the clickhouse-clickhouse service ports here as mentioned in the second part in #3498 (comment). I can add it back if anyone thinks it is necessary to do this. cc @dreamtalen @wsquan171

port: 8123
- name: tcp
port: 9000
type: LoadBalancer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a LoadBalancer Service? It's not accessed from outside the cluster, it seems that the typical deployment at the moment is to deploy all the flow visibility stuff in the same cluster as the workloads. If the cluster doesn't support LoadBalancer Services, then the external IP will always show as pending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this. It makes sense for me to use ClusterIP for now. Updated.

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>
@yanjunz97 yanjunz97 force-pushed the single-pod-clickhouse-monitor branch from 1bc8958 to d730119 Compare March 24, 2022 22:52
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

/skip-all

can skip tests based on code changes

@antoninbas antoninbas merged commit a2c6a1f into antrea-io:main Mar 25, 2022
@dreamtalen dreamtalen mentioned this pull request Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants