Refactor ClickHouse monitor implementation #3498

yanjunz97 · 2022-03-21T21:40:06Z

This PR implements the flow visibility ClickHouse monitor with a long-running pod instead of the Cronjob which is used before. This implementation brings the following advantages.

It reduces the overhead of creating and destroying new pod every time the monitor executes
It avoids reading the K8s logs to check the last state of execution

Signed-off-by: Yanjun Zhou zhouya@vmware.com

codecov-commenter · 2022-03-21T21:52:50Z

Codecov Report

Merging #3498 (d730119) into main (6c4e5a3) will decrease coverage by 11.08%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             main    #3498       +/-   ##
===========================================
- Coverage   65.59%   54.50%   -11.09%     
===========================================
  Files         268      383      +115     
  Lines       26780    42043    +15263     
===========================================
+ Hits        17567    22917     +5350     
- Misses       7314    16776     +9462     
- Partials     1899     2350      +451

Flag	Coverage Δ
integration-tests	`35.83% <ø> (?)`
kind-e2e-tests	`54.01% <ø> (-1.83%)`	⬇️
unit-tests	`43.06% <ø> (+0.48%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...g/agent/apiserver/handlers/featuregates/handler.go	`0.00% <0.00%> (-82.36%)`	⬇️
...kg/apiserver/registry/system/supportbundle/rest.go	`20.45% <0.00%> (-54.55%)`	⬇️
pkg/support/dump.go	`8.19% <0.00%> (-49.19%)`	⬇️
...egator/apiserver/handlers/recordmetrics/handler.go	`0.00% <0.00%> (-44.45%)`	⬇️
pkg/support/dump_others.go	`0.00% <0.00%> (-44.00%)`	⬇️
...g/agent/apiserver/handlers/addressgroup/handler.go	`0.00% <0.00%> (-40.00%)`	⬇️
...agent/apiserver/handlers/appliedtogroup/handler.go	`0.00% <0.00%> (-40.00%)`	⬇️
...gregator/apiserver/handlers/flowrecords/handler.go	`0.00% <0.00%> (-40.00%)`	⬇️
pkg/apiserver/handlers/loglevel/handler.go	`0.00% <0.00%> (-38.47%)`	⬇️
pkg/ovs/ovsctl/ofctl.go	`19.10% <0.00%> (-17.98%)`	⬇️
... and 171 more

antoninbas · 2022-03-22T21:24:35Z

build/yamls/flow-visibility.yml

@@ -4829,6 +4784,44 @@ spec:
 ---
 apiVersion: apps/v1
 kind: Deployment


I thought we were going to use a new container within the same Deployment, and not a new Deployment?

There is value in keeping the number of Pods down.

As We consider to deploy the ClickHouse cluster in future which may have multiple ClickHouse pods, but we only need one monitor for the whole ClickHouse cluster. Thus we split the monitor deployment from the ClickHouse.

The fact is that at the moment, there is a single replica so there is no reason not to go with the simple solution.

Looking at the ClickHouse operator in more details it is very easy to add a container:

antrea/build/yamls/flow-visibility.yml

Line 5004 in 55e6797

containers:

. And because the ClickHouse servers are run as a StatefulSet, and not as a Deployment, if we ever have more than 1 replica, it will be easy to run the monitor for the first replica only (only one container will do the work, the others will not do anything).

Got it. Thanks Antonin, I updated the code to move the monitor to the ClickHouse pod.

build/yamls/flow-visibility.yml

dreamtalen · 2022-03-23T01:16:49Z

plugins/flow-visibility/clickhouse-monitor/main.go

-	// Retry connection to ClickHouse every 5 seconds if it fails.
-	connRetryInterval = 5 * time.Second
+	// Retry connection to ClickHouse every 10 seconds if it fails.
+	connRetryInterval = 10 * time.Second


Just curious why we changed the retry interval this time?

I observed 4 to 5 retries at the beginning of the monitor as the ClickHouse server was not ready. Thought it would make sense to slightly lower the retry frequency.
I've thought about adding some sleeping time before the monitor starts, but it seems that does not make sense to sleep first if the monitor is recovered from a the crash.

plugins/flow-visibility/clickhouse-monitor/main.go

antoninbas · 2022-03-23T21:02:48Z

build/yamls/flow-visibility.yml

+                key: password
+                name: clickhouse-secret
+          - name: DB_URL
+            value: tcp://clickhouse-clickhouse.flow-visibility.svc:9000


there is no need to use the svc DNS anymore. They are in the same network namespace now, so you can just use tcp://localhost:9000

another question related to @dreamtalen's comment above: I don't see a way to change the port to something other than 9000 for the ClickHouse server in this manifest. How would a user be able to do that?

Updated the DNS.

Previously we do not provide a clear way to change the port. I updated the code to allow that. But currently the user might need multiple steps to complete the change.

I'm not sure whether there are use cases that the ports only exposed to the ClickHouse pod required to be changed, but if the user would like to change the port exposed in the pod, they need to update

The corresponding port in the hostTemplates in clickhouse.yml

The DB_URL env used by the monitor in clickhouse.yml

The port used to connect to the database when initializing the ClickHouse server in create_table.sh

If the user would like to change the port exposed out of the pod, which is the clickhouse-clickhouse service, they need to update

The corresponding port in the serviceTemplates in clickhouse.yml

The port used by the Grafana defined in datasouce_provider.yml

The databaseURL used by the Flow Aggregator defined in flow-aggregator.conf, which is introduced in PR Add ClickHouse Client #3196

I think it is not convenient for users to do all these changes just to use another port other than 9000. Would like to hear from @heanlan about suggestions on what we can do to make the change easier.

Thanks for the investigation. It looks fine to me for now. @antoninbas previously mentioned we'll support Helm later. With Helm, port values at different places can be configured easily by sharing a common values.yaml.

plugins/flow-visibility/clickhouse-monitor/main.go

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

build/yamls/flow-visibility/base/clickhouse.yml

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

yanjunz97 · 2022-03-24T19:39:17Z

After some more discussions with @heanlan , we do not come up with the use case that user may want to change the port exposed only inside the ClickHouse server container. And I think to enable this change may introduce some confusions. Thus I only keep the way to change the clickhouse-clickhouse service ports here as mentioned in the second part in #3498 (comment). I can add it back if anyone thinks it is necessary to do this. cc @dreamtalen @wsquan171

antoninbas · 2022-03-24T22:15:55Z

build/yamls/flow-visibility/base/clickhouse.yml

+              port: 8123
+            - name: tcp
+              port: 9000
+          type: LoadBalancer


I don't think this needs to be a LoadBalancer Service? It's not accessed from outside the cluster, it seems that the typical deployment at the moment is to deploy all the flow visibility stuff in the same cluster as the workloads. If the cluster doesn't support LoadBalancer Services, then the external IP will always show as pending.

Thanks for addressing this. It makes sense for me to use ClusterIP for now. Updated.

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

antoninbas

LGTM

antoninbas · 2022-03-25T01:00:53Z

/skip-all

can skip tests based on code changes

yanjunz97 requested review from dreamtalen and heanlan March 21, 2022 21:40

yanjunz97 requested review from antoninbas and salv-orlando March 22, 2022 17:28

yanjunz97 added this to the Antrea v1.6 release milestone Mar 22, 2022

yanjunz97 mentioned this pull request Mar 22, 2022

[Flow Visibility] Add CI validation job for flow visibility depolyment files #3312

Merged

antoninbas reviewed Mar 22, 2022

View reviewed changes

dreamtalen reviewed Mar 23, 2022

View reviewed changes

yanjunz97 force-pushed the single-pod-clickhouse-monitor branch from 4924745 to 100cd4d Compare March 23, 2022 01:34

heanlan reviewed Mar 23, 2022

View reviewed changes

plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved

plugins/flow-visibility/clickhouse-monitor/main.go Show resolved Hide resolved

antoninbas reviewed Mar 23, 2022

View reviewed changes

yanjunz97 added 2 commits March 23, 2022 16:52

Refactor ClickHouse monitor implementation

2d08171

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

Add support for customized Clickhouse port

9f6fcb8

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

yanjunz97 force-pushed the single-pod-clickhouse-monitor branch from 43e3e53 to 9f6fcb8 Compare March 24, 2022 01:36

tnqn removed this from the Antrea v1.6 release milestone Mar 24, 2022

wsquan171 reviewed Mar 24, 2022

View reviewed changes

build/yamls/flow-visibility/base/clickhouse.yml Outdated Show resolved Hide resolved

Add docs and remove hostTemplate

152c804

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

antoninbas reviewed Mar 24, 2022

View reviewed changes

User clusterIP instead of LoadBalancer

d730119

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

yanjunz97 force-pushed the single-pod-clickhouse-monitor branch from 1bc8958 to d730119 Compare March 24, 2022 22:52

antoninbas approved these changes Mar 25, 2022

View reviewed changes

antoninbas merged commit a2c6a1f into antrea-io:main Mar 25, 2022

dreamtalen mentioned this pull request Mar 28, 2022

Release 1.6.0 #3531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ClickHouse monitor implementation #3498

Refactor ClickHouse monitor implementation #3498

yanjunz97 commented Mar 21, 2022

codecov-commenter commented Mar 21, 2022 •

edited

antoninbas Mar 22, 2022

yanjunz97 Mar 22, 2022

antoninbas Mar 22, 2022

yanjunz97 Mar 23, 2022

dreamtalen Mar 23, 2022 •

edited

yanjunz97 Mar 23, 2022

antoninbas Mar 23, 2022

yanjunz97 Mar 24, 2022

yanjunz97 Mar 24, 2022

heanlan Mar 24, 2022

yanjunz97 commented Mar 24, 2022

antoninbas Mar 24, 2022

yanjunz97 Mar 24, 2022

antoninbas left a comment

antoninbas commented Mar 25, 2022

Refactor ClickHouse monitor implementation #3498

Refactor ClickHouse monitor implementation #3498

Conversation

yanjunz97 commented Mar 21, 2022

codecov-commenter commented Mar 21, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dreamtalen Mar 23, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanjunz97 commented Mar 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas commented Mar 25, 2022

codecov-commenter commented Mar 21, 2022 •

edited

dreamtalen Mar 23, 2022 •

edited