[Flow Visibility] Add data retention methods for in-memory Clickhouse deployment #3244

yanjunz97 · 2022-01-25T20:19:02Z

This PR updates the Clickhouse in memory deployment with restricted memory storage. It adds the TTL mechanism and an independent monitor as strategies to ensure data retention.
The TTL mechanism is provided by Clickhouse MergeTree Engine which deleted expired data periodically. The monitor is designed to deal with the burst in the data insertion. It runs periodically by a Kubernetes cronjob, which deletes records when the Clickhouse server memory usage is larger than the threshold.

yanjunz97 · 2022-01-27T17:55:08Z

This PR cherry picks PR #3063 for basic Clickhouse deployment. It is expected to be merged after PR #3063 merged.

codecov-commenter · 2022-01-27T17:57:15Z

Codecov Report

Merging #3244 (02b8a92) into main (f7e980e) will decrease coverage by 17.29%.
The diff coverage is n/a.

❗ Current head 02b8a92 differs from pull request most recent head 2fb140d. Consider uploading reports for the commit 2fb140d to get more accurate results

@@             Coverage Diff             @@
##             main    #3244       +/-   ##
===========================================
- Coverage   65.56%   48.27%   -17.30%     
===========================================
  Files         268      345       +77     
  Lines       26909    48810    +21901     
===========================================
+ Hits        17643    23562     +5919     
- Misses       7354    22983    +15629     
- Partials     1912     2265      +353

Flag	Coverage Δ
e2e-tests	`53.54% <ø> (?)`
integration-tests	`35.84% <ø> (?)`
kind-e2e-tests	`?`
unit-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/controller/egress/controller.go	`0.00% <0.00%> (-88.45%)`	⬇️
pkg/controller/networkpolicy/endpoint_querier.go	`4.58% <0.00%> (-86.85%)`	⬇️
pkg/controller/ipam/validate.go	`0.00% <0.00%> (-82.26%)`	⬇️
pkg/agent/util/iptables/lock.go	`0.00% <0.00%> (-81.82%)`	⬇️
pkg/controller/ipam/antrea_ipam_controller.go	`0.00% <0.00%> (-80.29%)`	⬇️
pkg/agent/cniserver/ipam/antrea_ipam_controller.go	`0.00% <0.00%> (-79.52%)`	⬇️
pkg/controller/externalippool/validate.go	`0.00% <0.00%> (-76.20%)`	⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go	`3.47% <0.00%> (-75.70%)`	⬇️
pkg/cni/client.go	`0.00% <0.00%> (-75.52%)`	⬇️
.../registry/networkpolicy/clustergroupmember/rest.go	`11.11% <0.00%> (-73.21%)`	⬇️
... and 364 more

dreamtalen · 2022-02-02T18:46:51Z

Thanks Yanjun, could you move your monitor files under /plugins/flow-visibility/clickhouse-monitor to be consistent with /plugins/flow-visibility/policy-recommendation directory?

For the dockerfile, I think build/images/flow-visibility/Dockerfile.clickhouse.monitor.ubuntu would be a better place.

yanjunz97 · 2022-02-03T00:04:16Z

Thanks Yanjun, could you move your monitor files under /plugins/flow-visibility/clickhouse-monitor to be consistent with /plugins/flow-visibility/policy-recommendation directory?

For the dockerfile, I think build/images/flow-visibility/Dockerfile.clickhouse.monitor.ubuntu would be a better place.

Thanks Yongming for reviewing. I've moved the monitor files as suggested. The dockerfile is still kept at the original place to be consistent with the octant one as we discussed in the meeting.

plugins/flow-visibility/clickhouse-monitor/main.go

build/images/Dockerfile.clickhouse.monitor.ubuntu

plugins/flow-visibility/clickhouse-monitor/main.go

build/yamls/flow-visibility/flow-visibility.yaml

heanlan · 2022-02-24T22:57:26Z

Hi @yanjunz97 , you may also want to make the corresponding changes:

Makefile - .PHONY: manifest add --mode dev
update the generated YAML manifest with the clickhouse monitor resources

yanjunz97 · 2022-02-24T23:27:15Z

Hi @yanjunz97 , you may also want to make the corresponding changes:

Makefile - .PHONY: manifest add --mode dev

update the generated YAML manifest with the clickhouse monitor resources

Thanks Anlan! Updated.

docs/network-flow-visibility.md

yanjunz97 · 2022-03-07T19:44:14Z

Hi @antoninbas @salv-orlando Could you take a look at this PR?

antoninbas · 2022-03-07T22:43:24Z

build/yamls/flow-visibility/base/clickhouse.yml

+    name: clickhouse-monitor
+---
+apiVersion: batch/v1
+kind: CronJob


why does this run as CronJob which I think will schedule a new Pod each time, instead of as an additional container in the Clickhouse Pod?

As Cronjob is scheduled by Kubernetes, we think in terms of lifecycle, it provides better management.
We also consider to deploy the Clickhouse cluster in future which may have multiple Clickhouse pods, but we only need one monitor for the whole Clickhouse cluster.

@yanjunz97 I agree on your point about not running the monitor in the clickhouse pod. On the other hand @antoninbas also correctly brings up that a cronjob has the additional overhead of creating/destroying a pod every time it's executed.

Personally, I'm ok with the cronjob - as making a change will also have impact on monitor impl - but I also think it might be ok to have the monitor as a continuously running pod in the future.

I see the consideration of the overhead. We do have some discussions in the previous team meeting about the plan to use linux crontab in a continuously running pod.

We think the Cronjob might be more reliable. For example, when the continuously running pod goes wrong, the monitor won't work any more. But with Cronjob each pod runs separately thus the failure of one pod does not affects the others. This might be more important when the cluster is large. I'm still no sure whether we should reconsider the crontab design.

Having some more discussions with @salv-orlando offline, I understand the reliability and saving of overhead with the single continuously running pod monitor solution. I plan to switch the current Cronjob monitor to a continuously running pod in a follow up PR.

Agreed. The current implementation assumes something like a cronjob is running, so it's better to keep the using it. We can iterate over it in the next release (with the caveat that we will need to find a way to terminate the cronjob or document it must be terminated)

antoninbas · 2022-03-08T22:17:34Z

.github/workflows/build.yml

@@ -182,3 +182,20 @@ jobs:
      run: |
        echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin
        docker push antrea/flow-aggregator:latest
+
+  build-flow-visibility-clickhouse-monitor:
+    needs: check-changes


I don't think you should use the same check-changes step here to determine if you need to build the image.
You should look for changes in plugins/flow-visibility/clickhouse-monitor/ instead IMO.

Thanks Antonin for reviews! I think the changes I need to look for are

plugins/flow-visibility/clickhouse-monitor/

build/images/Dockerfile.clickhouse.monitor.ubuntu

It seems has-changes accepts the excluded paths instead of the include ones. I try to exclude most of the paths, but not sure if it is enough, or whether there is a better way to do this.

build/yamls/flow-visibility/base/clickhouse.yml

docs/network-flow-visibility.md

plugins/flow-visibility/clickhouse-monitor/main.go

salv-orlando · 2022-03-14T10:34:05Z

build/yamls/flow-visibility/patches/dev/imagePullPolicy.yml

+        spec:
+          containers:
+          - name: clickhouse-monitor
+            imagePullPolicy: IfNotPresent


I am simply curious about why we force imagePullPolicy to IfNotPresent in "dev" mode...

I think it is because in development we prefer to use the image which is built locally instead of the one from repo.

build/yamls/flow-visibility/base/clickhouse.yml

salv-orlando · 2022-03-14T10:51:43Z

build/yamls/flow-visibility/base/clickhouse.yml

+    name: clickhouse-monitor
+---
+apiVersion: batch/v1
+kind: CronJob


@yanjunz97 I agree on your point about not running the monitor in the clickhouse pod. On the other hand @antoninbas also correctly brings up that a cronjob has the additional overhead of creating/destroying a pod every time it's executed.

Personally, I'm ok with the cronjob - as making a change will also have impact on monitor impl - but I also think it might be ok to have the monitor as a continuously running pod in the future.

salv-orlando · 2022-03-14T11:21:39Z

build/yamls/flow-visibility/patches/release/.gitignore

@@ -0,0 +1 @@
+# placeholder


Is this file needed for this PR? Not a big deal, I just don't fully understand why it's needed in the first place

It seems it works without this file, but I just add it to keep consistence with the flow aggregator and antrea patches.

salv-orlando · 2022-03-14T11:29:18Z

plugins/flow-visibility/clickhouse-monitor/main.go

+// Checks the k8s log for the number of rounds to skip.
+// Returns true when the monitor needs to skip more rounds and log the rest number of rounds to skip.
+func skipRound() bool {
+	logString, err := getPodLogs()


@antoninbas , @yanjunz97 I think that if we move away from CronJob we will need to reconsider this logic, and probably that might not be trivial. I think we should do that, but perhaps as a follow-up to this PR.

Sure, if we adopt the cronjob solution, we may use a logic to read cronlog instead of the K8S log.

I think this is actually an argument in favor of not using a CronJob in the first place. If your logic needs some state (which is the case here), it is better to just have a long-running Pod.

I see. Is it fine to have a follow up PR to switch it to a long-running Pod targeting at next release?

Fine by me... I think doing that will require a bit of refactoring...

salv-orlando · 2022-03-14T11:38:16Z

plugins/flow-visibility/clickhouse-monitor/main.go

+	// Clickhouse configuration
+	userName := os.Getenv("CH_USERNAME")
+	password := os.Getenv("CH_PASSWORD")
+	host, port := os.Getenv("SVC_HOST"), os.Getenv("SVC_PORT")


nit: In the clickhouse client PR (#3196 ) the DB URI is being passed entirely whereas in this PR is being built from host and port. Not sure if we want to consider a common approach. (this is not something we have to necessarily address in this PR, it's just a suggestion which can be ignored or taken for future work)

Thanks Salvatore for reviews! It makes sense for me to use DB URI. Updated the env variables in the monitor.

plugins/flow-visibility/clickhouse-monitor/main.go

salv-orlando · 2022-03-14T11:41:38Z

plugins/flow-visibility/clickhouse-monitor/main.go

+			if err != nil {
+				klog.ErrorS(err, "Failed to connect to Clickhouse")
+			}
+			if err := connect.Ping(); err != nil {


just a curiosity: is the ping operation necessary to assess the connection is healthy?

This is suggested by the clickhouse-go examples.

yanjunz97 · 2022-03-17T17:04:29Z

Hi @antoninbas @salv-orlando Could you take another look at this PR? Thanks!

.github/workflows/build.yml

build/images/Dockerfile.clickhouse.monitor.ubuntu

plugins/flow-visibility/clickhouse-monitor/main.go

antoninbas · 2022-03-17T17:43:46Z

plugins/flow-visibility/clickhouse-monitor/main.go

+var (
+	// The name of the table to store the flow records
+	tableName = os.Getenv("TABLE_NAME")
+	// The names of the materialized views
+	mvNames = strings.Split(os.Getenv("MV_NAMES"), " ")
+	// The namespace of the Clickhouse server
+	namespace = os.Getenv("NAMESPACE")
+	// The clickhouse monitor label
+	monitorLabel = os.Getenv("MONITOR_LABEL")
+)


it would be good to fail early in main() if one of the required environment variables is missing. What do you think?

Added a check at the beginning of main()

plugins/flow-visibility/clickhouse-monitor/main.go

antoninbas · 2022-03-17T17:47:19Z

plugins/flow-visibility/clickhouse-monitor/main.go

+// Checks the k8s log for the number of rounds to skip.
+// Returns true when the monitor needs to skip more rounds and log the rest number of rounds to skip.
+func skipRound() bool {
+	logString, err := getPodLogs()


I think this is actually an argument in favor of not using a CronJob in the first place. If your logic needs some state (which is the case here), it is better to just have a long-running Pod.

salv-orlando

I'm approving - we will move away from the CronJob, but we do that iteratively instead of doing everything in this PR.

build/yamls/flow-visibility/base/clickhouse.yml

salv-orlando · 2022-03-18T12:02:11Z

build/yamls/flow-visibility/base/clickhouse.yml

+    name: clickhouse-monitor
+---
+apiVersion: batch/v1
+kind: CronJob


Agreed. The current implementation assumes something like a cronjob is running, so it's better to keep the using it. We can iterate over it in the next release (with the caveat that we will need to find a way to terminate the cronjob or document it must be terminated)

plugins/flow-visibility/clickhouse-monitor/main.go

salv-orlando · 2022-03-18T12:03:16Z

plugins/flow-visibility/clickhouse-monitor/main.go

+// Checks the k8s log for the number of rounds to skip.
+// Returns true when the monitor needs to skip more rounds and log the rest number of rounds to skip.
+func skipRound() bool {
+	logString, err := getPodLogs()


Fine by me... I think doing that will require a bit of refactoring...

yanjunz97 · 2022-03-18T20:39:18Z

/test-all

yanjunz97 · 2022-03-18T21:19:38Z

Thanks @salv-orlando for reviews. Just squashed and rebased the code. Would like to have @antoninbas taking another look.

antoninbas

Let's merge this.
If you have time to move away from the CronJob before the next release, it would be ideal IMO.

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

yanjunz97 · 2022-03-19T01:38:07Z

Fixed a bug in the retry logic. Run tests again.
/test-all

yanjunz97 · 2022-03-19T20:12:36Z

/test-e2e

yanjunz97 · 2022-03-21T16:48:17Z

All tests passed. This PR can be merged if no other comment.

heanlan mentioned this pull request Jan 26, 2022

[Flow Visibility]Add Grafana dashboards #3191

Closed

yanjunz97 force-pushed the clickhouse branch from 6119ddc to 304f661 Compare January 27, 2022 17:50

yanjunz97 force-pushed the clickhouse branch 2 times, most recently from 989108b to 35ea5e7 Compare January 28, 2022 02:14

yanjunz97 marked this pull request as ready for review January 31, 2022 18:11

yanjunz97 requested review from dreamtalen, heanlan and zyiou February 1, 2022 23:36

yanjunz97 force-pushed the clickhouse branch from 35ea5e7 to 930762f Compare February 2, 2022 23:58

dreamtalen reviewed Feb 3, 2022

View reviewed changes

heanlan reviewed Feb 4, 2022

View reviewed changes

heanlan reviewed Feb 8, 2022

View reviewed changes

plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved

yanjunz97 force-pushed the clickhouse branch from ec3efce to 3eecf02 Compare February 8, 2022 20:05

dreamtalen reviewed Feb 11, 2022

View reviewed changes

build/yamls/flow-visibility/flow-visibility.yaml Outdated Show resolved Hide resolved

yanjunz97 force-pushed the clickhouse branch 2 times, most recently from 96c3f75 to 1abd279 Compare February 11, 2022 22:54

heanlan mentioned this pull request Feb 16, 2022

[Flow Visibility] Add Grafana and Clickhouse deployment file #3063

Merged

heanlan added this to the Antrea v1.6 release milestone Feb 22, 2022

yanjunz97 force-pushed the clickhouse branch from 1abd279 to 72420b6 Compare February 24, 2022 20:03

yanjunz97 requested review from antoninbas, jianjuns and salv-orlando February 24, 2022 21:46

yanjunz97 force-pushed the clickhouse branch 2 times, most recently from 494a9ed to bc22847 Compare March 1, 2022 02:06

yanjunz97 force-pushed the clickhouse branch from 204f8d5 to c942405 Compare March 5, 2022 02:24

heanlan reviewed Mar 5, 2022

View reviewed changes

docs/network-flow-visibility.md Outdated Show resolved Hide resolved

docs/network-flow-visibility.md Outdated Show resolved Hide resolved

docs/network-flow-visibility.md Outdated Show resolved Hide resolved

yanjunz97 force-pushed the clickhouse branch from bca1001 to 97e3d88 Compare March 7, 2022 18:50

antoninbas reviewed Mar 7, 2022

View reviewed changes

yanjunz97 force-pushed the clickhouse branch from 97e3d88 to 3dbc0d5 Compare March 8, 2022 00:32

yanjunz97 requested a review from antoninbas March 8, 2022 20:31

antoninbas reviewed Mar 8, 2022

View reviewed changes

yanjunz97 requested a review from antoninbas March 9, 2022 21:22

salv-orlando reviewed Mar 14, 2022

View reviewed changes

yanjunz97 force-pushed the clickhouse branch from f93ae37 to 0bd6b6f Compare March 15, 2022 22:37

antoninbas reviewed Mar 17, 2022

View reviewed changes

yanjunz97 requested review from antoninbas and salv-orlando March 17, 2022 18:58

salv-orlando previously approved these changes Mar 18, 2022

View reviewed changes

yanjunz97 dismissed salv-orlando’s stale review via 33a913e March 18, 2022 20:37

yanjunz97 force-pushed the clickhouse branch from da26f2e to 33a913e Compare March 18, 2022 20:37

antoninbas previously approved these changes Mar 18, 2022

View reviewed changes

Add data retention methods for in-memory Clickhouse deployment

2fb140d

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>

yanjunz97 dismissed antoninbas’s stale review via 2fb140d March 19, 2022 01:37

yanjunz97 force-pushed the clickhouse branch from 33a913e to 2fb140d Compare March 19, 2022 01:37

antoninbas approved these changes Mar 21, 2022

View reviewed changes

antoninbas merged commit 6c4e5a3 into antrea-io:main Mar 21, 2022

dreamtalen mentioned this pull request Mar 28, 2022

Release 1.6.0 #3531

Merged

[Flow Visibility] Add data retention methods for in-memory Clickhouse deployment #3244

[Flow Visibility] Add data retention methods for in-memory Clickhouse deployment #3244

Conversation

yanjunz97 commented Jan 25, 2022

yanjunz97 commented Jan 27, 2022

codecov-commenter commented Jan 27, 2022 • edited

Codecov Report

dreamtalen commented Feb 2, 2022 • edited

yanjunz97 commented Feb 3, 2022

heanlan commented Feb 24, 2022 • edited

yanjunz97 commented Feb 24, 2022

yanjunz97 commented Mar 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanjunz97 Mar 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanjunz97 commented Mar 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salv-orlando left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanjunz97 commented Mar 18, 2022

yanjunz97 commented Mar 18, 2022

antoninbas left a comment

Choose a reason for hiding this comment

yanjunz97 commented Mar 19, 2022

yanjunz97 commented Mar 19, 2022

yanjunz97 commented Mar 21, 2022

codecov-commenter commented Jan 27, 2022 •

edited

dreamtalen commented Feb 2, 2022 •

edited

heanlan commented Feb 24, 2022 •

edited

yanjunz97 Mar 9, 2022 •

edited