Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flow Visibility] Add data retention methods for in-memory Clickhouse deployment #3244

Merged
merged 1 commit into from Mar 21, 2022

Conversation

yanjunz97
Copy link
Contributor

This PR updates the Clickhouse in memory deployment with restricted memory storage. It adds the TTL mechanism and an independent monitor as strategies to ensure data retention.
The TTL mechanism is provided by Clickhouse MergeTree Engine which deleted expired data periodically. The monitor is designed to deal with the burst in the data insertion. It runs periodically by a Kubernetes cronjob, which deletes records when the Clickhouse server memory usage is larger than the threshold.

@yanjunz97
Copy link
Contributor Author

This PR cherry picks PR #3063 for basic Clickhouse deployment. It is expected to be merged after PR #3063 merged.

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2022

Codecov Report

Merging #3244 (02b8a92) into main (f7e980e) will decrease coverage by 17.29%.
The diff coverage is n/a.

❗ Current head 02b8a92 differs from pull request most recent head 2fb140d. Consider uploading reports for the commit 2fb140d to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #3244       +/-   ##
===========================================
- Coverage   65.56%   48.27%   -17.30%     
===========================================
  Files         268      345       +77     
  Lines       26909    48810    +21901     
===========================================
+ Hits        17643    23562     +5919     
- Misses       7354    22983    +15629     
- Partials     1912     2265      +353     
Flag Coverage Δ
e2e-tests 53.54% <ø> (?)
integration-tests 35.84% <ø> (?)
kind-e2e-tests ?
unit-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/controller/egress/controller.go 0.00% <0.00%> (-88.45%) ⬇️
pkg/controller/networkpolicy/endpoint_querier.go 4.58% <0.00%> (-86.85%) ⬇️
pkg/controller/ipam/validate.go 0.00% <0.00%> (-82.26%) ⬇️
pkg/agent/util/iptables/lock.go 0.00% <0.00%> (-81.82%) ⬇️
pkg/controller/ipam/antrea_ipam_controller.go 0.00% <0.00%> (-80.29%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam_controller.go 0.00% <0.00%> (-79.52%) ⬇️
pkg/controller/externalippool/validate.go 0.00% <0.00%> (-76.20%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go 3.47% <0.00%> (-75.70%) ⬇️
pkg/cni/client.go 0.00% <0.00%> (-75.52%) ⬇️
.../registry/networkpolicy/clustergroupmember/rest.go 11.11% <0.00%> (-73.21%) ⬇️
... and 364 more

@yanjunz97 yanjunz97 force-pushed the clickhouse branch 2 times, most recently from 989108b to 35ea5e7 Compare January 28, 2022 02:14
@yanjunz97 yanjunz97 marked this pull request as ready for review January 31, 2022 18:11
@dreamtalen
Copy link
Contributor

dreamtalen commented Feb 2, 2022

Thanks Yanjun, could you move your monitor files under /plugins/flow-visibility/clickhouse-monitor to be consistent with /plugins/flow-visibility/policy-recommendation directory?

For the dockerfile, I think build/images/flow-visibility/Dockerfile.clickhouse.monitor.ubuntu would be a better place.

@yanjunz97
Copy link
Contributor Author

Thanks Yanjun, could you move your monitor files under /plugins/flow-visibility/clickhouse-monitor to be consistent with /plugins/flow-visibility/policy-recommendation directory?

For the dockerfile, I think build/images/flow-visibility/Dockerfile.clickhouse.monitor.ubuntu would be a better place.

Thanks Yongming for reviewing. I've moved the monitor files as suggested. The dockerfile is still kept at the original place to be consistent with the octant one as we discussed in the meeting.

build/images/Dockerfile.clickhouse.monitor.ubuntu Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
@heanlan
Copy link
Contributor

heanlan commented Feb 24, 2022

Hi @yanjunz97 , you may also want to make the corresponding changes:

  • Makefile - .PHONY: manifest add --mode dev
  • update the generated YAML manifest with the clickhouse monitor resources

@yanjunz97
Copy link
Contributor Author

Hi @yanjunz97 , you may also want to make the corresponding changes:

  • Makefile - .PHONY: manifest add --mode dev
  • update the generated YAML manifest with the clickhouse monitor resources

Thanks Anlan! Updated.

@yanjunz97 yanjunz97 force-pushed the clickhouse branch 2 times, most recently from 494a9ed to bc22847 Compare March 1, 2022 02:06
docs/network-flow-visibility.md Outdated Show resolved Hide resolved
docs/network-flow-visibility.md Outdated Show resolved Hide resolved
docs/network-flow-visibility.md Outdated Show resolved Hide resolved
@yanjunz97
Copy link
Contributor Author

Hi @antoninbas @salv-orlando Could you take a look at this PR?

name: clickhouse-monitor
---
apiVersion: batch/v1
kind: CronJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this run as CronJob which I think will schedule a new Pod each time, instead of as an additional container in the Clickhouse Pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Cronjob is scheduled by Kubernetes, we think in terms of lifecycle, it provides better management.
We also consider to deploy the Clickhouse cluster in future which may have multiple Clickhouse pods, but we only need one monitor for the whole Clickhouse cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanjunz97 I agree on your point about not running the monitor in the clickhouse pod. On the other hand @antoninbas also correctly brings up that a cronjob has the additional overhead of creating/destroying a pod every time it's executed.

Personally, I'm ok with the cronjob - as making a change will also have impact on monitor impl - but I also think it might be ok to have the monitor as a continuously running pod in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the consideration of the overhead. We do have some discussions in the previous team meeting about the plan to use linux crontab in a continuously running pod.

We think the Cronjob might be more reliable. For example, when the continuously running pod goes wrong, the monitor won't work any more. But with Cronjob each pod runs separately thus the failure of one pod does not affects the others. This might be more important when the cluster is large. I'm still no sure whether we should reconsider the crontab design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having some more discussions with @salv-orlando offline, I understand the reliability and saving of overhead with the single continuously running pod monitor solution. I plan to switch the current Cronjob monitor to a continuously running pod in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The current implementation assumes something like a cronjob is running, so it's better to keep the using it. We can iterate over it in the next release (with the caveat that we will need to find a way to terminate the cronjob or document it must be terminated)

@@ -182,3 +182,20 @@ jobs:
run: |
echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin
docker push antrea/flow-aggregator:latest

build-flow-visibility-clickhouse-monitor:
needs: check-changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should use the same check-changes step here to determine if you need to build the image.
You should look for changes in plugins/flow-visibility/clickhouse-monitor/ instead IMO.

Copy link
Contributor Author

@yanjunz97 yanjunz97 Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Antonin for reviews! I think the changes I need to look for are

  • plugins/flow-visibility/clickhouse-monitor/
  • build/images/Dockerfile.clickhouse.monitor.ubuntu

It seems has-changes accepts the excluded paths instead of the include ones. I try to exclude most of the paths, but not sure if it is enough, or whether there is a better way to do this.

build/yamls/flow-visibility/base/clickhouse.yml Outdated Show resolved Hide resolved
docs/network-flow-visibility.md Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
spec:
containers:
- name: clickhouse-monitor
imagePullPolicy: IfNotPresent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am simply curious about why we force imagePullPolicy to IfNotPresent in "dev" mode...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because in development we prefer to use the image which is built locally instead of the one from repo.

name: clickhouse-monitor
---
apiVersion: batch/v1
kind: CronJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanjunz97 I agree on your point about not running the monitor in the clickhouse pod. On the other hand @antoninbas also correctly brings up that a cronjob has the additional overhead of creating/destroying a pod every time it's executed.

Personally, I'm ok with the cronjob - as making a change will also have impact on monitor impl - but I also think it might be ok to have the monitor as a continuously running pod in the future.

@@ -0,0 +1 @@
# placeholder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed for this PR? Not a big deal, I just don't fully understand why it's needed in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it works without this file, but I just add it to keep consistence with the flow aggregator and antrea patches.

// Checks the k8s log for the number of rounds to skip.
// Returns true when the monitor needs to skip more rounds and log the rest number of rounds to skip.
func skipRound() bool {
logString, err := getPodLogs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas , @yanjunz97 I think that if we move away from CronJob we will need to reconsider this logic, and probably that might not be trivial. I think we should do that, but perhaps as a follow-up to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, if we adopt the cronjob solution, we may use a logic to read cronlog instead of the K8S log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually an argument in favor of not using a CronJob in the first place. If your logic needs some state (which is the case here), it is better to just have a long-running Pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Is it fine to have a follow up PR to switch it to a long-running Pod targeting at next release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me... I think doing that will require a bit of refactoring...

// Clickhouse configuration
userName := os.Getenv("CH_USERNAME")
password := os.Getenv("CH_PASSWORD")
host, port := os.Getenv("SVC_HOST"), os.Getenv("SVC_PORT")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In the clickhouse client PR (#3196 ) the DB URI is being passed entirely whereas in this PR is being built from host and port. Not sure if we want to consider a common approach. (this is not something we have to necessarily address in this PR, it's just a suggestion which can be ignored or taken for future work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Salvatore for reviews! It makes sense for me to use DB URI. Updated the env variables in the monitor.

if err != nil {
klog.ErrorS(err, "Failed to connect to Clickhouse")
}
if err := connect.Ping(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a curiosity: is the ping operation necessary to assess the connection is healthy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suggested by the clickhouse-go examples.

@yanjunz97
Copy link
Contributor Author

Hi @antoninbas @salv-orlando Could you take another look at this PR? Thanks!

.github/workflows/build.yml Outdated Show resolved Hide resolved
build/images/Dockerfile.clickhouse.monitor.ubuntu Outdated Show resolved Hide resolved
build/images/Dockerfile.clickhouse.monitor.ubuntu Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
Comment on lines 56 to 65
var (
// The name of the table to store the flow records
tableName = os.Getenv("TABLE_NAME")
// The names of the materialized views
mvNames = strings.Split(os.Getenv("MV_NAMES"), " ")
// The namespace of the Clickhouse server
namespace = os.Getenv("NAMESPACE")
// The clickhouse monitor label
monitorLabel = os.Getenv("MONITOR_LABEL")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to fail early in main() if one of the required environment variables is missing. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check at the beginning of main()

plugins/flow-visibility/clickhouse-monitor/main.go Outdated Show resolved Hide resolved
// Checks the k8s log for the number of rounds to skip.
// Returns true when the monitor needs to skip more rounds and log the rest number of rounds to skip.
func skipRound() bool {
logString, err := getPodLogs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually an argument in favor of not using a CronJob in the first place. If your logic needs some state (which is the case here), it is better to just have a long-running Pod.

salv-orlando
salv-orlando previously approved these changes Mar 18, 2022
Copy link
Contributor

@salv-orlando salv-orlando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving - we will move away from the CronJob, but we do that iteratively instead of doing everything in this PR.

name: clickhouse-monitor
---
apiVersion: batch/v1
kind: CronJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The current implementation assumes something like a cronjob is running, so it's better to keep the using it. We can iterate over it in the next release (with the caveat that we will need to find a way to terminate the cronjob or document it must be terminated)

// Checks the k8s log for the number of rounds to skip.
// Returns true when the monitor needs to skip more rounds and log the rest number of rounds to skip.
func skipRound() bool {
logString, err := getPodLogs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me... I think doing that will require a bit of refactoring...

@yanjunz97
Copy link
Contributor Author

/test-all

@yanjunz97
Copy link
Contributor Author

Thanks @salv-orlando for reviews. Just squashed and rebased the code. Would like to have @antoninbas taking another look.

antoninbas
antoninbas previously approved these changes Mar 18, 2022
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this.
If you have time to move away from the CronJob before the next release, it would be ideal IMO.

Signed-off-by: Yanjun Zhou <zhouya@vmware.com>
@yanjunz97
Copy link
Contributor Author

Fixed a bug in the retry logic. Run tests again.
/test-all

@yanjunz97
Copy link
Contributor Author

/test-e2e

@yanjunz97
Copy link
Contributor Author

All tests passed. This PR can be merged if no other comment.

@antoninbas antoninbas merged commit 6c4e5a3 into antrea-io:main Mar 21, 2022
@dreamtalen dreamtalen mentioned this pull request Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants