feat(webserver): implement metrics endpoint #3140

sgaist · 2024-03-15T21:36:55Z

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

/kind release

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area build

/area engine

/area tests

/area proposals

/area CI

What this PR does / why we need it:

This PR implements the /metrics endpoint in the webserver providing currently prometheus metrics.

Which issue(s) this PR fixes:

Fixes #3131
Fixes #1772

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

feat(webserver): a metrics endpoint has been added providing prometheus metrics.

It can be optionally enabled using the new `metrics.prometheus_enabled` configuration option.
It will only be activated if the `metrics.enabled` is true as well.

incertum

🚀 nice thanks for getting this up so quickly!

incertum · 2024-03-15T22:03:23Z

userspace/falco/webserver.cpp

+                    metrics_collector.snapshot();
+                    auto metrics_snapshot = metrics_collector.get_metrics();
+
+                    for (auto& metric: metrics_snapshot)


Could we also add all the wrapper metrics already in, see for example #3131 (comment) (see the libs unit tests) or the website https://falco.org/docs/metrics/falco-metrics/ or run Falco w/ the output rule metrics.

Or would you just want to do that in another PR? Also ok for me.

Some more notes:

For Prometheus we do not need the _prev metrics values and even the drop percentage is not needed as it can be calculated via Prometheus queries, hence not a 1:1 translation in terms of the wrapper or additional metrics fields.

In addition, we distinguish between falco. and scap. prefixes -- probably best to preserve that and as namespace I personally would prefer falcosecurity.

From a code organization point of view we can also iterate as it would be nice to not add metrics logic into the webserver file itself and rather maybe create a more formal falco_metrics_collector class or something else.

Or maybe you planned that already in a follow up PR.

Additional note I forgot: It appears we need to move the code logic to the stats_writer::collector::collector so that the metrics get refreshed and reposted to /metrics every x interval.

In addition, we distinguish between falco. and scap. prefixes -- probably best to preserve that and as namespace I personally would prefer falcosecurity.

Agree with falcosecurity.

From what i see, we are refreshing the stats whenever the /metrics endpoint is called, instead of passing through the stats_writer::collector::collector.
This ensures that we follow Prometheus settings (ie: if one setups prometheus to scrape every 2s, we honor that even if metrics.interval is higher).
Don't know if this is the behavior we want though.
Otherwise, we might still follow metrics.interval and then when prometheus tries to scrape more often, we will offer stale data.

I can see the dilemma. At the same time I also don't see a problem with just recommending to having the same Falco metrics interval as the Prometheus scrape interval? How is this solved in other projects? At the end of the day metrics time series are always discrete ... I would also be fine to refresh the metrics in the Prometheus case whenever /metrics was called as @FedeDP suggested.

One thing that comes to mind, doesn't the stats_writer class contain the required metrics generation logic ?

The metrics wrapper fields (the ones not part of the libs_metrics_collector) are easy to duplicate if needed, or we create some sort of more formal class falco-metrics_collector ...

At the same time I also don't see a problem with just recommending to having the same Falco metrics interval as the Prometheus scrape interval?

Yep, but:

/metrics endpoint is the only "interactive" metrics generator we have, since output_rule and output_file are passive. Therefore, it makes sense that is up to the caller to decide the frequency.

also, consider the case where prometheus is setup to scrape every 5seconds; consider also that output_file is enabled too. This has a cost in terms of IO perf if user has to set 5s metrics.interval in the config file

At the same time, i quite don't like having a metric that does not obey to metrics.interval though...

I would love to gather more feedback! cc @falcosecurity/falco-maintainers

Maybe we need to just make everything configurable, for example, we could add sub metrics configs similar to below:

metrics: enabled: false interval: 1h # Typically, in production, you only use `output_rule` or `output_file`, but not both. # However, if you have a very unique use case, you can use both together. output_rule: true # output_file: /tmp/falco_stats.jsonl output_prometheus: # Will only be enabled when `metrics.enabled` is set to true enabled: false # Take a new metrics snapshot automatically after each Prometheus server scrape auto_refresh_enabled: false # Alternative to `auto_refresh_enabled` refreshes /metrics with new metrics snapshot every x interval; supersedes and is separate from the global `metrics.interval` setting refresh_interval: 5m resource_utilization_enabled: true state_counters_enabled: true kernel_event_counters_enabled: true libbpf_stats_enabled: true convert_memory_to_mb: true include_empty_values: false

userspace/falco/webserver.cpp

incertum · 2024-03-17T06:09:46Z

/milestone 0.38.0

userspace/falco/app/actions/start_webserver.cpp

FedeDP · 2024-03-21T09:56:00Z

falco.yaml

@@ -982,6 +982,7 @@ metrics:
  libbpf_stats_enabled: true
  convert_memory_to_mb: true
  include_empty_values: false
+  prometheus_enabled: false


Given that this is indeed exposed by the webserver, shouldn't it live under webserver:? perhaps we might want to specify that it is useful only when metrics are enabled.

Good point and agreed

One related question: should the endpoint name also be configurable ? It usually is /metrics but...

I'd leave /metrics for now. We can always make int configurable in the future if there is demand for it!

Right now we only have the heathz endpoint in our configuration 👇

falco/falco.yaml

Lines 662 to 699 in 12cd72a

# [Stable] `webserver`

#

# Falco supports an embedded webserver that runs within the Falco process,

# providing a lightweight and efficient way to expose web-based functionalities

# without the need for an external web server. The following endpoints are

# exposed:

# - /healthz: designed to be used for checking the health and availability of

# the Falco application (the name of the endpoint is configurable).

# - /versions: responds with a JSON object containing the version numbers of the

# internal Falco components (similar output as `falco --version -o

# json_output=true`).

#

# Please note that the /versions endpoint is particularly useful for other Falco

# services, such as `falcoctl`, to retrieve information about a running Falco

# instance. If you plan to use `falcoctl` locally or with Kubernetes, make sure

# the Falco webserver is enabled.

#

# The behavior of the webserver can be controlled with the following options,

# which are enabled by default:

#

# The `ssl_certificate` option specifies a combined SSL certificate and

# corresponding key that are contained in a single file. You can generate a

# key/cert as follows:

#

# $ openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 365 -out

# certificate.pem $ cat certificate.pem key.pem > falco.pem $ sudo cp falco.pem

# /etc/falco/falco.pem

webserver:

enabled: true

# When the `threadiness` value is set to 0, Falco will automatically determine

# the appropriate number of threads based on the number of online cores in the system.

threadiness: 0

listen_port: 8765

# Can be an IPV4 or IPV6 address, defaults to IPV4

listen_address: 0.0.0.0

k8s_healthz_endpoint: /healthz

ssl_enabled: false

ssl_certificate: /etc/falco/falco.pem

I believe we have to reaudit our configuration before making a decision, so I agree to leave the configuration part out of this PR. We can add it later, eventually.

sgaist · 2024-03-26T22:16:26Z

There's something that is escaping me with regard to the sinsp object initialization. It seems that when the webserver is started, which happens very late in the process, said sinsp objects are not fully initialized.

incertum · 2024-03-27T18:46:31Z

Samuel, pulled your branch and wanted to test drive it, but it doesn't compile. Mind fixing the build? Thank you! I'll review the changes then. In addition, we still need to agree on the design wrt refreshing the metrics, @leogr to compile a final proposal based on suggestions we have so far. Thanks.

This endpoint currently returns only prometheus metrics. Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

… documentation The cross reference makes it easier to pair the web server and the metrics configuration elements. Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

…otonic Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

incertum

/approve

Let's defer further changes to a follow up PR. Thanks @sgaist!

poiana · 2024-04-24T09:43:23Z

LGTM label has been added.

Git tree hash: 7266865d0b743e48936d17fda8c04dabcbaf4973

incertum · 2024-04-24T09:45:14Z

Now just need to check CI fixes.

userspace/falco/falco_metrics.cpp

The textual content was fixed but not the metrics name. Co-authored-by: Melissa Kilby <melissa.kilby.oss@gmail.com> Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

incertum

/approve

poiana · 2024-04-29T08:15:14Z

LGTM label has been added.

Git tree hash: b0bf6a5ca9cc69fc00fd9bae353af6d4009cb5fd

leogr

SGTM too.

I am waiting for someone of the previous reviewer to take another look before giving the final approval

leogr · 2024-04-30T16:10:36Z

/assign

FedeDP · 2024-05-02T07:07:56Z

falco.yaml

@@ -695,6 +695,9 @@ webserver:
  # Can be an IPV4 or IPV6 address, defaults to IPV4
  listen_address: 0.0.0.0
  k8s_healthz_endpoint: /healthz
+  # Enable the metrics endpoint providing Prometheus values
+  # It will only have an effect if metrics.enabled is set to true as well.
+  prometheus_metrics_enabled: false


@leogr do we need to enforce a feat status for this? Like Incubating ?

It would be preferable, IMO. It is not a blocker for this PR anyway. We may reaudit all options later, but still before the 0.38 release. Does it make sense?

cc @falcosecurity/falco-maintainers

@leogr proposing to merge it as is and address the follow up items in a new PR. I believe we need another touch up PR anyways.

FedeDP

/approve

poiana · 2024-05-03T09:22:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, incertum, sgaist

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [FedeDP,incertum]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

FedeDP · 2024-05-03T09:23:08Z

Thank you very much Samuel! You did a terrific job!

leogr · 2024-05-08T08:39:04Z

👏

poiana added release-note dco-signoff: yes kind/feature area/engine labels Mar 15, 2024

poiana requested review from Kaizhe and leogr March 15, 2024 21:37

poiana added the size/M label Mar 15, 2024

incertum reviewed Mar 15, 2024

View reviewed changes

poiana added this to the 0.38.0 milestone Mar 17, 2024

FedeDP reviewed Mar 19, 2024

View reviewed changes

userspace/falco/app/actions/start_webserver.cpp Outdated Show resolved Hide resolved

poiana added dco-signoff: no and removed dco-signoff: yes labels Mar 19, 2024

sgaist force-pushed the 3131_implement_metrics_endpoint branch from 885cefa to d3775ae Compare March 19, 2024 22:43

poiana added dco-signoff: yes and removed dco-signoff: no labels Mar 19, 2024

FedeDP reviewed Mar 21, 2024

View reviewed changes

poiana added dco-signoff: no size/XL and removed dco-signoff: yes size/M labels Mar 26, 2024

sgaist force-pushed the 3131_implement_metrics_endpoint branch from 37b8085 to 667b84b Compare March 26, 2024 22:10

poiana added dco-signoff: yes and removed dco-signoff: no labels Mar 26, 2024

sgaist force-pushed the 3131_implement_metrics_endpoint branch from 667b84b to 5374769 Compare March 28, 2024 08:27

poiana added dco-signoff: no and removed dco-signoff: yes labels Mar 28, 2024

feat(webserver): implement metrics endpoint

aa4ae6f

This endpoint currently returns only prometheus metrics. Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

sgaist added 2 commits April 24, 2024 10:46

chore(configuration): add reference to Prometheus endpoint in metrics…

b587754

… documentation The cross reference makes it easier to pair the web server and the metrics configuration elements. Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

fix(falco_metrics): make duration_sec and outputs_queue_num_drops mon…

815af40

…otonic Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

sgaist force-pushed the 3131_implement_metrics_endpoint branch from 4e38520 to 815af40 Compare April 24, 2024 08:51

incertum previously approved these changes Apr 24, 2024

View reviewed changes

poiana assigned incertum Apr 24, 2024

poiana added the lgtm label Apr 24, 2024

poiana added the approved label Apr 24, 2024

incertum reviewed Apr 27, 2024

View reviewed changes

userspace/falco/falco_metrics.cpp Outdated Show resolved Hide resolved

fix(falco_metrics): remove falco_ prefix for version

77a6a71

The textual content was fixed but not the metrics name. Co-authored-by: Melissa Kilby <melissa.kilby.oss@gmail.com> Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>

sgaist dismissed incertum’s stale review via 77a6a71 April 29, 2024 07:44

poiana removed the lgtm label Apr 29, 2024

poiana requested a review from incertum April 29, 2024 07:44

incertum approved these changes Apr 29, 2024

View reviewed changes

poiana added the lgtm label Apr 29, 2024

leogr reviewed Apr 30, 2024

View reviewed changes

poiana assigned leogr Apr 30, 2024

FedeDP reviewed May 2, 2024

View reviewed changes

FedeDP approved these changes May 3, 2024

View reviewed changes

poiana assigned FedeDP May 3, 2024

poiana merged commit cbfe77d into falcosecurity:master May 3, 2024
33 checks passed

This was referenced May 14, 2024

new(metrics): add rules_counters_enabled option #3192

Merged

[TRACKING] metrics framework future refactors / cleanups #3194

Open

sgaist deleted the 3131_implement_metrics_endpoint branch May 17, 2024 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(webserver): implement metrics endpoint #3140

feat(webserver): implement metrics endpoint #3140

sgaist commented Mar 15, 2024 •

edited by leogr

incertum left a comment

incertum Mar 15, 2024

incertum Mar 15, 2024

incertum Mar 17, 2024

FedeDP Mar 19, 2024 •

edited

incertum Mar 19, 2024

sgaist Mar 19, 2024

incertum Mar 19, 2024

FedeDP Mar 20, 2024

incertum Mar 20, 2024

incertum commented Mar 17, 2024

FedeDP Mar 21, 2024 •

edited

sgaist Mar 21, 2024

FedeDP Mar 22, 2024

leogr Mar 22, 2024

sgaist commented Mar 26, 2024

incertum commented Mar 27, 2024

incertum left a comment

poiana commented Apr 24, 2024

incertum commented Apr 24, 2024

incertum left a comment

poiana commented Apr 29, 2024

leogr left a comment

leogr commented Apr 30, 2024

FedeDP May 2, 2024

leogr May 2, 2024

incertum May 2, 2024

FedeDP left a comment

poiana commented May 3, 2024

FedeDP commented May 3, 2024

leogr commented May 8, 2024

	# [Stable] `webserver`
	#
	# Falco supports an embedded webserver that runs within the Falco process,
	# providing a lightweight and efficient way to expose web-based functionalities
	# without the need for an external web server. The following endpoints are
	# exposed:
	# - /healthz: designed to be used for checking the health and availability of
	# the Falco application (the name of the endpoint is configurable).
	# - /versions: responds with a JSON object containing the version numbers of the
	# internal Falco components (similar output as `falco --version -o
	# json_output=true`).
	#
	# Please note that the /versions endpoint is particularly useful for other Falco
	# services, such as `falcoctl`, to retrieve information about a running Falco
	# instance. If you plan to use `falcoctl` locally or with Kubernetes, make sure
	# the Falco webserver is enabled.
	#
	# The behavior of the webserver can be controlled with the following options,
	# which are enabled by default:
	#
	# The `ssl_certificate` option specifies a combined SSL certificate and
	# corresponding key that are contained in a single file. You can generate a
	# key/cert as follows:
	#
	# $ openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 365 -out
	# certificate.pem $ cat certificate.pem key.pem > falco.pem $ sudo cp falco.pem
	# /etc/falco/falco.pem
	webserver:
	enabled: true
	# When the `threadiness` value is set to 0, Falco will automatically determine
	# the appropriate number of threads based on the number of online cores in the system.
	threadiness: 0
	listen_port: 8765
	# Can be an IPV4 or IPV6 address, defaults to IPV4
	listen_address: 0.0.0.0
	k8s_healthz_endpoint: /healthz
	ssl_enabled: false
	ssl_certificate: /etc/falco/falco.pem

feat(webserver): implement metrics endpoint #3140

feat(webserver): implement metrics endpoint #3140

Conversation

sgaist commented Mar 15, 2024 • edited by leogr

incertum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FedeDP Mar 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

incertum commented Mar 17, 2024

FedeDP Mar 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgaist commented Mar 26, 2024

incertum commented Mar 27, 2024

incertum left a comment

Choose a reason for hiding this comment

poiana commented Apr 24, 2024

incertum commented Apr 24, 2024

incertum left a comment

Choose a reason for hiding this comment

poiana commented Apr 29, 2024

leogr left a comment

Choose a reason for hiding this comment

leogr commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FedeDP left a comment

Choose a reason for hiding this comment

poiana commented May 3, 2024

FedeDP commented May 3, 2024

leogr commented May 8, 2024

sgaist commented Mar 15, 2024 •

edited by leogr

FedeDP Mar 19, 2024 •

edited

FedeDP Mar 21, 2024 •

edited