Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(webserver): implement metrics endpoint #3140

Merged
merged 22 commits into from May 3, 2024

Conversation

sgaist
Copy link
Contributor

@sgaist sgaist commented Mar 15, 2024

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

/kind release

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area build

/area engine

/area tests

/area proposals

/area CI

What this PR does / why we need it:

This PR implements the /metrics endpoint in the webserver providing currently prometheus metrics.

Which issue(s) this PR fixes:

Fixes #3131
Fixes #1772

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

feat(webserver): a metrics endpoint has been added providing prometheus metrics.

It can be optionally enabled using the new `metrics.prometheus_enabled` configuration option.
It will only be activated if the `metrics.enabled` is true as well.

Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 nice thanks for getting this up so quickly!

metrics_collector.snapshot();
auto metrics_snapshot = metrics_collector.get_metrics();

for (auto& metric: metrics_snapshot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add all the wrapper metrics already in, see for example #3131 (comment) (see the libs unit tests) or the website https://falco.org/docs/metrics/falco-metrics/ or run Falco w/ the output rule metrics.

Or would you just want to do that in another PR? Also ok for me.

Some more notes:

For Prometheus we do not need the _prev metrics values and even the drop percentage is not needed as it can be calculated via Prometheus queries, hence not a 1:1 translation in terms of the wrapper or additional metrics fields.

In addition, we distinguish between falco. and scap. prefixes -- probably best to preserve that and as namespace I personally would prefer falcosecurity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code organization point of view we can also iterate as it would be nice to not add metrics logic into the webserver file itself and rather maybe create a more formal falco_metrics_collector class or something else.

Or maybe you planned that already in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional note I forgot: It appears we need to move the code logic to the stats_writer::collector::collector so that the metrics get refreshed and reposted to /metrics every x interval.

Copy link
Contributor

@FedeDP FedeDP Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we distinguish between falco. and scap. prefixes -- probably best to preserve that and as namespace I personally would prefer falcosecurity.

Agree with falcosecurity.

From what i see, we are refreshing the stats whenever the /metrics endpoint is called, instead of passing through the stats_writer::collector::collector.
This ensures that we follow Prometheus settings (ie: if one setups prometheus to scrape every 2s, we honor that even if metrics.interval is higher).
Don't know if this is the behavior we want though.
Otherwise, we might still follow metrics.interval and then when prometheus tries to scrape more often, we will offer stale data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the dilemma. At the same time I also don't see a problem with just recommending to having the same Falco metrics interval as the Prometheus scrape interval? How is this solved in other projects? At the end of the day metrics time series are always discrete ... I would also be fine to refresh the metrics in the Prometheus case whenever /metrics was called as @FedeDP suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that comes to mind, doesn't the stats_writer class contain the required metrics generation logic ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics wrapper fields (the ones not part of the libs_metrics_collector) are easy to duplicate if needed, or we create some sort of more formal class falco-metrics_collector ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time I also don't see a problem with just recommending to having the same Falco metrics interval as the Prometheus scrape interval?

Yep, but:

  • /metrics endpoint is the only "interactive" metrics generator we have, since output_rule and output_file are passive. Therefore, it makes sense that is up to the caller to decide the frequency.
  • also, consider the case where prometheus is setup to scrape every 5seconds; consider also that output_file is enabled too. This has a cost in terms of IO perf if user has to set 5s metrics.interval in the config file

At the same time, i quite don't like having a metric that does not obey to metrics.interval though...

I would love to gather more feedback! cc @falcosecurity/falco-maintainers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to just make everything configurable, for example, we could add sub metrics configs similar to below:

metrics:
  enabled: false
  interval: 1h
  # Typically, in production, you only use `output_rule` or `output_file`, but not both. 
  # However, if you have a very unique use case, you can use both together.
  output_rule: true
  # output_file: /tmp/falco_stats.jsonl
  output_prometheus:
    # Will only be enabled when `metrics.enabled` is set to true
    enabled: false
    # Take a new metrics snapshot automatically after each Prometheus server scrape
    auto_refresh_enabled: false
    # Alternative to `auto_refresh_enabled` refreshes /metrics with new metrics snapshot every x interval; supersedes and is separate from the global `metrics.interval` setting
    refresh_interval: 5m
  resource_utilization_enabled: true
  state_counters_enabled: true
  kernel_event_counters_enabled: true
  libbpf_stats_enabled: true
  convert_memory_to_mb: true
  include_empty_values: false

userspace/falco/webserver.cpp Outdated Show resolved Hide resolved
@incertum
Copy link
Contributor

/milestone 0.38.0

@poiana poiana added this to the 0.38.0 milestone Mar 17, 2024
falco.yaml Outdated
@@ -982,6 +982,7 @@ metrics:
libbpf_stats_enabled: true
convert_memory_to_mb: true
include_empty_values: false
prometheus_enabled: false
Copy link
Contributor

@FedeDP FedeDP Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is indeed exposed by the webserver, shouldn't it live under webserver:? perhaps we might want to specify that it is useful only when metrics are enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point and agreed

One related question: should the endpoint name also be configurable ? It usually is /metrics but...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave /metrics for now. We can always make int configurable in the future if there is demand for it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we only have the heathz endpoint in our configuration 👇

falco/falco.yaml

Lines 662 to 699 in 12cd72a

# [Stable] `webserver`
#
# Falco supports an embedded webserver that runs within the Falco process,
# providing a lightweight and efficient way to expose web-based functionalities
# without the need for an external web server. The following endpoints are
# exposed:
# - /healthz: designed to be used for checking the health and availability of
# the Falco application (the name of the endpoint is configurable).
# - /versions: responds with a JSON object containing the version numbers of the
# internal Falco components (similar output as `falco --version -o
# json_output=true`).
#
# Please note that the /versions endpoint is particularly useful for other Falco
# services, such as `falcoctl`, to retrieve information about a running Falco
# instance. If you plan to use `falcoctl` locally or with Kubernetes, make sure
# the Falco webserver is enabled.
#
# The behavior of the webserver can be controlled with the following options,
# which are enabled by default:
#
# The `ssl_certificate` option specifies a combined SSL certificate and
# corresponding key that are contained in a single file. You can generate a
# key/cert as follows:
#
# $ openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 365 -out
# certificate.pem $ cat certificate.pem key.pem > falco.pem $ sudo cp falco.pem
# /etc/falco/falco.pem
webserver:
enabled: true
# When the `threadiness` value is set to 0, Falco will automatically determine
# the appropriate number of threads based on the number of online cores in the system.
threadiness: 0
listen_port: 8765
# Can be an IPV4 or IPV6 address, defaults to IPV4
listen_address: 0.0.0.0
k8s_healthz_endpoint: /healthz
ssl_enabled: false
ssl_certificate: /etc/falco/falco.pem

I believe we have to reaudit our configuration before making a decision, so I agree to leave the configuration part out of this PR. We can add it later, eventually.

@sgaist
Copy link
Contributor Author

sgaist commented Mar 26, 2024

There's something that is escaping me with regard to the sinsp object initialization. It seems that when the webserver is started, which happens very late in the process, said sinsp objects are not fully initialized.

@incertum
Copy link
Contributor

Samuel, pulled your branch and wanted to test drive it, but it doesn't compile. Mind fixing the build? Thank you! I'll review the changes then. In addition, we still need to agree on the design wrt refreshing the metrics, @leogr to compile a final proposal based on suggestions we have so far. Thanks.

This endpoint currently returns only prometheus metrics.

Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
… documentation

The cross reference makes it easier to pair the web server and the
metrics configuration elements.

Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
…otonic

Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
@sgaist sgaist force-pushed the 3131_implement_metrics_endpoint branch from 4e38520 to 815af40 Compare April 24, 2024 08:51
incertum
incertum previously approved these changes Apr 24, 2024
Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Let's defer further changes to a follow up PR. Thanks @sgaist!

@poiana
Copy link

poiana commented Apr 24, 2024

LGTM label has been added.

Git tree hash: 7266865d0b743e48936d17fda8c04dabcbaf4973

@incertum
Copy link
Contributor

Now just need to check CI fixes.

The textual content was fixed but not the metrics name.

Co-authored-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana poiana added the lgtm label Apr 29, 2024
@poiana
Copy link

poiana commented Apr 29, 2024

LGTM label has been added.

Git tree hash: b0bf6a5ca9cc69fc00fd9bae353af6d4009cb5fd

Copy link
Member

@leogr leogr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM too.

I am waiting for someone of the previous reviewer to take another look before giving the final approval

@leogr
Copy link
Member

leogr commented Apr 30, 2024

/assign

@@ -695,6 +695,9 @@ webserver:
# Can be an IPV4 or IPV6 address, defaults to IPV4
listen_address: 0.0.0.0
k8s_healthz_endpoint: /healthz
# Enable the metrics endpoint providing Prometheus values
# It will only have an effect if metrics.enabled is set to true as well.
prometheus_metrics_enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leogr do we need to enforce a feat status for this? Like Incubating ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable, IMO. It is not a blocker for this PR anyway. We may reaudit all options later, but still before the 0.38 release. Does it make sense?

cc @falcosecurity/falco-maintainers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leogr proposing to merge it as is and address the follow up items in a new PR. I believe we need another touch up PR anyways.

Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link

poiana commented May 3, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, incertum, sgaist

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana merged commit cbfe77d into falcosecurity:master May 3, 2024
33 checks passed
@FedeDP
Copy link
Contributor

FedeDP commented May 3, 2024

Thank you very much Samuel! You did a terrific job!

@leogr
Copy link
Member

leogr commented May 8, 2024

👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
8 participants