Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infrastructure UI] New Hosts API #154179

Closed
wants to merge 25 commits into from

Conversation

crespocarlos
Copy link
Contributor

@crespocarlos crespocarlos commented Mar 31, 2023

closes: #152103
fixes: #151768

Summary

This PR adds a new API to return host metrics for the Hosts View page. There are some constraints to what Elasticsearch can do in terms of sorting. The main query runs a terms aggregation, whose size is dynamic.

The caveat of this solution is that instead of sorting by the actual metric aggregation, it sorts by either a min or max aggregation of a field. The doc mentions that this is the safest way to sort by a sub-aggregation. This could, however, potentially return a wrong order, but due to Elasticsearch limitations, it's much more likely to return wrong results sorting by an avg aggregation.

Comparison

The comparison below was done with a load of 5 days worth of data, containing 500 hosts on each day, on my local env. For short date ranges, Snapshot API performs consistently better, and the new approach performs better with a large amount of data.

Snapshot API

Returns all 500 hosts

15 minutes
image

1 day
image

1 week
image

Profiler

The profiler results was taken running a query with a 1 day date range.

image image image image

Hosts API

Returns up to 100 hosts

15 minutes
image

1 day
image

1 week
image

Profiler

The profiler results was taken running a query with a 1 day date range.

image image

How to test

curl --location -u elastic:changeme 'http://0.0.0.0:5601/ftw/api/metrics/hosts' \
--header 'kbn-xsrf: xxxx' \
--header 'Content-Type: application/json' \
--data '{
    "limit": 100,
    "metrics": [
        {
            "type": "rx"
        },
        {
            "type": "tx"
        },
        {
            "type": "memory"
        },
        {
            "type": "cpu"
        },
        {
            "type": "diskLatency"
        },
        {
            "type": "memoryTotal"
        }
    ],
    "query": {
        "bool": {
            "must": [],
            "filter": [],
            "should": [],
            "must_not": []
        }
    },
    "timeRange": {
        "from": 1680040800000,
        "to":   1680645600000
    },
    "sortField": "rx",
    "sortDirection": "desc",
    "sourceId": "default"
}'

@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@crespocarlos crespocarlos marked this pull request as ready for review April 5, 2023 12:57
@crespocarlos crespocarlos requested review from a team as code owners April 5, 2023 12:57
@crespocarlos crespocarlos added Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting Feature:ObsHosts Hosts feature within Observability v8.8.0 labels Apr 5, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@crespocarlos crespocarlos changed the title 152103 new hosts api [Infrastructure UI] New Hosts API Apr 5, 2023
Copy link
Member

@jbudz jbudz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftr_configs.yml

@@ -0,0 +1,53 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crespocarlos
Copy link
Contributor Author

Hey @tonyghiani ! I'll answer these questions before addressing your comments.

Changing sourceId to something meaningless doesn’t break the request and silently returns an empty array. Shouldn’t it be a 400 Bad request informing about a missing source id? Should the sourceId be a path param instead of a body param?

Hm, it'd be confusing to have this as path params, but it's fair to say that returning a 400 if sourceId Is invalid.

Timerange only accepts timestamps? Can’t we make it flexible to accept also dates and relative formats as other APIs do? It could be easier for us to deal with it at the moment of passing relative dates.

I rather not give so much flexibility here. If the client passes relative dates e.g: gte: now-15m, lte: now, elasticsearch won't be able to cache the results, because now will change each time the query is executed (see this). Caching will be particularly importing if you sort by any column, we want elasticsearch to cache results as much as possible.

In metadata host.os.name is null, but cloud.provider is an empty string. Shouldn’t be consistent and return null as well if missing?

yeah. makes sense

I had some first issues using the API before reading the code, I haven't clear what is the format type of each returned metric and what params I could use. Would be awesome if you could add documentation for the endpoint and some usage examples.

Can’t sort by host.os.name? This is currently a feature we have in the client, would it break if we can't sort by this field?

Unfortunately, it's not possible because it's a top_metric aggregation. This column will have to be removed from the table, or at least not let the user sort by it.

Looks like the filtering is not working with some hosts. I tried on a lite-edge cluster to filter by one of the hosts that are currently in the table (opbeans-php-d994b56-5ndtq) but it returned an empty array as if this hostname didn't exist. For other hosts it worked correctly.

I'll try to reproduce the issue.

@crespocarlos
Copy link
Contributor Author

crespocarlos commented Apr 6, 2023

I’m wondering if we could sort by multiple fields if necessary. Having an API similar to the ES one, we could potentially sort the results by multiple fields easily. Something like a param sort: [{rx: “desc”}, {tx: “asc”}]

I don't think terms aggs supports this. Besides, it would complicate things on the query side and I don't even know the performance implications of doing that. If this turns out to be a use-case in the future, we can look into whether or not that's possible.

Removing the query body option fails the request. Shouldn’t be optional and return all the limited results if no query is applied?

Yeah. this is a valid point

@tonyghiani
Copy link
Contributor

Hey @crespocarlos thanks for clarifying all the previous questions!
Just one more doubt regarding

I rather not give so much flexibility here. If the client passes relative dates e.g: gte: now-15m, lte: now, elasticsearch won't be able to cache the results, because now will change each time the query is executed (see elastic/elasticsearch#4947). Caching will be particularly importing if you sort by any column, we want elasticsearch to cache results as much as possible.

When the timerange is relative (gte: now-15m, lte: now) when still set from the client timestamps differents each time the request is performed, so the query won't be cached anyway right? I guess what elasticsearch caches is absolute dates, while for the relative ones won't be able to do it, but this should be reflected anyway on how we pass the time range from the client. However, it was mostly for having more flexibility on this front, not a blocking feature request for me 👌

@crespocarlos
Copy link
Contributor Author

crespocarlos commented Apr 6, 2023

When the timerange is relative (gte: now-15m, lte: now) when still set from the client timestamps differents each time the request is performed

Indeed, but if you sort by a column, we wouldn't necessarily pass a new timestamp. We would use what's stored in the state and just pass the sortField and sortDirection. The first time a query with a new sortDirection is executed, it won't return a cached result, but if the request is repeated, the server would return what's cached.

On the usability side, it's not great that we need to calculate the timestamp on the frontend. But we would need at least a format that looks like strict_date_optional_time

@crespocarlos
Copy link
Contributor Author

Looks like the filtering is not working with some hosts. I tried on a lite-edge cluster to filter by one of the hosts that are currently in the table (opbeans-php-d994b56-5ndtq) but it returned an empty array as if this hostname didn't exist. For other hosts it worked correctly.

The filter was filtering too much. There was a clause to filter event.module: "system"

@crespocarlos
Copy link
Contributor Author

Changing sourceId to something meaningless doesn’t break the request and silently returns an empty array. Shouldn’t it be a 400 Bad request informing about a missing source id? Should the sourceId be a path param instead of a body param?

Passing a random sourceId is not invalid per se. If InfraSources has trouble returning a source configuration based on the sourceId, it in the end returns a fallback object, which might work in some cases. I'd probably not return a 404 in this case.

@tonyghiani
Copy link
Contributor

Passing a random sourceId is not invalid per se. If InfraSources has trouble returning a source configuration based on the sourceId, it in the end returns a fallback object, which might work in some cases. I'd probably not return a 404 in this case.

This is why I also thought about passing the sourceId in the path. Similar to how the GET /api/metrics/source/${sourceId} works, when a passed sourceId doesn't match any source it returns a 404 because the resources don't exist.
It is pretty unopinionated, I do see this endpoint as "given a sourceId, retrieve hosts metrics for the related source", which is why I'm thinking to pass it as a param and return 404 in case there is no match, but we could use some other opinion by the team.

@crespocarlos
Copy link
Contributor Author

crespocarlos commented Apr 6, 2023

This is why I also thought about passing the sourceId in the path. Similar to how the GET /api/metrics/source/${sourceId} works, when a passed sourceId doesn't match any source it returns a 404 because the resources don't exist.
It is pretty unopinionated, I do see this endpoint as "given a sourceId, retrieve hosts metrics for the related source", which is why I'm thinking to pass it as a param and return 404 in case there is no match, but we could use some other opinion by the team.

I'm not really familiar with this endpoint, but isn't it supposed to return the configuration based on the sourceId - meaning that we're looking at the source collection and returning it by an id?

In this case, I wouldn't say hosts is a resource from source collection for it to be in a path param. Of course, this is open to interpretation, but IMO it would also mean that every other endpoint that receives a sourceId in infra would be part of the source collection.

You could on the other hand hypothetically have something like /api/metrics/hosts/host-0 - which seems to be the same idea in the /metrics/source endpoint.

Getting back to the fallback object used when the sourceId is invalid. It returns a default indexPattern value and it might as well work. So at first glance, it seems that it would require some validation to be able to tell the user what went wrong.

@tonyghiani
Copy link
Contributor

I'm not really familiar with this endpoint, but isn't it supposed to return the configuration based on the sourceId - meaning that we're looking at the source collection and returning it by an id?

In this case, I wouldn't say hosts is a resource from source collection for it to be in a path param. Of course, this is open to interpretation, but IMO it would also mean that every other endpoint that receives a sourceId in infra would be part of the source collection.

You could on the other hand hypothetically have something like /api/metrics/hosts/host-0 - which seems to be the same idea in the /metrics/source endpoint.

Yep agree, I guess what concern me is not explicitly seeing why I'm not getting any metric if the sourceId is misspelt or missing

@@ -0,0 +1,92 @@
# Infra Hosts API
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had open-api docs in place

@crespocarlos
Copy link
Contributor Author

Yep agree, I guess what concern me is not explicitly seeing why I'm not getting any metric if the sourceId is misspelt or missing

yeah. I get it. It's frustrating. My first thought to solve that is to let the client send what's the index pattern the API needs to use - which could have typos too or refer to a non-existing index. It will make the problem more noticeable, maybe?

I'll try to see if there is any combination of conditions we could use to return a 404.

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@crespocarlos
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1275 1278 +3
infra 1328 1330 +2
profiling 138 141 +3
ux 142 145 +3
total +11

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/io-ts-utils 24 32 +8

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 3.4MB 3.4MB +3.7KB
infra 2.0MB 2.0MB +6.0B
profiling 269.3KB 273.0KB +3.7KB
ux 161.0KB 164.7KB +3.7KB
total +11.1KB
Unknown metric groups

API count

id before after diff
@kbn/io-ts-utils 24 32 +8

ESLint disabled line counts

id before after diff
securitySolution 433 436 +3

Total ESLint disabled count

id before after diff
securitySolution 512 515 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@crespocarlos
Copy link
Contributor Author

Closing this PR to rethink the solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Metrics UI Metrics UI feature Feature:ObsHosts Hosts feature within Observability release_note:skip Skip the PR/issue when compiling release notes Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.8.0
Projects
None yet
7 participants