-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: update BuilderQuery struct and add PrepareTimeseriesFilterQuery #4165
Conversation
I added the comprehensive description that sets the stage for the dozen PRs I have in the pipeline for new metrics builder changes. The examples illustrate what happens in the background for the chosen small raw data. This should help you understand how metrics work. Please go through it. Let me know if there are any questions on anything not just changes in this PR. A part of this description will also go into the docs. My goal is to make you understand first since you are one of the end users. |
Can there be different temporalities for the same metric_name? If yes, we should move temporality as the 1st sorting key? Regarding INNER JOIN, say we do A (1000 rows) inner join B (100 rows), the intersection would be 100 rows. If we are interested in the 100 rows, I think A will have a lot many extra fingerprints as label filtering is not there in samples table. Does it affect performance? cc @dhawal1248 For the above query, can we use:
instead of below
|
It is configurable today. It was planned to be configurable from the frontend also but due to a bug in frontend we disabled it in UI.
Usually no, the exception is when someone is transitioning from one to another, they could send the same metrics with different temporalities to be backfilled and then eventually only send one temporality. We are going to do the same for span metrics see SigNoz/charts#355
It does affect; ClickHouse doesn't shine at JOINS. The "ClickHouse way" of doing things is using wide tables. The temporality is part of ORDER BY https://github.com/SigNoz/signoz-otel-collector/blob/1fe5faae2cfef2e32ee0f5021a532c10436f7a5b/migrationmanager/migrators/metrics/migrations/000001_init_db.up.sql#L43-L53 for v3 table. We are going to move to this table soon.
No, when there is a group by the result should include the group by labels, It's not possible with IN because there are no labels on samples table. This is an invalid query
|
@srikanthccv can you share the code link of where we start this query prep? |
By this I assume you are asking what exists in production today; This is the entry point
|
I am going to merge this but you can review and ask any questions. |
Summary
Part 1 of #4016
Overview
Metric types
Primary metric types supported are:
Counter: A counter is a (cumulative/delta) metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
Gauge: A gauge is a metric that represents a single numerical value that can arbitrarily change. It can go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.
Histogram: A histogram samples observations (usually things like request durations) and (cumulative/delta) counts them in configurable buckets. This allows for aggregatable calculation of quantiles.
Temporality
The definition of the word Temporality is the state of existing within or having some relationship with time. In the context of metrics, it means how the metric value changes over time. There are two types of temporality:
Cumulative: Cumulative metrics represent a monotonically increasing value. Cumulative metrics are always non-negative floating-point numbers and are only reset when the process restarts.
Delta: Delta metrics are the difference between the current value and the previous value. Delta metrics are always non-negative floating-point numbers.
Both cumulative and delta metrics are supported by the metrics service. We strongly recommend using delta temporality whenever possible.
Cumulative Counter
A cumulative counter represents a monotonically increasing count over time, reset only on restart.
Example: Total number of requests served.
In this table, each row after 00:00 shows the cumulative count of requests served since the 00:00 report. For instance, at 00:20, there were 12 requests served since the 00:00 report.
Delta Counter
A delta counter shows the difference in count since the last report.
Example: Number of new requests served since last report.
In this table, each row after 00:00 shows the count of new requests served since the last report. For instance, at 00:20, there were 7 new requests served since the 00:10 report.
Gauge
A gauge represents a value that can increase or decrease over time.
Example: Current number of active sessions.
In this table, each row after 00:00 shows the current number of active sessions. For instance, at 00:20, there were 5 active sessions.
Cumulative Histogram
A cumulative histogram represents a monotonically increasing count of observations over time, reset only on restart.
Example: Response times categorized in buckets (e.g., <100ms, 100-200ms, 200-300ms, >300ms).
In this table, each row after 00:00 shows the cumulative count of observations in each response time bucket. For instance, at 00:20, there were 10 observations with response times under 100ms, 2 observations with response times between 100-200ms, and 0 observations with response times between 200-300ms since the 00:00 report.
Delta Histogram
A delta histogram also counts observations in buckets, but the counts are the difference since the last report.
Example: New response times in the same buckets.
In this table, each row after 00:00 shows the count of new observations in each response time bucket since the last report. For instance, at 00:20, there were 5 new observations with response times under 100ms and 2 new observations with response times between 100-200ms since the 00:10 report.
Time and Spatial Aggregation Explained for Metrics Data
This document clarifies the concepts of time and spatial aggregation in the context of metrics data analysis.
Time Aggregation
avg
,sum
,min
,max
,count
, etc.Spatial Aggregation
avg
,sum
,min
,max
,count
, etc.The following table represents the metrics data from five hosts
h1
,h2
,h3
,h4
, andh5
spread acrossr1
,r2
, andr3
regions. Assume the reported value is memory usage in MB for each host. The timestamp is mm:ss (minute:second) format and ranges from 10th minute 00 seconds to 12th minute 30 seconds with a collection interval of 10 seconds. The regionr1
has two hostsh1
andh2
, regionr2
has one hosth3
and regionr3
has two hostsh4
andh5
. The metrics data is collected for 150 sec.We can't display the raw data since it is too big. So we first perform the aggregation on the time axis for each unique series. There are 5 time series in the above table. We could use the aggregation operator
avg
to get the representative value for each 30 seconds. The aggregation result is shown in the following table.Even this table could be too big to display if there were hundreds of hosts. The spatial aggregation is performed on the result of the time aggregation. We could use the aggregation operator
sum
to get the total memory usage.Total Memory Usage for Each Region
Total Memory Usage for Each Host
Note: this table is the same as the time aggregation result because each host is unique and there are no sub time series for each host. Other metrics such as DISK usage could have sub time series for each host (usage from each partition). In that case, the spatial aggregation result could be different from the time aggregation result.
Total Memory Usage from All Hosts
Default aggregation operators
Based on the metric type, the default time and space aggregation operators are chosen. The following table shows the default aggregation operators for each metric type.
Histograms are a special case because the value isn't a single number but a group of numbers. The most common use case is to calculate the quantiles. The current implementation supports the following quantiles: 0.5, 0.9, 0.95, 0.99. The time and space aggregation produces the distribution of observations in each bucket. The quantiles are calculated from the distribution.
Implementation Details
The schema of the metrics database tables is as follows:
we are using the hash of the labels to generate the fingerprint.
be one of the following values:
we are using the hash of the labels to generate the fingerprint.
Query preparation
As there are two tables for metrics, any query on metrics will need to join these two tables. First, we get the fingerprints of the metrics that match the query criteria from the
time_series_v2
table. Then, we join thesamples_v2
table with thetime_series_v2
table to get the actual metric values. The are three to four steps in the query preparation:time_series_v2
table.samples_v2
table with thetime_series_v2
table to get the actual metric values.A typical query looks like the following:
The query can be broken down into the following steps:
time_series_v2
table.This is a simple example, things gets little complicated when we need to compute rates and percentiles.
The major changes in the metrics builder improvements are
I will be send a series of PRs to implement these changes. This PR is the first one in the series.