diff --git a/docs/reference/sql/functions/anomaly.md b/docs/reference/sql/functions/anomaly.md new file mode 100644 index 000000000..05b15c010 --- /dev/null +++ b/docs/reference/sql/functions/anomaly.md @@ -0,0 +1,235 @@ +--- +keywords: [anomaly detection, anomaly score, window functions, zscore, MAD, IQR, statistical functions] +description: Lists and describes the anomaly detection window functions available in GreptimeDB, including Z-Score, MAD, and IQR-based scoring. +--- + +# Anomaly Detection Functions + +GreptimeDB provides a set of statistical anomaly-scoring **window functions** that compute a numeric score reflecting how anomalous each row is relative to its window of values. +All three functions must be used with an `OVER` clause (window function syntax). + +:::tip +These functions return `NULL` when the window does not contain enough valid (non-NULL) data points. +A score of `0.0` means the value is not anomalous; a larger value indicates a stronger anomaly. +When the spread (stddev / MAD / IQR) is zero but the current value deviates from the window center, the returned score is `+inf`, meaning the deviation is infinitely anomalous. +::: + +## `anomaly_score_zscore` + +Computes a Z-Score-based anomaly score for each row in a window. + +**Formula:** `|x − mean| / stddev` + +**Minimum valid samples:** 2 (uses population stddev, i.e. dividing by n) + +```sql +anomaly_score_zscore(value) OVER (window_spec) +``` + +**Arguments:** + +- **value**: A numeric column or expression to evaluate. + +**Return type:** `DOUBLE` + +**Degenerate cases:** + +| Condition | Result | +|---|---| +| Fewer than 2 valid points in window | `NULL` | +| `stddev = 0` and `value = mean` | `0.0` | +| `stddev = 0` and `value ≠ mean` | `+inf` | +| Normal case | Finite positive `DOUBLE` | + +**Example:** + +```sql +SELECT + ts, + val, + anomaly_score_zscore(val) OVER ( + ORDER BY ts + ROWS BETWEEN 4 PRECEDING AND CURRENT ROW + ) AS zscore +FROM metrics +ORDER BY ts; +``` + +## `anomaly_score_mad` + +Computes a Median Absolute Deviation (MAD)-based anomaly score for each row in a window. +MAD is more robust than Z-Score because it is not influenced by extreme outliers. + +**Formula:** `|x − median| / (MAD × 1.4826)` + +The constant 1.4826 is a consistency factor that makes the MAD-based score asymptotically equivalent to the Z-Score for normally distributed data. + +**Minimum valid samples:** 3 (with ≤ 2 samples, MAD is almost always 0, which yields spurious `+inf` scores) + +```sql +anomaly_score_mad(value) OVER (window_spec) +``` + +**Arguments:** + +- **value**: A numeric column or expression to evaluate. + +**Return type:** `DOUBLE` + +**Degenerate cases:** + +| Condition | Result | +|---|---| +| Fewer than 3 valid points in window | `NULL` | +| `MAD = 0` and `value = median` | `0.0` | +| `MAD = 0` and `value ≠ median` | `+inf` | +| Normal case | Finite positive `DOUBLE` | + +**Example:** + +```sql +SELECT + ts, + val, + anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS mad_score +FROM metrics +ORDER BY host, ts; +``` + +## `anomaly_score_iqr` + +Computes an IQR (Interquartile Range / Tukey Fences)-based anomaly score for each row in a window. +The score measures the distance of the value beyond the lower fence (`Q1 − k × IQR`) or upper fence (`Q3 + k × IQR`). +Values within the fences receive a score of `0.0`. + +**Formula:** +- If `value < Q1 − k × IQR`: score = `(Q1 − k × IQR − value) / IQR` +- If `value > Q3 + k × IQR`: score = `(value − Q3 − k × IQR) / IQR` +- Otherwise: score = `0.0` + +**Minimum valid samples:** 3 (linear-interpolated Q1 ≠ Q3 is only possible at n ≥ 3) + +```sql +anomaly_score_iqr(value, k) OVER (window_spec) +``` + +**Arguments:** + +- **value**: A numeric column or expression to evaluate. +- **k**: A non-negative `DOUBLE` multiplier for the IQR fences (e.g., `1.5` for standard Tukey fences, `3.0` for far-out fences). Returns `NULL` if `k < 0`. + +**Return type:** `DOUBLE` + +**Degenerate cases:** + +| Condition | Result | +|---|---| +| Fewer than 3 valid points in window | `NULL` | +| `IQR = 0` and value is within fences | `0.0` | +| `IQR = 0` and value is outside fences | `+inf` | +| Normal case | Finite non-negative `DOUBLE` | + +**Example:** + +```sql +SELECT + ts, + val, + anomaly_score_iqr(val, 1.5) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS iqr_score +FROM metrics +ORDER BY host, ts; +``` + +## Full Usage Example + +This example creates a sample table, inserts time-series data with an injected outlier, and then uses all three anomaly functions together. + +```sql +CREATE TABLE sensor_data ( + host STRING, + val DOUBLE, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (host) +); + +INSERT INTO sensor_data VALUES + ('web-1', 10.0, '2025-01-01 00:00:00'), + ('web-1', 11.0, '2025-01-01 00:01:00'), + ('web-1', 10.5, '2025-01-01 00:02:00'), + ('web-1', 10.8, '2025-01-01 00:03:00'), + ('web-1', 80.0, '2025-01-01 00:04:00'), -- outlier + ('web-1', 10.3, '2025-01-01 00:05:00'), + ('web-1', 11.2, '2025-01-01 00:06:00'); +``` + +Use a shared named window and round results for readability: + +```sql +SELECT + ts, + val, + ROUND(anomaly_score_zscore(val) OVER w, 2) AS zscore, + ROUND(anomaly_score_mad(val) OVER w, 2) AS mad, + ROUND(anomaly_score_iqr(val, 1.5) OVER w, 2) AS iqr +FROM sensor_data +WINDOW w AS ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW +) +ORDER BY ts; +``` + +Expected output — the outlier row (`val = 80.0`) produces significantly higher scores across all three metrics: + +``` ++---------------------+------+--------+--------+-------+ +| ts | val | zscore | mad | iqr | ++---------------------+------+--------+--------+-------+ +| 2025-01-01 00:00:00 | 10 | NULL | NULL | NULL | +| 2025-01-01 00:01:00 | 11 | 1 | NULL | NULL | +| 2025-01-01 00:02:00 | 10.5 | 0 | 0 | 0 | +| 2025-01-01 00:03:00 | 10.8 | 0.6 | 0.4 | 0 | +| 2025-01-01 00:04:00 | 80 | 2 | 155.58 | 136.5 | +| 2025-01-01 00:05:00 | 10.3 | 0.46 | 0.67 | 0 | +| 2025-01-01 00:06:00 | 11.2 | 0.38 | 0.67 | 0 | ++---------------------+------+--------+--------+-------+ +``` + +### Filter Anomalous Rows + +You can wrap the window query in a subquery to keep only rows whose score exceeds a threshold: + +```sql +SELECT * FROM ( + SELECT + host, + ts, + val, + ROUND(anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ), 2) AS mad + FROM sensor_data +) WHERE mad > 3.0 +ORDER BY host, ts; +``` + +Expected output: + +``` ++-------+---------------------+------+--------+ +| host | ts | val | mad | ++-------+---------------------+------+--------+ +| web-1 | 2025-01-01 00:04:00 | 80 | 155.58 | ++-------+---------------------+------+--------+ +``` diff --git a/docs/reference/sql/functions/overview.md b/docs/reference/sql/functions/overview.md index f89db2dcb..2a69fc702 100644 --- a/docs/reference/sql/functions/overview.md +++ b/docs/reference/sql/functions/overview.md @@ -17,3 +17,4 @@ Use this page as a quick index to GreptimeDB function references. - [JSON Functions](./json.md) - [Vector Functions](./vector.md) - [Approximate Functions](./approximate.md) + - [Anomaly Detection Functions](./anomaly.md) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/anomaly.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/anomaly.md new file mode 100644 index 000000000..6d4d85fe2 --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/anomaly.md @@ -0,0 +1,235 @@ +--- +keywords: [异常检测, 异常评分, 窗口函数, zscore, MAD, IQR, 统计函数] +description: GreptimeDB 异常检测窗口函数:Z-Score、MAD、IQR 评分。 +--- + +# 异常检测函数 + +GreptimeDB 提供三个统计异常评分**窗口函数**,为窗口中的每一行计算异常分数。 +使用时必须带 `OVER` 子句。 + +:::tip +窗口内有效(非 NULL)数据点不够时返回 `NULL`。 +分数 `0.0` 表示正常;分数越大,异常程度越高。 +如果离散度(stddev / MAD / IQR)为零,而当前值又偏离中心,则返回 `+inf`——统计意义上的"无穷异常"。 +::: + +## `anomaly_score_zscore` + +基于 Z-Score 的异常评分。 + +**公式:** `|x − mean| / stddev` + +**最少有效样本数:** 2(使用总体标准差,即除以 n) + +```sql +anomaly_score_zscore(value) OVER (window_spec) +``` + +**参数:** + +- **value**:数值列或表达式。 + +**返回类型:** `DOUBLE` + +**退化情况:** + +| 条件 | 结果 | +|---|---| +| 窗口内有效点少于 2 | `NULL` | +| `stddev = 0` 且 `value = mean` | `0.0` | +| `stddev = 0` 且 `value ≠ mean` | `+inf` | +| 正常情况 | 有限正数 `DOUBLE` | + +**示例:** + +```sql +SELECT + ts, + val, + anomaly_score_zscore(val) OVER ( + ORDER BY ts + ROWS BETWEEN 4 PRECEDING AND CURRENT ROW + ) AS zscore +FROM metrics +ORDER BY ts; +``` + +## `anomaly_score_mad` + +基于中位绝对偏差(MAD)的异常评分。 +MAD 对极端离群值不敏感,比 Z-Score 更稳健。 + +**公式:** `|x − median| / (MAD × 1.4826)` + +其中 1.4826 是正态分布一致性常数,保证在正态假设下 MAD 评分与 Z-Score 渐近等价。 + +**最少有效样本数:** 3(≤ 2 个样本时 MAD 几乎总为 0,会产生虚假的 `+inf`) + +```sql +anomaly_score_mad(value) OVER (window_spec) +``` + +**参数:** + +- **value**:数值列或表达式。 + +**返回类型:** `DOUBLE` + +**退化情况:** + +| 条件 | 结果 | +|---|---| +| 窗口内有效点少于 3 | `NULL` | +| `MAD = 0` 且 `value = median` | `0.0` | +| `MAD = 0` 且 `value ≠ median` | `+inf` | +| 正常情况 | 有限正数 `DOUBLE` | + +**示例:** + +```sql +SELECT + ts, + val, + anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS mad_score +FROM metrics +ORDER BY host, ts; +``` + +## `anomaly_score_iqr` + +基于四分位距(IQR / Tukey Fences)的异常评分。 +分数反映当前值超出下围栏(`Q1 − k × IQR`)或上围栏(`Q3 + k × IQR`)多远; +在围栏以内的值分数为 `0.0`。 + +**公式:** +- 若 `value < Q1 − k × IQR`:score = `(Q1 − k × IQR − value) / IQR` +- 若 `value > Q3 + k × IQR`:score = `(value − Q3 − k × IQR) / IQR` +- 否则:score = `0.0` + +**最少有效样本数:** 3(线性插值下 Q1 ≠ Q3 至少需要 3 个点) + +```sql +anomaly_score_iqr(value, k) OVER (window_spec) +``` + +**参数:** + +- **value**:数值列或表达式。 +- **k**:围栏倍数,非负 `DOUBLE`(`1.5` 为标准 Tukey 围栏,`3.0` 为远端围栏)。`k < 0` 时返回 `NULL`。 + +**返回类型:** `DOUBLE` + +**退化情况:** + +| 条件 | 结果 | +|---|---| +| 窗口内有效点少于 3 | `NULL` | +| `IQR = 0` 且值在围栏内 | `0.0` | +| `IQR = 0` 且值在围栏外 | `+inf` | +| 正常情况 | 有限非负 `DOUBLE` | + +**示例:** + +```sql +SELECT + ts, + val, + anomaly_score_iqr(val, 1.5) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS iqr_score +FROM metrics +ORDER BY host, ts; +``` + +## 完整示例 + +建表、写入带离群点的时序数据,然后用三个异常函数同时打分。 + +```sql +CREATE TABLE sensor_data ( + host STRING, + val DOUBLE, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (host) +); + +INSERT INTO sensor_data VALUES + ('web-1', 10.0, '2025-01-01 00:00:00'), + ('web-1', 11.0, '2025-01-01 00:01:00'), + ('web-1', 10.5, '2025-01-01 00:02:00'), + ('web-1', 10.8, '2025-01-01 00:03:00'), + ('web-1', 80.0, '2025-01-01 00:04:00'), -- 离群点 + ('web-1', 10.3, '2025-01-01 00:05:00'), + ('web-1', 11.2, '2025-01-01 00:06:00'); +``` + +用命名窗口让三个函数共享同一窗口,ROUND 取两位小数方便阅读: + +```sql +SELECT + ts, + val, + ROUND(anomaly_score_zscore(val) OVER w, 2) AS zscore, + ROUND(anomaly_score_mad(val) OVER w, 2) AS mad, + ROUND(anomaly_score_iqr(val, 1.5) OVER w, 2) AS iqr +FROM sensor_data +WINDOW w AS ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW +) +ORDER BY ts; +``` + +输出如下,`val = 80.0` 那行三个分数都远高于其他行: + +``` ++---------------------+------+--------+--------+-------+ +| ts | val | zscore | mad | iqr | ++---------------------+------+--------+--------+-------+ +| 2025-01-01 00:00:00 | 10 | NULL | NULL | NULL | +| 2025-01-01 00:01:00 | 11 | 1 | NULL | NULL | +| 2025-01-01 00:02:00 | 10.5 | 0 | 0 | 0 | +| 2025-01-01 00:03:00 | 10.8 | 0.6 | 0.4 | 0 | +| 2025-01-01 00:04:00 | 80 | 2 | 155.58 | 136.5 | +| 2025-01-01 00:05:00 | 10.3 | 0.46 | 0.67 | 0 | +| 2025-01-01 00:06:00 | 11.2 | 0.38 | 0.67 | 0 | ++---------------------+------+--------+--------+-------+ +``` + +### 过滤异常行 + +用子查询只留分数超过阈值的行: + +```sql +SELECT * FROM ( + SELECT + host, + ts, + val, + ROUND(anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ), 2) AS mad + FROM sensor_data +) WHERE mad > 3.0 +ORDER BY host, ts; +``` + +输出: + +``` ++-------+---------------------+------+--------+ +| host | ts | val | mad | ++-------+---------------------+------+--------+ +| web-1 | 2025-01-01 00:04:00 | 80 | 155.58 | ++-------+---------------------+------+--------+ +``` diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md index 9889d4b62..de5c066a2 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md @@ -17,3 +17,4 @@ description: 提供了 GreptimeDB 中函数的概述,包括函数的分类、 - [JSON 函数](./json.md) - [向量函数](./vector.md) - [近似函数](./approximate.md) + - [异常检测函数](./anomaly.md) diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-1.0/reference/sql/functions/anomaly.md b/i18n/zh/docusaurus-plugin-content-docs/version-1.0/reference/sql/functions/anomaly.md new file mode 100644 index 000000000..6d4d85fe2 --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/version-1.0/reference/sql/functions/anomaly.md @@ -0,0 +1,235 @@ +--- +keywords: [异常检测, 异常评分, 窗口函数, zscore, MAD, IQR, 统计函数] +description: GreptimeDB 异常检测窗口函数:Z-Score、MAD、IQR 评分。 +--- + +# 异常检测函数 + +GreptimeDB 提供三个统计异常评分**窗口函数**,为窗口中的每一行计算异常分数。 +使用时必须带 `OVER` 子句。 + +:::tip +窗口内有效(非 NULL)数据点不够时返回 `NULL`。 +分数 `0.0` 表示正常;分数越大,异常程度越高。 +如果离散度(stddev / MAD / IQR)为零,而当前值又偏离中心,则返回 `+inf`——统计意义上的"无穷异常"。 +::: + +## `anomaly_score_zscore` + +基于 Z-Score 的异常评分。 + +**公式:** `|x − mean| / stddev` + +**最少有效样本数:** 2(使用总体标准差,即除以 n) + +```sql +anomaly_score_zscore(value) OVER (window_spec) +``` + +**参数:** + +- **value**:数值列或表达式。 + +**返回类型:** `DOUBLE` + +**退化情况:** + +| 条件 | 结果 | +|---|---| +| 窗口内有效点少于 2 | `NULL` | +| `stddev = 0` 且 `value = mean` | `0.0` | +| `stddev = 0` 且 `value ≠ mean` | `+inf` | +| 正常情况 | 有限正数 `DOUBLE` | + +**示例:** + +```sql +SELECT + ts, + val, + anomaly_score_zscore(val) OVER ( + ORDER BY ts + ROWS BETWEEN 4 PRECEDING AND CURRENT ROW + ) AS zscore +FROM metrics +ORDER BY ts; +``` + +## `anomaly_score_mad` + +基于中位绝对偏差(MAD)的异常评分。 +MAD 对极端离群值不敏感,比 Z-Score 更稳健。 + +**公式:** `|x − median| / (MAD × 1.4826)` + +其中 1.4826 是正态分布一致性常数,保证在正态假设下 MAD 评分与 Z-Score 渐近等价。 + +**最少有效样本数:** 3(≤ 2 个样本时 MAD 几乎总为 0,会产生虚假的 `+inf`) + +```sql +anomaly_score_mad(value) OVER (window_spec) +``` + +**参数:** + +- **value**:数值列或表达式。 + +**返回类型:** `DOUBLE` + +**退化情况:** + +| 条件 | 结果 | +|---|---| +| 窗口内有效点少于 3 | `NULL` | +| `MAD = 0` 且 `value = median` | `0.0` | +| `MAD = 0` 且 `value ≠ median` | `+inf` | +| 正常情况 | 有限正数 `DOUBLE` | + +**示例:** + +```sql +SELECT + ts, + val, + anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS mad_score +FROM metrics +ORDER BY host, ts; +``` + +## `anomaly_score_iqr` + +基于四分位距(IQR / Tukey Fences)的异常评分。 +分数反映当前值超出下围栏(`Q1 − k × IQR`)或上围栏(`Q3 + k × IQR`)多远; +在围栏以内的值分数为 `0.0`。 + +**公式:** +- 若 `value < Q1 − k × IQR`:score = `(Q1 − k × IQR − value) / IQR` +- 若 `value > Q3 + k × IQR`:score = `(value − Q3 − k × IQR) / IQR` +- 否则:score = `0.0` + +**最少有效样本数:** 3(线性插值下 Q1 ≠ Q3 至少需要 3 个点) + +```sql +anomaly_score_iqr(value, k) OVER (window_spec) +``` + +**参数:** + +- **value**:数值列或表达式。 +- **k**:围栏倍数,非负 `DOUBLE`(`1.5` 为标准 Tukey 围栏,`3.0` 为远端围栏)。`k < 0` 时返回 `NULL`。 + +**返回类型:** `DOUBLE` + +**退化情况:** + +| 条件 | 结果 | +|---|---| +| 窗口内有效点少于 3 | `NULL` | +| `IQR = 0` 且值在围栏内 | `0.0` | +| `IQR = 0` 且值在围栏外 | `+inf` | +| 正常情况 | 有限非负 `DOUBLE` | + +**示例:** + +```sql +SELECT + ts, + val, + anomaly_score_iqr(val, 1.5) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS iqr_score +FROM metrics +ORDER BY host, ts; +``` + +## 完整示例 + +建表、写入带离群点的时序数据,然后用三个异常函数同时打分。 + +```sql +CREATE TABLE sensor_data ( + host STRING, + val DOUBLE, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (host) +); + +INSERT INTO sensor_data VALUES + ('web-1', 10.0, '2025-01-01 00:00:00'), + ('web-1', 11.0, '2025-01-01 00:01:00'), + ('web-1', 10.5, '2025-01-01 00:02:00'), + ('web-1', 10.8, '2025-01-01 00:03:00'), + ('web-1', 80.0, '2025-01-01 00:04:00'), -- 离群点 + ('web-1', 10.3, '2025-01-01 00:05:00'), + ('web-1', 11.2, '2025-01-01 00:06:00'); +``` + +用命名窗口让三个函数共享同一窗口,ROUND 取两位小数方便阅读: + +```sql +SELECT + ts, + val, + ROUND(anomaly_score_zscore(val) OVER w, 2) AS zscore, + ROUND(anomaly_score_mad(val) OVER w, 2) AS mad, + ROUND(anomaly_score_iqr(val, 1.5) OVER w, 2) AS iqr +FROM sensor_data +WINDOW w AS ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW +) +ORDER BY ts; +``` + +输出如下,`val = 80.0` 那行三个分数都远高于其他行: + +``` ++---------------------+------+--------+--------+-------+ +| ts | val | zscore | mad | iqr | ++---------------------+------+--------+--------+-------+ +| 2025-01-01 00:00:00 | 10 | NULL | NULL | NULL | +| 2025-01-01 00:01:00 | 11 | 1 | NULL | NULL | +| 2025-01-01 00:02:00 | 10.5 | 0 | 0 | 0 | +| 2025-01-01 00:03:00 | 10.8 | 0.6 | 0.4 | 0 | +| 2025-01-01 00:04:00 | 80 | 2 | 155.58 | 136.5 | +| 2025-01-01 00:05:00 | 10.3 | 0.46 | 0.67 | 0 | +| 2025-01-01 00:06:00 | 11.2 | 0.38 | 0.67 | 0 | ++---------------------+------+--------+--------+-------+ +``` + +### 过滤异常行 + +用子查询只留分数超过阈值的行: + +```sql +SELECT * FROM ( + SELECT + host, + ts, + val, + ROUND(anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ), 2) AS mad + FROM sensor_data +) WHERE mad > 3.0 +ORDER BY host, ts; +``` + +输出: + +``` ++-------+---------------------+------+--------+ +| host | ts | val | mad | ++-------+---------------------+------+--------+ +| web-1 | 2025-01-01 00:04:00 | 80 | 155.58 | ++-------+---------------------+------+--------+ +``` diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-1.0/reference/sql/functions/overview.md b/i18n/zh/docusaurus-plugin-content-docs/version-1.0/reference/sql/functions/overview.md index 9889d4b62..de5c066a2 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-1.0/reference/sql/functions/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-1.0/reference/sql/functions/overview.md @@ -17,3 +17,4 @@ description: 提供了 GreptimeDB 中函数的概述,包括函数的分类、 - [JSON 函数](./json.md) - [向量函数](./vector.md) - [近似函数](./approximate.md) + - [异常检测函数](./anomaly.md) diff --git a/sidebars.ts b/sidebars.ts index b4be4930a..d4763a317 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -685,6 +685,7 @@ const sidebars: SidebarsConfig = { 'reference/sql/functions/json', 'reference/sql/functions/vector', 'reference/sql/functions/approximate', + 'reference/sql/functions/anomaly', ], }, 'reference/sql/functions/df-functions', diff --git a/versioned_docs/version-1.0/reference/sql/functions/anomaly.md b/versioned_docs/version-1.0/reference/sql/functions/anomaly.md new file mode 100644 index 000000000..05b15c010 --- /dev/null +++ b/versioned_docs/version-1.0/reference/sql/functions/anomaly.md @@ -0,0 +1,235 @@ +--- +keywords: [anomaly detection, anomaly score, window functions, zscore, MAD, IQR, statistical functions] +description: Lists and describes the anomaly detection window functions available in GreptimeDB, including Z-Score, MAD, and IQR-based scoring. +--- + +# Anomaly Detection Functions + +GreptimeDB provides a set of statistical anomaly-scoring **window functions** that compute a numeric score reflecting how anomalous each row is relative to its window of values. +All three functions must be used with an `OVER` clause (window function syntax). + +:::tip +These functions return `NULL` when the window does not contain enough valid (non-NULL) data points. +A score of `0.0` means the value is not anomalous; a larger value indicates a stronger anomaly. +When the spread (stddev / MAD / IQR) is zero but the current value deviates from the window center, the returned score is `+inf`, meaning the deviation is infinitely anomalous. +::: + +## `anomaly_score_zscore` + +Computes a Z-Score-based anomaly score for each row in a window. + +**Formula:** `|x − mean| / stddev` + +**Minimum valid samples:** 2 (uses population stddev, i.e. dividing by n) + +```sql +anomaly_score_zscore(value) OVER (window_spec) +``` + +**Arguments:** + +- **value**: A numeric column or expression to evaluate. + +**Return type:** `DOUBLE` + +**Degenerate cases:** + +| Condition | Result | +|---|---| +| Fewer than 2 valid points in window | `NULL` | +| `stddev = 0` and `value = mean` | `0.0` | +| `stddev = 0` and `value ≠ mean` | `+inf` | +| Normal case | Finite positive `DOUBLE` | + +**Example:** + +```sql +SELECT + ts, + val, + anomaly_score_zscore(val) OVER ( + ORDER BY ts + ROWS BETWEEN 4 PRECEDING AND CURRENT ROW + ) AS zscore +FROM metrics +ORDER BY ts; +``` + +## `anomaly_score_mad` + +Computes a Median Absolute Deviation (MAD)-based anomaly score for each row in a window. +MAD is more robust than Z-Score because it is not influenced by extreme outliers. + +**Formula:** `|x − median| / (MAD × 1.4826)` + +The constant 1.4826 is a consistency factor that makes the MAD-based score asymptotically equivalent to the Z-Score for normally distributed data. + +**Minimum valid samples:** 3 (with ≤ 2 samples, MAD is almost always 0, which yields spurious `+inf` scores) + +```sql +anomaly_score_mad(value) OVER (window_spec) +``` + +**Arguments:** + +- **value**: A numeric column or expression to evaluate. + +**Return type:** `DOUBLE` + +**Degenerate cases:** + +| Condition | Result | +|---|---| +| Fewer than 3 valid points in window | `NULL` | +| `MAD = 0` and `value = median` | `0.0` | +| `MAD = 0` and `value ≠ median` | `+inf` | +| Normal case | Finite positive `DOUBLE` | + +**Example:** + +```sql +SELECT + ts, + val, + anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS mad_score +FROM metrics +ORDER BY host, ts; +``` + +## `anomaly_score_iqr` + +Computes an IQR (Interquartile Range / Tukey Fences)-based anomaly score for each row in a window. +The score measures the distance of the value beyond the lower fence (`Q1 − k × IQR`) or upper fence (`Q3 + k × IQR`). +Values within the fences receive a score of `0.0`. + +**Formula:** +- If `value < Q1 − k × IQR`: score = `(Q1 − k × IQR − value) / IQR` +- If `value > Q3 + k × IQR`: score = `(value − Q3 − k × IQR) / IQR` +- Otherwise: score = `0.0` + +**Minimum valid samples:** 3 (linear-interpolated Q1 ≠ Q3 is only possible at n ≥ 3) + +```sql +anomaly_score_iqr(value, k) OVER (window_spec) +``` + +**Arguments:** + +- **value**: A numeric column or expression to evaluate. +- **k**: A non-negative `DOUBLE` multiplier for the IQR fences (e.g., `1.5` for standard Tukey fences, `3.0` for far-out fences). Returns `NULL` if `k < 0`. + +**Return type:** `DOUBLE` + +**Degenerate cases:** + +| Condition | Result | +|---|---| +| Fewer than 3 valid points in window | `NULL` | +| `IQR = 0` and value is within fences | `0.0` | +| `IQR = 0` and value is outside fences | `+inf` | +| Normal case | Finite non-negative `DOUBLE` | + +**Example:** + +```sql +SELECT + ts, + val, + anomaly_score_iqr(val, 1.5) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS iqr_score +FROM metrics +ORDER BY host, ts; +``` + +## Full Usage Example + +This example creates a sample table, inserts time-series data with an injected outlier, and then uses all three anomaly functions together. + +```sql +CREATE TABLE sensor_data ( + host STRING, + val DOUBLE, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (host) +); + +INSERT INTO sensor_data VALUES + ('web-1', 10.0, '2025-01-01 00:00:00'), + ('web-1', 11.0, '2025-01-01 00:01:00'), + ('web-1', 10.5, '2025-01-01 00:02:00'), + ('web-1', 10.8, '2025-01-01 00:03:00'), + ('web-1', 80.0, '2025-01-01 00:04:00'), -- outlier + ('web-1', 10.3, '2025-01-01 00:05:00'), + ('web-1', 11.2, '2025-01-01 00:06:00'); +``` + +Use a shared named window and round results for readability: + +```sql +SELECT + ts, + val, + ROUND(anomaly_score_zscore(val) OVER w, 2) AS zscore, + ROUND(anomaly_score_mad(val) OVER w, 2) AS mad, + ROUND(anomaly_score_iqr(val, 1.5) OVER w, 2) AS iqr +FROM sensor_data +WINDOW w AS ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW +) +ORDER BY ts; +``` + +Expected output — the outlier row (`val = 80.0`) produces significantly higher scores across all three metrics: + +``` ++---------------------+------+--------+--------+-------+ +| ts | val | zscore | mad | iqr | ++---------------------+------+--------+--------+-------+ +| 2025-01-01 00:00:00 | 10 | NULL | NULL | NULL | +| 2025-01-01 00:01:00 | 11 | 1 | NULL | NULL | +| 2025-01-01 00:02:00 | 10.5 | 0 | 0 | 0 | +| 2025-01-01 00:03:00 | 10.8 | 0.6 | 0.4 | 0 | +| 2025-01-01 00:04:00 | 80 | 2 | 155.58 | 136.5 | +| 2025-01-01 00:05:00 | 10.3 | 0.46 | 0.67 | 0 | +| 2025-01-01 00:06:00 | 11.2 | 0.38 | 0.67 | 0 | ++---------------------+------+--------+--------+-------+ +``` + +### Filter Anomalous Rows + +You can wrap the window query in a subquery to keep only rows whose score exceeds a threshold: + +```sql +SELECT * FROM ( + SELECT + host, + ts, + val, + ROUND(anomaly_score_mad(val) OVER ( + PARTITION BY host + ORDER BY ts + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ), 2) AS mad + FROM sensor_data +) WHERE mad > 3.0 +ORDER BY host, ts; +``` + +Expected output: + +``` ++-------+---------------------+------+--------+ +| host | ts | val | mad | ++-------+---------------------+------+--------+ +| web-1 | 2025-01-01 00:04:00 | 80 | 155.58 | ++-------+---------------------+------+--------+ +``` diff --git a/versioned_docs/version-1.0/reference/sql/functions/overview.md b/versioned_docs/version-1.0/reference/sql/functions/overview.md index f89db2dcb..2a69fc702 100644 --- a/versioned_docs/version-1.0/reference/sql/functions/overview.md +++ b/versioned_docs/version-1.0/reference/sql/functions/overview.md @@ -17,3 +17,4 @@ Use this page as a quick index to GreptimeDB function references. - [JSON Functions](./json.md) - [Vector Functions](./vector.md) - [Approximate Functions](./approximate.md) + - [Anomaly Detection Functions](./anomaly.md) diff --git a/versioned_sidebars/version-1.0-sidebars.json b/versioned_sidebars/version-1.0-sidebars.json index 4b598e6f8..c1a34a753 100644 --- a/versioned_sidebars/version-1.0-sidebars.json +++ b/versioned_sidebars/version-1.0-sidebars.json @@ -685,7 +685,8 @@ "reference/sql/functions/ip", "reference/sql/functions/json", "reference/sql/functions/vector", - "reference/sql/functions/approximate" + "reference/sql/functions/approximate", + "reference/sql/functions/anomaly" ] }, "reference/sql/functions/df-functions"