diff --git a/.typos.toml b/.typos.toml index 0b8414e39..5b39d2710 100644 --- a/.typos.toml +++ b/.typos.toml @@ -15,5 +15,6 @@ flate = "flate" [files] extend-exclude = [ "pnpm-lock.yaml", - "*/**/df-functions.md" + "*/**/df-functions.md", + "**/*.svg" ] diff --git a/docs/faq-and-others/faq.md b/docs/faq-and-others/faq.md index 5c80b3085..7bc97261e 100644 --- a/docs/faq-and-others/faq.md +++ b/docs/faq-and-others/faq.md @@ -219,7 +219,7 @@ Learn more about indexing: [Index Management](/user-guide/manage-data/data-index **Real-Time Processing**: - **[Flow Engine](/user-guide/flow-computation/overview.md)**: Real-time stream processing system that enables continuous, incremental computation on streaming data with automatic result table updates -- **[Pipeline](/user-guide/logs/pipeline-config.md)**: Data parsing and transformation mechanism for processing incoming data in real-time, with configurable processors for field extraction and data type conversion across multiple data formats +- **[Pipeline](/reference/pipeline/pipeline-config.md)**: Data parsing and transformation mechanism for processing incoming data in real-time, with configurable processors for field extraction and data type conversion across multiple data formats - **Output Tables**: Persist processed results for analysis diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md index ec86668d7..fdbac53cf 100644 --- a/docs/getting-started/quick-start.md +++ b/docs/getting-started/quick-start.md @@ -237,7 +237,7 @@ ORDER BY +---------------------+-------+------------------+-----------+--------------------+ ``` -The `@@` operator is used for [term searching](/user-guide/logs/query-logs.md). +The `@@` operator is used for [term searching](/user-guide/logs/fulltext-search.md). ### Range query diff --git a/docs/greptimecloud/integrations/fluent-bit.md b/docs/greptimecloud/integrations/fluent-bit.md index fa92db437..cb3f5db26 100644 --- a/docs/greptimecloud/integrations/fluent-bit.md +++ b/docs/greptimecloud/integrations/fluent-bit.md @@ -28,7 +28,7 @@ Fluent Bit can be configured to send logs to GreptimeCloud using the HTTP protoc http_Passwd ``` -In this example, the `http` output plugin is used to send logs to GreptimeCloud. For more information, and extra options, refer to the [Logs HTTP API](https://docs.greptime.com/user-guide/logs/write-logs#http-api) guide. +In this example, the `http` output plugin is used to send logs to GreptimeCloud. For more information, and extra options, refer to the [Logs HTTP API](https://docs.greptime.com/reference/pipeline/write-log-api/#http-api) guide. ## Prometheus Remote Write diff --git a/docs/greptimecloud/integrations/kafka.md b/docs/greptimecloud/integrations/kafka.md index c20c10fab..2ee9ff1aa 100644 --- a/docs/greptimecloud/integrations/kafka.md +++ b/docs/greptimecloud/integrations/kafka.md @@ -13,7 +13,7 @@ Here we are using Vector as the tool to transport data from Kafka to GreptimeDB. ## Logs A sample configuration. Note that you will need to [create your -pipeline](https://docs.greptime.com/user-guide/logs/pipeline-config/) for log +pipeline](https://docs.greptime.com/user-guide/logs/use-custom-pipelines/) for log parsing. ```toml diff --git a/docs/reference/pipeline/built-in-pipelines.md b/docs/reference/pipeline/built-in-pipelines.md new file mode 100644 index 000000000..efda2d1f9 --- /dev/null +++ b/docs/reference/pipeline/built-in-pipelines.md @@ -0,0 +1,176 @@ +--- +keywords: [built-in pipelines, greptime_identity, JSON logs, log processing, time index, pipeline, GreptimeDB] +description: Learn about GreptimeDB's built-in pipelines, including the greptime_identity pipeline for processing JSON logs with automatic schema creation, type conversion, and time index configuration. +--- + +# Built-in Pipelines + +GreptimeDB offers built-in pipelines for common log formats, allowing you to use them directly without creating new pipelines. + +Note that the built-in pipelines are not editable. +Additionally, the "greptime_" prefix of the pipeline name is reserved. + +## `greptime_identity` + +The `greptime_identity` pipeline is designed for writing JSON logs and automatically creates columns for each field in the JSON log. + +- The first-level keys in the JSON log are used as column names. +- An error is returned if the same field has different types. +- Fields with `null` values are ignored. +- If time index is not specified, an additional column, `greptime_timestamp`, is added to the table as the time index to indicate when the log was written. + +### Type conversion rules + +- `string` -> `string` +- `number` -> `int64` or `float64` +- `boolean` -> `bool` +- `null` -> ignore +- `array` -> `json` +- `object` -> `json` + + +For example, if we have the following json data: + +```json +[ + {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, + {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, + {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} +] +``` + +We'll merge the schema for each row of this batch to get the final schema. The table schema will be: + +```sql +mysql> desc pipeline_logs; ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| age | Int64 | | YES | | FIELD | +| is_student | Boolean | | YES | | FIELD | +| name | String | | YES | | FIELD | +| object | Json | | YES | | FIELD | +| score | Float64 | | YES | | FIELD | +| company | String | | YES | | FIELD | +| array | Json | | YES | | FIELD | +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | ++--------------------+---------------------+------+------+---------+---------------+ +8 rows in set (0.00 sec) +``` + +The data will be stored in the table as follows: + +```sql +mysql> select * from pipeline_logs; ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| age | is_student | name | object | score | company | array | greptime_timestamp | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | +| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | +| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +3 rows in set (0.01 sec) +``` + +### Specify time index + +A time index is necessary in GreptimeDB. Since the `greptime_identity` pipeline does not require a YAML configuration, you must set the time index in the query parameters if you want to use the timestamp from the log data instead of the automatically generated timestamp when the data arrives. + +Example of Incoming Log Data: +```JSON +[ + {"action": "login", "ts": 1742814853} +] +``` + +To instruct the server to use ts as the time index, set the following query parameter in the HTTP header: +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d $'[{"action": "login", "ts": 1742814853}]' +``` + +The `custom_time_index` parameter accepts two formats, depending on the input data format: +- Epoch number format: `;epoch;` + - The field can be an integer or a string. + - The resolution must be one of: `s`, `ms`, `us`, or `ns`. +- Date string format: `;datestr;` + - For example, if the input data contains a timestamp like `2025-03-24 19:31:37+08:00`, the corresponding format should be `%Y-%m-%d %H:%M:%S%:z`. + +With the configuration above, the resulting table will correctly use the specified log data field as the time index. +```sql +DESC pipeline_logs; +``` +```sql ++--------+-----------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------+-----------------+------+------+---------+---------------+ +| ts | TimestampSecond | PRI | NO | | TIMESTAMP | +| action | String | | YES | | FIELD | ++--------+-----------------+------+------+---------+---------------+ +2 rows in set (0.02 sec) +``` + +Here are some example of using `custom_time_index` assuming the time variable is named `input_ts`: +- 1742814853: `custom_time_index=input_ts;epoch;s` +- 1752749137000: `custom_time_index=input_ts;epoch;ms` +- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` +- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` + + +### Flatten JSON objects + +If flattening a JSON object into a single-level structure is needed, add the `x-greptime-pipeline-params` header to the request and set `flatten_json_object` to `true`. + +Here is a sample request: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -H "x-greptime-pipeline-params: flatten_json_object=true" \ + -d "$" +``` + +With this configuration, GreptimeDB will automatically flatten each field of the JSON object into separate columns. For example: + +```JSON +{ + "a": { + "b": { + "c": [1, 2, 3] + } + }, + "d": [ + "foo", + "bar" + ], + "e": { + "f": [7, 8, 9], + "g": { + "h": 123, + "i": "hello", + "j": { + "k": true + } + } + } +} +``` + +Will be flattened to: + +```json +{ + "a.b.c": [1,2,3], + "d": ["foo","bar"], + "e.f": [7,8,9], + "e.g.h": 123, + "e.g.i": "hello", + "e.g.j.k": true +} +``` + + + diff --git a/docs/user-guide/logs/pipeline-config.md b/docs/reference/pipeline/pipeline-config.md similarity index 99% rename from docs/user-guide/logs/pipeline-config.md rename to docs/reference/pipeline/pipeline-config.md index 324595be7..522b8f1b1 100644 --- a/docs/user-guide/logs/pipeline-config.md +++ b/docs/reference/pipeline/pipeline-config.md @@ -51,10 +51,10 @@ The above plain text data will be converted to the following equivalent form: In other words, when the input is in plain text format, you need to use `message` to refer to the content of each line when writing `Processor` and `Transform` configurations. -## Overall structure +## Pipeline Configuration Structure Pipeline consists of four parts: Processors, Dispatcher, Transform, and Table suffix. -Processors pre-processes input log data. +Processors pre-process input log data. Dispatcher forwards pipeline execution context onto different subsequent pipeline. Transform decides the final datatype and table structure in the database. Table suffix allows storing the data into different tables. @@ -827,6 +827,8 @@ Some notes regarding the `vrl` processor: 2. The returning value of the vrl script should not contain any regex-type variables. They can be used in the script, but have to be `del`ed before returning. 3. Due to type conversion between pipeline's value type and vrl's, the value type that comes out of the vrl script will be the ones with max capacity, meaning `i64`, `f64`, and `Timestamp::nanoseconds`. +You can use `vrl` processor to set [table options](./write-log-api.md#set-table-options) while writing logs. + ### `filter` The `filter` processor can filter out unneeded lines when the condition is meet. @@ -1013,7 +1015,7 @@ Specify which field uses the inverted index. Refer to the [Transform Example](#t #### The Fulltext Index -Specify which field will be used for full-text search using `index: fulltext`. This index greatly improves the performance of [log search](./query-logs.md). Refer to the [Transform Example](#transform-example) below for syntax. +Specify which field will be used for full-text search using `index: fulltext`. This index greatly improves the performance of [log search](/user-guide/logs/fulltext-search.md). Refer to the [Transform Example](#transform-example) below for syntax. #### The Skipping Index @@ -1159,4 +1161,4 @@ table_suffix: _${type} These three lines of input log will be inserted into three tables: 1. `persist_app_db` 2. `persist_app_http` -3. `persist_app`, for it doesn't have a `type` field, thus the default table name will be used. \ No newline at end of file +3. `persist_app`, for it doesn't have a `type` field, thus the default table name will be used. diff --git a/docs/reference/pipeline/write-log-api.md b/docs/reference/pipeline/write-log-api.md new file mode 100644 index 000000000..04f22faad --- /dev/null +++ b/docs/reference/pipeline/write-log-api.md @@ -0,0 +1,160 @@ +--- +keywords: [write logs, HTTP interface, log formats, request parameters, JSON logs] +description: Describes how to write logs to GreptimeDB using a pipeline via the HTTP interface, including supported formats and request parameters. +--- + +# APIs for Writing Logs + +Before writing logs, please read the [Pipeline Configuration](/user-guide/logs/use-custom-pipelines.md#upload-pipeline) to complete the configuration setup and upload. + +## HTTP API + +You can use the following command to write logs via the HTTP interface: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -d "$" +``` + +### Request parameters + +This interface accepts the following parameters: + +- `db`: The name of the database. +- `table`: The name of the table. +- `pipeline_name`: The name of the [pipeline](./pipeline-config.md). +- `version`: The version of the pipeline. Optional, default use the latest one. +- `skip_error`: Whether to skip errors when writing logs. Optional, defaults to `false`. When set to `true`, GreptimeDB will skip individual log entries that encounter errors and continue processing the remaining logs. This prevents the entire request from failing due to a single problematic log entry. + +### `Content-Type` and body format + +GreptimeDB uses `Content-Type` header to decide how to decode the payload body. Currently the following two format is supported: +- `application/json`: this includes normal JSON format and NDJSON format. +- `application/x-ndjson`: specifically uses NDJSON format, which will try to split lines and parse for more accurate error checking. +- `text/plain`: multiple log lines separated by line breaks. + +#### `application/json` and `application/x-ndjson` format + +Here is an example of JSON format body payload + +```JSON +[ + {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, + {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, + {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, + {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +] +``` + +Note the whole JSON is an array (log lines). Each JSON object represents one line to be processed by Pipeline engine. + +The name of the key in JSON objects, which is `message` here, is used as field name in Pipeline processors. For example: + +```yaml +processors: + - dissect: + fields: + # `message` is the key in JSON object + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# rest of the file is ignored +``` + +We can also rewrite the payload into NDJSON format like following: + +```JSON +{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} +{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} +{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} +{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +``` + +Note the outer array is eliminated, and lines are separated by line breaks instead of `,`. + +#### `text/plain` format + +Log in plain text format is widely used throughout the ecosystem. GreptimeDB also supports `text/plain` format as log data input, enabling ingesting logs first hand from log producers. + +The equivalent body payload of previous example is like following: + +```plain +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" +10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" +172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" +``` + +Sending log ingestion request to GreptimeDB requires only modifying the `Content-Type` header to be `text/plain`, and you are good to go! + +Please note that, unlike JSON format, where the input data already have key names as field names to be used in Pipeline processors, `text/plain` format just gives the whole line as input to the Pipeline engine. In this case we use `message` as the field name to refer to the input line, for example: + +```yaml +processors: + - dissect: + fields: + # use `message` as the field name + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# rest of the file is ignored +``` + +It is recommended to use `dissect` or `regex` processor to split the input line into fields first and then process the fields accordingly. + +## Set Table Options + +The table options need to be set in the pipeline configurations. +Starting from `v0.15`, the pipeline engine recognizes certain variables, and can set corresponding table options based on the value of the variables. +Combined with the `vrl` processor, it's now easy to create and set table options during the pipeline execution based on input data. + +Here is a list of supported common table option variables: +- `greptime_auto_create_table` +- `greptime_ttl` +- `greptime_append_mode` +- `greptime_merge_mode` +- `greptime_physical_table` +- `greptime_skip_wal` + +Please refer to [table options](/reference/sql/create.md#table-options) for the detailed explanation of each option. + +Here are some pipeline specific variables: +- `greptime_table_suffix`: add suffix to the destined table name. + +Let's use the following pipeline file to demonstrate: +```YAML +processors: + - date: + field: time + formats: + - "%Y-%m-%d %H:%M:%S%.3f" + ignore_missing: true + - vrl: + source: | + .greptime_table_suffix, err = "_" + .id + .greptime_table_ttl = "1d" + . +``` + +In the vrl script, we set the table suffix variable with the input field `.id`(leading with an underscore), and set the ttl to `1d`. +Then we run the ingestion using the following JSON data. + +```JSON +{ + "id": "2436", + "time": "2024-05-25 20:16:37.217" +} +``` + +Assuming the given table name being `d_table`, the final table name would be `d_table_2436` as we would expected. +The table is also set with a ttl of 1 day. + +## Examples + +Please refer to the "Writing Logs" section in the [Quick Start](/user-guide/logs/quick-start.md#direct-http-ingestion) and [Using Custom Pipelines](/user-guide/logs/use-custom-pipelines.md#write-logs) guide for examples. diff --git a/docs/reference/sql/alter.md b/docs/reference/sql/alter.md index af18f9d61..5b1676682 100644 --- a/docs/reference/sql/alter.md +++ b/docs/reference/sql/alter.md @@ -194,7 +194,7 @@ You can specify the following options using `FULLTEXT INDEX WITH` when enabling - `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. Default is `10240`. - `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. Value is a float between `0` and `1`. Default is `0.01`. -For more information on full-text index configuration and performance comparison, refer to the [Full-Text Index Configuration Guide](/user-guide/logs/fulltext-index-config.md). +For more information on full-text index configuration and performance comparison, refer to the [Full-Text Index Configuration Guide](/user-guide/manage-data/data-index.md#fulltext-index). If `WITH ` is not specified, `FULLTEXT INDEX` will use the default values. diff --git a/docs/reference/sql/functions/overview.md b/docs/reference/sql/functions/overview.md index 50b74ba21..dbb94e9f8 100644 --- a/docs/reference/sql/functions/overview.md +++ b/docs/reference/sql/functions/overview.md @@ -50,7 +50,7 @@ DataFusion [String Function](./df-functions.md#string-functions). GreptimeDB provides: * `matches_term(expression, term)` for full text search. -For details, read the [Query Logs](/user-guide/logs/query-logs.md). +For details, read the [Fulltext Search](/user-guide/logs/fulltext-search.md). ### Math Functions diff --git a/docs/reference/sql/where.md b/docs/reference/sql/where.md index 421ef5815..3b76d2cf0 100644 --- a/docs/reference/sql/where.md +++ b/docs/reference/sql/where.md @@ -77,4 +77,4 @@ SELECT * FROM go_info WHERE instance LIKE 'localhost:____'; ``` -For searching terms in logs, please read [Query Logs](/user-guide/logs/query-logs.md). \ No newline at end of file +For searching terms in logs, please read [Fulltext Search](/user-guide/logs/fulltext-search.md). diff --git a/docs/user-guide/ingest-data/for-observability/fluent-bit.md b/docs/user-guide/ingest-data/for-observability/fluent-bit.md index adf5b364a..59cb82a30 100644 --- a/docs/user-guide/ingest-data/for-observability/fluent-bit.md +++ b/docs/user-guide/ingest-data/for-observability/fluent-bit.md @@ -43,7 +43,7 @@ In params Uri, - `table` is the table name you want to write logs to. - `pipeline_name` is the pipeline name you want to use for processing logs. -In this example, the [Logs Http API](/user-guide/logs/write-logs.md#http-api) interface is used. For more information, refer to the [Write Logs](/user-guide/logs/write-logs.md) guide. +In this example, the [Logs Http API](/reference/pipeline/write-log-api.md#http-api) interface is used. For more information, refer to the [Write Logs](/user-guide/logs/use-custom-pipelines.md#ingest-logs-using-the-pipeline) guide. ## OpenTelemetry diff --git a/docs/user-guide/ingest-data/for-observability/kafka.md b/docs/user-guide/ingest-data/for-observability/kafka.md index 5d5b9c61b..8238d7e88 100644 --- a/docs/user-guide/ingest-data/for-observability/kafka.md +++ b/docs/user-guide/ingest-data/for-observability/kafka.md @@ -128,8 +128,8 @@ For logs in text format, such as the access log format below, you'll need to cre #### Create a pipeline To create a custom pipeline, -please refer to the [Create Pipeline](/user-guide/logs/quick-start.md#create-a-pipeline) -and [Pipeline Configuration](/user-guide/logs/pipeline-config.md) documentation for detailed instructions. +please refer to the [using custom pipelines](/user-guide/logs/use-custom-pipelines.md) +documentation for detailed instructions. #### Ingest data diff --git a/docs/user-guide/ingest-data/for-observability/loki.md b/docs/user-guide/ingest-data/for-observability/loki.md index ddacc2650..fb61b4bdb 100644 --- a/docs/user-guide/ingest-data/for-observability/loki.md +++ b/docs/user-guide/ingest-data/for-observability/loki.md @@ -184,7 +184,7 @@ transform: ``` The pipeline content is straightforward: we use `vrl` processor to parse the line into a JSON object, then extract the fields to the root level. -`log_time` is specified as the time index in the transform section, other fields will be auto-inferred by the pipeline engine, see [pipeline version 2](/user-guide/logs/pipeline-config.md#transform-in-version-2) for details. +`log_time` is specified as the time index in the transform section, other fields will be auto-inferred by the pipeline engine, see [pipeline version 2](/reference/pipeline/pipeline-config.md#transform-in-version-2) for details. Note that the input field name is `loki_line`, which contains the original log line from Loki. @@ -264,4 +264,4 @@ log_source: application This output demonstrates that the pipeline engine has successfully parsed the original JSON log lines and extracted the structured data into separate columns. -For more details about pipeline configuration and features, refer to the [pipeline documentation](/user-guide/logs/pipeline-config.md). +For more details about pipeline configuration and features, refer to the [pipeline documentation](/reference/pipeline/pipeline-config.md). diff --git a/docs/user-guide/ingest-data/for-observability/prometheus.md b/docs/user-guide/ingest-data/for-observability/prometheus.md index 889282fd0..82508948b 100644 --- a/docs/user-guide/ingest-data/for-observability/prometheus.md +++ b/docs/user-guide/ingest-data/for-observability/prometheus.md @@ -303,7 +303,7 @@ mysql> select * from `go_memstats_mcache_inuse_bytes`; 2 rows in set (0.01 sec) ``` -You can refer to the [pipeline's documentation](/user-guide/logs/pipeline-config.md) for more details. +You can refer to the [pipeline's documentation](/user-guide/logs/use-custom-pipelines.md) for more details. ## Performance tuning diff --git a/docs/user-guide/logs/fulltext-index-config.md b/docs/user-guide/logs/fulltext-index-config.md deleted file mode 100644 index a0f244399..000000000 --- a/docs/user-guide/logs/fulltext-index-config.md +++ /dev/null @@ -1,131 +0,0 @@ ---- -keywords: [fulltext index, tantivy, bloom, analyzer, case_sensitive, configuration] -description: Comprehensive guide for configuring full-text index in GreptimeDB, including backend selection and other configuration options. ---- - -# Full-Text Index Configuration - -This document provides a comprehensive guide for configuring full-text index in GreptimeDB, including backend selection and other configuration options. - -## Overview - -GreptimeDB provides full-text indexing capabilities to accelerate text search operations. You can configure full-text index when creating or altering tables, with various options to optimize for different use cases. For a general introduction to different types of indexes in GreptimeDB, including inverted index and skipping index, please refer to the [Data Index](/user-guide/manage-data/data-index) guide. - -## Configuration Options - -When creating or modifying a full-text index, you can specify the following options using `FULLTEXT INDEX WITH`: - -### Basic Options - -- `analyzer`: Sets the language analyzer for the full-text index - - Supported values: `English`, `Chinese` - - Default: `English` - - Note: The Chinese analyzer requires significantly more time to build the index due to the complexity of Chinese text segmentation. Consider using it only when Chinese text search is a primary requirement. - -- `case_sensitive`: Determines whether the full-text index is case-sensitive - - Supported values: `true`, `false` - - Default: `false` - - Note: Setting to `true` may slightly improve performance for case-sensitive queries, but will degrade performance for case-insensitive queries. This setting does not affect the results of `matches_term` queries. - -- `backend`: Sets the backend for the full-text index - - Supported values: `bloom`, `tantivy` - - Default: `bloom` - -- `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. - - Supported values: positive integer - - Default: `10240` - -- `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. - - Supported values: float between `0` and `1` - - Default: `0.01` - -### Backend Selection - -GreptimeDB provides two full-text index backends for efficient log searching: - -1. **Bloom Backend** - - Best for: General-purpose log searching - - Features: - - Uses Bloom filter for efficient filtering - - Lower storage overhead - - Consistent performance across different query patterns - - Limitations: - - Slightly slower for high-selectivity queries - - Storage Cost Example: - - Original data: ~10GB - - Bloom index: ~1GB - -2. **Tantivy Backend** - - Best for: High-selectivity queries (e.g., unique values like TraceID) - - Features: - - Uses inverted index for fast exact matching - - Excellent performance for high-selectivity queries - - Limitations: - - Higher storage overhead (close to original data size) - - Slower performance for low-selectivity queries - - Storage Cost Example: - - Original data: ~10GB - - Tantivy index: ~10GB - -### Performance Comparison - -The following table shows the performance comparison between different query methods (using Bloom as baseline): - -| Query Type | High Selectivity (e.g., TraceID) | Low Selectivity (e.g., "HTTP") | -|------------|----------------------------------|--------------------------------| -| LIKE | 50x slower | 1x | -| Tantivy | 5x faster | 5x slower | -| Bloom | 1x (baseline) | 1x (baseline) | - -Key observations: -- For high-selectivity queries (e.g., unique values), Tantivy provides the best performance -- For low-selectivity queries, Bloom offers more consistent performance -- Bloom has significant storage advantage over Tantivy (1GB vs 10GB in test case) - -## Configuration Examples - -### Creating a Table with Full-Text Index - -```sql --- Using Bloom backend (recommended for most cases) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'bloom', - analyzer = 'English', - case_sensitive = 'false' - ) -); - --- Using Tantivy backend (for high-selectivity queries) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'tantivy', - analyzer = 'English', - case_sensitive = 'false' - ) -); -``` - -### Modifying an Existing Table - -```sql --- Enable full-text index on an existing column -ALTER TABLE monitor -MODIFY COLUMN load_15 -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'bloom' -); - --- Change full-text index configuration -ALTER TABLE logs -MODIFY COLUMN message -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'tantivy' -); -``` diff --git a/versioned_docs/version-0.17/user-guide/logs/query-logs.md b/docs/user-guide/logs/fulltext-search.md similarity index 99% rename from versioned_docs/version-0.17/user-guide/logs/query-logs.md rename to docs/user-guide/logs/fulltext-search.md index cc994f1eb..466392abd 100644 --- a/versioned_docs/version-0.17/user-guide/logs/query-logs.md +++ b/docs/user-guide/logs/fulltext-search.md @@ -3,12 +3,10 @@ keywords: [query logs, pattern matching, matches_term, query statements, log ana description: Provides a guide on using GreptimeDB's query language for effective searching and analysis of log data, including pattern matching and query statements. --- -# Query Logs +# Full-Text Search This document provides a guide on how to use GreptimeDB's query language for effective searching and analysis of log data. -## Overview - GreptimeDB allows for flexible querying of data using SQL statements. This section introduces specific search functions and query statements designed to enhance your log querying capabilities. ## Pattern Matching Using the `matches_term` Function diff --git a/docs/user-guide/logs/manage-pipelines.md b/docs/user-guide/logs/manage-pipelines.md index 7e2a6901b..b870c5e40 100644 --- a/docs/user-guide/logs/manage-pipelines.md +++ b/docs/user-guide/logs/manage-pipelines.md @@ -7,14 +7,14 @@ description: Guides on creating, deleting, and managing pipelines in GreptimeDB In GreptimeDB, each `pipeline` is a collection of data processing units used for parsing and transforming the ingested log content. This document provides guidance on creating and deleting pipelines to efficiently manage the processing flow of log data. -For specific pipeline configurations, please refer to the [Pipeline Configuration](pipeline-config.md) documentation. +For specific pipeline configurations, please refer to the [Pipeline Configuration](/reference/pipeline/pipeline-config.md) documentation. ## Authentication The HTTP API for managing pipelines requires authentication. For more information, see the [Authentication](/user-guide/protocols/http.md#authentication) documentation. -## Create a Pipeline +## Upload a Pipeline GreptimeDB provides a dedicated HTTP interface for creating pipelines. Assuming you have prepared a pipeline configuration file `pipeline.yaml`, use the following command to upload the configuration file, where `test` is the name you specify for the pipeline: @@ -28,6 +28,23 @@ curl -X "POST" "http://localhost:4000/v1/pipelines/test" \ The created Pipeline is shared for all databases. +## Pipeline Versions + +You can upload multiple versions of a pipeline with the same name. +Each time you upload a pipeline with an existing name, a new version is created automatically. +You can specify which version to use when [ingesting logs](/reference/pipeline/write-log-api.md#http-api), [querying](#query-pipelines), or [deleting](#delete-a-pipeline) a pipeline. +The last uploaded version is used by default if no version is specified. + +After successfully uploading a pipeline, the response will include version information: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"} +``` + +The version is a timestamp in UTC format that indicates when the pipeline was created. +This timestamp serves as a unique identifier for each pipeline version. + + ## Delete a Pipeline You can use the following HTTP interface to delete a pipeline: diff --git a/docs/user-guide/logs/overview.md b/docs/user-guide/logs/overview.md index bf424898e..9c9e2aa1e 100644 --- a/docs/user-guide/logs/overview.md +++ b/docs/user-guide/logs/overview.md @@ -1,16 +1,108 @@ --- keywords: [log service, quick start, pipeline configuration, manage pipelines, query logs] -description: Provides links to various guides on using GreptimeDB's log service, including quick start, pipeline configuration, managing pipelines, writing logs, querying logs, and full-text index configuration. +description: Comprehensive guide to GreptimeDB's log management capabilities, covering log collection architecture, pipeline processing, integration with popular collectors like Vector and Kafka, and advanced querying with full-text search. --- # Logs -In this chapter, we will walk-through GreptimeDB's features for logs support, -from basic ingestion/query, to advanced transformation, full-text index topics. +GreptimeDB provides a comprehensive log management solution designed for modern observability needs. +It offers seamless integration with popular log collectors, +flexible pipeline processing, +and powerful querying capabilities, including full-text search. -- [Quick Start](./quick-start.md): Provides an introduction on how to quickly get started with GreptimeDB log service. -- [Pipeline Configuration](./pipeline-config.md): Provides in-depth information on each specific configuration of pipelines in GreptimeDB. +Key features include: + +- **Unified Storage**: Store logs alongside metrics and traces in a single database +- **Pipeline Processing**: Transform and enrich raw logs with customizable pipelines, supporting various log collectors and formats +- **Advanced Querying**: SQL-based analysis with full-text search capabilities +- **Real-time Processing**: Process and query logs in real-time for monitoring and alerting + + +## Log Collection Flow + +![log-collection-flow](/log-collection-flow.drawio.svg) + +The diagram above illustrates the comprehensive log collection architecture, +which follows a structured four-stage process: Log Sources, Log Collectors, Pipeline Processing, and Storage in GreptimeDB. + +### Log Sources + +Log sources represent the foundational layer where log data originates within your infrastructure. +GreptimeDB supports ingestion from diverse source types to accommodate comprehensive observability requirements: + +- **Applications**: Application-level logs from microservices architectures, web applications, mobile applications, and custom software components +- **IoT Devices**: Device logs, sensor event logs, and operational status logs from Internet of Things ecosystems +- **Infrastructure**: Cloud platform logs, container orchestration logs (Kubernetes, Docker), load balancer logs, and network infrastructure component logs +- **System Components**: Operating system logs, kernel events, system daemon logs, and hardware monitoring logs +- **Custom Sources**: Any other log sources specific to your environment or applications + +### Log Collectors + +Log collectors are responsible for efficiently gathering log data from diverse sources and reliably forwarding it to the storage backend. GreptimeDB seamlessly integrates with industry-standard log collectors, +including Vector, Fluent Bit, Apache Kafka, OpenTelemetry Collector and more. + +GreptimeDB functions as a powerful sink backend for these collectors, +providing robust data ingestion capabilities. +During the ingestion process, +GreptimeDB's pipeline system enables real-time transformation and enrichment of log data, +ensuring optimal structure and quality before storage. + +### Pipeline Processing + +GreptimeDB's pipeline mechanism transforms raw logs into structured, queryable data: + +- **Parse**: Extract structured data from unstructured log messages +- **Transform**: Enrich logs with additional context and metadata +- **Index**: Configure indexes to optimize query performance and enable efficient searching, including full-text indexes, time indexes, and more + +### Storage in GreptimeDB + +After processing through the pipeline, +the logs are stored in GreptimeDB enabling flexible analysis and visualization: + +- **SQL Querying**: Use familiar SQL syntax to analyze log data +- **Time-based Analysis**: Leverage time-series capabilities for temporal analysis +- **Full-text Search**: Perform advanced text searches across log messages +- **Real-time Analytics**: Query logs in real-time for monitoring and alerting + +## Quick Start + +You can quickly get started by using the built-in `greptime_identity` pipeline for log ingestion. +For more information, please refer to the [Quick Start](./quick-start.md) guide. + +## Integrate with Log Collectors + +GreptimeDB integrates seamlessly with various log collectors to provide a comprehensive logging solution. The integration process follows these key steps: + +1. **Select Appropriate Log Collectors**: Choose collectors based on your infrastructure requirements, data sources, and performance needs +2. **Analyze Output Format**: Understand the log format and structure produced by your chosen collector +3. **Configure Pipeline**: Create and configure pipelines in GreptimeDB to parse, transform, and enrich the incoming log data +4. **Store and Query**: Efficiently store processed logs in GreptimeDB for real-time analysis and monitoring + +To successfully integrate your log collector with GreptimeDB, you'll need to: +- First understand how pipelines work in GreptimeDB +- Then configure the sink settings in your log collector to send data to GreptimeDB + +Please refer to the following guides for detailed instructions on integrating GreptimeDB with log collectors: + +- [Vector](/user-guide/ingest-data/for-observability/vector.md#using-greptimedb_logs-sink-recommended) +- [Kafka](/user-guide/ingest-data/for-observability/kafka.md#logs) +- [Fluent Bit](/user-guide/ingest-data/for-observability/fluent-bit.md#http) +- [OpenTelemetry Collector](/user-guide/ingest-data/for-observability/otel-collector.md) +- [Loki](/user-guide/ingest-data/for-observability/loki.md#using-pipeline-with-loki-push-api) + +## Learn More About Pipelines + +- [Using Custom Pipelines](./use-custom-pipelines.md): Explains how to create and use custom pipelines for log ingestion. - [Managing Pipelines](./manage-pipelines.md): Explains how to create and delete pipelines. -- [Writing Logs with Pipelines](./write-logs.md): Provides detailed instructions on efficiently writing log data by leveraging the pipeline mechanism. -- [Query Logs](./query-logs.md): Describes how to query logs using the GreptimeDB SQL interface. -- [Full-Text Index Configuration](./fulltext-index-config.md): Describes how to configure full-text index in GreptimeDB. + +## Query Logs + +- [Full-Text Search](./fulltext-search.md): Guide on using GreptimeDB's query language for effective searching and analysis of log data. + +## Reference + +- [Built-in Pipelines](/reference/pipeline/built-in-pipelines.md): Lists and describes the details of the built-in pipelines provided by GreptimeDB for log ingestion. +- [APIs for Writing Logs](/reference/pipeline/write-log-api.md): Describes the HTTP API for writing logs to GreptimeDB. +- [Pipeline Configuration](/reference/pipeline/pipeline-config.md): Provides in-depth information on each specific configuration of pipelines in GreptimeDB. + diff --git a/docs/user-guide/logs/quick-start.md b/docs/user-guide/logs/quick-start.md index dfb42b7d3..593d6daf1 100644 --- a/docs/user-guide/logs/quick-start.md +++ b/docs/user-guide/logs/quick-start.md @@ -1,333 +1,123 @@ --- -keywords: [quick start, write logs, query logs, pipeline, structured data, log ingestion, log collection, log management tools] -description: A comprehensive guide to quickly writing and querying logs in GreptimeDB, including direct log writing and using pipelines for structured data. +keywords: [logs, log service, pipeline, greptime_identity, quick start, json logs] +description: Quick start guide for GreptimeDB log service, including basic log ingestion using the built-in greptime_identity pipeline and integration with log collectors. --- # Quick Start -This guide provides step-by-step instructions for quickly writing and querying logs in GreptimeDB. +This guide will walk you through the essential steps to get started with GreptimeDB's log service. +You'll learn how to ingest logs using the built-in `greptime_identity` pipeline and integrate with log collectors. -GreptimeDB supports a pipeline mechanism to parse and transform structured log messages into multiple columns for efficient storage and querying. +GreptimeDB provides a powerful pipeline-based log ingestion system. +For quick setup with JSON-formatted logs, +you can use the built-in `greptime_identity` pipeline, which: -For unstructured logs, you can write them directly into a table without utilizing a pipeline. +- Automatically handles field mapping from JSON to table columns +- Creates tables automatically if they don't exist +- Supports flexible schemas for varying log structures +- Requires minimal configuration to get started -## Write logs by Pipeline +## Direct HTTP Ingestion -Pipelines enable automatic parsing and transformation of log messages into multiple columns, -as well as automatic table creation and alteration. +The simplest way to ingest logs into GreptimeDB is through a direct HTTP request using the `greptime_identity` pipeline. -### Write JSON logs using the built-in `greptime_identity` Pipeline - -GreptimeDB offers a built-in pipeline, `greptime_identity`, for handling JSON log formats. This pipeline simplifies the process of writing JSON logs. +For example, you can use `curl` to send a POST request with JSON log data: ```shell curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity" \ + "http://localhost:4000/v1/ingest?db=public&table=demo_logs&pipeline_name=greptime_identity" \ -H "Content-Type: application/json" \ -H "Authorization: Basic {{authentication}}" \ -d '[ { - "name": "Alice", - "age": 20, - "is_student": true, - "score": 90.5, - "object": { "a": 1, "b": 2 } - }, - { - "age": 21, - "is_student": false, - "score": 85.5, - "company": "A", - "whatever": null + "timestamp": "2024-01-15T10:30:00Z", + "level": "INFO", + "service": "web-server", + "message": "User login successful", + "user_id": 12345, + "ip_address": "192.168.1.100" }, { - "name": "Charlie", - "age": 22, - "is_student": true, - "score": 95.5, - "array": [1, 2, 3] + "timestamp": "2024-01-15T10:31:00Z", + "level": "ERROR", + "service": "database", + "message": "Connection timeout occurred", + "error_code": 500, + "retry_count": 3 } ]' ``` -- [`Authorization`](/user-guide/protocols/http.md#authentication) header. -- `pipeline_name=greptime_identity` specifies the built-in pipeline. -- `table=pipeline_logs` specifies the target table. If the table does not exist, it will be created automatically. - -The `greptime_identity` pipeline automatically creates columns for each field in the JSON log. -A successful command execution returns: - -```json -{"output":[{"affectedrows":3}],"execution_time_ms":9} -``` - -For more details about the `greptime_identity` pipeline, please refer to the [Write Logs](write-logs.md#greptime_identity) document. - -### Write logs using a custom Pipeline - -Custom pipelines allow you to parse and transform log messages into multiple columns based on specific patterns, -and automatically create tables. - -#### Create a Pipeline - -GreptimeDB provides an HTTP interface for creating pipelines. -Here is how to do it: - -First, create a pipeline file, for example, `pipeline.yaml`. - -```yaml -version: 2 -processors: - - dissect: - fields: - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - - date: - fields: - - timestamp - formats: - - "%d/%b/%Y:%H:%M:%S %z" - - select: - type: exclude - fields: - - message - -transform: - - fields: - - ip_address - type: string - index: inverted - tag: true - - fields: - - status_code - type: int32 - index: inverted - tag: true - - fields: - - request_line - - user_agent - type: string - index: fulltext - - fields: - - response_size - type: int32 - - fields: - - timestamp - type: time - index: timestamp -``` - -The pipeline splits the message field using the specified pattern to extract the `ip_address`, `timestamp`, `http_method`, `request_line`, `status_code`, `response_size`, and `user_agent`. -It then parses the `timestamp` field using the format` %d/%b/%Y:%H:%M:%S %z` to convert it into a proper timestamp format that the database can understand. -Finally, it converts each field to the appropriate datatype and indexes it accordingly. -Note at the beginning the pipeline is using version 2 format, see [here](./pipeline-config.md#transform-in-version-2) for more details. -In short, the version 2 indicates the pipeline engine to find fields that are not specified in the transform section, and persist them using the default datatype. -You can see in the [later section](#differences-between-using-a-pipeline-and-writing-unstructured-logs-directly) that although the `http_method` is not specified in the transform, it is persisted as well. -Also, a `select` processor is used to filter out the original `message` field. -It is worth noting that the `request_line` and `user_agent` fields are indexed as `fulltext` to optimize full-text search queries. -And there must be one time index column specified by the `timestamp`. - -Execute the following command to upload the configuration file: - -```shell -curl -X "POST" \ - "http://localhost:4000/v1/pipelines/nginx_pipeline" \ - -H 'Authorization: Basic {{authentication}}' \ - -F "file=@pipeline.yaml" -``` - -After successfully executing this command, a pipeline named `nginx_pipeline` will be created, and the result will be returned as: - -```json -{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. -``` - -You can create multiple versions for the same pipeline name. -All pipelines are stored at the `greptime_private.pipelines` table. -Please refer to [Query Pipelines](manage-pipelines.md#query-pipelines) to view the pipeline data in the table. - -#### Write logs - -The following example writes logs to the `custom_pipeline_logs` table and uses the `nginx_pipeline` pipeline to format and transform the log messages. - -```shell -curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d '[ - { - "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" - }, - { - "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" - }, - { - "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" - }, - { - "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" - } - ]' -``` +The key parameters are: -You will see the following output if the command is successful: +- `db=public`: Target database name (use your database name) +- `table=demo_logs`: Target table name (created automatically if it doesn't exist) +- `pipeline_name=greptime_identity`: Uses `greptime_identity` identity pipeline for JSON processing +- `Authorization` header: Basic authentication with base64-encoded `username:password`, see the [HTTP Authentication Guide](/user-guide/protocols/http.md#authentication) +A successful request returns: ```json -{"output":[{"affectedrows":4}],"execution_time_ms":79} -``` - -## Write unstructured logs directly - -When your log messages are unstructured text, -you can write them directly to the database. -However, this method limits the ability to perform high-performance analysis. - -### Create a table for unstructured logs - -You need to create a table to store the logs before inserting. -Use the following SQL statement to create a table named `origin_logs`: - -* The `FULLTEXT INDEX` on the `message` column optimizes text search queries -* Setting `append_mode` to `true` optimizes log insertion by only appending new rows to the table - -```sql -CREATE TABLE `origin_logs` ( - `message` STRING FULLTEXT INDEX, - `time` TIMESTAMP TIME INDEX -) WITH ( - append_mode = 'true' -); -``` - -### Insert logs - -#### Write logs using the SQL protocol - -Use the `INSERT` statement to insert logs into the table. - -```sql -INSERT INTO origin_logs (message, time) VALUES -('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), -('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), -('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), -('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); -``` - -The above SQL inserts the entire log text into a single column, -and you must add an extra timestamp for each log. - -#### Write logs using the gRPC protocol - -You can also write logs using the gRPC protocol, which is a more efficient method. - -Refer to [Write Data Using gRPC](/user-guide/ingest-data/for-iot/grpc-sdks/overview.md) to learn how to write logs using the gRPC protocol. - -## Differences between using a pipeline and writing unstructured logs directly - -In the above examples, the table `custom_pipeline_logs` is automatically created by writing logs using pipeline, -and the table `origin_logs` is created by writing logs directly. -Let's explore the differences between these two tables. - -```sql -DESC custom_pipeline_logs; -``` - -```sql -+---------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------------+---------------------+------+------+---------+---------------+ -| ip_address | String | PRI | YES | | TAG | -| status_code | Int32 | PRI | YES | | TAG | -| request_line | String | | YES | | FIELD | -| user_agent | String | | YES | | FIELD | -| response_size | Int32 | | YES | | FIELD | -| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -| http_method | String | | YES | | FIELD | -+---------------+---------------------+------+------+---------+---------------+ -7 rows in set (0.00 sec) +{ + "output": [{"affectedrows": 2}], + "execution_time_ms": 15 +} ``` +After successful ingestion, +the corresponding table `demo_logs` is automatically created with columns based on the JSON fields. +The schema is as follows: ```sql -DESC origin_logs; ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| ip_address | String | | YES | | FIELD | +| level | String | | YES | | FIELD | +| message | String | | YES | | FIELD | +| service | String | | YES | | FIELD | +| timestamp | String | | YES | | FIELD | +| user_id | Int64 | | YES | | FIELD | +| error_code | Int64 | | YES | | FIELD | +| retry_count | Int64 | | YES | | FIELD | ++--------------------+---------------------+------+------+---------+---------------+ ``` -```sql -+---------+----------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------+----------------------+------+------+---------+---------------+ -| message | String | | YES | | FIELD | -| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | -+---------+----------------------+------+------+---------+---------------+ -``` - -From the table structure, you can see that the `origin_logs` table has only two columns, -with the entire log message stored in a single column. -The `custom_pipeline_logs` table stores the log message in multiple columns. - -It is recommended to use the pipeline method to split the log message into multiple columns, which offers the advantage of explicitly querying specific values within certain columns. Column matching query proves superior to full-text searching for several key reasons: - -- **Performance Efficiency**: Column matching query is typically faster than full-text searching. -- **Resource Consumption**: Due to GreptimeDB's columnar storage engine, structured data is more conducive to compression. Additionally, the inverted index used for tag matching query typically consumes significantly fewer resources than a full-text index, especially in terms of storage size. -- **Maintainability**: Tag matching query is straightforward and easier to understand, write, and debug. - -Of course, if you need keyword searching within large text blocks, you must use full-text searching as it is specifically designed for that purpose. - -## Query logs +## Integration with Log Collectors -We use the `custom_pipeline_logs` table as an example to query logs. +For production environments, +you'll typically use log collectors to automatically forward logs to GreptimeDB. +Here is an example about how to configure Vector to send logs to GreptimeDB using the `greptime_identity` pipeline: -### Query logs by tags - -With the multiple tag columns in `custom_pipeline_logs`, -you can query data by tags flexibly. -For example, query the logs with `status_code` 200 and `http_method` GET. - -```sql -SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; -``` - -```sql -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -1 row in set (0.02 sec) +```toml +[sinks.my_sink_id] +type = "greptimedb_logs" +dbname = "public" +endpoint = "http://:4000" +pipeline_name = "greptime_identity" +table = "" +username = "" +password = "" +# Additional configurations as needed ``` -### Full-Text Search +The key configuration parameters are: +- `type = "greptimedb_logs"`: Specifies the GreptimeDB logs sink +- `dbname`: Target database name +- `endpoint`: GreptimeDB HTTP endpoint +- `pipeline_name`: Uses `greptime_identity` pipeline for JSON processing +- `table`: Target table name (created automatically if it doesn't exist) +- `username` and `password`: Credentials for HTTP Basic Authentication -For the text fields `request_line` and `user_agent`, you can use `matches_term` function to search logs. -Remember, we created the full-text index for these two columns when [creating a pipeline](#create-a-pipeline). \ -This allows for high-performance full-text searches. +For details about the Vector configuration and options, +refer to the [Vector Integration Guide](/user-guide/ingest-data/for-observability/vector.md#using-greptimedb_logs-sink-recommended). -For example, query the logs with `request_line` containing `/index.html` or `/api/login`. -```sql -SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); -``` - -```sql -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -2 rows in set (0.00 sec) -``` +## Next Steps -You can refer to the [Query Logs](query-logs.md) document for detailed usage of the `matches_term` function. +You've successfully ingested your first logs, here are the recommended next steps: -## Next steps +- **Learn more about the behaviours of built-in Pipelines**: Refer to the [Built-in Pipelines](/reference/pipeline/built-in-pipelines.md) guide for detailed information on available built-in pipelines and their configurations. +- **Integrate with Popular Log Collectors**: For detailed instructions on integrating GreptimeDB with various log collectors like Fluent Bit, Fluentd, and others, refer to the [Integrate with Popular Log Collectors](./overview.md#integrate-with-log-collectors) section in the [Logs Overview](./overview.md) guide. +- **Using Custom Pipelines**: To learn more about creating custom pipelines for advanced log processing and transformation, refer to the [Using Custom Pipelines](./use-custom-pipelines.md) guide. -You have now experienced GreptimeDB's logging capabilities. -You can explore further by following the documentation below: -- [Pipeline Configuration](./pipeline-config.md): Provides in-depth information on each specific configuration of pipelines in GreptimeDB. -- [Managing Pipelines](./manage-pipelines.md): Explains how to create and delete pipelines. -- [Writing Logs with Pipelines](./write-logs.md): Provides detailed instructions on efficiently writing log data by leveraging the pipeline mechanism. -- [Query Logs](./query-logs.md): Describes how to query logs using the GreptimeDB SQL interface. diff --git a/docs/user-guide/logs/use-custom-pipelines.md b/docs/user-guide/logs/use-custom-pipelines.md new file mode 100644 index 000000000..951934823 --- /dev/null +++ b/docs/user-guide/logs/use-custom-pipelines.md @@ -0,0 +1,317 @@ +--- +keywords: [quick start, write logs, query logs, pipeline, structured data, log ingestion, log collection, log management tools] +description: A comprehensive guide to quickly writing and querying logs in GreptimeDB, including direct log writing and using pipelines for structured data. +--- + +# Using Custom Pipelines + +GreptimeDB automatically parses and transforms logs into structured, +multi-column data based on your pipeline configuration. +When built-in pipelines cannot handle your specific log format, +you can create custom pipelines to define exactly how your log data should be parsed and transformed. + +## Identify Your Original Log Format + +Before creating a custom pipeline, it's essential to understand the format of original log data. +If you're using log collectors and aren't sure about the log format, +there are two ways to examine your logs: + +1. **Read the collector official documentation**: Configure your collector to output data to console or file to inspect the log format. +2. **Use the `greptime_identity` pipeline**: Ingest sample logs directly into GreptimeDB using the built-in `greptime_identity` pipeline. + The `greptime_identity` pipeline treats the entire text log as a single `message` field, + which makes it very convenient to see the raw log content directly. + +Once understand the log format you want to process, +you can create a custom pipeline. +This document uses the following Nginx access log entry as an example: + +```txt +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +``` + +## Create a Custom Pipeline + +GreptimeDB provides an HTTP interface for creating pipelines. +Here's how to create one. + +First, create an example pipeline configuration file to process Nginx access logs, +naming it `pipeline.yaml`: + +```yaml +version: 2 +processors: + - dissect: + fields: + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + - date: + fields: + - timestamp + formats: + - "%d/%b/%Y:%H:%M:%S %z" + - select: + type: exclude + fields: + - message + - vrl: + source: | + .greptime_table_ttl = "7d" + . + +transform: + - fields: + - ip_address + type: string + index: inverted + tag: true + - fields: + - status_code + type: int32 + index: inverted + tag: true + - fields: + - request_line + - user_agent + type: string + index: fulltext + - fields: + - response_size + type: int32 + - fields: + - timestamp + type: time + index: timestamp +``` + +The pipeline configuration above uses the [version 2](/reference/pipeline/pipeline-config.md#transform-in-version-2) format, +contains `processors` and `transform` sections that work together to structure your log data: + +**Processors**: Used to preprocess log data before transformation: +- **Data Extraction**: The `dissect` processor uses pattern matching to parse the `message` field and extract structured data including `ip_address`, `timestamp`, `http_method`, `request_line`, `status_code`, `response_size`, and `user_agent`. +- **Timestamp Processing**: The `date` processor parses the extracted `timestamp` field using the format `%d/%b/%Y:%H:%M:%S %z` and converts it to a proper timestamp data type. +- **Field Selection**: The `select` processor excludes the original `message` field from the final output while retaining all other fields. +- **Table Options**: The `vrl` processor sets the table options based on the extracted fields, such as adding a suffix to the table name and setting the TTL. The `greptime_table_ttl = "7d"` line configures the table data to have a time-to-live of 7 days. + +**Transform**: Defines how to convert and index the extracted fields: +- **Field Transformation**: Each extracted field is converted to its appropriate data type with specific indexing configurations. Fields like `http_method` retain their default data types when no explicit configuration is provided. +- **Indexing Strategy**: + - `ip_address` and `status_code` use inverted indexing as tags for fast filtering + - `request_line` and `user_agent` use full-text indexing for optimal text search capabilities + - `timestamp` serves as the required time index column + +For detailed information about pipeline configuration options, +please refer to the [Pipeline Configuration](/reference/pipeline/pipeline-config.md) documentation. + +## Upload the Pipeline + +Execute the following command to upload the pipeline configuration: + +```shell +curl -X "POST" \ + "http://localhost:4000/v1/pipelines/nginx_pipeline" \ + -H 'Authorization: Basic {{authentication}}' \ + -F "file=@pipeline.yaml" +``` + +After successful execution, a pipeline named `nginx_pipeline` will be created and return the following result: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. +``` + +You can create multiple versions for the same pipeline name. +All pipelines are stored in the `greptime_private.pipelines` table. +Refer to [Query Pipelines](manage-pipelines.md#query-pipelines) to view pipeline data. + +## Ingest Logs Using the Pipeline + +The following example writes logs to the `custom_pipeline_logs` table using the `nginx_pipeline` pipeline to format and transform the log messages: + +```shell +curl -X POST \ + "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d '[ + { + "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" + }, + { + "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" + }, + { + "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" + }, + { + "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" + } + ]' +``` + +The command will return the following output upon success: + +```json +{"output":[{"affectedrows":4}],"execution_time_ms":79} +``` + +The `custom_pipeline_logs` table content is automatically created based on the pipeline configuration: + +```sql ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| ip_address | http_method | status_code | request_line | user_agent | response_size | timestamp | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| 10.0.0.1 | GET | 304 | /images/logo.png HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0 | 0 | 2024-05-25 20:18:37 | +| 127.0.0.1 | GET | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | +| 172.16.0.1 | GET | 404 | /contact HTTP/1.1 | Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1 | 162 | 2024-05-25 20:19:37 | +| 192.168.1.1 | POST | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +``` +For more detailed information about the log ingestion API endpoint `/ingest`, +including additional parameters and configuration options, +please refer to the [APIs for Writing Logs](/reference/pipeline/write-log-api.md) documentation. + +## Query Logs + +We use the `custom_pipeline_logs` table as an example to query logs. + +### Query logs by tags + +With the multiple tag columns in `custom_pipeline_logs`, +you can query data by tags flexibly. +For example, query the logs with `status_code` 200 and `http_method` GET. + +```sql +SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; +``` + +```sql ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +1 row in set (0.02 sec) +``` + +### Full‑Text Search + +For the text fields `request_line` and `user_agent`, you can use `matches_term` function to search logs. +Remember, we created the full-text index for these two columns when [creating a pipeline](#create-a-pipeline). +This allows for high-performance full-text searches. + +For example, query the logs with `request_line` containing `/index.html` or `/api/login`. + +```sql +SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); +``` + +```sql ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | +| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +2 rows in set (0.00 sec) +``` + +You can refer to the [Full-Text Search](fulltext-search.md) document for detailed usage of the `matches_term` function. + + +## Benefits of Using Pipelines + +Using pipelines to process logs provides structured data and automatic field extraction, +enabling more efficient querying and analysis. + +You can also write logs directly to the database without pipelines, +but this approach limits high-performance analysis capabilities. + +### Direct Log Insertion (Without Pipeline) + +For comparison, you can create a table to store original log messages: + +```sql +CREATE TABLE `origin_logs` ( + `message` STRING FULLTEXT INDEX, + `time` TIMESTAMP TIME INDEX +) WITH ( + append_mode = 'true' +); +``` + +Use the `INSERT` statement to insert logs into the table. +Note that you need to manually add a timestamp field for each log: + +```sql +INSERT INTO origin_logs (message, time) VALUES +('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), +('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), +('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), +('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); +``` + +### Schema Comparison: Pipeline vs Raw + +In the above examples, the table `custom_pipeline_logs` is automatically created by writing logs using pipeline, +and the table `origin_logs` is created by writing logs directly. +Let's explore the differences between these two tables. + +```sql +DESC custom_pipeline_logs; +``` + +```sql ++---------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------------+---------------------+------+------+---------+---------------+ +| ip_address | String | PRI | YES | | TAG | +| status_code | Int32 | PRI | YES | | TAG | +| request_line | String | | YES | | FIELD | +| user_agent | String | | YES | | FIELD | +| response_size | Int32 | | YES | | FIELD | +| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| http_method | String | | YES | | FIELD | ++---------------+---------------------+------+------+---------+---------------+ +7 rows in set (0.00 sec) +``` + +```sql +DESC origin_logs; +``` + +```sql ++---------+----------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------+----------------------+------+------+---------+---------------+ +| message | String | | YES | | FIELD | +| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | ++---------+----------------------+------+------+---------+---------------+ +``` + +Comparing the table structures shows the key differences: + +The `custom_pipeline_logs` table (created with pipeline) automatically structures log data into multiple columns: +- `ip_address`, `status_code` as indexed tags for fast filtering +- `request_line`, `user_agent` with full-text indexing for text search +- `response_size`, `http_method` as regular fields +- `timestamp` as the time index + +The `origin_logs` table (direct insertion) stores everything in a single `message` column. + +### Why Use Pipelines? + +It is recommended to use the pipeline method to split the log message into multiple columns, +which offers the advantage of explicitly querying specific values within certain columns. +Column matching query proves superior to full-text searching for several key reasons: + +- **Performance**: Column-based queries are typically faster than full-text searches +- **Storage Efficiency**: GreptimeDB's columnar storage compresses structured data better; inverted indexes for tags consume less storage than full-text indexes +- **Query Simplicity**: Tag-based queries are easier to write, understand, and debug + +## Next Steps + +- **Full-Text Search**: Explore the [Full-Text Search](fulltext-search.md) guide to learn advanced text search capabilities and query techniques in GreptimeDB +- **Pipeline Configuration**: Explore the [Pipeline Configuration](/reference/pipeline/pipeline-config.md) documentation to learn more about creating and customizing pipelines for various log formats and processing needs + diff --git a/docs/user-guide/logs/write-logs.md b/docs/user-guide/logs/write-logs.md deleted file mode 100644 index 61c91542d..000000000 --- a/docs/user-guide/logs/write-logs.md +++ /dev/null @@ -1,347 +0,0 @@ ---- -keywords: [write logs, HTTP interface, log formats, request parameters, JSON logs] -description: Describes how to write logs to GreptimeDB using a pipeline via the HTTP interface, including supported formats and request parameters. ---- - -# Writing Logs Using a Pipeline - -This document describes how to write logs to GreptimeDB by processing them through a specified pipeline using the HTTP interface. - -Before writing logs, please read the [Pipeline Configuration](pipeline-config.md) and [Managing Pipelines](manage-pipelines.md) documents to complete the configuration setup and upload. - -## HTTP API - -You can use the following command to write logs via the HTTP interface: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - -## Request parameters - -This interface accepts the following parameters: - -- `db`: The name of the database. -- `table`: The name of the table. -- `pipeline_name`: The name of the [pipeline](./pipeline-config.md). -- `version`: The version of the pipeline. Optional, default use the latest one. - -## `Content-Type` and body format - -GreptimeDB uses `Content-Type` header to decide how to decode the payload body. Currently the following two format is supported: -- `application/json`: this includes normal JSON format and NDJSON format. -- `application/x-ndjson`: specifically uses NDJSON format, which will try to split lines and parse for more accurate error checking. -- `text/plain`: multiple log lines separated by line breaks. - -### `application/json` and `application/x-ndjson` format - -Here is an example of JSON format body payload - -```JSON -[ - {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, - {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, - {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, - {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -] -``` - -Note the whole JSON is an array (log lines). Each JSON object represents one line to be processed by Pipeline engine. - -The name of the key in JSON objects, which is `message` here, is used as field name in Pipeline processors. For example: - -```yaml -processors: - - dissect: - fields: - # `message` is the key in JSON object - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# rest of the file is ignored -``` - -We can also rewrite the payload into NDJSON format like following: - -```JSON -{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} -{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} -{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} -{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -``` - -Note the outer array is eliminated, and lines are separated by line breaks instead of `,`. - -### `text/plain` format - -Log in plain text format is widely used throughout the ecosystem. GreptimeDB also supports `text/plain` format as log data input, enabling ingesting logs first hand from log producers. - -The equivalent body payload of previous example is like following: - -```plain -127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" -192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" -10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" -172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" -``` - -Sending log ingestion request to GreptimeDB requires only modifying the `Content-Type` header to be `text/plain`, and you are good to go! - -Please note that, unlike JSON format, where the input data already have key names as field names to be used in Pipeline processors, `text/plain` format just gives the whole line as input to the Pipeline engine. In this case we use `message` as the field name to refer to the input line, for example: - -```yaml -processors: - - dissect: - fields: - # use `message` as the field name - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# rest of the file is ignored -``` - -It is recommended to use `dissect` or `regex` processor to split the input line into fields first and then process the fields accordingly. - -## Built-in Pipelines - -GreptimeDB offers built-in pipelines for common log formats, allowing you to use them directly without creating new pipelines. - -Note that the built-in pipelines are not editable. Additionally, the "greptime_" prefix of the pipeline name is reserved. - -### `greptime_identity` - -The `greptime_identity` pipeline is designed for writing JSON logs and automatically creates columns for each field in the JSON log. - -- The first-level keys in the JSON log are used as column names. -- An error is returned if the same field has different types. -- Fields with `null` values are ignored. -- If time index is not specified, an additional column, `greptime_timestamp`, is added to the table as the time index to indicate when the log was written. - -#### Type conversion rules - -- `string` -> `string` -- `number` -> `int64` or `float64` -- `boolean` -> `bool` -- `null` -> ignore -- `array` -> `json` -- `object` -> `json` - - -For example, if we have the following json data: - -```json -[ - {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, - {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, - {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} -] -``` - -We'll merge the schema for each row of this batch to get the final schema. The table schema will be: - -```sql -mysql> desc pipeline_logs; -+--------------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------------------+---------------------+------+------+---------+---------------+ -| age | Int64 | | YES | | FIELD | -| is_student | Boolean | | YES | | FIELD | -| name | String | | YES | | FIELD | -| object | Json | | YES | | FIELD | -| score | Float64 | | YES | | FIELD | -| company | String | | YES | | FIELD | -| array | Json | | YES | | FIELD | -| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -+--------------------+---------------------+------+------+---------+---------------+ -8 rows in set (0.00 sec) -``` - -The data will be stored in the table as follows: - -```sql -mysql> select * from pipeline_logs; -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| age | is_student | name | object | score | company | array | greptime_timestamp | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | -| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | -| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -3 rows in set (0.01 sec) -``` - -#### Specify time index - -A time index is necessary in GreptimeDB. Since the `greptime_identity` pipeline does not require a YAML configuration, you must set the time index in the query parameters if you want to use the timestamp from the log data instead of the automatically generated timestamp when the data arrives. - -Example of Incoming Log Data: -```JSON -[ - {"action": "login", "ts": 1742814853} -] -``` - -To instruct the server to use ts as the time index, set the following query parameter in the HTTP header: -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d $'[{"action": "login", "ts": 1742814853}]' -``` - -The `custom_time_index` parameter accepts two formats, depending on the input data format: -- Epoch number format: `;epoch;` - - The field can be an integer or a string. - - The resolution must be one of: `s`, `ms`, `us`, or `ns`. -- Date string format: `;datestr;` - - For example, if the input data contains a timestamp like `2025-03-24 19:31:37+08:00`, the corresponding format should be `%Y-%m-%d %H:%M:%S%:z`. - -With the configuration above, the resulting table will correctly use the specified log data field as the time index. -```sql -DESC pipeline_logs; -``` -```sql -+--------+-----------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------+-----------------+------+------+---------+---------------+ -| ts | TimestampSecond | PRI | NO | | TIMESTAMP | -| action | String | | YES | | FIELD | -+--------+-----------------+------+------+---------+---------------+ -2 rows in set (0.02 sec) -``` - -Here are some example of using `custom_time_index` assuming the time variable is named `input_ts`: -- 1742814853: `custom_time_index=input_ts;epoch;s` -- 1752749137000: `custom_time_index=input_ts;epoch;ms` -- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` -- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` - - -#### Flatten JSON objects - -If flattening a JSON object into a single-level structure is needed, add the `x-greptime-pipeline-params` header to the request and set `flatten_json_object` to `true`. - -Here is a sample request: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -H "x-greptime-pipeline-params: flatten_json_object=true" \ - -d "$" -``` - -With this configuration, GreptimeDB will automatically flatten each field of the JSON object into separate columns. For example: - -```JSON -{ - "a": { - "b": { - "c": [1, 2, 3] - } - }, - "d": [ - "foo", - "bar" - ], - "e": { - "f": [7, 8, 9], - "g": { - "h": 123, - "i": "hello", - "j": { - "k": true - } - } - } -} -``` - -Will be flattened to: - -```json -{ - "a.b.c": [1,2,3], - "d": ["foo","bar"], - "e.f": [7,8,9], - "e.g.h": 123, - "e.g.i": "hello", - "e.g.j.k": true -} -``` - - - -## Variable hints in the pipeline context - -Starting from `v0.15`, the pipeline engine now recognizes certain variables, and can set corresponding table options based on the value of the variables. -Combined with the `vrl` processor, it's now easy to create and set table options during the pipeline execution based on input data. - -Here is a list of supported common table option variables: -- `greptime_auto_create_table` -- `greptime_ttl` -- `greptime_append_mode` -- `greptime_merge_mode` -- `greptime_physical_table` -- `greptime_skip_wal` -You can find the explanation [here](/reference/sql/create.md#table-options). - -Here are some pipeline specific variables: -- `greptime_table_suffix`: add suffix to the destined table name. - -Let's use the following pipeline file to demonstrate: -```YAML -processors: - - date: - field: time - formats: - - "%Y-%m-%d %H:%M:%S%.3f" - ignore_missing: true - - vrl: - source: | - .greptime_table_suffix, err = "_" + .id - .greptime_table_ttl = "1d" - . -``` - -In the vrl script, we set the table suffix variable with the input field `.id`(leading with an underscore), and set the ttl to `1d`. -Then we run the ingestion using the following JSON data. - -```JSON -{ - "id": "2436", - "time": "2024-05-25 20:16:37.217" -} -``` - -Assuming the given table name being `d_table`, the final table name would be `d_table_2436` as we would expected. -The table is also set with a ttl of 1 day. - -## Examples - -Please refer to the "Writing Logs" section in the [Quick Start](quick-start.md#write-logs) guide for examples. - -## Append Only - -By default, logs table created by HTTP ingestion API are in [append only -mode](/user-guide/deployments-administration/performance-tuning/design-table.md#when-to-use-append-only-tables). - -## Skip Errors with skip_error - -If you want to skip errors when writing logs, you can add the `skip_error` parameter to the HTTP request's query params. For example: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=true" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - -With this, GreptimeDB will skip the log entry when an error is encountered and continue processing the remaining logs. The entire request will not fail due to an error in a single log entry. \ No newline at end of file diff --git a/docs/user-guide/manage-data/data-index.md b/docs/user-guide/manage-data/data-index.md index c50ebacba..c1c38a58c 100644 --- a/docs/user-guide/manage-data/data-index.md +++ b/docs/user-guide/manage-data/data-index.md @@ -1,6 +1,6 @@ --- -keywords: [index, inverted index, skipping index, fulltext index, query performance] -description: Learn about different types of indexes in GreptimeDB, including inverted index, skipping index, and fulltext index, and how to use them effectively to optimize query performance. +keywords: [index, inverted index, skipping index, full-text index, query performance] +description: Learn about different types of indexes in GreptimeDB, including inverted index, skipping index, and full-text index, and how to use them effectively to optimize query performance. --- # Data Index @@ -75,11 +75,11 @@ CREATE TABLE sensor_data ( ); ``` -Skipping index can't handle complex filter conditions, and usually has a lower filtering performance compared to inverted index or fulltext index. +Skipping index can't handle complex filter conditions, and usually has a lower filtering performance compared to inverted index or full-text index. -### Fulltext Index +### Full-Text Index -Fulltext index is designed for text search operations on string columns. It enables efficient searching of text content using word-based matching and text search capabilities. You can query text data with flexible keywords, phrases, or pattern matching queries. +Full-text index is designed for text search operations on string columns. It enables efficient searching of text content using word-based matching and text search capabilities. You can query text data with flexible keywords, phrases, or pattern matching queries. **Use Cases:** - Text search operations @@ -95,20 +95,120 @@ CREATE TABLE logs ( ); ``` -Fulltext index supports options by `WITH`: -* `analyzer`: Sets the language analyzer for the fulltext index. Supported values are `English` and `Chinese`. Default to `English`. -* `case_sensitive`: Determines whether the fulltext index is case-sensitive. Supported values are `true` and `false`. Default to `false`. -* `backend`: Sets the backend for the fulltext index. Supported values are `bloom` and `tantivy`. Default to `bloom`. -* `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. Default is `10240`. -* `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. Value is a float between `0` and `1`. Default is `0.01`. +#### Configuration Options -For example: +When creating or modifying a full-text index, you can specify the following options using `FULLTEXT INDEX WITH`: + +- `analyzer`: Sets the language analyzer for the full-text index + - Supported values: `English`, `Chinese` + - Default: `English` + - Note: The Chinese analyzer requires significantly more time to build the index due to the complexity of Chinese text segmentation. Consider using it only when Chinese text search is a primary requirement. + +- `case_sensitive`: Determines whether the full-text index is case-sensitive + - Supported values: `true`, `false` + - Default: `false` + - Note: Setting to `true` may slightly improve performance for case-sensitive queries, but will degrade performance for case-insensitive queries. This setting does not affect the results of `matches_term` queries. + +- `backend`: Sets the backend for the full-text index + - Supported values: `bloom`, `tantivy` + - Default: `bloom` + +- `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. + - Supported values: positive integer + - Default: `10240` + +- `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. + - Supported values: float between `0` and `1` + - Default: `0.01` + +#### Backend Selection + +GreptimeDB provides two full-text index backends for efficient log searching: + +1. **Bloom Backend** + - Best for: General-purpose log searching + - Features: + - Uses Bloom filter for efficient filtering + - Lower storage overhead + - Consistent performance across different query patterns + - Limitations: + - Slightly slower for high-selectivity queries + - Storage Cost Example: + - Original data: ~10GB + - Bloom index: ~1GB + +2. **Tantivy Backend** + - Best for: High-selectivity queries (e.g., unique values like TraceID) + - Features: + - Uses inverted index for fast exact matching + - Excellent performance for high-selectivity queries + - Limitations: + - Higher storage overhead (close to original data size) + - Slower performance for low-selectivity queries + - Storage Cost Example: + - Original data: ~10GB + - Tantivy index: ~10GB + +#### Performance Comparison + +The following table shows the performance comparison between different query methods (using Bloom as baseline): + +| Query Type | High Selectivity (e.g., TraceID) | Low Selectivity (e.g., "HTTP") | +|------------|----------------------------------|--------------------------------| +| LIKE | 50x slower | 1x | +| Tantivy | 5x faster | 5x slower | +| Bloom | 1x (baseline) | 1x (baseline) | + +Key observations: +- For high-selectivity queries (e.g., unique values), Tantivy provides the best performance +- For low-selectivity queries, Bloom offers more consistent performance +- Bloom has significant storage advantage over Tantivy (1GB vs 10GB in test case) + +#### Examples + +**Creating a Table with Full-Text Index** ```sql +-- Using Bloom backend (recommended for most cases) CREATE TABLE logs ( - message STRING FULLTEXT INDEX WITH(analyzer='English', case_sensitive='true', backend='bloom', granularity=1024, false_positive_rate=0.01), - `level` STRING PRIMARY KEY, - `timestamp` TIMESTAMP TIME INDEX, + timestamp TIMESTAMP(9) TIME INDEX, + message STRING FULLTEXT INDEX WITH ( + backend = 'bloom', + analyzer = 'English', + case_sensitive = 'false' + ) +); + +-- Using Tantivy backend (for high-selectivity queries) +CREATE TABLE logs ( + timestamp TIMESTAMP(9) TIME INDEX, + message STRING FULLTEXT INDEX WITH ( + backend = 'tantivy', + analyzer = 'English', + case_sensitive = 'false' + ) +); +``` + +**Modifying an Existing Table** + +```sql +-- Enable full-text index on an existing column +ALTER TABLE monitor +MODIFY COLUMN load_15 +SET FULLTEXT INDEX WITH ( + analyzer = 'English', + case_sensitive = 'false', + backend = 'bloom' +); + +-- Change full-text index configuration +ALTER TABLE logs +MODIFY COLUMN message +SET FULLTEXT INDEX WITH ( + analyzer = 'English', + case_sensitive = 'false', + backend = 'tantivy' ); ``` @@ -118,9 +218,7 @@ Fulltext index usually comes with following drawbacks: - Increased flush and compaction latency as each text document needs to be tokenized and indexed - May not be optimal for simple prefix or suffix matching operations -Consider using fulltext index only when you need advanced text search capabilities and flexible query patterns. - -For more detailed information about fulltext index configuration and backend selection, please refer to the [Full-Text Index Configuration](/user-guide/logs/fulltext-index-config) guide. +Consider using full-text index only when you need advanced text search capabilities and flexible query patterns. ## Modify indexes diff --git a/docs/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md b/docs/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md index b926db148..4cf6a8892 100644 --- a/docs/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md +++ b/docs/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md @@ -117,7 +117,7 @@ CREATE TABLE logs ( **Notes:** - `host` and `service` serve as common query filters and are included in the primary key to optimize filtering. If there are very many hosts, you might not want to include `host` in the primary key but instead create a skip index. -- `log_message` is treated as raw content with a full-text index created. If you want the full-text index to take effect during queries, you also need to adjust your SQL query syntax. Please refer to [the log query documentation](/user-guide/logs/query-logs.md) for details +- `log_message` is treated as raw content with a full-text index created. If you want the full-text index to take effect during queries, you also need to adjust your SQL query syntax. Please refer to [the log query documentation](/user-guide/logs/fulltext-search.md) for details - Since `trace_id` and `span_id` are mostly high-cardinality fields, it is not recommended to use them in the primary key, but skip indexes have been added. --- @@ -228,7 +228,7 @@ Alternatively, you can convert the CSV to standard INSERT statements for batch i ## Frequently Asked Questions and Optimization Tips ### What if SQL/types are incompatible? - Before migration, audit all query SQL and rewrite or translate as necessary, referring to the [official documentation](/user-guide/query-data/sql.md) (especially for [log query](/user-guide/logs/query-logs.md)) for any incompatible syntax or data types. + Before migration, audit all query SQL and rewrite or translate as necessary, referring to the [official documentation](/user-guide/query-data/sql.md) (especially for [log query](/user-guide/logs/fulltext-search.md)) for any incompatible syntax or data types. ### How do I efficiently import very large datasets in batches? For large tables or full historical data, export and import by partition or shard as appropriate. Monitor write speed and import progress closely. diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/faq-and-others/faq.md b/i18n/zh/docusaurus-plugin-content-docs/current/faq-and-others/faq.md index 14bb67c48..a2afd9112 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/faq-and-others/faq.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/faq-and-others/faq.md @@ -210,7 +210,7 @@ GreptimeDB 提供多种灾备策略以满足不同的可用性需求: **实时处理**: - **[Flow Engine](/user-guide/flow-computation/overview.md)**:实时流数据处理系统,对流式数据进行连续增量计算,自动更新结果表 -- **[Pipeline](/user-guide/logs/pipeline-config.md)**:实时数据解析转换机制,通过可配置处理器对各种入库数据进行字段提取和数据类型转换 +- **[Pipeline](/user-guide/logs/use-custom-pipelines.md)**:实时数据解析转换机制,通过可配置处理器对各种入库数据进行字段提取和数据类型转换 - **输出表**:持久化处理结果用于分析 ### GreptimeDB 的可扩展性特征是什么? diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md b/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md index a5bbb604d..7087af6d8 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/getting-started/quick-start.md @@ -236,7 +236,7 @@ ORDER BY +---------------------+-------+------------------+-----------+--------------------+ ``` -`@@` 操作符用于[短语搜索](/user-guide/logs/query-logs.md)。 +`@@` 操作符用于[短语搜索](/user-guide/logs/fulltext-search.md)。 ### Range query diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/fluent-bit.md b/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/fluent-bit.md index caac4197a..696f4fa77 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/fluent-bit.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/fluent-bit.md @@ -27,7 +27,7 @@ Fluent Bit 可以配置为使用 HTTP 协议将日志发送到 GreptimeCloud。 http_Passwd ``` -在此示例中,使用 `http` 输出插件将日志发送到 GreptimeCloud。有关更多信息和额外选项,请参阅 [Logs HTTP API](https://docs.greptime.cn/user-guide/logs/write-logs#http-api) 指南。 +在此示例中,使用 `http` 输出插件将日志发送到 GreptimeCloud。有关更多信息和额外选项,请参阅 [Logs HTTP API](https://docs.greptime.cn/reference/pipeline/write-log-api.md#http-api) 指南。 ## Prometheus Remote Write diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/kafka.md b/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/kafka.md index 3325559db..20d6c3788 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/kafka.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/greptimecloud/integrations/kafka.md @@ -11,7 +11,7 @@ description: 介绍如何使用 Kafka 将数据传输到 GreptimeCloud,并提 ## Logs 以下是一个示例配置。请注意,您需要创建您的 -[Pipeline](https://docs.greptime.cn/user-guide/logs/pipeline-config/) 用于日志 +[Pipeline](https://docs.greptime.cn/user-guide/logs/use-custom-pipelines/) 用于日志 解析。 ```toml diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/built-in-pipelines.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/built-in-pipelines.md new file mode 100644 index 000000000..9eac5023e --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/built-in-pipelines.md @@ -0,0 +1,172 @@ +--- +keywords: [内置 pipeline, greptime_identity, JSON 日志, 日志处理, 时间索引, pipeline, GreptimeDB] +description: 了解 GreptimeDB 的内置 pipeline,包括用于处理 JSON 日志的 greptime_identity pipeline,具有自动 schema 创建、类型转换和时间索引配置功能。 +--- + +# 内置 Pipeline + +GreptimeDB 提供了常见日志格式的内置 Pipeline,允许你直接使用而无需创建新的 Pipeline。 + +请注意,内置 Pipeline 的名称以 "greptime_" 为前缀,不可编辑。 + +## `greptime_identity` + +`greptime_identity` Pipeline 适用于写入 JSON 日志,并自动为 JSON 日志中的每个字段创建列。 + +- JSON 日志中的第一层级的 key 是表中的列名。 +- 如果相同字段包含不同类型的数据,则会返回错误。 +- 值为 `null` 的字段将被忽略。 +- 如果没有手动指定,一个作为时间索引的额外列 `greptime_timestamp` 将被添加到表中,以指示日志写入的时间。 + +### 类型转换规则 + +- `string` -> `string` +- `number` -> `int64` 或 `float64` +- `boolean` -> `bool` +- `null` -> 忽略 +- `array` -> `json` +- `object` -> `json` + +例如,如果我们有以下 JSON 数据: + +```json +[ + {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, + {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, + {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} +] +``` + +我们将合并每个批次的行结构以获得最终 schema。表 schema 如下所示: + +```sql +mysql> desc pipeline_logs; ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| age | Int64 | | YES | | FIELD | +| is_student | Boolean | | YES | | FIELD | +| name | String | | YES | | FIELD | +| object | Json | | YES | | FIELD | +| score | Float64 | | YES | | FIELD | +| company | String | | YES | | FIELD | +| array | Json | | YES | | FIELD | +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | ++--------------------+---------------------+------+------+---------+---------------+ +8 rows in set (0.00 sec) +``` + +数据将存储在表中,如下所示: + +```sql +mysql> select * from pipeline_logs; ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| age | is_student | name | object | score | company | array | greptime_timestamp | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | +| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | +| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +3 rows in set (0.01 sec) +``` + +### 自定义时间索引列 + +每个 GreptimeDB 表中都必须有时间索引列。`greptime_identity` pipeline 不需要额外的 YAML 配置,如果你希望使用写入数据中自带的时间列(而不是日志数据到达服务端的时间戳)作为表的时间索引列,则需要通过参数进行指定。 + +假设这是一份待写入的日志数据: +```JSON +[ + {"action": "login", "ts": 1742814853} +] +``` + +设置如下的 URL 参数来指定自定义时间索引列: +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d $'[{"action": "login", "ts": 1742814853}]' +``` + +取决于数据的格式,`custom_time_index` 参数接受两种格式的配置值: +- Unix 时间戳: `<字段名>;epoch;<精度>` + - 该字段需要是整数或者字符串 + - 精度为这四种选项之一: `s`, `ms`, `us`, or `ns`. +- 时间戳字符串: `<字段名>;datestr;<字符串解析格式>` + - 例如输入的时间字段值为 `2025-03-24 19:31:37+08:00`,则对应的字符串解析格式为 `%Y-%m-%d %H:%M:%S%:z` + +通过上述配置,结果表就能正确使用输入字段作为时间索引列 +```sql +DESC pipeline_logs; +``` +```sql ++--------+-----------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------+-----------------+------+------+---------+---------------+ +| ts | TimestampSecond | PRI | NO | | TIMESTAMP | +| action | String | | YES | | FIELD | ++--------+-----------------+------+------+---------+---------------+ +2 rows in set (0.02 sec) +``` + +假设时间变量名称为 `input_ts`,以下是一些使用 `custom_time_index` 的示例: +- 1742814853: `custom_time_index=input_ts;epoch;s` +- 1752749137000: `custom_time_index=input_ts;epoch;ms` +- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` +- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` + + +### 展开 json 对象 + +如果你希望将 JSON 对象展开为单层结构,可以在请求的 header 中添加 `x-greptime-pipeline-params` 参数,设置 `flatten_json_object` 为 `true`。 + +以下是一个示例请求: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -H "x-greptime-pipeline-params: flatten_json_object=true" \ + -d "$" +``` + +这样,GreptimeDB 将自动将 JSON 对象的每个字段展开为单独的列。比如 + +```JSON +{ + "a": { + "b": { + "c": [1, 2, 3] + } + }, + "d": [ + "foo", + "bar" + ], + "e": { + "f": [7, 8, 9], + "g": { + "h": 123, + "i": "hello", + "j": { + "k": true + } + } + } +} +``` + +将被展开为: + +```json +{ + "a.b.c": [1,2,3], + "d": ["foo","bar"], + "e.f": [7,8,9], + "e.g.h": 123, + "e.g.i": "hello", + "e.g.j.k": true +} +``` + diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/pipeline-config.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/pipeline-config.md similarity index 99% rename from i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/pipeline-config.md rename to i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/pipeline-config.md index 3b40d9c7e..ef9f942bb 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/pipeline-config.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/pipeline-config.md @@ -1032,7 +1032,7 @@ GreptimeDB 支持以下四种字段的索引类型: #### Fulltext 索引 -通过 `index: fulltext` 指定在哪个列上建立全文索引,该索引可大大提升 [日志搜索](./query-logs.md) 的性能,写法请参考下方的 [Transform 示例](#transform-示例)。 +通过 `index: fulltext` 指定在哪个列上建立全文索引,该索引可大大提升 [日志搜索](/user-guide/logs/fulltext-search.md) 的性能,写法请参考下方的 [Transform 示例](#transform-示例)。 #### Skipping 索引 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/write-log-api.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/write-log-api.md new file mode 100644 index 000000000..0c23d9d4d --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/pipeline/write-log-api.md @@ -0,0 +1,161 @@ +--- +keywords: [日志写入, HTTP 接口, Pipeline 配置, 数据格式, 请求参数] +description: 介绍如何通过 HTTP 接口使用指定的 Pipeline 将日志写入 GreptimeDB,包括请求参数、数据格式和示例。 +--- + +# 写入日志的 API + +在写入日志之前,请先阅读 [Pipeline 配置](/user-guide/logs/use-custom-pipelines.md#上传-pipeline)完成配置的设定和上传。 + +## HTTP API + +你可以使用以下命令通过 HTTP 接口写入日志: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -d "$" +``` + + +### 请求参数 + +此接口接受以下参数: + +- `db`:数据库名称。 +- `table`:表名称。 +- `pipeline_name`:[Pipeline](./pipeline-config.md) 名称。 +- `version`:Pipeline 版本号。可选,默认使用最新版本。 +- `skip_error`:写入日志时是否跳过错误。可选,默认为 `false`。当设置为 `true` 时,GreptimeDB 会跳过遇到错误的单条日志项并继续处理剩余的日志,不会因为一条日志项的错误导致整个请求失败。 + +### `Content-Type` 和 Body 数据格式 + +GreptimeDB 使用 `Content-Type` header 来决定如何解码请求体内容。目前我们支持以下两种格式: +- `application/json`: 包括普通的 JSON 格式和 NDJSON 格式。 +- `application/x-ndjson`: 指定 NDJSON 格式,会尝试先分割行再进行解析,可以达到精确的错误检查。 +- `text/plain`: 通过换行符分割的多行日志文本行。 + +#### `application/json` 和 `application/x-ndjson` 格式 + +以下是一份 JSON 格式请求体内容的示例: + +```JSON +[ + {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, + {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, + {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, + {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +] +``` + +请注意整个 JSON 是一个数组(包含多行日志)。每个 JSON 对象代表即将要被 Pipeline 引擎处理的一行日志。 + +JSON 对象中的 key 名,也就是这里的 `message`,会被用作 Pipeline processor 处理时的 field 名称。比如: + +```yaml +processors: + - dissect: + fields: + # `message` 是 JSON 对象中的 key 名 + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# pipeline 文件的剩余部分在这里省略 +``` + +我们也可以将这个请求体内容改写成 NDJSON 的格式,如下所示: + +```JSON +{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} +{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} +{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} +{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +``` + +注意到最外层的数组符被消去了,现在每个 JSON 对象通过换行符分割而不是 `,`。 + +#### `text/plain` 格式 + +纯文本日志在整个生态系统中被广泛应用。GreptimeDB 同样支持日志数据以 `text/plain` 格式进行输入,使得我们可以直接从日志产生源进行写入。 + +以下是一份和上述样例请求体内容等价的文本请求示例: + +```plain +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" +10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" +172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" +``` + +仅需要将 `Content-Type` header 设置成 `text/plain`,即可将纯文本请求发送到 GreptimeDB。 + +主要注意的是,和 JSON 格式自带 key 名可以被 Pipeline processor 识别和处理不同,`text/plain` 格式直接将整行文本输入到 Pipeline engine。在这种情况下我们可以使用 `message` 来指代整行输入文本,例如: + +```yaml +processors: + - dissect: + fields: + # 使用 `message` 作为 field 名称 + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# pipeline 文件的剩余部分在这里省略 +``` + +对于 `text/plain` 格式的输入,推荐首先使用 `dissect` 或者 `regex` processor 将整行文本分割成不同的字段,以便进行后续的处理。 + +## 设置表选项 + +写入日志的表选项需要在 pipeline 中配置。 +从 `v0.15` 开始,pipeline 引擎可以识别特定的变量名称,并且通过这些变量对应的值设置相应的建表选项。 +通过与 `vrl` 处理器的结合,现在可以非常轻易地通过输入的数据在 pipeline 的执行过程中设置建表选项。 + +以下是支持的表选项变量名: +- `greptime_auto_create_table` +- `greptime_ttl` +- `greptime_append_mode` +- `greptime_merge_mode` +- `greptime_physical_table` +- `greptime_skip_wal` + +请前往[表选项](/reference/sql/create.md#表选项)文档了解每一个选项的详细含义。 + +以下是 pipeline 特有的变量: +- `greptime_table_suffix`: 在给定的目标表后增加后缀 + +以如下 pipeline 文件为例 +```YAML +processors: + - date: + field: time + formats: + - "%Y-%m-%d %H:%M:%S%.3f" + ignore_missing: true + - vrl: + source: | + .greptime_table_suffix, err = "_" + .id + .greptime_table_ttl = "1d" + . +``` + +在这份 vrl 脚本中,我们将表后缀变量设置为输入字段中的 `id`(通过一个下划线连接),然后将 ttl 设置成 `1d`。 +然后我们使用如下数据执行写入。 + +```JSON +{ + "id": "2436", + "time": "2024-05-25 20:16:37.217" +} +``` + +假设给定的表名为 `d_table`,那么最终的表名就会按照预期被设置成 `d_table_2436`。这个表同样的 ttl 同样会被设置成 1 天。 + +## 示例 + +请参考[快速开始](/user-guide/logs/quick-start.md)和[使用自定义 pipeline 中的](/user-guide/logs/use-custom-pipelines.md#使用-pipeline-写入日志)写入日志部分的文档。 + diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/alter.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/alter.md index 6d2e811bf..c9a857a33 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/alter.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/alter.md @@ -194,7 +194,7 @@ ALTER TABLE monitor MODIFY COLUMN load_15 SET FULLTEXT INDEX WITH (analyzer = 'E - `granularity`:(适用于 `bloom` 后端)每个过滤器覆盖的数据块大小。粒度越小,过滤效果越好,但索引大小会增加。默认为 `10240`。 - `false_positive_rate`:(适用于 `bloom` 后端)错误识别块的概率。该值越低,准确性越高(过滤效果越好),但索引大小会增加。该值为介于 `0` 和 `1` 之间的浮点数。默认为 `0.01`。 -更多关于全文索引配置和性能对比的信息,请参考[全文索引配置指南](/user-guide/logs/fulltext-index-config.md)。 +更多关于全文索引配置和性能对比的信息,请参考[全文索引配置指南](/user-guide/manage-data/data-index.md#全文索引)。 与 `CREATE TABLE` 一样,可以不带 `WITH` 选项,全部使用默认值。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md index 838b282a2..4d1cd63be 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md @@ -49,7 +49,7 @@ DataFusion [字符串函数](./df-functions.md#string-functions)。 GreptimeDB 提供: * `matches_term(expression, term)` 用于全文检索。 -阅读[查询日志](/user-guide/logs/query-logs.md)文档获取更多详情。 +阅读[查询日志](/user-guide/logs/fulltext-search.md)文档获取更多详情。 ### 数学函数 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/where.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/where.md index 22dc76456..0293da10e 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/where.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/where.md @@ -75,4 +75,4 @@ SELECT * FROM go_info WHERE instance LIKE 'localhost:____'; ``` -有关在日志中搜索关键字,请阅读[查询日志](/user-guide/logs/query-logs.md)。 \ No newline at end of file +有关在日志中搜索关键字,请阅读[查询日志](/user-guide/logs/fulltext-search.md)。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/fluent-bit.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/fluent-bit.md index 250895e88..ee0d29f82 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/fluent-bit.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/fluent-bit.md @@ -42,7 +42,7 @@ description: 将 GreptimeDB 与 Fluent bit 集成以实现 Prometheus Remote Wri - `table` 是您要写入日志的表名称。 - `pipeline_name` 是您要用于处理日志的管道名称。 -本示例中,使用的是 [Logs Http API](/user-guide/logs/write-logs.md#http-api) 接口。如需更多信息,请参阅 [写入日志](/user-guide/logs/write-logs.md) 文档。 +本示例中,使用的是 [Logs Http API](/reference/pipeline/write-log-api.md#http-api) 接口。如需更多信息,请参阅 [写入日志](/user-guide/logs/use-custom-pipelines.md#使用-pipeline-写入日志) 文档。 ## OpenTelemetry diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/kafka.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/kafka.md index 1cd1a3667..0940c7849 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/kafka.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/kafka.md @@ -127,7 +127,7 @@ pipeline_name = "greptime_identity" #### 创建 pipeline 要创建自定义 pipeline, -请参阅[创建 pipeline](/user-guide/logs/quick-start.md#创建-pipeline) 和 [pipeline 配置](/user-guide/logs/pipeline-config.md)文档获取详细说明。 +请参阅[使用自定义 pipeline](/user-guide/logs/use-custom-pipelines.md)文档获取详细说明。 #### 写入数据 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/loki.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/loki.md index 29bc29dc3..dd398e932 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/loki.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/loki.md @@ -184,7 +184,7 @@ transform: ``` pipeline 的配置相对直观: 使用 `vrl` 处理器将日志行解析为 JSON 对象,然后将其中的字段提取到根目录。 -`log_time` 在 transform 部分中被指定为时间索引,其他字段将由 pipeline 引擎自动推导,详见 [pipeline version 2](/user-guide/logs/pipeline-config.md#版本-2-中的-transform)。 +`log_time` 在 transform 部分中被指定为时间索引,其他字段将由 pipeline 引擎自动推导,详见 [pipeline version 2](/reference/pipeline/pipeline-config.md#版本-2-中的-transform)。 请注意,输入字段名为 `loki_line`,它包含来自 Loki 的原始日志行。 @@ -264,4 +264,4 @@ log_source: application 此输出演示了 pipeline 引擎已成功解析原始 JSON 日志行,并将结构化数据提取到单独的列中。 -有关 pipeline 配置和功能的更多详细信息,请参考[pipeline 文档](/user-guide/logs/pipeline-config.md)。 +有关 pipeline 配置和功能的更多详细信息,请参考[pipeline 文档](/reference/pipeline/pipeline-config.md)。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/prometheus.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/prometheus.md index 44a521c50..466596354 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/prometheus.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/ingest-data/for-observability/prometheus.md @@ -289,7 +289,7 @@ mysql> select * from `go_memstats_mcache_inuse_bytes`; 2 rows in set (0.01 sec) ``` -更多配置详情请参考 [pipeline 相关文档](/user-guide/logs/pipeline-config.md)。 +更多配置详情请参考 [pipeline 相关文档](/reference/pipeline/pipeline-config.md)。 ## 性能优化 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/fulltext-index-config.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/fulltext-index-config.md deleted file mode 100644 index e1448a274..000000000 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/fulltext-index-config.md +++ /dev/null @@ -1,131 +0,0 @@ ---- -keywords: [全文索引, tantivy, bloom, 分析器, 大小写敏感, 配置] -description: GreptimeDB 全文索引配置的完整指南,包括后端选择和其他配置选项。 ---- - -# 全文索引配置 - -本文档提供了 GreptimeDB 全文索引配置的完整指南,包括后端选择和其他配置选项。 - -## 概述 - -GreptimeDB 提供全文索引功能以加速文本搜索操作。您可以在创建或修改表时配置全文索引,并提供各种选项以针对不同用例进行优化。有关 GreptimeDB 中不同类型索引(包括倒排索引和跳数索引)的概述,请参考[数据索引](/user-guide/manage-data/data-index)指南。 - -## 配置选项 - -在创建或修改全文索引时,您可以使用 `FULLTEXT INDEX WITH` 指定以下选项: - -### 基本选项 - -- `analyzer`:设置全文索引的语言分析器 - - 支持的值:`English`、`Chinese` - - 默认值:`English` - - 注意:由于中文文本分词的复杂性,中文分析器构建索引需要的时间显著更长。建议仅在中文文本搜索是主要需求时使用。 - -- `case_sensitive`:决定全文索引是否区分大小写 - - 支持的值:`true`、`false` - - 默认值:`false` - - 注意:设置为 `true` 可能会略微提高区分大小写查询的性能,但会降低不区分大小写查询的性能。此设置不会影响 `matches_term` 查询的结果。 - -- `backend`:设置全文索引的后端实现 - - 支持的值:`bloom`、`tantivy` - - 默认值:`bloom` - -- `granularity`:(适用于 `bloom` 后端)每个过滤器覆盖的数据块大小。粒度越小,过滤效果越好,但索引大小会增加。 - - 支持的值:正整数 - - 默认值:`10240` - -- `false_positive_rate`:(适用于 `bloom` 后端)错误识别块的概率。该值越低,准确性越高(过滤效果越好),但索引大小会增加。该值为介于 `0` 和 `1` 之间的浮点数。 - - 支持的值:介于 `0` 和 `1` 之间的浮点数 - - 默认值:`0.01` - -### 后端选择 - -GreptimeDB 提供两种全文索引后端用于高效日志搜索: - -1. **Bloom 后端** - - 最适合:通用日志搜索 - - 特点: - - 使用 Bloom 过滤器进行高效过滤 - - 存储开销较低 - - 在不同查询模式下性能稳定 - - 限制: - - 对于高选择性查询稍慢 - - 存储成本示例: - - 原始数据:约 10GB - - Bloom 索引:约 1GB - -2. **Tantivy 后端** - - 最适合:高选择性查询(如 TraceID 等唯一值) - - 特点: - - 使用倒排索引实现快速精确匹配 - - 对高选择性查询性能优异 - - 限制: - - 存储开销较高(接近原始数据大小) - - 对低选择性查询性能较慢 - - 存储成本示例: - - 原始数据:约 10GB - - Tantivy 索引:约 10GB - -### 性能对比 - -下表显示了不同查询方法之间的性能对比(以 Bloom 为基准): - -| 查询类型 | 高选择性(如 TraceID) | 低选择性(如 "HTTP") | -|------------|----------------------------------|--------------------------------| -| LIKE | 慢 50 倍 | 1 倍 | -| Tantivy | 快 5 倍 | 慢 5 倍 | -| Bloom | 1 倍(基准) | 1 倍(基准) | - -主要观察结果: -- 对于高选择性查询(如唯一值),Tantivy 提供最佳性能 -- 对于低选择性查询,Bloom 提供更稳定的性能 -- Bloom 在存储方面比 Tantivy 有明显优势(测试案例中为 1GB vs 10GB) - -## 配置示例 - -### 创建带全文索引的表 - -```sql --- 使用 Bloom 后端(大多数情况推荐) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'bloom', - analyzer = 'English', - case_sensitive = 'false' - ) -); - --- 使用 Tantivy 后端(用于高选择性查询) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'tantivy', - analyzer = 'English', - case_sensitive = 'false' - ) -); -``` - -### 修改现有表 - -```sql --- 在现有列上启用全文索引 -ALTER TABLE monitor -MODIFY COLUMN load_15 -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'bloom' -); - --- 更改全文索引配置 -ALTER TABLE logs -MODIFY COLUMN message -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'tantivy' -); -``` diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/query-logs.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/fulltext-search.md similarity index 90% rename from i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/query-logs.md rename to i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/fulltext-search.md index 24ea1454d..0ed21f74f 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/query-logs.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/fulltext-search.md @@ -3,17 +3,15 @@ keywords: [日志查询, GreptimeDB 查询语言, matches_term, 模式匹配, description: 详细介绍如何利用 GreptimeDB 的查询语言对日志数据进行高效搜索和分析,包括使用 matches_term 函数进行精确匹配。 --- -# 日志查询 +# 全文搜索 本文档详细介绍如何利用 GreptimeDB 的查询语言对日志数据进行高效搜索和分析。 -## 概述 - -GreptimeDB 支持通过 SQL 语句灵活查询数据。本节将介绍特定的搜索功能和查询语句,帮助您提升日志查询效率。 +GreptimeDB 支持通过 SQL 语句灵活查询数据。本节将介绍特定的搜索功能和查询语句,帮助你提升日志查询效率。 ## 使用 `matches_term` 函数进行精确匹配 -在 SQL 查询中,您可以使用 `matches_term` 函数执行精确的词语/短语匹配,这在日志分析中尤其实用。`matches_term` 函数支持对 `String` 类型列进行精确匹配。您也可以使用 `@@` 操作符作为 `matches_term` 的简写形式。下面是一个典型示例: +在 SQL 查询中,你可以使用 `matches_term` 函数执行精确的词语/短语匹配,这在日志分析中尤其实用。`matches_term` 函数支持对 `String` 类型列进行精确匹配。你也可以使用 `@@` 操作符作为 `matches_term` 的简写形式。下面是一个典型示例: ```sql -- 使用 matches_term 函数 @@ -45,7 +43,7 @@ SELECT * FROM logs WHERE matches_term(message, 'error'); SELECT * FROM logs WHERE message @@ 'error'; ``` -此查询将返回所有 `message` 列中包含完整词语 "error" 的记录。该函数确保您不会得到部分匹配或词语内的匹配。 +此查询将返回所有 `message` 列中包含完整词语 "error" 的记录。该函数确保你不会得到部分匹配或词语内的匹配。 匹配和不匹配的示例: - ✅ "An error occurred!" - 匹配,因为 "error" 是一个完整词语 @@ -57,7 +55,7 @@ SELECT * FROM logs WHERE message @@ 'error'; ### 多关键词搜索 -您可以使用 `OR` 运算符组合多个 `matches_term` 条件来搜索包含多个关键词中任意一个的日志。当您想要查找可能包含不同错误变体或不同类型问题的日志时,这很有用。 +你可以使用 `OR` 运算符组合多个 `matches_term` 条件来搜索包含多个关键词中任意一个的日志。当你想要查找可能包含不同错误变体或不同类型问题的日志时,这很有用。 ```sql -- 使用 matches_term 函数 @@ -78,7 +76,7 @@ SELECT * FROM logs WHERE message @@ 'critical' OR message @@ 'error'; ### 排除条件搜索 -您可以使用 `NOT` 运算符与 `matches_term` 结合来从搜索结果中排除某些词语。当您想要查找包含一个词语但不包含另一个词语的日志时,这很有用。 +你可以使用 `NOT` 运算符与 `matches_term` 结合来从搜索结果中排除某些词语。当你想要查找包含一个词语但不包含另一个词语的日志时,这很有用。 ```sql -- 使用 matches_term 函数 @@ -97,7 +95,7 @@ SELECT * FROM logs WHERE message @@ 'error' AND NOT message @@ 'critical'; ### 多条件必要搜索 -您可以使用 `AND` 运算符要求日志消息中必须存在多个词语。这对于查找包含特定词语组合的日志很有用。 +你可以使用 `AND` 运算符要求日志消息中必须存在多个词语。这对于查找包含特定词语组合的日志很有用。 ```sql -- 使用 matches_term 函数 @@ -136,7 +134,7 @@ SELECT * FROM logs WHERE message @@ 'system failure'; ### 不区分大小写匹配 -虽然 `matches_term` 默认区分大小写,但您可以通过在匹配前将文本转换为小写来实现不区分大小写的匹配。 +虽然 `matches_term` 默认区分大小写,但你可以通过在匹配前将文本转换为小写来实现不区分大小写的匹配。 ```sql -- 使用 matches_term 函数 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/manage-pipelines.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/manage-pipelines.md index d2bb9abba..db41a932c 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/manage-pipelines.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/manage-pipelines.md @@ -8,14 +8,14 @@ description: 介绍如何在 GreptimeDB 中管理 Pipeline,包括创建、删 在 GreptimeDB 中,每个 `pipeline` 是一个数据处理单元集合,用于解析和转换写入的日志内容。本文档旨在指导你如何创建和删除 Pipeline,以便高效地管理日志数据的处理流程。 -有关 Pipeline 的具体配置,请阅读 [Pipeline 配置](pipeline-config.md)。 +有关 Pipeline 的具体配置,请阅读 [Pipeline 配置](/reference/pipeline/pipeline-config.md)。 ## 鉴权 在使用 HTTP API 进行 Pipeline 管理时,你需要提供有效的鉴权信息。 请参考[鉴权](/user-guide/protocols/http.md#鉴权)文档了解详细信息。 -## 创建 Pipeline +## 上传 Pipeline GreptimeDB 提供了专用的 HTTP 接口用于创建 Pipeline。 假设你已经准备好了一个 Pipeline 配置文件 pipeline.yaml,使用以下命令上传配置文件,其中 `test` 是你指定的 Pipeline 的名称: @@ -29,6 +29,22 @@ curl -X "POST" "http://localhost:4000/v1/pipelines/test" \ 你可以在所有 Database 中使用创建的 Pipeline。 +## Pipeline 版本 + +你可以使用相同的名称上传多个版本的 pipeline。 +每次你使用现有名称上传 pipeline 时,都会自动创建一个新版本。 +你可以在[写入日志](/reference/pipeline/write-log-api.md#http-api)、[查询](#查询-pipeline)或[删除](#删除-pipeline) pipeline 时指定要使用的版本。 +如果未指定版本,默认使用最后上传的版本。 + +成功上传 pipeline 后,响应将包含版本信息: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"} +``` + +版本是 UTC 格式的时间戳,表示 pipeline 的创建时间。 +此时间戳作为每个 pipeline 版本的唯一标识符。 + ## 删除 Pipeline 可以使用以下 HTTP 接口删除 Pipeline: diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/overview.md index 6531d9093..e59f8d3a5 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/overview.md @@ -1,16 +1,105 @@ --- -keywords: [日志, GreptimeDB, 日志写入, 日志配置, 查询日志] -description: 提供了使用 GreptimeDB 日志服务的各种指南,包括快速开始、Pipeline 配置、管理 Pipeline、写入日志、查询日志和全文索引配置。 +keywords: [log service, quick start, pipeline configuration, manage pipelines, query logs] +description: GreptimeDB 日志管理功能的综合指南,包括日志收集架构、Pipeline 处理、与 Vector 和 Kafka 等流行的日志收集器的集成以及使用全文搜索的高级查询。 --- # 日志 -本节内容将涵盖 GreptimeDB 针对日志的功能介绍,从基本的写入查询,到高级功能,诸如 -数据变换、全文索引等。 +GreptimeDB 提供了专为满足现代可观测需求而设计的日志管理解决方案, +它可以和主流日志收集器无缝集成, +提供了灵活的使用 pipeline 转换日志的功能 +和包括全文搜索的查询功能。 + +核心功能点包括: + +- **统一存储**:将日志与指标和 Trace 数据一起存储在单个数据库中 +- **Pipeline 处理数据**:使用可自定义的 pipeline 转换和丰富原始日志,支持多种日志收集器和格式 +- **高级查询**:基于 SQL 的分析,并具有全文搜索功能 +- **实时数据处理**:实时处理和查询日志以进行监控和警报 + +## 日志收集流程 + +![log-collection-flow](/log-collection-flow.drawio.svg) + +上图展示了日志收集的整体架构, +它包括四阶段流程:日志源、日志收集器、Pipeline 处理和在存储到 GreptimeDB 中。 + +### 日志源 + +日志源是基础设施中产生日志数据的基础层。 +GreptimeDB 支持从各种源写入数据以满足全面的可观测性需求: + +- **应用程序**:来自微服务架构、Web 应用程序、移动应用程序和自定义软件组件的应用程序级日志 +- **IoT 设备**:来自物联网生态系统的设备日志、传感器事件日志和运行状态日志 +- **基础设施**:云平台日志、容器编排日志(Kubernetes、Docker)、负载均衡器日志以及网络基础设施组件日志 +- **系统组件**:操作系统日志、内核事件、系统守护进程日志以及硬件监控日志 +- **自定义源**:特定于你环境或应用程序的任何其他日志源 + +### 日志收集器 + +日志收集器负责高效地从各种源收集日志数据并转发到存储后端。 +GreptimeDB 可以与行业标准的日志收集器无缝集成, +包括 Vector、Fluent Bit、Apache Kafka、OpenTelemetry Collector 等。 + +GreptimeDB 作为这些收集器的 sink 后端, +提供强大的数据写入能力。 +在写入过程中,GreptimeDB 的 pipeline 系统能够实时转换和丰富日志数据, +确保在存储前获得最佳的结构和质量。 + +### Pipeline 处理 + +GreptimeDB 的 pipeline 机制将原始日志转换为结构化、可查询的数据: + +- **解析**:从非结构化日志消息中提取结构化数据 +- **转换**:使用额外的上下文和元数据丰富日志 +- **索引**:配置必要的索引以提升查询性能,例如全文索引、时间索引等 + +### 存储日志到 GreptimeDB + +通过 pipeline 处理后,日志存储在 GreptimeDB 中,支持灵活的分析和可视化: + +- **SQL 查询**:使用熟悉的 SQL 语法分析日志数据 +- **基于时间的分析**:利用时间序列功能进行时间分析 +- **全文搜索**:在日志消息中执行高级文本搜索 +- **实时分析**:实时查询日志进行监控和告警 + +## 快速开始 + +你可以使用内置的 `greptime_identity` pipeline 快速开始日志写入。更多信息请参考[快速开始](./quick-start.md)指南。 + +## 集成到日志收集器 + +GreptimeDB 与各种日志收集器无缝集成,提供全面的日志记录解决方案。集成过程包括以下关键步骤: + +1. **选择合适的日志收集器**:根据你的基础设施要求、数据源和性能需求选择收集器 +2. **分析输出格式**:了解你选择的收集器产生的日志格式和结构 +3. **配置 Pipeline**:在 GreptimeDB 中创建和配置 pipeline 来解析、转换和丰富传入的日志数据 +4. **存储和查询**:在 GreptimeDB 中高效存储处理后的日志,用于实时分析和监控 + +要成功将你的日志收集器与 GreptimeDB 集成,你需要: +- 首先了解 pipeline 在 GreptimeDB 中的工作方式 +- 然后在你的日志收集器中配置 sink 设置,将数据发送到 GreptimeDB + +请参考以下指南获取将 GreptimeDB 集成到日志收集器的详细说明: + +- [Vector](/user-guide/ingest-data/for-observability/vector.md#using-greptimedb_logs-sink-recommended) +- [Kafka](/user-guide/ingest-data/for-observability/kafka.md#logs) +- [Fluent Bit](/user-guide/ingest-data/for-observability/fluent-bit.md#http) +- [OpenTelemetry Collector](/user-guide/ingest-data/for-observability/otel-collector.md) +- [Loki](/user-guide/ingest-data/for-observability/loki.md#using-pipeline-with-loki-push-api) + +## 了解更多关于 Pipeline 的信息 + +- [使用自定义 Pipeline](./use-custom-pipelines.md):解释如何创建和使用自定义 pipeline 进行日志写入。 +- [管理 Pipeline](./manage-pipelines.md):解释如何创建和删除 pipeline。 + +## 查询日志 + +- [全文搜索](./fulltext-search.md):使用 GreptimeDB 查询语言有效搜索和分析日志数据的指南。 + +## 参考 + +- [内置 Pipeline](/reference/pipeline/built-in-pipelines.md):GreptimeDB 为日志写入提供的内置 pipeline 详细信息。 +- [写入日志的 API](/reference/pipeline/write-log-api.md):描述向 GreptimeDB 写入日志的 HTTP API。 +- [Pipeline 配置](/reference/pipeline/pipeline-config.md):提供 GreptimeDB 中 pipeline 各项具体配置的信息。 -- [快速开始](./quick-start.md):介绍了如何快速开始使用 GreptimeDB 日志服务。 -- [Pipeline 配置](./pipeline-config.md):深入介绍 GreptimeDB 中的 Pipeline 的每项具体配置。 -- [管理 Pipeline](./manage-pipelines.md):介绍了如何创建、删除 Pipeline。 -- [配合 Pipeline 写入日志](./write-logs.md): 详细说明了如何结合 Pipeline 机制高效写入日志数据。 -- [查询日志](./query-logs.md):描述了如何使用 GreptimeDB SQL 接口查询日志。 -- [全文索引配置](./fulltext-index-config.md):介绍了如何配置全文索引。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/quick-start.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/quick-start.md index 157a24596..c365e6d1d 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/quick-start.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/quick-start.md @@ -1,326 +1,122 @@ --- -keywords: [快速开始, 写入日志, 查询日志, 直接写入, 使用 Pipeline, 创建表, 插入日志, gRPC 协议, JSON 日志, 自定义 Pipeline] -description: 介绍如何快速开始写入和查询日志,包括直接写入日志和使用 Pipeline 写入日志的方法,以及两者的区别。 +keywords: [logs, log service, pipeline, greptime_identity, quick start, JSON logs] +description: GreptimeDB 日志服务快速入门指南,包括使用内置 greptime_identity pipeline 的基本日志写入和与日志收集器的集成。 --- -# 快速开始 +# 快速入门 -本指南逐步讲解如何在 GreptimeDB 中快速写入和查询日志。 +本指南将引导你完成使用 GreptimeDB 日志服务的基本步骤。 +你将学习如何使用内置的 `greptime_identity` pipeline 写入日志并集成日志收集器。 -GreptimeDB 支持可以将结构化日志消息解析并转换为多列的 Pipeline 机制, -以实现高效的存储和查询。 +GreptimeDB 提供了强大的基于 pipeline 的日志写入系统。 +你可以使用内置的 `greptime_identity` pipeline 快速写入 JSON 格式的日志, +该 pipeline 具有以下特点: -对于非结构化的日志,你可以不使用 Pipeline,直接将日志写入表。 +- 自动处理从 JSON 到表列的字段映射 +- 如果表不存在则自动创建表 +- 灵活支持变化的日志结构 +- 需要最少的配置即可开始使用 -## 使用 Pipeline 写入日志 +## 直接通过 HTTP 写入日志 -使用 pipeline 可以自动将日志消息格式化并转换为多个列,并自动创建和修改表结构。 +GreptimeDB 日志写入最简单的方法是通过使用 `greptime_identity` pipeline 发送 HTTP 请求。 -### 使用内置 Pipeline 写入 JSON 日志 - -GreptimeDB 提供了一个内置 pipeline `greptime_identity` 用于处理 JSON 日志格式。该 pipeline 简化了写入 JSON 日志的过程。 +例如,你可以使用 `curl` 发送带有 JSON 日志数据的 POST 请求: ```shell curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity" \ + "http://localhost:4000/v1/ingest?db=public&table=demo_logs&pipeline_name=greptime_identity" \ -H "Content-Type: application/json" \ -H "Authorization: Basic {{authentication}}" \ -d '[ { - "name": "Alice", - "age": 20, - "is_student": true, - "score": 90.5, - "object": { "a": 1, "b": 2 } - }, - { - "age": 21, - "is_student": false, - "score": 85.5, - "company": "A", - "whatever": null + "timestamp": "2024-01-15T10:30:00Z", + "level": "INFO", + "service": "web-server", + "message": "用户登录成功", + "user_id": 12345, + "ip_address": "192.168.1.100" }, { - "name": "Charlie", - "age": 22, - "is_student": true, - "score": 95.5, - "array": [1, 2, 3] + "timestamp": "2024-01-15T10:31:00Z", + "level": "ERROR", + "service": "database", + "message": "连接超时", + "error_code": 500, + "retry_count": 3 } ]' ``` -- [`鉴权`](/user-guide/protocols/http.md#鉴权) HTTP header。 -- `pipeline_name=greptime_identity` 指定了内置 pipeline。 -- `table=pipeline_logs` 指定了目标表。如果表不存在,将自动创建。 -`greptime_identity` pipeline 将自动为 JSON 日志中的每个字段创建列。成功执行命令将返回: - -```json -{"output":[{"affectedrows":3}],"execution_time_ms":9} -``` - -有关 `greptime_identity` pipeline 的更多详细信息,请参阅 [写入日志](write-logs.md#greptime_identity) 文档。 +关键参数包括: -### 使用自定义 Pipeline 写入日志 +- `db=public`:目标数据库名称(你的数据库名称) +- `table=demo_logs`:目标表名称(如果不存在则自动创建) +- `pipeline_name=greptime_identity`:使用 `greptime_identity` pipeline 进行 JSON 处理 +- `Authorization` 头:使用 base64 编码的 `username:password` 进行基本身份验证,请参阅 [HTTP 鉴权指南](/user-guide/protocols/http.md#authentication) -自定义 pipeline 允许你解析结构的日志消息并将其转换为多列,并自动创建表。 - -#### 创建 Pipeline - -GreptimeDB 提供了一个专用的 HTTP 接口来创建 pipeline。方法如下: - -首先,创建一个 pipeline 文件,例如 `pipeline.yaml`。 - -```yaml -version: 2 -processors: - - dissect: - fields: - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - - date: - fields: - - timestamp - formats: - - "%d/%b/%Y:%H:%M:%S %z" - - select: - type: exclude - fields: - - message - -transform: - - fields: - - ip_address - type: string - index: inverted - tag: true - - fields: - - status_code - type: int32 - index: inverted - tag: true - - fields: - - request_line - - user_agent - type: string - index: fulltext - - fields: - - response_size - type: int32 - - fields: - - timestamp - type: time - index: timestamp -``` - -该 pipeline 使用指定的模式拆分 `message` 字段以提取 `ip_address`、`timestamp`、`http_method`、`request_line`、`status_code`、`response_size` 和 `user_agent`。 -然后,它使用格式 `%d/%b/%Y:%H:%M:%S %z` 解析 `timestamp` 字段,将其转换为数据库可以理解的正确时间戳格式。 -最后,它将每个字段转换为适当的数据类型并相应地建立索引。 -注意到在 pipeline 的最开始我们使用了版本 2 格式,详情请参考[这个文档](./pipeline-config.md#版本-2-中的-transform)。 -简而言之,在版本 2 下 pipeline 引擎会自动查找所有没有在 transform 模块中指定的字段,并使用默认的数据类型将他们持久化到数据库中。 -你可以在[后续章节](#使用-pipeline-与直接写入非结构化日志的区别)中看到,虽然 `http_method` 没有在 transform 模块中被指定,但它依然被写入到了数据库中。 -另外,`select` 处理器被用于过滤原始的 `message` 字段。 -需要注意的是,`request_line` 和 `user_agent` 字段被索引为 `fulltext` 以优化全文搜索查询,且表中必须有一个由 `timestamp` 指定的时间索引列。 - -执行以下命令上传配置文件: - -```shell -curl -X "POST" \ - "http://localhost:4000/v1/pipelines/nginx_pipeline" \ - -H 'Authorization: Basic {{authentication}}' \ - -F "file=@pipeline.yaml" -``` - -成功执行此命令后,将创建一个名为 `nginx_pipeline` 的 pipeline,返回的结果如下: +成功的请求返回: ```json -{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. +{ + "output": [{"affectedrows": 2}], + "execution_time_ms": 15 +} ``` -你可以为同一 pipeline 名称创建多个版本。 -所有 pipeline 都存储在 `greptime_private.pipelines` 表中。 -请参阅[查询 Pipelines](manage-pipelines.md#查询-pipeline)以查看表中的 pipeline 数据。 - -#### 写入日志 - -以下示例将日志写入 `custom_pipeline_logs` 表,并使用 `nginx_pipeline` pipeline 格式化和转换日志消息。 - -```shell -curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d '[ - { - "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" - }, - { - "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" - }, - { - "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" - }, - { - "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" - } - ]' -``` - -如果命令执行成功,你将看到以下输出: - -```json -{"output":[{"affectedrows":4}],"execution_time_ms":79} -``` - -## 直接写入非结构化的日志 - -如果你的日志消息是非结构化文本, -你可以将其直接写入数据库。 -但是这种方法限制了数据库执行高性能分析的能力。 - -### 创建表 - -你需要在插入日志之前创建一个表来存储日志。 -使用以下 SQL 语句创建一个名为 `origin_logs` 的表: - -* `message` 列上的 `FULLTEXT INDEX` 可优化文本搜索查询 -* 将 `append_mode` 设置为 `true` 表示以附加行的方式写入数据,不对历史数据做覆盖。 +成功写入日志后, +相应的表 `demo_logs` 会根据 JSON 字段自动创建相应的列,其 schema 如下: ```sql -CREATE TABLE `origin_logs` ( - `message` STRING FULLTEXT INDEX, - `time` TIMESTAMP TIME INDEX -) WITH ( - append_mode = 'true' -); -``` - -### 插入日志 - -#### 使用 SQL 协议写入 - -使用 `INSERT` 语句将日志插入表中。 - -```sql -INSERT INTO origin_logs (message, time) VALUES -('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), -('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), -('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), -('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); -``` - -上述 SQL 将整个日志文本插入到一个列中,除此之外,你必须为每条日志添加一个额外的时间戳。 - -#### 使用 gRPC 协议写入 - -你也可以使用 gRPC 协议写入日志,这是一个更高效的方法。 - -请参阅[使用 gRPC 写入数据](/user-guide/ingest-data/for-iot/grpc-sdks/overview.md)以了解如何使用 gRPC 协议写入日志。 - - -## 使用 Pipeline 与直接写入非结构化日志的区别 - -在上述示例中, -使用 pipeline 写入日志的方式自动创建了表 `custom_pipeline_logs`, -直接写入日志的方式创建了表 `origin_logs`, -让我们来探讨这两个表之间的区别。 - -```sql -DESC custom_pipeline_logs; -``` - -```sql -+---------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------------+---------------------+------+------+---------+---------------+ -| ip_address | String | PRI | YES | | TAG | -| status_code | Int32 | PRI | YES | | TAG | -| request_line | String | | YES | | FIELD | -| user_agent | String | | YES | | FIELD | -| response_size | Int32 | | YES | | FIELD | -| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -| http_method | String | | YES | | FIELD | -+---------------+---------------------+------+------+---------+---------------+ -7 rows in set (0.00 sec) -``` - -```sql -DESC origin_logs; -``` - -```sql -+---------+----------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------+----------------------+------+------+---------+---------------+ -| message | String | | YES | | FIELD | -| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | -+---------+----------------------+------+------+---------+---------------+ -``` - -从表结构中可以看到,`origin_logs` 表只有两列,整个日志消息存储在一个列中。 -而 `custom_pipeline_logs` 表将日志消息存储在多个列中。 - -推荐使用 pipeline 方法将日志消息拆分为多个列,这样可以精确查询某个特定列中的某个值。 -与全文搜索相比,列匹配查询在处理字符串时具有以下几个优势: - -- **性能效率**:列的匹配查询通常都比全文搜索更快。 -- **资源消耗**:由于 GreptimeDB 的存储引擎是列存,结构化的数据更利于数据的压缩,并且 Tag 匹配查询使用的倒排索引,其资源消耗通常显著少于全文索引,尤其是在存储大小方面。 -- **可维护性**:精确匹配查询简单明了,更易于理解、编写和调试。 - -当然,如果需要在大段文本中进行关键词搜索,依然需要使用全文搜索,因为它就是专门为此设计。 - -## 查询日志 - -以 `custom_pipeline_logs` 表为例查询日志。 - -### 按 Tag 查询日志 - -对于 `custom_pipeline_logs` 中的多个 Tag 列,你可以灵活地按 Tag 查询数据。 -例如,查询 `status_code` 为 `200` 且 `http_method` 为 `GET` 的日志。 - -```sql -SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; -``` - -```sql -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -1 row in set (0.02 sec) -``` - -### 全文搜索 - -对于 `request_line` 和 `user_agent` 文本字段,你可以使用 `matches_term` 函数查询日志。 -为了提高全文搜索的性能,我们在[创建 Pipeline](#创建-pipeline) 时为这两个列创建了全文索引。 - -例如,查询 `request_line` 包含 `/index.html` 或 `/api/login` 的日志。 - -```sql -SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); -``` - -```sql -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -2 rows in set (0.00 sec) -``` - -你可以参阅[全文搜索](query-logs.md)文档以获取 `matches_term` 的详细用法。 ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| ip_address | String | | YES | | FIELD | +| level | String | | YES | | FIELD | +| message | String | | YES | | FIELD | +| service | String | | YES | | FIELD | +| timestamp | String | | YES | | FIELD | +| user_id | Int64 | | YES | | FIELD | +| error_code | Int64 | | YES | | FIELD | +| retry_count | Int64 | | YES | | FIELD | ++--------------------+---------------------+------+------+---------+---------------+ +``` + +## 与日志收集器集成 + +对于生产环境, +你通常会使用日志收集器自动将日志转发到 GreptimeDB。 +以下是如何配置 Vector 使用 `greptime_identity` pipeline 向 GreptimeDB 发送日志的示例: + +```toml +[sinks.my_sink_id] +type = "greptimedb_logs" +dbname = "public" +endpoint = "http://:4000" +pipeline_name = "greptime_identity" +table = "
" +username = "" +password = "" +# 根据需要添加其他配置 +``` + +关键配置参数包括: +- `type = "greptimedb_logs"`:指定 GreptimeDB 日志接收器 +- `dbname`:目标数据库名称 +- `endpoint`:GreptimeDB HTTP 端点 +- `pipeline_name`:使用 `greptime_identity` pipeline 进行 JSON 处理 +- `table`:目标表名称(如果不存在则自动创建) +- `username` 和 `password`:HTTP 基本身份验证的凭证 + +有关 Vector 配置和选项的详细信息, +请参阅 [Vector 集成指南](/user-guide/ingest-data/for-observability/vector.md#使用-greptimedb_logs-sink-推荐)。 ## 下一步 -你现在已经体验了 GreptimeDB 的日志记录功能,可以通过以下文档进一步探索: +你已成功写入了第一批日志,以下是推荐的后续步骤: + +- **了解更多关于内置 Pipeline 的行为**:请参阅[内置 Pipeline](/reference/pipeline/built-in-pipelines.md)指南,了解可用的内置 pipeline 及其配置的详细信息 +- **与流行的日志收集器集成**:有关将 GreptimeDB 与 Fluent Bit、Fluentd 等各种日志收集器集成的详细说明,请参阅[日志概览](./overview.md)中的[集成到日志收集器](./overview.md#集成到日志收集器)部分 +- **使用自定义 Pipeline**:要了解使用自定义 pipeline 进行高级日志处理和转换的信息,请参阅[使用自定义 Pipeline](./use-custom-pipelines.md)指南 -- [Pipeline 配置](./pipeline-config.md): 提供 GreptimeDB 中每个 pipeline 配置的深入信息。 -- [管理 Pipeline](./manage-pipelines.md): 解释如何创建和删除 pipeline。 -- [使用 Pipeline 写入日志](./write-logs.md): 介绍利用 pipeline 机制写入日志数据的详细说明。 -- [查询日志](./query-logs.md): 描述如何使用 GreptimeDB SQL 接口查询日志。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/use-custom-pipelines.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/use-custom-pipelines.md new file mode 100644 index 000000000..cea036a4d --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/use-custom-pipelines.md @@ -0,0 +1,318 @@ +--- +keywords: [快速开始, 写入日志, 查询日志, pipeline, 结构化数据, 日志写入, 日志收集, 日志管理工具] +description: 在 GreptimeDB 中快速写入和查询日志的全面指南,包括直接日志写入和使用 pipeline 处理结构化数据。 +--- + +# 使用自定义 Pipeline + +基于你的 pipeline 配置, +GreptimeDB 能够将日志自动解析和转换为多列的结构化数据, +当内置 pipeline 无法处理特定的文本日志格式时, +你可以创建自定义 pipeline 来定义如何根据你的需求解析和转换日志数据。 + +## 识别你的原始日志格式 + +在创建自定义 pipeline 之前,了解原始日志数据的格式至关重要。 +如果你正在使用日志收集器且不确定日志格式, +有两种方法可以检查你的日志: + +1. **阅读收集器的官方文档**:配置你的收集器将数据输出到控制台或文件以检查日志格式。 +2. **使用 `greptime_identity` pipeline**:使用内置的 `greptime_identity` pipeline 将示例日志直接写入到 GreptimeDB 中。 + `greptime_identity` pipeline 将整个文本日志视为单个 `message` 字段,方便你直接看到原始日志的内容。 + +一旦了解了要处理的日志格式, +你就可以创建自定义 pipeline。 +本文档使用以下 Nginx 访问日志条目作为示例: + +```txt +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +``` + +## 创建自定义 Pipeline + +GreptimeDB 提供 HTTP 接口用于创建 pipeline。 +以下是创建方法。 + +首先,创建一个示例 pipeline 配置文件来处理 Nginx 访问日志, +将其命名为 `pipeline.yaml`: + +```yaml +version: 2 +processors: + - dissect: + fields: + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + - date: + fields: + - timestamp + formats: + - "%d/%b/%Y:%H:%M:%S %z" + - select: + type: exclude + fields: + - message + - vrl: + source: | + .greptime_table_ttl = "7d" + . + +transform: + - fields: + - ip_address + type: string + index: inverted + tag: true + - fields: + - status_code + type: int32 + index: inverted + tag: true + - fields: + - request_line + - user_agent + type: string + index: fulltext + - fields: + - response_size + type: int32 + - fields: + - timestamp + type: time + index: timestamp +``` + +上面的 pipeline 配置使用 [version 2](/reference/pipeline/pipeline-config.md#transform-in-version-2) 格式, +包含 `processors` 和 `transform` 部分来结构化你的日志数据: + +**Processors**:用于在转换前预处理日志数据: +- **数据提取**:`dissect` 处理器使用 pattern 匹配来解析 `message` 字段并提取结构化数据,包括 `ip_address`、`timestamp`、`http_method`、`request_line`、`status_code`、`response_size` 和 `user_agent`。 +- **时间戳处理**:`date` 处理器使用格式 `%d/%b/%Y:%H:%M:%S %z` 解析提取的 `timestamp` 字段并将其转换为适当的时间戳数据类型。 +- **字段选择**:`select` 处理器从最终输出中排除原始 `message` 字段,同时保留所有其他字段。 +- **表选项**:`vrl` 处理器根据提取的字段设置表选项,例如向表名添加后缀和设置 TTL。`greptime_table_ttl = "7d"` 配置表数据的保存时间为 7 天。 + +**Transform**:定义如何转换和索引提取的字段: +- **字段转换**:每个提取的字段都转换为适当的数据类型并根据需要配置相应的索引。像 `http_method` 这样的字段在没有提供显式配置时保留其默认数据类型。 +- **索引策略**: + - `ip_address` 和 `status_code` 使用倒排索引作为标签进行快速过滤 + - `request_line` 和 `user_agent` 使用全文索引以获得最佳文本搜索能力 + - `timestamp` 是必需的时间索引列 + +有关 pipeline 配置选项的详细信息, +请参考 [Pipeline 配置](/reference/pipeline/pipeline-config.md) 文档。 + +## 上传 Pipeline + +执行以下命令上传 pipeline 配置: + +```shell +curl -X "POST" \ + "http://localhost:4000/v1/pipelines/nginx_pipeline" \ + -H 'Authorization: Basic {{authentication}}' \ + -F "file=@pipeline.yaml" +``` + +成功执行后,将创建一个名为 `nginx_pipeline` 的 pipeline 并返回以下结果: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. +``` + +你可以为同一个 pipeline 名称创建多个版本。 +所有 pipeline 都存储在 `greptime_private.pipelines` 表中。 +参考[查询 Pipeline](manage-pipelines.md#查询-pipeline) 来查看 pipeline 数据。 + +## 使用 Pipeline 写入日志 + +以下示例使用 `nginx_pipeline` pipeline 将日志写入 `custom_pipeline_logs` 表来格式化和转换日志消息: + +```shell +curl -X POST \ + "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d '[ + { + "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" + }, + { + "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" + }, + { + "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" + }, + { + "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" + } + ]' +``` + +命令执行成功后将返回以下输出: + +```json +{"output":[{"affectedrows":4}],"execution_time_ms":79} +``` + +`custom_pipeline_logs` 表内容根据 pipeline 配置自动创建: + +```sql ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| ip_address | http_method | status_code | request_line | user_agent | response_size | timestamp | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| 10.0.0.1 | GET | 304 | /images/logo.png HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0 | 0 | 2024-05-25 20:18:37 | +| 127.0.0.1 | GET | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | +| 172.16.0.1 | GET | 404 | /contact HTTP/1.1 | Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1 | 162 | 2024-05-25 20:19:37 | +| 192.168.1.1 | POST | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +``` + +有关日志写入 API 端点 `/ingest` 的更详细信息, +包括附加参数和配置选项, +请参考[日志写入 API](/reference/pipeline/write-log-api.md) 文档。 + +## 查询日志 + +我们使用 `custom_pipeline_logs` 表作为示例来查询日志。 + +### 通过 tag 查询日志 + +通过 `custom_pipeline_logs` 中的多个 tag 列, +你可以灵活地通过 tag 查询数据。 +例如,查询 `status_code` 为 200 且 `http_method` 为 GET 的日志。 + +```sql +SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; +``` + +```sql ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +1 row in set (0.02 sec) +``` + +### 全文搜索 + +对于文本字段 `request_line` 和 `user_agent`,你可以使用 `matches_term` 函数来搜索日志。 +还记得我们在[创建 pipeline](#create-a-pipeline) 时为这两列创建了全文索引。 +这带来了高性能的全文搜索。 + +例如,查询 `request_line` 列包含 `/index.html` 或 `/api/login` 的日志。 + +```sql +SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); +``` + +```sql ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | +| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +2 rows in set (0.00 sec) +``` + +你可以参考[全文搜索](fulltext-search.md) 文档了解 `matches_term` 函数的详细用法。 + + +## 使用 Pipeline 的好处 + +使用 pipeline 处理日志带来了结构化的数据和自动的字段提取, +这使得查询和分析更加高效。 + +你也可以在没有 pipeline 的情况下直接将日志写入数据库, +但这种方法限制了高性能分析能力。 + +### 直接插入日志(不使用 Pipeline) + +为了比较,你可以创建一个表来存储原始日志消息: + +```sql +CREATE TABLE `origin_logs` ( + `message` STRING FULLTEXT INDEX, + `time` TIMESTAMP TIME INDEX +) WITH ( + append_mode = 'true' +); +``` + +使用 `INSERT` 语句将日志插入表中。 +注意你需要为每个日志手动添加时间戳字段: + +```sql +INSERT INTO origin_logs (message, time) VALUES +('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), +('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), +('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), +('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); +``` + +### 表结构比较:Pipeline 转换后 vs 原始日志 + +在上面的示例中,表 `custom_pipeline_logs` 是通过使用 pipeline 写入日志自动创建的, +而表 `origin_logs` 是通过直接写入日志创建的。 +让我们看一看这两个表之间的差异。 + +```sql +DESC custom_pipeline_logs; +``` + +```sql ++---------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------------+---------------------+------+------+---------+---------------+ +| ip_address | String | PRI | YES | | TAG | +| status_code | Int32 | PRI | YES | | TAG | +| request_line | String | | YES | | FIELD | +| user_agent | String | | YES | | FIELD | +| response_size | Int32 | | YES | | FIELD | +| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| http_method | String | | YES | | FIELD | ++---------------+---------------------+------+------+---------+---------------+ +7 rows in set (0.00 sec) +``` + +```sql +DESC origin_logs; +``` + +```sql ++---------+----------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------+----------------------+------+------+---------+---------------+ +| message | String | | YES | | FIELD | +| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | ++---------+----------------------+------+------+---------+---------------+ +``` + +以上表结构显示了关键差异: + +`custom_pipeline_logs` 表(使用 pipeline 创建)自动将日志数据结构化为多列: +- `ip_address`、`status_code` 作为索引标签用于快速过滤 +- `request_line`、`user_agent` 具有全文索引用于文本搜索 +- `response_size`、`http_method` 作为常规字段 +- `timestamp` 作为时间索引 + +`origin_logs` 表(直接插入)将所有内容存储在单个 `message` 列中。 + +### 为什么使用 Pipeline? + +建议使用 pipeline 方法将日志消息拆分为多列, +这具有明确查询特定列中特定值的优势。 +有几个关键原因使得基于列的匹配查询比全文搜索更优越: + +- **性能**:基于列的查询通常比全文搜索更快 +- **存储效率**:GreptimeDB 的列式存储能更好地压缩结构化数据;标签的倒排索引比全文索引消耗更少的存储空间 +- **查询简单性**:基于标签的查询更容易编写、理解和调试 + +## 下一步 + +- **全文搜索**:阅读[全文搜索](fulltext-search.md) 指南,了解 GreptimeDB 中的高级文本搜索功能和查询技术 +- **Pipeline 配置**:阅读 [Pipeline 配置](/reference/pipeline/pipeline-config.md) 文档,了解更多关于为各种日志格式和处理需求创建和自定义 pipeline 的信息 + + diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/write-logs.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/write-logs.md deleted file mode 100644 index 3275b00e6..000000000 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/write-logs.md +++ /dev/null @@ -1,344 +0,0 @@ ---- -keywords: [日志写入, HTTP 接口, Pipeline 配置, 数据格式, 请求参数] -description: 介绍如何通过 HTTP 接口使用指定的 Pipeline 将日志写入 GreptimeDB,包括请求参数、数据格式和示例。 ---- - -# 使用 Pipeline 写入日志 - -本文档介绍如何通过 HTTP 接口使用指定的 Pipeline 进行处理后将日志写入 GreptimeDB。 - -在写入日志之前,请先阅读 [Pipeline 配置](pipeline-config.md)和[管理 Pipeline](manage-pipelines.md) 完成配置的设定和上传。 - -## HTTP API - -您可以使用以下命令通过 HTTP 接口写入日志: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - - -## 请求参数 - -此接口接受以下参数: - -- `db`:数据库名称。 -- `table`:表名称。 -- `pipeline_name`:[Pipeline](./pipeline-config.md) 名称。 -- `version`:Pipeline 版本号。可选,默认使用最新版本。 - -## `Content-Type` 和 Body 数据格式 - -GreptimeDB 使用 `Content-Type` header 来决定如何解码请求体内容。目前我们支持以下两种格式: -- `application/json`: 包括普通的 JSON 格式和 NDJSON 格式。 -- `application/x-ndjson`: 指定 NDJSON 格式,会尝试先分割行再进行解析,可以达到精确的错误检查。 -- `text/plain`: 通过换行符分割的多行日志文本行。 - -### `application/json` 和 `application/x-ndjson` 格式 - -以下是一份 JSON 格式请求体内容的示例: - -```JSON -[ - {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, - {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, - {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, - {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -] -``` - -请注意整个 JSON 是一个数组(包含多行日志)。每个 JSON 对象代表即将要被 Pipeline 引擎处理的一行日志。 - -JSON 对象中的 key 名,也就是这里的 `message`,会被用作 Pipeline processor 处理时的 field 名称。比如: - -```yaml -processors: - - dissect: - fields: - # `message` 是 JSON 对象中的 key 名 - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# pipeline 文件的剩余部分在这里省略 -``` - -我们也可以将这个请求体内容改写成 NDJSON 的格式,如下所示: - -```JSON -{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} -{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} -{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} -{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -``` - -注意到最外层的数组符被消去了,现在每个 JSON 对象通过换行符分割而不是 `,`。 - -### `text/plain` 格式 - -纯文本日志在整个生态系统中被广泛应用。GreptimeDB 同样支持日志数据以 `text/plain` 格式进行输入,使得我们可以直接从日志产生源进行写入。 - -以下是一份和上述样例请求体内容等价的文本请求示例: - -```plain -127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" -192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" -10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" -172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" -``` - -仅需要将 `Content-Type` header 设置成 `text/plain`,即可将纯文本请求发送到 GreptimeDB。 - -主要注意的是,和 JSON 格式自带 key 名可以被 Pipeline processor 识别和处理不同,`text/plain` 格式直接将整行文本输入到 Pipeline engine。在这种情况下我们可以使用 `message` 来指代整行输入文本,例如: - -```yaml -processors: - - dissect: - fields: - # 使用 `message` 作为 field 名称 - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# pipeline 文件的剩余部分在这里省略 -``` - -对于 `text/plain` 格式的输入,推荐首先使用 `dissect` 或者 `regex` processor 将整行文本分割成不同的字段,以便进行后续的处理。 - -## 内置 Pipeline - -GreptimeDB 提供了常见日志格式的内置 Pipeline,允许您直接使用而无需创建新的 Pipeline。 - -请注意,内置 Pipeline 的名称以 "greptime_" 为前缀,不可编辑。 - -### `greptime_identity` - -`greptime_identity` Pipeline 适用于写入 JSON 日志,并自动为 JSON 日志中的每个字段创建列。 - -- JSON 日志中的第一层级的 key 是表中的列名。 -- 如果相同字段包含不同类型的数据,则会返回错误。 -- 值为 `null` 的字段将被忽略。 -- 如果没有手动指定,一个作为时间索引的额外列 `greptime_timestamp` 将被添加到表中,以指示日志写入的时间。 - -#### 类型转换规则 - -- `string` -> `string` -- `number` -> `int64` 或 `float64` -- `boolean` -> `bool` -- `null` -> 忽略 -- `array` -> `json` -- `object` -> `json` - -例如,如果我们有以下 JSON 数据: - -```json -[ - {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, - {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, - {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} -] -``` - -我们将合并每个批次的行结构以获得最终 schema。表 schema 如下所示: - -```sql -mysql> desc pipeline_logs; -+--------------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------------------+---------------------+------+------+---------+---------------+ -| age | Int64 | | YES | | FIELD | -| is_student | Boolean | | YES | | FIELD | -| name | String | | YES | | FIELD | -| object | Json | | YES | | FIELD | -| score | Float64 | | YES | | FIELD | -| company | String | | YES | | FIELD | -| array | Json | | YES | | FIELD | -| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -+--------------------+---------------------+------+------+---------+---------------+ -8 rows in set (0.00 sec) -``` - -数据将存储在表中,如下所示: - -```sql -mysql> select * from pipeline_logs; -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| age | is_student | name | object | score | company | array | greptime_timestamp | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | -| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | -| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -3 rows in set (0.01 sec) -``` - -#### 自定义时间索引列 - -每个 GreptimeDB 表中都必须有时间索引列。`greptime_identity` pipeline 不需要额外的 YAML 配置,如果你希望使用写入数据中自带的时间列(而不是日志数据到达服务端的时间戳)作为表的时间索引列,则需要通过参数进行指定。 - -假设这是一份待写入的日志数据: -```JSON -[ - {"action": "login", "ts": 1742814853} -] -``` - -设置如下的 URL 参数来指定自定义时间索引列: -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d $'[{"action": "login", "ts": 1742814853}]' -``` - -取决于数据的格式,`custom_time_index` 参数接受两种格式的配置值: -- Unix 时间戳: `<字段名>;epoch;<精度>` - - 该字段需要是整数或者字符串 - - 精度为这四种选项之一: `s`, `ms`, `us`, or `ns`. -- 时间戳字符串: `<字段名>;datestr;<字符串解析格式>` - - 例如输入的时间字段值为 `2025-03-24 19:31:37+08:00`,则对应的字符串解析格式为 `%Y-%m-%d %H:%M:%S%:z` - -通过上述配置,结果表就能正确使用输入字段作为时间索引列 -```sql -DESC pipeline_logs; -``` -```sql -+--------+-----------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------+-----------------+------+------+---------+---------------+ -| ts | TimestampSecond | PRI | NO | | TIMESTAMP | -| action | String | | YES | | FIELD | -+--------+-----------------+------+------+---------+---------------+ -2 rows in set (0.02 sec) -``` - -假设时间变量名称为 `input_ts`,以下是一些使用 `custom_time_index` 的示例: -- 1742814853: `custom_time_index=input_ts;epoch;s` -- 1752749137000: `custom_time_index=input_ts;epoch;ms` -- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` -- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` - - -#### 展开 json 对象 - -如果你希望将 JSON 对象展开为单层结构,可以在请求的 header 中添加 `x-greptime-pipeline-params` 参数,设置 `flatten_json_object` 为 `true`。 - -以下是一个示例请求: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -H "x-greptime-pipeline-params: flatten_json_object=true" \ - -d "$" -``` - -这样,GreptimeDB 将自动将 JSON 对象的每个字段展开为单独的列。比如 - -```JSON -{ - "a": { - "b": { - "c": [1, 2, 3] - } - }, - "d": [ - "foo", - "bar" - ], - "e": { - "f": [7, 8, 9], - "g": { - "h": 123, - "i": "hello", - "j": { - "k": true - } - } - } -} -``` - -将被展开为: - -```json -{ - "a.b.c": [1,2,3], - "d": ["foo","bar"], - "e.f": [7,8,9], - "e.g.h": 123, - "e.g.i": "hello", - "e.g.j.k": true -} -``` - -## Pipeline 上下文中的 hint 变量 - -从 `v0.15` 开始,pipeline 引擎可以识别特定的变量名称,并且通过这些变量对应的值设置相应的建表选项。 -通过与 `vrl` 处理器的结合,现在可以非常轻易地通过输入的数据在 pipeline 的执行过程中设置建表选项。 - -以下是支持的表选项变量名: -- `greptime_auto_create_table` -- `greptime_ttl` -- `greptime_append_mode` -- `greptime_merge_mode` -- `greptime_physical_table` -- `greptime_skip_wal` -关于这些表选项的含义,可以参考[这份文档](/reference/sql/create.md#表选项)。 - -以下是 pipeline 特有的变量: -- `greptime_table_suffix`: 在给定的目标表后增加后缀 - -以如下 pipeline 文件为例 -```YAML -processors: - - date: - field: time - formats: - - "%Y-%m-%d %H:%M:%S%.3f" - ignore_missing: true - - vrl: - source: | - .greptime_table_suffix, err = "_" + .id - .greptime_table_ttl = "1d" - . -``` - -在这份 vrl 脚本中,我们将表后缀变量设置为输入字段中的 `id`(通过一个下划线连接),然后将 ttl 设置成 `1d`。 -然后我们使用如下数据执行写入。 - -```JSON -{ - "id": "2436", - "time": "2024-05-25 20:16:37.217" -} -``` - -假设给定的表名为 `d_table`,那么最终的表名就会按照预期被设置成 `d_table_2436`。这个表同样的 ttl 同样会被设置成 1 天。 - -## 示例 - -请参考快速开始中的[写入日志](quick-start.md#写入日志)部分。 - -## Append 模式 - -通过此接口创建的日志表,默认为[Append 模式](/user-guide/deployments-administration/performance-tuning/design-table.md#何时使用-append-only-表). - - -## 使用 skip_error 跳过错误 - -如果你希望在写入日志时跳过错误,可以在 HTTP 请求的 query params 中添加 `skip_error` 参数。比如: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=true" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - -这样,GreptimeDB 将在遇到错误时跳过该条日志,并继续处理其他日志。不会因为某一条日志的错误而导致整个请求失败。 \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/manage-data/data-index.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/manage-data/data-index.md index 09f850f66..7c69bd838 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/manage-data/data-index.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/manage-data/data-index.md @@ -79,7 +79,7 @@ CREATE TABLE sensor_data ( ### 全文索引 -全文索引专门用于优化字符串列的文本搜索操作。它支持基于词的匹配和文本搜索功能,能够实现对文本内容的高效检索。用户可以使用灵活的关键词、短语或模式匹配来查询文本数据。 +全文索引专门用于优化字符串列的文本搜索操作。它支持基于词的匹配和文本搜索功能,能够实现对文本内容的高效检索。你可以使用灵活的关键词、短语或模式匹配来查询文本数据。 **适用场景:** - 文本内容搜索 @@ -95,32 +95,122 @@ CREATE TABLE logs ( ); ``` -全文索引通过 `WITH` 支持以下选项: -* `analyzer`:设置全文索引的语言分析器。支持的值包括 `English`(英语)和 `Chinese`(中文)。默认值为 `English`。 -* `case_sensitive`:决定全文索引是否区分大小写。支持的值为 `true`(是)和 `false`(否)。默认值为 `false`。 -* `backend`:设置全文索引的后端引擎。支持的值包括 `bloom` 和 `tantivy`。默认值为 `bloom`。 -* `granularity`: (适用于 `bloom` 后端)每个过滤器覆盖的数据块大小。粒度越小,过滤效果越好,但索引大小会增加。默认为 `10240`。 -* `false_positive_rate`: (适用于 `bloom` 后端)错误识别块的概率。该值越低,准确性越高(过滤效果越好),但索引大小会增加。该值为介于 `0` 和 `1` 之间的浮点数。默认为 `0.01`。 +#### 配置选项 -示例: +在创建或修改全文索引时,您可以使用 `FULLTEXT INDEX WITH` 指定以下选项: + +- `analyzer`:设置全文索引的语言分析器 + - 支持的值:`English`、`Chinese` + - 默认值:`English` + - 注意:由于中文文本分词的复杂性,中文分析器构建索引需要的时间显著更长。建议仅在中文文本搜索是主要需求时使用。 + +- `case_sensitive`:决定全文索引是否区分大小写 + - 支持的值:`true`、`false` + - 默认值:`false` + - 注意:设置为 `true` 可能会略微提高区分大小写查询的性能,但会降低不区分大小写查询的性能。此设置不会影响 `matches_term` 查询的结果。 + +- `backend`:设置全文索引的后端实现 + - 支持的值:`bloom`、`tantivy` + - 默认值:`bloom` + +- `granularity`:(适用于 `bloom` 后端)每个过滤器覆盖的数据块大小。粒度越小,过滤效果越好,但索引大小会增加。 + - 支持的值:正整数 + - 默认值:`10240` + +- `false_positive_rate`:(适用于 `bloom` 后端)错误识别块的概率。该值越低,准确性越高(过滤效果越好),但索引大小会增加。该值为介于 `0` 和 `1` 之间的浮点数。 + - 支持的值:介于 `0` 和 `1` 之间的浮点数 + - 默认值:`0.01` + +#### 后端选择 + +GreptimeDB 提供两种全文索引后端用于高效日志搜索: + +1. **Bloom 后端** + - 最适合:通用日志搜索 + - 特点: + - 使用 Bloom 过滤器进行高效过滤 + - 存储开销较低 + - 在不同查询模式下性能稳定 + - 限制: + - 对于高选择性查询稍慢 + - 存储成本示例: + - 原始数据:约 10GB + - Bloom 索引:约 1GB + +2. **Tantivy 后端** + - 最适合:高选择性查询(如 TraceID 等唯一值) + - 特点: + - 使用倒排索引实现快速精确匹配 + - 对高选择性查询性能优异 + - 限制: + - 存储开销较高(接近原始数据大小) + - 对低选择性查询性能较慢 + - 存储成本示例: + - 原始数据:约 10GB + - Tantivy 索引:约 10GB + +#### 性能对比 + +下表显示了不同查询方法之间的性能对比(以 Bloom 为基准): + +| 查询类型 | 高选择性(如 TraceID) | 低选择性(如 "HTTP") | +|------------|----------------------------------|--------------------------------| +| LIKE | 慢 50 倍 | 1 倍 | +| Tantivy | 快 5 倍 | 慢 5 倍 | +| Bloom | 1 倍(基准) | 1 倍(基准) | + +主要观察结果: +- 对于高选择性查询(如唯一值),Tantivy 提供最佳性能 +- 对于低选择性查询,Bloom 提供更稳定的性能 +- Bloom 在存储方面比 Tantivy 有明显优势(测试案例中为 1GB vs 10GB) + +#### 配置示例 + +**创建带全文索引的表** ```sql +-- 使用 Bloom 后端(大多数情况推荐) CREATE TABLE logs ( - message STRING FULLTEXT INDEX WITH(analyzer='Chinese', case_sensitive='true', backend='bloom', granularity=1024, false_positive_rate=0.01), - `level` STRING PRIMARY KEY, - `timestamp` TIMESTAMP TIME INDEX, + timestamp TIMESTAMP(9) TIME INDEX, + message STRING FULLTEXT INDEX WITH ( + backend = 'bloom', + analyzer = 'English', + case_sensitive = 'false' + ) ); -``` -使用全文索引时需要注意以下限制: +-- 使用 Tantivy 后端(用于高选择性查询) +CREATE TABLE logs ( + timestamp TIMESTAMP(9) TIME INDEX, + message STRING FULLTEXT INDEX WITH ( + backend = 'tantivy', + analyzer = 'English', + case_sensitive = 'false' + ) +); +``` -- 存储开销较大,因需要保存词条和位置信息 -- 文本分词和索引过程会增加数据刷新和压缩的延迟 -- 对于简单的前缀或后缀匹配可能不是最优选择 +**修改现有表** -建议仅在需要高级文本搜索功能和灵活查询模式时使用全文索引。 +```sql +-- 在现有列上启用全文索引 +ALTER TABLE monitor +MODIFY COLUMN load_15 +SET FULLTEXT INDEX WITH ( + analyzer = 'English', + case_sensitive = 'false', + backend = 'bloom' +); -有关全文索引配置和后端选择的更多详细信息,请参考[全文索引配置](/user-guide/logs/fulltext-index-config)指南。 +-- 更改全文索引配置 +ALTER TABLE logs +MODIFY COLUMN message +SET FULLTEXT INDEX WITH ( + analyzer = 'English', + case_sensitive = 'false', + backend = 'tantivy' +); +``` ## 修改索引 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md index c2a5e3a3d..c29e7b05a 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md @@ -118,7 +118,7 @@ CREATE TABLE logs ( **说明:** - `host` 和 `service` 作为常用过滤项列入主键,如主机数量非常多,可移出主键,改为跳数索引。 -- `log_message` 作为原始文本内容建立全文索引。**若要全文索引生效,查询时 SQL 语法也需调整,详见[日志检索文档](/user-guide/logs/query-logs.md)**。 +- `log_message` 作为原始文本内容建立全文索引。**若要全文索引生效,查询时 SQL 语法也需调整,详见[日志检索文档](/user-guide/logs/fulltext-search.md)**。 - `trace_id` 和 `span_id` 通常为高基数字段,建议仅做跳数索引。 @@ -227,7 +227,7 @@ clickhouse client --query="SELECT * FROM example INTO OUTFILE 'example.csv' FORM ### SQL/类型不兼容怎么办? -迁移前需梳理所有查询 SQL 并按官方文档 ([SQL 查询](/user-guide/query-data/sql.md)、[日志检索](/user-guide/logs/query-logs.md)) 重写或翻译不兼容语法和类型。 +迁移前需梳理所有查询 SQL 并按官方文档 ([SQL 查询](/user-guide/query-data/sql.md)、[日志检索](/user-guide/logs/fulltext-search.md)) 重写或翻译不兼容语法和类型。 ### 如何高效批量导入大规模数据? diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.12/faq-and-others/faq.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.12/faq-and-others/faq.md index bd730e851..6402de2e6 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.12/faq-and-others/faq.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.12/faq-and-others/faq.md @@ -16,7 +16,7 @@ GreptimeDB 现在仅支持日志(Log)数据类型,在 v0.10 版本中引 我们计划进一步优化日志引擎,着重提升查询性能和用户体验。未来的增强功能将包括(但不限于)扩展 GreptimeDB 日志查询 DSL 的功能,并实现与部分 Elasticsearch/Loki API 的兼容,为用户提供更高效、灵活的日志查询能力。 关于如何使用 GreptimeDB 处理日志的更多信息,您可以参考以下文档: -- [日志概述](https://docs.greptime.com/user-guide/logs/overview) +- [日志概述](https://docs.greptime.cn/user-guide/logs/overview) - [OpenTelemetry 兼容性](https://docs.greptime.com/user-guide/ingest-data/for-observability/opentelemetry) - [Loki 协议兼容性](/user-guide/ingest-data/for-observability/loki.md) - [Vector 兼容性](https://docs.greptime.com/user-guide/ingest-data/for-observability/vector) @@ -163,7 +163,7 @@ GreptimeDB 是一个快速发展的开源项目,欢迎社区的反馈和贡献 ### GreptimeDB 是否可以用于存储日志? -可以,详细信息请参考[这里](https://docs.greptime.com/user-guide/logs/overview)。 +可以,详细信息请参考[这里](https://docs.greptime.cn/user-guide/logs/overview)。 ### 非主键字段的查询性能如何?是否可以设置倒排索引?与 Elasticsearch 相比,存储成本是否更低? diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.15/user-guide/manage-data/data-index.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.15/user-guide/manage-data/data-index.md index 09f850f66..ab790c4d7 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.15/user-guide/manage-data/data-index.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.15/user-guide/manage-data/data-index.md @@ -120,7 +120,7 @@ CREATE TABLE logs ( 建议仅在需要高级文本搜索功能和灵活查询模式时使用全文索引。 -有关全文索引配置和后端选择的更多详细信息,请参考[全文索引配置](/user-guide/logs/fulltext-index-config)指南。 +有关全文索引配置和后端选择的更多详细信息,请参考[全文索引配置](/user-guide/logs/fulltext-index-config.md)指南。 ## 修改索引 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.16/user-guide/manage-data/data-index.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.16/user-guide/manage-data/data-index.md index 09f850f66..ab790c4d7 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.16/user-guide/manage-data/data-index.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.16/user-guide/manage-data/data-index.md @@ -120,7 +120,7 @@ CREATE TABLE logs ( 建议仅在需要高级文本搜索功能和灵活查询模式时使用全文索引。 -有关全文索引配置和后端选择的更多详细信息,请参考[全文索引配置](/user-guide/logs/fulltext-index-config)指南。 +有关全文索引配置和后端选择的更多详细信息,请参考[全文索引配置](/user-guide/logs/fulltext-index-config.md)指南。 ## 修改索引 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/faq-and-others/faq.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/faq-and-others/faq.md index 14bb67c48..a2afd9112 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/faq-and-others/faq.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/faq-and-others/faq.md @@ -210,7 +210,7 @@ GreptimeDB 提供多种灾备策略以满足不同的可用性需求: **实时处理**: - **[Flow Engine](/user-guide/flow-computation/overview.md)**:实时流数据处理系统,对流式数据进行连续增量计算,自动更新结果表 -- **[Pipeline](/user-guide/logs/pipeline-config.md)**:实时数据解析转换机制,通过可配置处理器对各种入库数据进行字段提取和数据类型转换 +- **[Pipeline](/user-guide/logs/use-custom-pipelines.md)**:实时数据解析转换机制,通过可配置处理器对各种入库数据进行字段提取和数据类型转换 - **输出表**:持久化处理结果用于分析 ### GreptimeDB 的可扩展性特征是什么? diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/getting-started/quick-start.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/getting-started/quick-start.md index a5bbb604d..7087af6d8 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/getting-started/quick-start.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/getting-started/quick-start.md @@ -236,7 +236,7 @@ ORDER BY +---------------------+-------+------------------+-----------+--------------------+ ``` -`@@` 操作符用于[短语搜索](/user-guide/logs/query-logs.md)。 +`@@` 操作符用于[短语搜索](/user-guide/logs/fulltext-search.md)。 ### Range query diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/fluent-bit.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/fluent-bit.md index caac4197a..700bdcae2 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/fluent-bit.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/fluent-bit.md @@ -27,7 +27,7 @@ Fluent Bit 可以配置为使用 HTTP 协议将日志发送到 GreptimeCloud。 http_Passwd ``` -在此示例中,使用 `http` 输出插件将日志发送到 GreptimeCloud。有关更多信息和额外选项,请参阅 [Logs HTTP API](https://docs.greptime.cn/user-guide/logs/write-logs#http-api) 指南。 +在此示例中,使用 `http` 输出插件将日志发送到 GreptimeCloud。有关更多信息和额外选项,请参阅 [Logs HTTP API](https://docs.greptime.cn/reference/pipeline/write-log-api/#http-api) 指南。 ## Prometheus Remote Write diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/kafka.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/kafka.md index 3325559db..20d6c3788 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/kafka.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/greptimecloud/integrations/kafka.md @@ -11,7 +11,7 @@ description: 介绍如何使用 Kafka 将数据传输到 GreptimeCloud,并提 ## Logs 以下是一个示例配置。请注意,您需要创建您的 -[Pipeline](https://docs.greptime.cn/user-guide/logs/pipeline-config/) 用于日志 +[Pipeline](https://docs.greptime.cn/user-guide/logs/use-custom-pipelines/) 用于日志 解析。 ```toml diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/built-in-pipelines.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/built-in-pipelines.md new file mode 100644 index 000000000..9eac5023e --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/built-in-pipelines.md @@ -0,0 +1,172 @@ +--- +keywords: [内置 pipeline, greptime_identity, JSON 日志, 日志处理, 时间索引, pipeline, GreptimeDB] +description: 了解 GreptimeDB 的内置 pipeline,包括用于处理 JSON 日志的 greptime_identity pipeline,具有自动 schema 创建、类型转换和时间索引配置功能。 +--- + +# 内置 Pipeline + +GreptimeDB 提供了常见日志格式的内置 Pipeline,允许你直接使用而无需创建新的 Pipeline。 + +请注意,内置 Pipeline 的名称以 "greptime_" 为前缀,不可编辑。 + +## `greptime_identity` + +`greptime_identity` Pipeline 适用于写入 JSON 日志,并自动为 JSON 日志中的每个字段创建列。 + +- JSON 日志中的第一层级的 key 是表中的列名。 +- 如果相同字段包含不同类型的数据,则会返回错误。 +- 值为 `null` 的字段将被忽略。 +- 如果没有手动指定,一个作为时间索引的额外列 `greptime_timestamp` 将被添加到表中,以指示日志写入的时间。 + +### 类型转换规则 + +- `string` -> `string` +- `number` -> `int64` 或 `float64` +- `boolean` -> `bool` +- `null` -> 忽略 +- `array` -> `json` +- `object` -> `json` + +例如,如果我们有以下 JSON 数据: + +```json +[ + {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, + {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, + {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} +] +``` + +我们将合并每个批次的行结构以获得最终 schema。表 schema 如下所示: + +```sql +mysql> desc pipeline_logs; ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| age | Int64 | | YES | | FIELD | +| is_student | Boolean | | YES | | FIELD | +| name | String | | YES | | FIELD | +| object | Json | | YES | | FIELD | +| score | Float64 | | YES | | FIELD | +| company | String | | YES | | FIELD | +| array | Json | | YES | | FIELD | +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | ++--------------------+---------------------+------+------+---------+---------------+ +8 rows in set (0.00 sec) +``` + +数据将存储在表中,如下所示: + +```sql +mysql> select * from pipeline_logs; ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| age | is_student | name | object | score | company | array | greptime_timestamp | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | +| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | +| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +3 rows in set (0.01 sec) +``` + +### 自定义时间索引列 + +每个 GreptimeDB 表中都必须有时间索引列。`greptime_identity` pipeline 不需要额外的 YAML 配置,如果你希望使用写入数据中自带的时间列(而不是日志数据到达服务端的时间戳)作为表的时间索引列,则需要通过参数进行指定。 + +假设这是一份待写入的日志数据: +```JSON +[ + {"action": "login", "ts": 1742814853} +] +``` + +设置如下的 URL 参数来指定自定义时间索引列: +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d $'[{"action": "login", "ts": 1742814853}]' +``` + +取决于数据的格式,`custom_time_index` 参数接受两种格式的配置值: +- Unix 时间戳: `<字段名>;epoch;<精度>` + - 该字段需要是整数或者字符串 + - 精度为这四种选项之一: `s`, `ms`, `us`, or `ns`. +- 时间戳字符串: `<字段名>;datestr;<字符串解析格式>` + - 例如输入的时间字段值为 `2025-03-24 19:31:37+08:00`,则对应的字符串解析格式为 `%Y-%m-%d %H:%M:%S%:z` + +通过上述配置,结果表就能正确使用输入字段作为时间索引列 +```sql +DESC pipeline_logs; +``` +```sql ++--------+-----------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------+-----------------+------+------+---------+---------------+ +| ts | TimestampSecond | PRI | NO | | TIMESTAMP | +| action | String | | YES | | FIELD | ++--------+-----------------+------+------+---------+---------------+ +2 rows in set (0.02 sec) +``` + +假设时间变量名称为 `input_ts`,以下是一些使用 `custom_time_index` 的示例: +- 1742814853: `custom_time_index=input_ts;epoch;s` +- 1752749137000: `custom_time_index=input_ts;epoch;ms` +- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` +- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` + + +### 展开 json 对象 + +如果你希望将 JSON 对象展开为单层结构,可以在请求的 header 中添加 `x-greptime-pipeline-params` 参数,设置 `flatten_json_object` 为 `true`。 + +以下是一个示例请求: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -H "x-greptime-pipeline-params: flatten_json_object=true" \ + -d "$" +``` + +这样,GreptimeDB 将自动将 JSON 对象的每个字段展开为单独的列。比如 + +```JSON +{ + "a": { + "b": { + "c": [1, 2, 3] + } + }, + "d": [ + "foo", + "bar" + ], + "e": { + "f": [7, 8, 9], + "g": { + "h": 123, + "i": "hello", + "j": { + "k": true + } + } + } +} +``` + +将被展开为: + +```json +{ + "a.b.c": [1,2,3], + "d": ["foo","bar"], + "e.f": [7,8,9], + "e.g.h": 123, + "e.g.i": "hello", + "e.g.j.k": true +} +``` + diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/pipeline-config.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/pipeline-config.md similarity index 99% rename from i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/pipeline-config.md rename to i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/pipeline-config.md index 3b40d9c7e..ef9f942bb 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/pipeline-config.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/pipeline-config.md @@ -1032,7 +1032,7 @@ GreptimeDB 支持以下四种字段的索引类型: #### Fulltext 索引 -通过 `index: fulltext` 指定在哪个列上建立全文索引,该索引可大大提升 [日志搜索](./query-logs.md) 的性能,写法请参考下方的 [Transform 示例](#transform-示例)。 +通过 `index: fulltext` 指定在哪个列上建立全文索引,该索引可大大提升 [日志搜索](/user-guide/logs/fulltext-search.md) 的性能,写法请参考下方的 [Transform 示例](#transform-示例)。 #### Skipping 索引 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/write-log-api.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/write-log-api.md new file mode 100644 index 000000000..0c23d9d4d --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/pipeline/write-log-api.md @@ -0,0 +1,161 @@ +--- +keywords: [日志写入, HTTP 接口, Pipeline 配置, 数据格式, 请求参数] +description: 介绍如何通过 HTTP 接口使用指定的 Pipeline 将日志写入 GreptimeDB,包括请求参数、数据格式和示例。 +--- + +# 写入日志的 API + +在写入日志之前,请先阅读 [Pipeline 配置](/user-guide/logs/use-custom-pipelines.md#上传-pipeline)完成配置的设定和上传。 + +## HTTP API + +你可以使用以下命令通过 HTTP 接口写入日志: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -d "$" +``` + + +### 请求参数 + +此接口接受以下参数: + +- `db`:数据库名称。 +- `table`:表名称。 +- `pipeline_name`:[Pipeline](./pipeline-config.md) 名称。 +- `version`:Pipeline 版本号。可选,默认使用最新版本。 +- `skip_error`:写入日志时是否跳过错误。可选,默认为 `false`。当设置为 `true` 时,GreptimeDB 会跳过遇到错误的单条日志项并继续处理剩余的日志,不会因为一条日志项的错误导致整个请求失败。 + +### `Content-Type` 和 Body 数据格式 + +GreptimeDB 使用 `Content-Type` header 来决定如何解码请求体内容。目前我们支持以下两种格式: +- `application/json`: 包括普通的 JSON 格式和 NDJSON 格式。 +- `application/x-ndjson`: 指定 NDJSON 格式,会尝试先分割行再进行解析,可以达到精确的错误检查。 +- `text/plain`: 通过换行符分割的多行日志文本行。 + +#### `application/json` 和 `application/x-ndjson` 格式 + +以下是一份 JSON 格式请求体内容的示例: + +```JSON +[ + {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, + {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, + {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, + {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +] +``` + +请注意整个 JSON 是一个数组(包含多行日志)。每个 JSON 对象代表即将要被 Pipeline 引擎处理的一行日志。 + +JSON 对象中的 key 名,也就是这里的 `message`,会被用作 Pipeline processor 处理时的 field 名称。比如: + +```yaml +processors: + - dissect: + fields: + # `message` 是 JSON 对象中的 key 名 + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# pipeline 文件的剩余部分在这里省略 +``` + +我们也可以将这个请求体内容改写成 NDJSON 的格式,如下所示: + +```JSON +{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} +{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} +{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} +{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +``` + +注意到最外层的数组符被消去了,现在每个 JSON 对象通过换行符分割而不是 `,`。 + +#### `text/plain` 格式 + +纯文本日志在整个生态系统中被广泛应用。GreptimeDB 同样支持日志数据以 `text/plain` 格式进行输入,使得我们可以直接从日志产生源进行写入。 + +以下是一份和上述样例请求体内容等价的文本请求示例: + +```plain +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" +10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" +172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" +``` + +仅需要将 `Content-Type` header 设置成 `text/plain`,即可将纯文本请求发送到 GreptimeDB。 + +主要注意的是,和 JSON 格式自带 key 名可以被 Pipeline processor 识别和处理不同,`text/plain` 格式直接将整行文本输入到 Pipeline engine。在这种情况下我们可以使用 `message` 来指代整行输入文本,例如: + +```yaml +processors: + - dissect: + fields: + # 使用 `message` 作为 field 名称 + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# pipeline 文件的剩余部分在这里省略 +``` + +对于 `text/plain` 格式的输入,推荐首先使用 `dissect` 或者 `regex` processor 将整行文本分割成不同的字段,以便进行后续的处理。 + +## 设置表选项 + +写入日志的表选项需要在 pipeline 中配置。 +从 `v0.15` 开始,pipeline 引擎可以识别特定的变量名称,并且通过这些变量对应的值设置相应的建表选项。 +通过与 `vrl` 处理器的结合,现在可以非常轻易地通过输入的数据在 pipeline 的执行过程中设置建表选项。 + +以下是支持的表选项变量名: +- `greptime_auto_create_table` +- `greptime_ttl` +- `greptime_append_mode` +- `greptime_merge_mode` +- `greptime_physical_table` +- `greptime_skip_wal` + +请前往[表选项](/reference/sql/create.md#表选项)文档了解每一个选项的详细含义。 + +以下是 pipeline 特有的变量: +- `greptime_table_suffix`: 在给定的目标表后增加后缀 + +以如下 pipeline 文件为例 +```YAML +processors: + - date: + field: time + formats: + - "%Y-%m-%d %H:%M:%S%.3f" + ignore_missing: true + - vrl: + source: | + .greptime_table_suffix, err = "_" + .id + .greptime_table_ttl = "1d" + . +``` + +在这份 vrl 脚本中,我们将表后缀变量设置为输入字段中的 `id`(通过一个下划线连接),然后将 ttl 设置成 `1d`。 +然后我们使用如下数据执行写入。 + +```JSON +{ + "id": "2436", + "time": "2024-05-25 20:16:37.217" +} +``` + +假设给定的表名为 `d_table`,那么最终的表名就会按照预期被设置成 `d_table_2436`。这个表同样的 ttl 同样会被设置成 1 天。 + +## 示例 + +请参考[快速开始](/user-guide/logs/quick-start.md)和[使用自定义 pipeline 中的](/user-guide/logs/use-custom-pipelines.md#使用-pipeline-写入日志)写入日志部分的文档。 + diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/alter.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/alter.md index 6d2e811bf..c9a857a33 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/alter.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/alter.md @@ -194,7 +194,7 @@ ALTER TABLE monitor MODIFY COLUMN load_15 SET FULLTEXT INDEX WITH (analyzer = 'E - `granularity`:(适用于 `bloom` 后端)每个过滤器覆盖的数据块大小。粒度越小,过滤效果越好,但索引大小会增加。默认为 `10240`。 - `false_positive_rate`:(适用于 `bloom` 后端)错误识别块的概率。该值越低,准确性越高(过滤效果越好),但索引大小会增加。该值为介于 `0` 和 `1` 之间的浮点数。默认为 `0.01`。 -更多关于全文索引配置和性能对比的信息,请参考[全文索引配置指南](/user-guide/logs/fulltext-index-config.md)。 +更多关于全文索引配置和性能对比的信息,请参考[全文索引配置指南](/user-guide/manage-data/data-index.md#全文索引)。 与 `CREATE TABLE` 一样,可以不带 `WITH` 选项,全部使用默认值。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/functions/overview.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/functions/overview.md index 838b282a2..4d1cd63be 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/functions/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/functions/overview.md @@ -49,7 +49,7 @@ DataFusion [字符串函数](./df-functions.md#string-functions)。 GreptimeDB 提供: * `matches_term(expression, term)` 用于全文检索。 -阅读[查询日志](/user-guide/logs/query-logs.md)文档获取更多详情。 +阅读[查询日志](/user-guide/logs/fulltext-search.md)文档获取更多详情。 ### 数学函数 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/where.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/where.md index 22dc76456..0293da10e 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/where.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/reference/sql/where.md @@ -75,4 +75,4 @@ SELECT * FROM go_info WHERE instance LIKE 'localhost:____'; ``` -有关在日志中搜索关键字,请阅读[查询日志](/user-guide/logs/query-logs.md)。 \ No newline at end of file +有关在日志中搜索关键字,请阅读[查询日志](/user-guide/logs/fulltext-search.md)。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md index 250895e88..ee0d29f82 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md @@ -42,7 +42,7 @@ description: 将 GreptimeDB 与 Fluent bit 集成以实现 Prometheus Remote Wri - `table` 是您要写入日志的表名称。 - `pipeline_name` 是您要用于处理日志的管道名称。 -本示例中,使用的是 [Logs Http API](/user-guide/logs/write-logs.md#http-api) 接口。如需更多信息,请参阅 [写入日志](/user-guide/logs/write-logs.md) 文档。 +本示例中,使用的是 [Logs Http API](/reference/pipeline/write-log-api.md#http-api) 接口。如需更多信息,请参阅 [写入日志](/user-guide/logs/use-custom-pipelines.md#使用-pipeline-写入日志) 文档。 ## OpenTelemetry diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md index 1cd1a3667..0940c7849 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md @@ -127,7 +127,7 @@ pipeline_name = "greptime_identity" #### 创建 pipeline 要创建自定义 pipeline, -请参阅[创建 pipeline](/user-guide/logs/quick-start.md#创建-pipeline) 和 [pipeline 配置](/user-guide/logs/pipeline-config.md)文档获取详细说明。 +请参阅[使用自定义 pipeline](/user-guide/logs/use-custom-pipelines.md)文档获取详细说明。 #### 写入数据 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/loki.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/loki.md index 29bc29dc3..dd398e932 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/loki.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/loki.md @@ -184,7 +184,7 @@ transform: ``` pipeline 的配置相对直观: 使用 `vrl` 处理器将日志行解析为 JSON 对象,然后将其中的字段提取到根目录。 -`log_time` 在 transform 部分中被指定为时间索引,其他字段将由 pipeline 引擎自动推导,详见 [pipeline version 2](/user-guide/logs/pipeline-config.md#版本-2-中的-transform)。 +`log_time` 在 transform 部分中被指定为时间索引,其他字段将由 pipeline 引擎自动推导,详见 [pipeline version 2](/reference/pipeline/pipeline-config.md#版本-2-中的-transform)。 请注意,输入字段名为 `loki_line`,它包含来自 Loki 的原始日志行。 @@ -264,4 +264,4 @@ log_source: application 此输出演示了 pipeline 引擎已成功解析原始 JSON 日志行,并将结构化数据提取到单独的列中。 -有关 pipeline 配置和功能的更多详细信息,请参考[pipeline 文档](/user-guide/logs/pipeline-config.md)。 +有关 pipeline 配置和功能的更多详细信息,请参考[pipeline 文档](/reference/pipeline/pipeline-config.md)。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md index 44a521c50..466596354 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md @@ -289,7 +289,7 @@ mysql> select * from `go_memstats_mcache_inuse_bytes`; 2 rows in set (0.01 sec) ``` -更多配置详情请参考 [pipeline 相关文档](/user-guide/logs/pipeline-config.md)。 +更多配置详情请参考 [pipeline 相关文档](/reference/pipeline/pipeline-config.md)。 ## 性能优化 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/fulltext-index-config.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/fulltext-index-config.md deleted file mode 100644 index e1448a274..000000000 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/fulltext-index-config.md +++ /dev/null @@ -1,131 +0,0 @@ ---- -keywords: [全文索引, tantivy, bloom, 分析器, 大小写敏感, 配置] -description: GreptimeDB 全文索引配置的完整指南,包括后端选择和其他配置选项。 ---- - -# 全文索引配置 - -本文档提供了 GreptimeDB 全文索引配置的完整指南,包括后端选择和其他配置选项。 - -## 概述 - -GreptimeDB 提供全文索引功能以加速文本搜索操作。您可以在创建或修改表时配置全文索引,并提供各种选项以针对不同用例进行优化。有关 GreptimeDB 中不同类型索引(包括倒排索引和跳数索引)的概述,请参考[数据索引](/user-guide/manage-data/data-index)指南。 - -## 配置选项 - -在创建或修改全文索引时,您可以使用 `FULLTEXT INDEX WITH` 指定以下选项: - -### 基本选项 - -- `analyzer`:设置全文索引的语言分析器 - - 支持的值:`English`、`Chinese` - - 默认值:`English` - - 注意:由于中文文本分词的复杂性,中文分析器构建索引需要的时间显著更长。建议仅在中文文本搜索是主要需求时使用。 - -- `case_sensitive`:决定全文索引是否区分大小写 - - 支持的值:`true`、`false` - - 默认值:`false` - - 注意:设置为 `true` 可能会略微提高区分大小写查询的性能,但会降低不区分大小写查询的性能。此设置不会影响 `matches_term` 查询的结果。 - -- `backend`:设置全文索引的后端实现 - - 支持的值:`bloom`、`tantivy` - - 默认值:`bloom` - -- `granularity`:(适用于 `bloom` 后端)每个过滤器覆盖的数据块大小。粒度越小,过滤效果越好,但索引大小会增加。 - - 支持的值:正整数 - - 默认值:`10240` - -- `false_positive_rate`:(适用于 `bloom` 后端)错误识别块的概率。该值越低,准确性越高(过滤效果越好),但索引大小会增加。该值为介于 `0` 和 `1` 之间的浮点数。 - - 支持的值:介于 `0` 和 `1` 之间的浮点数 - - 默认值:`0.01` - -### 后端选择 - -GreptimeDB 提供两种全文索引后端用于高效日志搜索: - -1. **Bloom 后端** - - 最适合:通用日志搜索 - - 特点: - - 使用 Bloom 过滤器进行高效过滤 - - 存储开销较低 - - 在不同查询模式下性能稳定 - - 限制: - - 对于高选择性查询稍慢 - - 存储成本示例: - - 原始数据:约 10GB - - Bloom 索引:约 1GB - -2. **Tantivy 后端** - - 最适合:高选择性查询(如 TraceID 等唯一值) - - 特点: - - 使用倒排索引实现快速精确匹配 - - 对高选择性查询性能优异 - - 限制: - - 存储开销较高(接近原始数据大小) - - 对低选择性查询性能较慢 - - 存储成本示例: - - 原始数据:约 10GB - - Tantivy 索引:约 10GB - -### 性能对比 - -下表显示了不同查询方法之间的性能对比(以 Bloom 为基准): - -| 查询类型 | 高选择性(如 TraceID) | 低选择性(如 "HTTP") | -|------------|----------------------------------|--------------------------------| -| LIKE | 慢 50 倍 | 1 倍 | -| Tantivy | 快 5 倍 | 慢 5 倍 | -| Bloom | 1 倍(基准) | 1 倍(基准) | - -主要观察结果: -- 对于高选择性查询(如唯一值),Tantivy 提供最佳性能 -- 对于低选择性查询,Bloom 提供更稳定的性能 -- Bloom 在存储方面比 Tantivy 有明显优势(测试案例中为 1GB vs 10GB) - -## 配置示例 - -### 创建带全文索引的表 - -```sql --- 使用 Bloom 后端(大多数情况推荐) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'bloom', - analyzer = 'English', - case_sensitive = 'false' - ) -); - --- 使用 Tantivy 后端(用于高选择性查询) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'tantivy', - analyzer = 'English', - case_sensitive = 'false' - ) -); -``` - -### 修改现有表 - -```sql --- 在现有列上启用全文索引 -ALTER TABLE monitor -MODIFY COLUMN load_15 -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'bloom' -); - --- 更改全文索引配置 -ALTER TABLE logs -MODIFY COLUMN message -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'tantivy' -); -``` diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/query-logs.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/fulltext-search.md similarity index 90% rename from i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/query-logs.md rename to i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/fulltext-search.md index 24ea1454d..0ed21f74f 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/query-logs.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/fulltext-search.md @@ -3,17 +3,15 @@ keywords: [日志查询, GreptimeDB 查询语言, matches_term, 模式匹配, description: 详细介绍如何利用 GreptimeDB 的查询语言对日志数据进行高效搜索和分析,包括使用 matches_term 函数进行精确匹配。 --- -# 日志查询 +# 全文搜索 本文档详细介绍如何利用 GreptimeDB 的查询语言对日志数据进行高效搜索和分析。 -## 概述 - -GreptimeDB 支持通过 SQL 语句灵活查询数据。本节将介绍特定的搜索功能和查询语句,帮助您提升日志查询效率。 +GreptimeDB 支持通过 SQL 语句灵活查询数据。本节将介绍特定的搜索功能和查询语句,帮助你提升日志查询效率。 ## 使用 `matches_term` 函数进行精确匹配 -在 SQL 查询中,您可以使用 `matches_term` 函数执行精确的词语/短语匹配,这在日志分析中尤其实用。`matches_term` 函数支持对 `String` 类型列进行精确匹配。您也可以使用 `@@` 操作符作为 `matches_term` 的简写形式。下面是一个典型示例: +在 SQL 查询中,你可以使用 `matches_term` 函数执行精确的词语/短语匹配,这在日志分析中尤其实用。`matches_term` 函数支持对 `String` 类型列进行精确匹配。你也可以使用 `@@` 操作符作为 `matches_term` 的简写形式。下面是一个典型示例: ```sql -- 使用 matches_term 函数 @@ -45,7 +43,7 @@ SELECT * FROM logs WHERE matches_term(message, 'error'); SELECT * FROM logs WHERE message @@ 'error'; ``` -此查询将返回所有 `message` 列中包含完整词语 "error" 的记录。该函数确保您不会得到部分匹配或词语内的匹配。 +此查询将返回所有 `message` 列中包含完整词语 "error" 的记录。该函数确保你不会得到部分匹配或词语内的匹配。 匹配和不匹配的示例: - ✅ "An error occurred!" - 匹配,因为 "error" 是一个完整词语 @@ -57,7 +55,7 @@ SELECT * FROM logs WHERE message @@ 'error'; ### 多关键词搜索 -您可以使用 `OR` 运算符组合多个 `matches_term` 条件来搜索包含多个关键词中任意一个的日志。当您想要查找可能包含不同错误变体或不同类型问题的日志时,这很有用。 +你可以使用 `OR` 运算符组合多个 `matches_term` 条件来搜索包含多个关键词中任意一个的日志。当你想要查找可能包含不同错误变体或不同类型问题的日志时,这很有用。 ```sql -- 使用 matches_term 函数 @@ -78,7 +76,7 @@ SELECT * FROM logs WHERE message @@ 'critical' OR message @@ 'error'; ### 排除条件搜索 -您可以使用 `NOT` 运算符与 `matches_term` 结合来从搜索结果中排除某些词语。当您想要查找包含一个词语但不包含另一个词语的日志时,这很有用。 +你可以使用 `NOT` 运算符与 `matches_term` 结合来从搜索结果中排除某些词语。当你想要查找包含一个词语但不包含另一个词语的日志时,这很有用。 ```sql -- 使用 matches_term 函数 @@ -97,7 +95,7 @@ SELECT * FROM logs WHERE message @@ 'error' AND NOT message @@ 'critical'; ### 多条件必要搜索 -您可以使用 `AND` 运算符要求日志消息中必须存在多个词语。这对于查找包含特定词语组合的日志很有用。 +你可以使用 `AND` 运算符要求日志消息中必须存在多个词语。这对于查找包含特定词语组合的日志很有用。 ```sql -- 使用 matches_term 函数 @@ -136,7 +134,7 @@ SELECT * FROM logs WHERE message @@ 'system failure'; ### 不区分大小写匹配 -虽然 `matches_term` 默认区分大小写,但您可以通过在匹配前将文本转换为小写来实现不区分大小写的匹配。 +虽然 `matches_term` 默认区分大小写,但你可以通过在匹配前将文本转换为小写来实现不区分大小写的匹配。 ```sql -- 使用 matches_term 函数 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/manage-pipelines.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/manage-pipelines.md index 82980c169..db41a932c 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/manage-pipelines.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/manage-pipelines.md @@ -5,17 +5,17 @@ description: 介绍如何在 GreptimeDB 中管理 Pipeline,包括创建、删 # 管理 Pipeline -在 GreptimeDB 中,每个 `pipeline` 是一个数据处理单元集合,用于解析和转换写入的日志内容。本文档旨在指导您如何创建和删除 Pipeline,以便高效地管理日志数据的处理流程。 +在 GreptimeDB 中,每个 `pipeline` 是一个数据处理单元集合,用于解析和转换写入的日志内容。本文档旨在指导你如何创建和删除 Pipeline,以便高效地管理日志数据的处理流程。 -有关 Pipeline 的具体配置,请阅读 [Pipeline 配置](pipeline-config.md)。 +有关 Pipeline 的具体配置,请阅读 [Pipeline 配置](/reference/pipeline/pipeline-config.md)。 ## 鉴权 在使用 HTTP API 进行 Pipeline 管理时,你需要提供有效的鉴权信息。 请参考[鉴权](/user-guide/protocols/http.md#鉴权)文档了解详细信息。 -## 创建 Pipeline +## 上传 Pipeline GreptimeDB 提供了专用的 HTTP 接口用于创建 Pipeline。 假设你已经准备好了一个 Pipeline 配置文件 pipeline.yaml,使用以下命令上传配置文件,其中 `test` 是你指定的 Pipeline 的名称: @@ -29,6 +29,22 @@ curl -X "POST" "http://localhost:4000/v1/pipelines/test" \ 你可以在所有 Database 中使用创建的 Pipeline。 +## Pipeline 版本 + +你可以使用相同的名称上传多个版本的 pipeline。 +每次你使用现有名称上传 pipeline 时,都会自动创建一个新版本。 +你可以在[写入日志](/reference/pipeline/write-log-api.md#http-api)、[查询](#查询-pipeline)或[删除](#删除-pipeline) pipeline 时指定要使用的版本。 +如果未指定版本,默认使用最后上传的版本。 + +成功上传 pipeline 后,响应将包含版本信息: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"} +``` + +版本是 UTC 格式的时间戳,表示 pipeline 的创建时间。 +此时间戳作为每个 pipeline 版本的唯一标识符。 + ## 删除 Pipeline 可以使用以下 HTTP 接口删除 Pipeline: @@ -129,7 +145,7 @@ transform: SELECT * FROM greptime_private.pipelines; ``` -请注意,如果您使用 MySQL 或者 PostgreSQL 协议作为连接 GreptimeDB 的方式,查询出来的 Pipeline 时间信息精度可能有所不同,可能会丢失纳秒级别的精度。 +请注意,如果你使用 MySQL 或者 PostgreSQL 协议作为连接 GreptimeDB 的方式,查询出来的 Pipeline 时间信息精度可能有所不同,可能会丢失纳秒级别的精度。 为了解决这个问题,可以将 `created_at` 字段强制转换为 timestamp 来查看 Pipeline 的创建时间。例如,下面的查询将 `created_at` 以 `bigint` 的格式展示: @@ -319,3 +335,119 @@ curl -X "POST" "http://localhost:4000/v1/pipelines/dryrun?pipeline_name=test" \ ``` 可以看到,`1998.08` 字符串中的 `.` 已经被替换为 `-`,Pipeline 处理成功。 + +## 从 Pipeline 配置生成表的建表语句 + +使用 Pipeline 时,GreptimeDB 默认会在首次数据写入时自动创建目标表。 +但是,你可能希望预先手动创建表以添加自定义表选项,例如添加分区规则以获得更好的性能。 + +虽然自动创建的表结构对于给定的 Pipeline 配置是确定的, +但根据配置手动编写表的建表语句可能会很繁琐。`/ddl` API 简化了这一过程。 + +对于现有的 Pipeline,你可以使用 `/v1/pipelines/{pipeline_name}/ddl` 来生成建表语句。 +此 API 会检查 Pipeline 配置中的 transform 定义并推断出相应的表结构。 +你可以在第一次写入数据之前使用此 API 来生成基础的建表语句,进行参数调整并手动建表。 +常见的调整选项包括: +- 增加[数据分区规则](/user-guide/deployments-administration/manage-data/table-sharding.md) +- 调整[索引的参数](/user-guide/manage-data/data-index.md) +- 增加其他[表选项](/reference/sql/create.md#表选项) + +以下是演示如何使用此 API 的示例。考虑以下 Pipeline 配置: +```YAML +# pipeline.yaml +processors: +- dissect: + fields: + - message + patterns: + - '%{ip_address} - %{username} [%{timestamp}] "%{http_method} %{request_line} %{protocol}" %{status_code} %{response_size}' + ignore_missing: true +- date: + fields: + - timestamp + formats: + - "%d/%b/%Y:%H:%M:%S %z" + +transform: + - fields: + - timestamp + type: time + index: timestamp + - fields: + - ip_address + type: string + index: skipping + - fields: + - username + type: string + tag: true + - fields: + - http_method + type: string + index: inverted + - fields: + - request_line + type: string + index: fulltext + - fields: + - protocol + type: string + - fields: + - status_code + type: int32 + index: inverted + tag: true + - fields: + - response_size + type: int64 + on_failure: default + default: 0 + - fields: + - message + type: string +``` + +首先,使用以下命令将 Pipeline 上传到数据库: +```bash +curl -X "POST" "http://localhost:4000/v1/pipelines/pp" -F "file=@pipeline.yaml" +``` +然后,使用以下命令查询表的建表语句: +```bash +curl -X "GET" "http://localhost:4000/v1/pipelines/pp/ddl?table=test_table" +``` +API 返回以下 JSON 格式的输出: +```JSON +{ + "sql": { + "sql": "CREATE TABLE IF NOT EXISTS `test_table` (\n `timestamp` TIMESTAMP(9) NOT NULL,\n `ip_address` STRING NULL SKIPPING INDEX WITH(false_positive_rate = '0.01', granularity = '10240', type = 'BLOOM'),\n `username` STRING NULL,\n `http_method` STRING NULL INVERTED INDEX,\n `request_line` STRING NULL FULLTEXT INDEX WITH(analyzer = 'English', backend = 'bloom', case_sensitive = 'false', false_positive_rate = '0.01', granularity = '10240'),\n `protocol` STRING NULL,\n `status_code` INT NULL INVERTED INDEX,\n `response_size` BIGINT NULL,\n `message` STRING NULL,\n TIME INDEX (`timestamp`),\n PRIMARY KEY (`username`, `status_code`)\n)\nENGINE=mito\nWITH(\n append_mode = 'true'\n)" + }, + "execution_time_ms": 3 +} +``` +格式化响应中的 `sql` 字段后,你可以看到推断出的表结构: +```SQL +CREATE TABLE IF NOT EXISTS `test_table` ( + `timestamp` TIMESTAMP(9) NOT NULL, + `ip_address` STRING NULL SKIPPING INDEX WITH(false_positive_rate = '0.01', granularity = '10240', type = 'BLOOM'), + `username` STRING NULL, + `http_method` STRING NULL INVERTED INDEX, + `request_line` STRING NULL FULLTEXT INDEX WITH(analyzer = 'English', backend = 'bloom', case_sensitive = 'false', false_positive_rate = '0.01', granularity = '10240'), + `protocol` STRING NULL, + `status_code` INT NULL INVERTED INDEX, + `response_size` BIGINT NULL, + `message` STRING NULL, + TIME INDEX (`timestamp`), + PRIMARY KEY (`username`, `status_code`) + ) +ENGINE=mito +WITH( + append_mode = 'true' +) +``` + +你可以将推断出的表的建表语句作为起点。 +根据你的需求自定义建表语句后,在通过 Pipeline 写入数据之前手动执行它。 + +**注意事项:** +1. 该 API 仅从 Pipeline 配置推断表结构;它不会检查表是否已存在。 +2. 该 API 不考虑表后缀。如果你在 Pipeline 配置中使用 `dispatcher`、`table_suffix` 或表后缀 hint,你需要手动调整表名。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/overview.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/overview.md index 6531d9093..e59f8d3a5 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/overview.md @@ -1,16 +1,105 @@ --- -keywords: [日志, GreptimeDB, 日志写入, 日志配置, 查询日志] -description: 提供了使用 GreptimeDB 日志服务的各种指南,包括快速开始、Pipeline 配置、管理 Pipeline、写入日志、查询日志和全文索引配置。 +keywords: [log service, quick start, pipeline configuration, manage pipelines, query logs] +description: GreptimeDB 日志管理功能的综合指南,包括日志收集架构、Pipeline 处理、与 Vector 和 Kafka 等流行的日志收集器的集成以及使用全文搜索的高级查询。 --- # 日志 -本节内容将涵盖 GreptimeDB 针对日志的功能介绍,从基本的写入查询,到高级功能,诸如 -数据变换、全文索引等。 +GreptimeDB 提供了专为满足现代可观测需求而设计的日志管理解决方案, +它可以和主流日志收集器无缝集成, +提供了灵活的使用 pipeline 转换日志的功能 +和包括全文搜索的查询功能。 + +核心功能点包括: + +- **统一存储**:将日志与指标和 Trace 数据一起存储在单个数据库中 +- **Pipeline 处理数据**:使用可自定义的 pipeline 转换和丰富原始日志,支持多种日志收集器和格式 +- **高级查询**:基于 SQL 的分析,并具有全文搜索功能 +- **实时数据处理**:实时处理和查询日志以进行监控和警报 + +## 日志收集流程 + +![log-collection-flow](/log-collection-flow.drawio.svg) + +上图展示了日志收集的整体架构, +它包括四阶段流程:日志源、日志收集器、Pipeline 处理和在存储到 GreptimeDB 中。 + +### 日志源 + +日志源是基础设施中产生日志数据的基础层。 +GreptimeDB 支持从各种源写入数据以满足全面的可观测性需求: + +- **应用程序**:来自微服务架构、Web 应用程序、移动应用程序和自定义软件组件的应用程序级日志 +- **IoT 设备**:来自物联网生态系统的设备日志、传感器事件日志和运行状态日志 +- **基础设施**:云平台日志、容器编排日志(Kubernetes、Docker)、负载均衡器日志以及网络基础设施组件日志 +- **系统组件**:操作系统日志、内核事件、系统守护进程日志以及硬件监控日志 +- **自定义源**:特定于你环境或应用程序的任何其他日志源 + +### 日志收集器 + +日志收集器负责高效地从各种源收集日志数据并转发到存储后端。 +GreptimeDB 可以与行业标准的日志收集器无缝集成, +包括 Vector、Fluent Bit、Apache Kafka、OpenTelemetry Collector 等。 + +GreptimeDB 作为这些收集器的 sink 后端, +提供强大的数据写入能力。 +在写入过程中,GreptimeDB 的 pipeline 系统能够实时转换和丰富日志数据, +确保在存储前获得最佳的结构和质量。 + +### Pipeline 处理 + +GreptimeDB 的 pipeline 机制将原始日志转换为结构化、可查询的数据: + +- **解析**:从非结构化日志消息中提取结构化数据 +- **转换**:使用额外的上下文和元数据丰富日志 +- **索引**:配置必要的索引以提升查询性能,例如全文索引、时间索引等 + +### 存储日志到 GreptimeDB + +通过 pipeline 处理后,日志存储在 GreptimeDB 中,支持灵活的分析和可视化: + +- **SQL 查询**:使用熟悉的 SQL 语法分析日志数据 +- **基于时间的分析**:利用时间序列功能进行时间分析 +- **全文搜索**:在日志消息中执行高级文本搜索 +- **实时分析**:实时查询日志进行监控和告警 + +## 快速开始 + +你可以使用内置的 `greptime_identity` pipeline 快速开始日志写入。更多信息请参考[快速开始](./quick-start.md)指南。 + +## 集成到日志收集器 + +GreptimeDB 与各种日志收集器无缝集成,提供全面的日志记录解决方案。集成过程包括以下关键步骤: + +1. **选择合适的日志收集器**:根据你的基础设施要求、数据源和性能需求选择收集器 +2. **分析输出格式**:了解你选择的收集器产生的日志格式和结构 +3. **配置 Pipeline**:在 GreptimeDB 中创建和配置 pipeline 来解析、转换和丰富传入的日志数据 +4. **存储和查询**:在 GreptimeDB 中高效存储处理后的日志,用于实时分析和监控 + +要成功将你的日志收集器与 GreptimeDB 集成,你需要: +- 首先了解 pipeline 在 GreptimeDB 中的工作方式 +- 然后在你的日志收集器中配置 sink 设置,将数据发送到 GreptimeDB + +请参考以下指南获取将 GreptimeDB 集成到日志收集器的详细说明: + +- [Vector](/user-guide/ingest-data/for-observability/vector.md#using-greptimedb_logs-sink-recommended) +- [Kafka](/user-guide/ingest-data/for-observability/kafka.md#logs) +- [Fluent Bit](/user-guide/ingest-data/for-observability/fluent-bit.md#http) +- [OpenTelemetry Collector](/user-guide/ingest-data/for-observability/otel-collector.md) +- [Loki](/user-guide/ingest-data/for-observability/loki.md#using-pipeline-with-loki-push-api) + +## 了解更多关于 Pipeline 的信息 + +- [使用自定义 Pipeline](./use-custom-pipelines.md):解释如何创建和使用自定义 pipeline 进行日志写入。 +- [管理 Pipeline](./manage-pipelines.md):解释如何创建和删除 pipeline。 + +## 查询日志 + +- [全文搜索](./fulltext-search.md):使用 GreptimeDB 查询语言有效搜索和分析日志数据的指南。 + +## 参考 + +- [内置 Pipeline](/reference/pipeline/built-in-pipelines.md):GreptimeDB 为日志写入提供的内置 pipeline 详细信息。 +- [写入日志的 API](/reference/pipeline/write-log-api.md):描述向 GreptimeDB 写入日志的 HTTP API。 +- [Pipeline 配置](/reference/pipeline/pipeline-config.md):提供 GreptimeDB 中 pipeline 各项具体配置的信息。 -- [快速开始](./quick-start.md):介绍了如何快速开始使用 GreptimeDB 日志服务。 -- [Pipeline 配置](./pipeline-config.md):深入介绍 GreptimeDB 中的 Pipeline 的每项具体配置。 -- [管理 Pipeline](./manage-pipelines.md):介绍了如何创建、删除 Pipeline。 -- [配合 Pipeline 写入日志](./write-logs.md): 详细说明了如何结合 Pipeline 机制高效写入日志数据。 -- [查询日志](./query-logs.md):描述了如何使用 GreptimeDB SQL 接口查询日志。 -- [全文索引配置](./fulltext-index-config.md):介绍了如何配置全文索引。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/quick-start.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/quick-start.md index 157a24596..c365e6d1d 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/quick-start.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/quick-start.md @@ -1,326 +1,122 @@ --- -keywords: [快速开始, 写入日志, 查询日志, 直接写入, 使用 Pipeline, 创建表, 插入日志, gRPC 协议, JSON 日志, 自定义 Pipeline] -description: 介绍如何快速开始写入和查询日志,包括直接写入日志和使用 Pipeline 写入日志的方法,以及两者的区别。 +keywords: [logs, log service, pipeline, greptime_identity, quick start, JSON logs] +description: GreptimeDB 日志服务快速入门指南,包括使用内置 greptime_identity pipeline 的基本日志写入和与日志收集器的集成。 --- -# 快速开始 +# 快速入门 -本指南逐步讲解如何在 GreptimeDB 中快速写入和查询日志。 +本指南将引导你完成使用 GreptimeDB 日志服务的基本步骤。 +你将学习如何使用内置的 `greptime_identity` pipeline 写入日志并集成日志收集器。 -GreptimeDB 支持可以将结构化日志消息解析并转换为多列的 Pipeline 机制, -以实现高效的存储和查询。 +GreptimeDB 提供了强大的基于 pipeline 的日志写入系统。 +你可以使用内置的 `greptime_identity` pipeline 快速写入 JSON 格式的日志, +该 pipeline 具有以下特点: -对于非结构化的日志,你可以不使用 Pipeline,直接将日志写入表。 +- 自动处理从 JSON 到表列的字段映射 +- 如果表不存在则自动创建表 +- 灵活支持变化的日志结构 +- 需要最少的配置即可开始使用 -## 使用 Pipeline 写入日志 +## 直接通过 HTTP 写入日志 -使用 pipeline 可以自动将日志消息格式化并转换为多个列,并自动创建和修改表结构。 +GreptimeDB 日志写入最简单的方法是通过使用 `greptime_identity` pipeline 发送 HTTP 请求。 -### 使用内置 Pipeline 写入 JSON 日志 - -GreptimeDB 提供了一个内置 pipeline `greptime_identity` 用于处理 JSON 日志格式。该 pipeline 简化了写入 JSON 日志的过程。 +例如,你可以使用 `curl` 发送带有 JSON 日志数据的 POST 请求: ```shell curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity" \ + "http://localhost:4000/v1/ingest?db=public&table=demo_logs&pipeline_name=greptime_identity" \ -H "Content-Type: application/json" \ -H "Authorization: Basic {{authentication}}" \ -d '[ { - "name": "Alice", - "age": 20, - "is_student": true, - "score": 90.5, - "object": { "a": 1, "b": 2 } - }, - { - "age": 21, - "is_student": false, - "score": 85.5, - "company": "A", - "whatever": null + "timestamp": "2024-01-15T10:30:00Z", + "level": "INFO", + "service": "web-server", + "message": "用户登录成功", + "user_id": 12345, + "ip_address": "192.168.1.100" }, { - "name": "Charlie", - "age": 22, - "is_student": true, - "score": 95.5, - "array": [1, 2, 3] + "timestamp": "2024-01-15T10:31:00Z", + "level": "ERROR", + "service": "database", + "message": "连接超时", + "error_code": 500, + "retry_count": 3 } ]' ``` -- [`鉴权`](/user-guide/protocols/http.md#鉴权) HTTP header。 -- `pipeline_name=greptime_identity` 指定了内置 pipeline。 -- `table=pipeline_logs` 指定了目标表。如果表不存在,将自动创建。 -`greptime_identity` pipeline 将自动为 JSON 日志中的每个字段创建列。成功执行命令将返回: - -```json -{"output":[{"affectedrows":3}],"execution_time_ms":9} -``` - -有关 `greptime_identity` pipeline 的更多详细信息,请参阅 [写入日志](write-logs.md#greptime_identity) 文档。 +关键参数包括: -### 使用自定义 Pipeline 写入日志 +- `db=public`:目标数据库名称(你的数据库名称) +- `table=demo_logs`:目标表名称(如果不存在则自动创建) +- `pipeline_name=greptime_identity`:使用 `greptime_identity` pipeline 进行 JSON 处理 +- `Authorization` 头:使用 base64 编码的 `username:password` 进行基本身份验证,请参阅 [HTTP 鉴权指南](/user-guide/protocols/http.md#authentication) -自定义 pipeline 允许你解析结构的日志消息并将其转换为多列,并自动创建表。 - -#### 创建 Pipeline - -GreptimeDB 提供了一个专用的 HTTP 接口来创建 pipeline。方法如下: - -首先,创建一个 pipeline 文件,例如 `pipeline.yaml`。 - -```yaml -version: 2 -processors: - - dissect: - fields: - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - - date: - fields: - - timestamp - formats: - - "%d/%b/%Y:%H:%M:%S %z" - - select: - type: exclude - fields: - - message - -transform: - - fields: - - ip_address - type: string - index: inverted - tag: true - - fields: - - status_code - type: int32 - index: inverted - tag: true - - fields: - - request_line - - user_agent - type: string - index: fulltext - - fields: - - response_size - type: int32 - - fields: - - timestamp - type: time - index: timestamp -``` - -该 pipeline 使用指定的模式拆分 `message` 字段以提取 `ip_address`、`timestamp`、`http_method`、`request_line`、`status_code`、`response_size` 和 `user_agent`。 -然后,它使用格式 `%d/%b/%Y:%H:%M:%S %z` 解析 `timestamp` 字段,将其转换为数据库可以理解的正确时间戳格式。 -最后,它将每个字段转换为适当的数据类型并相应地建立索引。 -注意到在 pipeline 的最开始我们使用了版本 2 格式,详情请参考[这个文档](./pipeline-config.md#版本-2-中的-transform)。 -简而言之,在版本 2 下 pipeline 引擎会自动查找所有没有在 transform 模块中指定的字段,并使用默认的数据类型将他们持久化到数据库中。 -你可以在[后续章节](#使用-pipeline-与直接写入非结构化日志的区别)中看到,虽然 `http_method` 没有在 transform 模块中被指定,但它依然被写入到了数据库中。 -另外,`select` 处理器被用于过滤原始的 `message` 字段。 -需要注意的是,`request_line` 和 `user_agent` 字段被索引为 `fulltext` 以优化全文搜索查询,且表中必须有一个由 `timestamp` 指定的时间索引列。 - -执行以下命令上传配置文件: - -```shell -curl -X "POST" \ - "http://localhost:4000/v1/pipelines/nginx_pipeline" \ - -H 'Authorization: Basic {{authentication}}' \ - -F "file=@pipeline.yaml" -``` - -成功执行此命令后,将创建一个名为 `nginx_pipeline` 的 pipeline,返回的结果如下: +成功的请求返回: ```json -{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. +{ + "output": [{"affectedrows": 2}], + "execution_time_ms": 15 +} ``` -你可以为同一 pipeline 名称创建多个版本。 -所有 pipeline 都存储在 `greptime_private.pipelines` 表中。 -请参阅[查询 Pipelines](manage-pipelines.md#查询-pipeline)以查看表中的 pipeline 数据。 - -#### 写入日志 - -以下示例将日志写入 `custom_pipeline_logs` 表,并使用 `nginx_pipeline` pipeline 格式化和转换日志消息。 - -```shell -curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d '[ - { - "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" - }, - { - "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" - }, - { - "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" - }, - { - "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" - } - ]' -``` - -如果命令执行成功,你将看到以下输出: - -```json -{"output":[{"affectedrows":4}],"execution_time_ms":79} -``` - -## 直接写入非结构化的日志 - -如果你的日志消息是非结构化文本, -你可以将其直接写入数据库。 -但是这种方法限制了数据库执行高性能分析的能力。 - -### 创建表 - -你需要在插入日志之前创建一个表来存储日志。 -使用以下 SQL 语句创建一个名为 `origin_logs` 的表: - -* `message` 列上的 `FULLTEXT INDEX` 可优化文本搜索查询 -* 将 `append_mode` 设置为 `true` 表示以附加行的方式写入数据,不对历史数据做覆盖。 +成功写入日志后, +相应的表 `demo_logs` 会根据 JSON 字段自动创建相应的列,其 schema 如下: ```sql -CREATE TABLE `origin_logs` ( - `message` STRING FULLTEXT INDEX, - `time` TIMESTAMP TIME INDEX -) WITH ( - append_mode = 'true' -); -``` - -### 插入日志 - -#### 使用 SQL 协议写入 - -使用 `INSERT` 语句将日志插入表中。 - -```sql -INSERT INTO origin_logs (message, time) VALUES -('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), -('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), -('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), -('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); -``` - -上述 SQL 将整个日志文本插入到一个列中,除此之外,你必须为每条日志添加一个额外的时间戳。 - -#### 使用 gRPC 协议写入 - -你也可以使用 gRPC 协议写入日志,这是一个更高效的方法。 - -请参阅[使用 gRPC 写入数据](/user-guide/ingest-data/for-iot/grpc-sdks/overview.md)以了解如何使用 gRPC 协议写入日志。 - - -## 使用 Pipeline 与直接写入非结构化日志的区别 - -在上述示例中, -使用 pipeline 写入日志的方式自动创建了表 `custom_pipeline_logs`, -直接写入日志的方式创建了表 `origin_logs`, -让我们来探讨这两个表之间的区别。 - -```sql -DESC custom_pipeline_logs; -``` - -```sql -+---------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------------+---------------------+------+------+---------+---------------+ -| ip_address | String | PRI | YES | | TAG | -| status_code | Int32 | PRI | YES | | TAG | -| request_line | String | | YES | | FIELD | -| user_agent | String | | YES | | FIELD | -| response_size | Int32 | | YES | | FIELD | -| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -| http_method | String | | YES | | FIELD | -+---------------+---------------------+------+------+---------+---------------+ -7 rows in set (0.00 sec) -``` - -```sql -DESC origin_logs; -``` - -```sql -+---------+----------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------+----------------------+------+------+---------+---------------+ -| message | String | | YES | | FIELD | -| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | -+---------+----------------------+------+------+---------+---------------+ -``` - -从表结构中可以看到,`origin_logs` 表只有两列,整个日志消息存储在一个列中。 -而 `custom_pipeline_logs` 表将日志消息存储在多个列中。 - -推荐使用 pipeline 方法将日志消息拆分为多个列,这样可以精确查询某个特定列中的某个值。 -与全文搜索相比,列匹配查询在处理字符串时具有以下几个优势: - -- **性能效率**:列的匹配查询通常都比全文搜索更快。 -- **资源消耗**:由于 GreptimeDB 的存储引擎是列存,结构化的数据更利于数据的压缩,并且 Tag 匹配查询使用的倒排索引,其资源消耗通常显著少于全文索引,尤其是在存储大小方面。 -- **可维护性**:精确匹配查询简单明了,更易于理解、编写和调试。 - -当然,如果需要在大段文本中进行关键词搜索,依然需要使用全文搜索,因为它就是专门为此设计。 - -## 查询日志 - -以 `custom_pipeline_logs` 表为例查询日志。 - -### 按 Tag 查询日志 - -对于 `custom_pipeline_logs` 中的多个 Tag 列,你可以灵活地按 Tag 查询数据。 -例如,查询 `status_code` 为 `200` 且 `http_method` 为 `GET` 的日志。 - -```sql -SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; -``` - -```sql -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -1 row in set (0.02 sec) -``` - -### 全文搜索 - -对于 `request_line` 和 `user_agent` 文本字段,你可以使用 `matches_term` 函数查询日志。 -为了提高全文搜索的性能,我们在[创建 Pipeline](#创建-pipeline) 时为这两个列创建了全文索引。 - -例如,查询 `request_line` 包含 `/index.html` 或 `/api/login` 的日志。 - -```sql -SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); -``` - -```sql -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -2 rows in set (0.00 sec) -``` - -你可以参阅[全文搜索](query-logs.md)文档以获取 `matches_term` 的详细用法。 ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| ip_address | String | | YES | | FIELD | +| level | String | | YES | | FIELD | +| message | String | | YES | | FIELD | +| service | String | | YES | | FIELD | +| timestamp | String | | YES | | FIELD | +| user_id | Int64 | | YES | | FIELD | +| error_code | Int64 | | YES | | FIELD | +| retry_count | Int64 | | YES | | FIELD | ++--------------------+---------------------+------+------+---------+---------------+ +``` + +## 与日志收集器集成 + +对于生产环境, +你通常会使用日志收集器自动将日志转发到 GreptimeDB。 +以下是如何配置 Vector 使用 `greptime_identity` pipeline 向 GreptimeDB 发送日志的示例: + +```toml +[sinks.my_sink_id] +type = "greptimedb_logs" +dbname = "public" +endpoint = "http://:4000" +pipeline_name = "greptime_identity" +table = "
" +username = "" +password = "" +# 根据需要添加其他配置 +``` + +关键配置参数包括: +- `type = "greptimedb_logs"`:指定 GreptimeDB 日志接收器 +- `dbname`:目标数据库名称 +- `endpoint`:GreptimeDB HTTP 端点 +- `pipeline_name`:使用 `greptime_identity` pipeline 进行 JSON 处理 +- `table`:目标表名称(如果不存在则自动创建) +- `username` 和 `password`:HTTP 基本身份验证的凭证 + +有关 Vector 配置和选项的详细信息, +请参阅 [Vector 集成指南](/user-guide/ingest-data/for-observability/vector.md#使用-greptimedb_logs-sink-推荐)。 ## 下一步 -你现在已经体验了 GreptimeDB 的日志记录功能,可以通过以下文档进一步探索: +你已成功写入了第一批日志,以下是推荐的后续步骤: + +- **了解更多关于内置 Pipeline 的行为**:请参阅[内置 Pipeline](/reference/pipeline/built-in-pipelines.md)指南,了解可用的内置 pipeline 及其配置的详细信息 +- **与流行的日志收集器集成**:有关将 GreptimeDB 与 Fluent Bit、Fluentd 等各种日志收集器集成的详细说明,请参阅[日志概览](./overview.md)中的[集成到日志收集器](./overview.md#集成到日志收集器)部分 +- **使用自定义 Pipeline**:要了解使用自定义 pipeline 进行高级日志处理和转换的信息,请参阅[使用自定义 Pipeline](./use-custom-pipelines.md)指南 -- [Pipeline 配置](./pipeline-config.md): 提供 GreptimeDB 中每个 pipeline 配置的深入信息。 -- [管理 Pipeline](./manage-pipelines.md): 解释如何创建和删除 pipeline。 -- [使用 Pipeline 写入日志](./write-logs.md): 介绍利用 pipeline 机制写入日志数据的详细说明。 -- [查询日志](./query-logs.md): 描述如何使用 GreptimeDB SQL 接口查询日志。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/use-custom-pipelines.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/use-custom-pipelines.md new file mode 100644 index 000000000..cea036a4d --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/use-custom-pipelines.md @@ -0,0 +1,318 @@ +--- +keywords: [快速开始, 写入日志, 查询日志, pipeline, 结构化数据, 日志写入, 日志收集, 日志管理工具] +description: 在 GreptimeDB 中快速写入和查询日志的全面指南,包括直接日志写入和使用 pipeline 处理结构化数据。 +--- + +# 使用自定义 Pipeline + +基于你的 pipeline 配置, +GreptimeDB 能够将日志自动解析和转换为多列的结构化数据, +当内置 pipeline 无法处理特定的文本日志格式时, +你可以创建自定义 pipeline 来定义如何根据你的需求解析和转换日志数据。 + +## 识别你的原始日志格式 + +在创建自定义 pipeline 之前,了解原始日志数据的格式至关重要。 +如果你正在使用日志收集器且不确定日志格式, +有两种方法可以检查你的日志: + +1. **阅读收集器的官方文档**:配置你的收集器将数据输出到控制台或文件以检查日志格式。 +2. **使用 `greptime_identity` pipeline**:使用内置的 `greptime_identity` pipeline 将示例日志直接写入到 GreptimeDB 中。 + `greptime_identity` pipeline 将整个文本日志视为单个 `message` 字段,方便你直接看到原始日志的内容。 + +一旦了解了要处理的日志格式, +你就可以创建自定义 pipeline。 +本文档使用以下 Nginx 访问日志条目作为示例: + +```txt +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +``` + +## 创建自定义 Pipeline + +GreptimeDB 提供 HTTP 接口用于创建 pipeline。 +以下是创建方法。 + +首先,创建一个示例 pipeline 配置文件来处理 Nginx 访问日志, +将其命名为 `pipeline.yaml`: + +```yaml +version: 2 +processors: + - dissect: + fields: + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + - date: + fields: + - timestamp + formats: + - "%d/%b/%Y:%H:%M:%S %z" + - select: + type: exclude + fields: + - message + - vrl: + source: | + .greptime_table_ttl = "7d" + . + +transform: + - fields: + - ip_address + type: string + index: inverted + tag: true + - fields: + - status_code + type: int32 + index: inverted + tag: true + - fields: + - request_line + - user_agent + type: string + index: fulltext + - fields: + - response_size + type: int32 + - fields: + - timestamp + type: time + index: timestamp +``` + +上面的 pipeline 配置使用 [version 2](/reference/pipeline/pipeline-config.md#transform-in-version-2) 格式, +包含 `processors` 和 `transform` 部分来结构化你的日志数据: + +**Processors**:用于在转换前预处理日志数据: +- **数据提取**:`dissect` 处理器使用 pattern 匹配来解析 `message` 字段并提取结构化数据,包括 `ip_address`、`timestamp`、`http_method`、`request_line`、`status_code`、`response_size` 和 `user_agent`。 +- **时间戳处理**:`date` 处理器使用格式 `%d/%b/%Y:%H:%M:%S %z` 解析提取的 `timestamp` 字段并将其转换为适当的时间戳数据类型。 +- **字段选择**:`select` 处理器从最终输出中排除原始 `message` 字段,同时保留所有其他字段。 +- **表选项**:`vrl` 处理器根据提取的字段设置表选项,例如向表名添加后缀和设置 TTL。`greptime_table_ttl = "7d"` 配置表数据的保存时间为 7 天。 + +**Transform**:定义如何转换和索引提取的字段: +- **字段转换**:每个提取的字段都转换为适当的数据类型并根据需要配置相应的索引。像 `http_method` 这样的字段在没有提供显式配置时保留其默认数据类型。 +- **索引策略**: + - `ip_address` 和 `status_code` 使用倒排索引作为标签进行快速过滤 + - `request_line` 和 `user_agent` 使用全文索引以获得最佳文本搜索能力 + - `timestamp` 是必需的时间索引列 + +有关 pipeline 配置选项的详细信息, +请参考 [Pipeline 配置](/reference/pipeline/pipeline-config.md) 文档。 + +## 上传 Pipeline + +执行以下命令上传 pipeline 配置: + +```shell +curl -X "POST" \ + "http://localhost:4000/v1/pipelines/nginx_pipeline" \ + -H 'Authorization: Basic {{authentication}}' \ + -F "file=@pipeline.yaml" +``` + +成功执行后,将创建一个名为 `nginx_pipeline` 的 pipeline 并返回以下结果: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. +``` + +你可以为同一个 pipeline 名称创建多个版本。 +所有 pipeline 都存储在 `greptime_private.pipelines` 表中。 +参考[查询 Pipeline](manage-pipelines.md#查询-pipeline) 来查看 pipeline 数据。 + +## 使用 Pipeline 写入日志 + +以下示例使用 `nginx_pipeline` pipeline 将日志写入 `custom_pipeline_logs` 表来格式化和转换日志消息: + +```shell +curl -X POST \ + "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d '[ + { + "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" + }, + { + "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" + }, + { + "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" + }, + { + "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" + } + ]' +``` + +命令执行成功后将返回以下输出: + +```json +{"output":[{"affectedrows":4}],"execution_time_ms":79} +``` + +`custom_pipeline_logs` 表内容根据 pipeline 配置自动创建: + +```sql ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| ip_address | http_method | status_code | request_line | user_agent | response_size | timestamp | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| 10.0.0.1 | GET | 304 | /images/logo.png HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0 | 0 | 2024-05-25 20:18:37 | +| 127.0.0.1 | GET | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | +| 172.16.0.1 | GET | 404 | /contact HTTP/1.1 | Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1 | 162 | 2024-05-25 20:19:37 | +| 192.168.1.1 | POST | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +``` + +有关日志写入 API 端点 `/ingest` 的更详细信息, +包括附加参数和配置选项, +请参考[日志写入 API](/reference/pipeline/write-log-api.md) 文档。 + +## 查询日志 + +我们使用 `custom_pipeline_logs` 表作为示例来查询日志。 + +### 通过 tag 查询日志 + +通过 `custom_pipeline_logs` 中的多个 tag 列, +你可以灵活地通过 tag 查询数据。 +例如,查询 `status_code` 为 200 且 `http_method` 为 GET 的日志。 + +```sql +SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; +``` + +```sql ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +1 row in set (0.02 sec) +``` + +### 全文搜索 + +对于文本字段 `request_line` 和 `user_agent`,你可以使用 `matches_term` 函数来搜索日志。 +还记得我们在[创建 pipeline](#create-a-pipeline) 时为这两列创建了全文索引。 +这带来了高性能的全文搜索。 + +例如,查询 `request_line` 列包含 `/index.html` 或 `/api/login` 的日志。 + +```sql +SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); +``` + +```sql ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | +| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +2 rows in set (0.00 sec) +``` + +你可以参考[全文搜索](fulltext-search.md) 文档了解 `matches_term` 函数的详细用法。 + + +## 使用 Pipeline 的好处 + +使用 pipeline 处理日志带来了结构化的数据和自动的字段提取, +这使得查询和分析更加高效。 + +你也可以在没有 pipeline 的情况下直接将日志写入数据库, +但这种方法限制了高性能分析能力。 + +### 直接插入日志(不使用 Pipeline) + +为了比较,你可以创建一个表来存储原始日志消息: + +```sql +CREATE TABLE `origin_logs` ( + `message` STRING FULLTEXT INDEX, + `time` TIMESTAMP TIME INDEX +) WITH ( + append_mode = 'true' +); +``` + +使用 `INSERT` 语句将日志插入表中。 +注意你需要为每个日志手动添加时间戳字段: + +```sql +INSERT INTO origin_logs (message, time) VALUES +('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), +('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), +('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), +('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); +``` + +### 表结构比较:Pipeline 转换后 vs 原始日志 + +在上面的示例中,表 `custom_pipeline_logs` 是通过使用 pipeline 写入日志自动创建的, +而表 `origin_logs` 是通过直接写入日志创建的。 +让我们看一看这两个表之间的差异。 + +```sql +DESC custom_pipeline_logs; +``` + +```sql ++---------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------------+---------------------+------+------+---------+---------------+ +| ip_address | String | PRI | YES | | TAG | +| status_code | Int32 | PRI | YES | | TAG | +| request_line | String | | YES | | FIELD | +| user_agent | String | | YES | | FIELD | +| response_size | Int32 | | YES | | FIELD | +| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| http_method | String | | YES | | FIELD | ++---------------+---------------------+------+------+---------+---------------+ +7 rows in set (0.00 sec) +``` + +```sql +DESC origin_logs; +``` + +```sql ++---------+----------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------+----------------------+------+------+---------+---------------+ +| message | String | | YES | | FIELD | +| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | ++---------+----------------------+------+------+---------+---------------+ +``` + +以上表结构显示了关键差异: + +`custom_pipeline_logs` 表(使用 pipeline 创建)自动将日志数据结构化为多列: +- `ip_address`、`status_code` 作为索引标签用于快速过滤 +- `request_line`、`user_agent` 具有全文索引用于文本搜索 +- `response_size`、`http_method` 作为常规字段 +- `timestamp` 作为时间索引 + +`origin_logs` 表(直接插入)将所有内容存储在单个 `message` 列中。 + +### 为什么使用 Pipeline? + +建议使用 pipeline 方法将日志消息拆分为多列, +这具有明确查询特定列中特定值的优势。 +有几个关键原因使得基于列的匹配查询比全文搜索更优越: + +- **性能**:基于列的查询通常比全文搜索更快 +- **存储效率**:GreptimeDB 的列式存储能更好地压缩结构化数据;标签的倒排索引比全文索引消耗更少的存储空间 +- **查询简单性**:基于标签的查询更容易编写、理解和调试 + +## 下一步 + +- **全文搜索**:阅读[全文搜索](fulltext-search.md) 指南,了解 GreptimeDB 中的高级文本搜索功能和查询技术 +- **Pipeline 配置**:阅读 [Pipeline 配置](/reference/pipeline/pipeline-config.md) 文档,了解更多关于为各种日志格式和处理需求创建和自定义 pipeline 的信息 + + diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/write-logs.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/write-logs.md deleted file mode 100644 index 3275b00e6..000000000 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/logs/write-logs.md +++ /dev/null @@ -1,344 +0,0 @@ ---- -keywords: [日志写入, HTTP 接口, Pipeline 配置, 数据格式, 请求参数] -description: 介绍如何通过 HTTP 接口使用指定的 Pipeline 将日志写入 GreptimeDB,包括请求参数、数据格式和示例。 ---- - -# 使用 Pipeline 写入日志 - -本文档介绍如何通过 HTTP 接口使用指定的 Pipeline 进行处理后将日志写入 GreptimeDB。 - -在写入日志之前,请先阅读 [Pipeline 配置](pipeline-config.md)和[管理 Pipeline](manage-pipelines.md) 完成配置的设定和上传。 - -## HTTP API - -您可以使用以下命令通过 HTTP 接口写入日志: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - - -## 请求参数 - -此接口接受以下参数: - -- `db`:数据库名称。 -- `table`:表名称。 -- `pipeline_name`:[Pipeline](./pipeline-config.md) 名称。 -- `version`:Pipeline 版本号。可选,默认使用最新版本。 - -## `Content-Type` 和 Body 数据格式 - -GreptimeDB 使用 `Content-Type` header 来决定如何解码请求体内容。目前我们支持以下两种格式: -- `application/json`: 包括普通的 JSON 格式和 NDJSON 格式。 -- `application/x-ndjson`: 指定 NDJSON 格式,会尝试先分割行再进行解析,可以达到精确的错误检查。 -- `text/plain`: 通过换行符分割的多行日志文本行。 - -### `application/json` 和 `application/x-ndjson` 格式 - -以下是一份 JSON 格式请求体内容的示例: - -```JSON -[ - {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, - {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, - {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, - {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -] -``` - -请注意整个 JSON 是一个数组(包含多行日志)。每个 JSON 对象代表即将要被 Pipeline 引擎处理的一行日志。 - -JSON 对象中的 key 名,也就是这里的 `message`,会被用作 Pipeline processor 处理时的 field 名称。比如: - -```yaml -processors: - - dissect: - fields: - # `message` 是 JSON 对象中的 key 名 - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# pipeline 文件的剩余部分在这里省略 -``` - -我们也可以将这个请求体内容改写成 NDJSON 的格式,如下所示: - -```JSON -{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} -{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} -{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} -{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -``` - -注意到最外层的数组符被消去了,现在每个 JSON 对象通过换行符分割而不是 `,`。 - -### `text/plain` 格式 - -纯文本日志在整个生态系统中被广泛应用。GreptimeDB 同样支持日志数据以 `text/plain` 格式进行输入,使得我们可以直接从日志产生源进行写入。 - -以下是一份和上述样例请求体内容等价的文本请求示例: - -```plain -127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" -192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" -10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" -172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" -``` - -仅需要将 `Content-Type` header 设置成 `text/plain`,即可将纯文本请求发送到 GreptimeDB。 - -主要注意的是,和 JSON 格式自带 key 名可以被 Pipeline processor 识别和处理不同,`text/plain` 格式直接将整行文本输入到 Pipeline engine。在这种情况下我们可以使用 `message` 来指代整行输入文本,例如: - -```yaml -processors: - - dissect: - fields: - # 使用 `message` 作为 field 名称 - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# pipeline 文件的剩余部分在这里省略 -``` - -对于 `text/plain` 格式的输入,推荐首先使用 `dissect` 或者 `regex` processor 将整行文本分割成不同的字段,以便进行后续的处理。 - -## 内置 Pipeline - -GreptimeDB 提供了常见日志格式的内置 Pipeline,允许您直接使用而无需创建新的 Pipeline。 - -请注意,内置 Pipeline 的名称以 "greptime_" 为前缀,不可编辑。 - -### `greptime_identity` - -`greptime_identity` Pipeline 适用于写入 JSON 日志,并自动为 JSON 日志中的每个字段创建列。 - -- JSON 日志中的第一层级的 key 是表中的列名。 -- 如果相同字段包含不同类型的数据,则会返回错误。 -- 值为 `null` 的字段将被忽略。 -- 如果没有手动指定,一个作为时间索引的额外列 `greptime_timestamp` 将被添加到表中,以指示日志写入的时间。 - -#### 类型转换规则 - -- `string` -> `string` -- `number` -> `int64` 或 `float64` -- `boolean` -> `bool` -- `null` -> 忽略 -- `array` -> `json` -- `object` -> `json` - -例如,如果我们有以下 JSON 数据: - -```json -[ - {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, - {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, - {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} -] -``` - -我们将合并每个批次的行结构以获得最终 schema。表 schema 如下所示: - -```sql -mysql> desc pipeline_logs; -+--------------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------------------+---------------------+------+------+---------+---------------+ -| age | Int64 | | YES | | FIELD | -| is_student | Boolean | | YES | | FIELD | -| name | String | | YES | | FIELD | -| object | Json | | YES | | FIELD | -| score | Float64 | | YES | | FIELD | -| company | String | | YES | | FIELD | -| array | Json | | YES | | FIELD | -| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -+--------------------+---------------------+------+------+---------+---------------+ -8 rows in set (0.00 sec) -``` - -数据将存储在表中,如下所示: - -```sql -mysql> select * from pipeline_logs; -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| age | is_student | name | object | score | company | array | greptime_timestamp | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | -| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | -| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -3 rows in set (0.01 sec) -``` - -#### 自定义时间索引列 - -每个 GreptimeDB 表中都必须有时间索引列。`greptime_identity` pipeline 不需要额外的 YAML 配置,如果你希望使用写入数据中自带的时间列(而不是日志数据到达服务端的时间戳)作为表的时间索引列,则需要通过参数进行指定。 - -假设这是一份待写入的日志数据: -```JSON -[ - {"action": "login", "ts": 1742814853} -] -``` - -设置如下的 URL 参数来指定自定义时间索引列: -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d $'[{"action": "login", "ts": 1742814853}]' -``` - -取决于数据的格式,`custom_time_index` 参数接受两种格式的配置值: -- Unix 时间戳: `<字段名>;epoch;<精度>` - - 该字段需要是整数或者字符串 - - 精度为这四种选项之一: `s`, `ms`, `us`, or `ns`. -- 时间戳字符串: `<字段名>;datestr;<字符串解析格式>` - - 例如输入的时间字段值为 `2025-03-24 19:31:37+08:00`,则对应的字符串解析格式为 `%Y-%m-%d %H:%M:%S%:z` - -通过上述配置,结果表就能正确使用输入字段作为时间索引列 -```sql -DESC pipeline_logs; -``` -```sql -+--------+-----------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------+-----------------+------+------+---------+---------------+ -| ts | TimestampSecond | PRI | NO | | TIMESTAMP | -| action | String | | YES | | FIELD | -+--------+-----------------+------+------+---------+---------------+ -2 rows in set (0.02 sec) -``` - -假设时间变量名称为 `input_ts`,以下是一些使用 `custom_time_index` 的示例: -- 1742814853: `custom_time_index=input_ts;epoch;s` -- 1752749137000: `custom_time_index=input_ts;epoch;ms` -- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` -- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` - - -#### 展开 json 对象 - -如果你希望将 JSON 对象展开为单层结构,可以在请求的 header 中添加 `x-greptime-pipeline-params` 参数,设置 `flatten_json_object` 为 `true`。 - -以下是一个示例请求: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -H "x-greptime-pipeline-params: flatten_json_object=true" \ - -d "$" -``` - -这样,GreptimeDB 将自动将 JSON 对象的每个字段展开为单独的列。比如 - -```JSON -{ - "a": { - "b": { - "c": [1, 2, 3] - } - }, - "d": [ - "foo", - "bar" - ], - "e": { - "f": [7, 8, 9], - "g": { - "h": 123, - "i": "hello", - "j": { - "k": true - } - } - } -} -``` - -将被展开为: - -```json -{ - "a.b.c": [1,2,3], - "d": ["foo","bar"], - "e.f": [7,8,9], - "e.g.h": 123, - "e.g.i": "hello", - "e.g.j.k": true -} -``` - -## Pipeline 上下文中的 hint 变量 - -从 `v0.15` 开始,pipeline 引擎可以识别特定的变量名称,并且通过这些变量对应的值设置相应的建表选项。 -通过与 `vrl` 处理器的结合,现在可以非常轻易地通过输入的数据在 pipeline 的执行过程中设置建表选项。 - -以下是支持的表选项变量名: -- `greptime_auto_create_table` -- `greptime_ttl` -- `greptime_append_mode` -- `greptime_merge_mode` -- `greptime_physical_table` -- `greptime_skip_wal` -关于这些表选项的含义,可以参考[这份文档](/reference/sql/create.md#表选项)。 - -以下是 pipeline 特有的变量: -- `greptime_table_suffix`: 在给定的目标表后增加后缀 - -以如下 pipeline 文件为例 -```YAML -processors: - - date: - field: time - formats: - - "%Y-%m-%d %H:%M:%S%.3f" - ignore_missing: true - - vrl: - source: | - .greptime_table_suffix, err = "_" + .id - .greptime_table_ttl = "1d" - . -``` - -在这份 vrl 脚本中,我们将表后缀变量设置为输入字段中的 `id`(通过一个下划线连接),然后将 ttl 设置成 `1d`。 -然后我们使用如下数据执行写入。 - -```JSON -{ - "id": "2436", - "time": "2024-05-25 20:16:37.217" -} -``` - -假设给定的表名为 `d_table`,那么最终的表名就会按照预期被设置成 `d_table_2436`。这个表同样的 ttl 同样会被设置成 1 天。 - -## 示例 - -请参考快速开始中的[写入日志](quick-start.md#写入日志)部分。 - -## Append 模式 - -通过此接口创建的日志表,默认为[Append 模式](/user-guide/deployments-administration/performance-tuning/design-table.md#何时使用-append-only-表). - - -## 使用 skip_error 跳过错误 - -如果你希望在写入日志时跳过错误,可以在 HTTP 请求的 query params 中添加 `skip_error` 参数。比如: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=true" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - -这样,GreptimeDB 将在遇到错误时跳过该条日志,并继续处理其他日志。不会因为某一条日志的错误而导致整个请求失败。 \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/manage-data/data-index.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/manage-data/data-index.md index 09f850f66..6ec6803f6 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/manage-data/data-index.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/manage-data/data-index.md @@ -120,7 +120,7 @@ CREATE TABLE logs ( 建议仅在需要高级文本搜索功能和灵活查询模式时使用全文索引。 -有关全文索引配置和后端选择的更多详细信息,请参考[全文索引配置](/user-guide/logs/fulltext-index-config)指南。 +有关全文索引配置和后端选择的更多详细信息,请参考[全文索引配置](/user-guide/manage-data/data-index.md#全文索引)指南。 ## 修改索引 diff --git a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md index c2a5e3a3d..c29e7b05a 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md +++ b/i18n/zh/docusaurus-plugin-content-docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md @@ -118,7 +118,7 @@ CREATE TABLE logs ( **说明:** - `host` 和 `service` 作为常用过滤项列入主键,如主机数量非常多,可移出主键,改为跳数索引。 -- `log_message` 作为原始文本内容建立全文索引。**若要全文索引生效,查询时 SQL 语法也需调整,详见[日志检索文档](/user-guide/logs/query-logs.md)**。 +- `log_message` 作为原始文本内容建立全文索引。**若要全文索引生效,查询时 SQL 语法也需调整,详见[日志检索文档](/user-guide/logs/fulltext-search.md)**。 - `trace_id` 和 `span_id` 通常为高基数字段,建议仅做跳数索引。 @@ -227,7 +227,7 @@ clickhouse client --query="SELECT * FROM example INTO OUTFILE 'example.csv' FORM ### SQL/类型不兼容怎么办? -迁移前需梳理所有查询 SQL 并按官方文档 ([SQL 查询](/user-guide/query-data/sql.md)、[日志检索](/user-guide/logs/query-logs.md)) 重写或翻译不兼容语法和类型。 +迁移前需梳理所有查询 SQL 并按官方文档 ([SQL 查询](/user-guide/query-data/sql.md)、[日志检索](/user-guide/logs/fulltext-search.md)) 重写或翻译不兼容语法和类型。 ### 如何高效批量导入大规模数据? diff --git a/sidebars.ts b/sidebars.ts index a26b8a55f..a297d28e9 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -234,11 +234,9 @@ const sidebars: SidebarsConfig = { label: 'Overview', }, 'user-guide/logs/quick-start', - 'user-guide/logs/pipeline-config', + 'user-guide/logs/use-custom-pipelines', + 'user-guide/logs/fulltext-search', 'user-guide/logs/manage-pipelines', - 'user-guide/logs/write-logs', - 'user-guide/logs/query-logs', - 'user-guide/logs/fulltext-index-config', ], }, { @@ -716,6 +714,15 @@ const sidebars: SidebarsConfig = { }, ], }, + { + type: 'category', + label: 'Pipeline', + items: [ + 'reference/pipeline/built-in-pipelines', + 'reference/pipeline/write-log-api', + 'reference/pipeline/pipeline-config', + ], + }, 'reference/http-endpoints', 'reference/telemetry', 'reference/gtctl', diff --git a/static/log-collection-flow.drawio.svg b/static/log-collection-flow.drawio.svg new file mode 100644 index 000000000..2dcf914a5 --- /dev/null +++ b/static/log-collection-flow.drawio.svg @@ -0,0 +1,401 @@ + + + + + + + + + + + + + + +
+
+
+ GreptimeDB +
+
+
+
+ + GreptimeDB + +
+
+
+ + + + + + + + +
+
+
+ Log Sources +
+
+
+
+ + Log Sources + +
+
+
+ + + + + + + +
+
+
+ Applications +
+
+
+
+ + Applications + +
+
+
+ + + + + + + +
+
+
+ IoT +
+
+
+
+ + IoT + +
+
+
+ + + + + + + +
+
+
+ Infrastructure +
+
+
+
+ + Infrastructure + +
+
+
+ + + + + + + +
+
+
+ System +
+
+
+
+ + System + +
+
+
+ + + + + + + +
+
+
+ ... +
+
+
+
+ + ... + +
+
+
+ + + + + + + + + + + + + + +
+
+
+ Log Collectors +
+
+
+
+ + Log Collectors + +
+
+
+ + + + + + + +
+
+
+ Vector +
+
+
+
+ + Vector + +
+
+
+ + + + + + + +
+
+
+ Fluent Bit +
+
+
+
+ + Fluent Bit + +
+
+
+ + + + + + + +
+
+
+ Kafka +
+
+
+
+ + Kafka + +
+
+
+ + + + + + + +
+
+
+ + OpenTelemetry + +
+ + Collector + +
+
+
+
+
+ + OpenTelemetry... + +
+
+
+ + + + + + + +
+
+
+ ... +
+
+
+
+ + ... + +
+
+
+ + + + + + + +
+
+
+
+ Original +
+ Logs +
+
+
+
+ + Origina... + +
+
+
+ + + + + + + + +
+
+
+ Pipeline +
+
+
+
+ + Pipeline + +
+
+
+ + + + + + + + + + + + + +
+
+
+
+ Collector +
+
+ Formatted +
+ Logs +
+
+
+
+ + Collector... + +
+
+
+
+ + + + + Text is not SVG - cannot display + + + +
\ No newline at end of file diff --git a/versioned_docs/version-0.15/user-guide/manage-data/data-index.md b/versioned_docs/version-0.15/user-guide/manage-data/data-index.md index c50ebacba..9720d55fd 100644 --- a/versioned_docs/version-0.15/user-guide/manage-data/data-index.md +++ b/versioned_docs/version-0.15/user-guide/manage-data/data-index.md @@ -120,7 +120,7 @@ Fulltext index usually comes with following drawbacks: Consider using fulltext index only when you need advanced text search capabilities and flexible query patterns. -For more detailed information about fulltext index configuration and backend selection, please refer to the [Full-Text Index Configuration](/user-guide/logs/fulltext-index-config) guide. +For more detailed information about fulltext index configuration and backend selection, please refer to the [Full-Text Index Configuration](/user-guide/logs/fulltext-index-config.md) guide. ## Modify indexes diff --git a/versioned_docs/version-0.16/user-guide/manage-data/data-index.md b/versioned_docs/version-0.16/user-guide/manage-data/data-index.md index c50ebacba..9720d55fd 100644 --- a/versioned_docs/version-0.16/user-guide/manage-data/data-index.md +++ b/versioned_docs/version-0.16/user-guide/manage-data/data-index.md @@ -120,7 +120,7 @@ Fulltext index usually comes with following drawbacks: Consider using fulltext index only when you need advanced text search capabilities and flexible query patterns. -For more detailed information about fulltext index configuration and backend selection, please refer to the [Full-Text Index Configuration](/user-guide/logs/fulltext-index-config) guide. +For more detailed information about fulltext index configuration and backend selection, please refer to the [Full-Text Index Configuration](/user-guide/logs/fulltext-index-config.md) guide. ## Modify indexes diff --git a/versioned_docs/version-0.17/faq-and-others/faq.md b/versioned_docs/version-0.17/faq-and-others/faq.md index 5c80b3085..7bc97261e 100644 --- a/versioned_docs/version-0.17/faq-and-others/faq.md +++ b/versioned_docs/version-0.17/faq-and-others/faq.md @@ -219,7 +219,7 @@ Learn more about indexing: [Index Management](/user-guide/manage-data/data-index **Real-Time Processing**: - **[Flow Engine](/user-guide/flow-computation/overview.md)**: Real-time stream processing system that enables continuous, incremental computation on streaming data with automatic result table updates -- **[Pipeline](/user-guide/logs/pipeline-config.md)**: Data parsing and transformation mechanism for processing incoming data in real-time, with configurable processors for field extraction and data type conversion across multiple data formats +- **[Pipeline](/reference/pipeline/pipeline-config.md)**: Data parsing and transformation mechanism for processing incoming data in real-time, with configurable processors for field extraction and data type conversion across multiple data formats - **Output Tables**: Persist processed results for analysis diff --git a/versioned_docs/version-0.17/getting-started/quick-start.md b/versioned_docs/version-0.17/getting-started/quick-start.md index ec86668d7..fdbac53cf 100644 --- a/versioned_docs/version-0.17/getting-started/quick-start.md +++ b/versioned_docs/version-0.17/getting-started/quick-start.md @@ -237,7 +237,7 @@ ORDER BY +---------------------+-------+------------------+-----------+--------------------+ ``` -The `@@` operator is used for [term searching](/user-guide/logs/query-logs.md). +The `@@` operator is used for [term searching](/user-guide/logs/fulltext-search.md). ### Range query diff --git a/versioned_docs/version-0.17/greptimecloud/integrations/fluent-bit.md b/versioned_docs/version-0.17/greptimecloud/integrations/fluent-bit.md index fa92db437..b8643face 100644 --- a/versioned_docs/version-0.17/greptimecloud/integrations/fluent-bit.md +++ b/versioned_docs/version-0.17/greptimecloud/integrations/fluent-bit.md @@ -28,7 +28,7 @@ Fluent Bit can be configured to send logs to GreptimeCloud using the HTTP protoc http_Passwd ``` -In this example, the `http` output plugin is used to send logs to GreptimeCloud. For more information, and extra options, refer to the [Logs HTTP API](https://docs.greptime.com/user-guide/logs/write-logs#http-api) guide. +In this example, the `http` output plugin is used to send logs to GreptimeCloud. For more information, and extra options, refer to the [Logs HTTP API](https://docs.greptime.com/reference/pipeline/write-log-api.md#http-api) guide. ## Prometheus Remote Write diff --git a/versioned_docs/version-0.17/greptimecloud/integrations/kafka.md b/versioned_docs/version-0.17/greptimecloud/integrations/kafka.md index c20c10fab..569776111 100644 --- a/versioned_docs/version-0.17/greptimecloud/integrations/kafka.md +++ b/versioned_docs/version-0.17/greptimecloud/integrations/kafka.md @@ -13,7 +13,7 @@ Here we are using Vector as the tool to transport data from Kafka to GreptimeDB. ## Logs A sample configuration. Note that you will need to [create your -pipeline](https://docs.greptime.com/user-guide/logs/pipeline-config/) for log +pipeline](https://docs.greptime.com/reference/pipeline/pipeline-config/) for log parsing. ```toml diff --git a/versioned_docs/version-0.17/reference/pipeline/built-in-pipelines.md b/versioned_docs/version-0.17/reference/pipeline/built-in-pipelines.md new file mode 100644 index 000000000..efda2d1f9 --- /dev/null +++ b/versioned_docs/version-0.17/reference/pipeline/built-in-pipelines.md @@ -0,0 +1,176 @@ +--- +keywords: [built-in pipelines, greptime_identity, JSON logs, log processing, time index, pipeline, GreptimeDB] +description: Learn about GreptimeDB's built-in pipelines, including the greptime_identity pipeline for processing JSON logs with automatic schema creation, type conversion, and time index configuration. +--- + +# Built-in Pipelines + +GreptimeDB offers built-in pipelines for common log formats, allowing you to use them directly without creating new pipelines. + +Note that the built-in pipelines are not editable. +Additionally, the "greptime_" prefix of the pipeline name is reserved. + +## `greptime_identity` + +The `greptime_identity` pipeline is designed for writing JSON logs and automatically creates columns for each field in the JSON log. + +- The first-level keys in the JSON log are used as column names. +- An error is returned if the same field has different types. +- Fields with `null` values are ignored. +- If time index is not specified, an additional column, `greptime_timestamp`, is added to the table as the time index to indicate when the log was written. + +### Type conversion rules + +- `string` -> `string` +- `number` -> `int64` or `float64` +- `boolean` -> `bool` +- `null` -> ignore +- `array` -> `json` +- `object` -> `json` + + +For example, if we have the following json data: + +```json +[ + {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, + {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, + {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} +] +``` + +We'll merge the schema for each row of this batch to get the final schema. The table schema will be: + +```sql +mysql> desc pipeline_logs; ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| age | Int64 | | YES | | FIELD | +| is_student | Boolean | | YES | | FIELD | +| name | String | | YES | | FIELD | +| object | Json | | YES | | FIELD | +| score | Float64 | | YES | | FIELD | +| company | String | | YES | | FIELD | +| array | Json | | YES | | FIELD | +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | ++--------------------+---------------------+------+------+---------+---------------+ +8 rows in set (0.00 sec) +``` + +The data will be stored in the table as follows: + +```sql +mysql> select * from pipeline_logs; ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| age | is_student | name | object | score | company | array | greptime_timestamp | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | +| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | +| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | ++------+------------+---------+---------------+-------+---------+---------+----------------------------+ +3 rows in set (0.01 sec) +``` + +### Specify time index + +A time index is necessary in GreptimeDB. Since the `greptime_identity` pipeline does not require a YAML configuration, you must set the time index in the query parameters if you want to use the timestamp from the log data instead of the automatically generated timestamp when the data arrives. + +Example of Incoming Log Data: +```JSON +[ + {"action": "login", "ts": 1742814853} +] +``` + +To instruct the server to use ts as the time index, set the following query parameter in the HTTP header: +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d $'[{"action": "login", "ts": 1742814853}]' +``` + +The `custom_time_index` parameter accepts two formats, depending on the input data format: +- Epoch number format: `;epoch;` + - The field can be an integer or a string. + - The resolution must be one of: `s`, `ms`, `us`, or `ns`. +- Date string format: `;datestr;` + - For example, if the input data contains a timestamp like `2025-03-24 19:31:37+08:00`, the corresponding format should be `%Y-%m-%d %H:%M:%S%:z`. + +With the configuration above, the resulting table will correctly use the specified log data field as the time index. +```sql +DESC pipeline_logs; +``` +```sql ++--------+-----------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------+-----------------+------+------+---------+---------------+ +| ts | TimestampSecond | PRI | NO | | TIMESTAMP | +| action | String | | YES | | FIELD | ++--------+-----------------+------+------+---------+---------------+ +2 rows in set (0.02 sec) +``` + +Here are some example of using `custom_time_index` assuming the time variable is named `input_ts`: +- 1742814853: `custom_time_index=input_ts;epoch;s` +- 1752749137000: `custom_time_index=input_ts;epoch;ms` +- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` +- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` + + +### Flatten JSON objects + +If flattening a JSON object into a single-level structure is needed, add the `x-greptime-pipeline-params` header to the request and set `flatten_json_object` to `true`. + +Here is a sample request: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -H "x-greptime-pipeline-params: flatten_json_object=true" \ + -d "$" +``` + +With this configuration, GreptimeDB will automatically flatten each field of the JSON object into separate columns. For example: + +```JSON +{ + "a": { + "b": { + "c": [1, 2, 3] + } + }, + "d": [ + "foo", + "bar" + ], + "e": { + "f": [7, 8, 9], + "g": { + "h": 123, + "i": "hello", + "j": { + "k": true + } + } + } +} +``` + +Will be flattened to: + +```json +{ + "a.b.c": [1,2,3], + "d": ["foo","bar"], + "e.f": [7,8,9], + "e.g.h": 123, + "e.g.i": "hello", + "e.g.j.k": true +} +``` + + + diff --git a/versioned_docs/version-0.17/user-guide/logs/pipeline-config.md b/versioned_docs/version-0.17/reference/pipeline/pipeline-config.md similarity index 99% rename from versioned_docs/version-0.17/user-guide/logs/pipeline-config.md rename to versioned_docs/version-0.17/reference/pipeline/pipeline-config.md index 324595be7..522b8f1b1 100644 --- a/versioned_docs/version-0.17/user-guide/logs/pipeline-config.md +++ b/versioned_docs/version-0.17/reference/pipeline/pipeline-config.md @@ -51,10 +51,10 @@ The above plain text data will be converted to the following equivalent form: In other words, when the input is in plain text format, you need to use `message` to refer to the content of each line when writing `Processor` and `Transform` configurations. -## Overall structure +## Pipeline Configuration Structure Pipeline consists of four parts: Processors, Dispatcher, Transform, and Table suffix. -Processors pre-processes input log data. +Processors pre-process input log data. Dispatcher forwards pipeline execution context onto different subsequent pipeline. Transform decides the final datatype and table structure in the database. Table suffix allows storing the data into different tables. @@ -827,6 +827,8 @@ Some notes regarding the `vrl` processor: 2. The returning value of the vrl script should not contain any regex-type variables. They can be used in the script, but have to be `del`ed before returning. 3. Due to type conversion between pipeline's value type and vrl's, the value type that comes out of the vrl script will be the ones with max capacity, meaning `i64`, `f64`, and `Timestamp::nanoseconds`. +You can use `vrl` processor to set [table options](./write-log-api.md#set-table-options) while writing logs. + ### `filter` The `filter` processor can filter out unneeded lines when the condition is meet. @@ -1013,7 +1015,7 @@ Specify which field uses the inverted index. Refer to the [Transform Example](#t #### The Fulltext Index -Specify which field will be used for full-text search using `index: fulltext`. This index greatly improves the performance of [log search](./query-logs.md). Refer to the [Transform Example](#transform-example) below for syntax. +Specify which field will be used for full-text search using `index: fulltext`. This index greatly improves the performance of [log search](/user-guide/logs/fulltext-search.md). Refer to the [Transform Example](#transform-example) below for syntax. #### The Skipping Index @@ -1159,4 +1161,4 @@ table_suffix: _${type} These three lines of input log will be inserted into three tables: 1. `persist_app_db` 2. `persist_app_http` -3. `persist_app`, for it doesn't have a `type` field, thus the default table name will be used. \ No newline at end of file +3. `persist_app`, for it doesn't have a `type` field, thus the default table name will be used. diff --git a/versioned_docs/version-0.17/reference/pipeline/write-log-api.md b/versioned_docs/version-0.17/reference/pipeline/write-log-api.md new file mode 100644 index 000000000..04f22faad --- /dev/null +++ b/versioned_docs/version-0.17/reference/pipeline/write-log-api.md @@ -0,0 +1,160 @@ +--- +keywords: [write logs, HTTP interface, log formats, request parameters, JSON logs] +description: Describes how to write logs to GreptimeDB using a pipeline via the HTTP interface, including supported formats and request parameters. +--- + +# APIs for Writing Logs + +Before writing logs, please read the [Pipeline Configuration](/user-guide/logs/use-custom-pipelines.md#upload-pipeline) to complete the configuration setup and upload. + +## HTTP API + +You can use the following command to write logs via the HTTP interface: + +```shell +curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=" \ + -H "Content-Type: application/x-ndjson" \ + -H "Authorization: Basic {{authentication}}" \ + -d "$" +``` + +### Request parameters + +This interface accepts the following parameters: + +- `db`: The name of the database. +- `table`: The name of the table. +- `pipeline_name`: The name of the [pipeline](./pipeline-config.md). +- `version`: The version of the pipeline. Optional, default use the latest one. +- `skip_error`: Whether to skip errors when writing logs. Optional, defaults to `false`. When set to `true`, GreptimeDB will skip individual log entries that encounter errors and continue processing the remaining logs. This prevents the entire request from failing due to a single problematic log entry. + +### `Content-Type` and body format + +GreptimeDB uses `Content-Type` header to decide how to decode the payload body. Currently the following two format is supported: +- `application/json`: this includes normal JSON format and NDJSON format. +- `application/x-ndjson`: specifically uses NDJSON format, which will try to split lines and parse for more accurate error checking. +- `text/plain`: multiple log lines separated by line breaks. + +#### `application/json` and `application/x-ndjson` format + +Here is an example of JSON format body payload + +```JSON +[ + {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, + {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, + {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, + {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +] +``` + +Note the whole JSON is an array (log lines). Each JSON object represents one line to be processed by Pipeline engine. + +The name of the key in JSON objects, which is `message` here, is used as field name in Pipeline processors. For example: + +```yaml +processors: + - dissect: + fields: + # `message` is the key in JSON object + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# rest of the file is ignored +``` + +We can also rewrite the payload into NDJSON format like following: + +```JSON +{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} +{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} +{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} +{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} +``` + +Note the outer array is eliminated, and lines are separated by line breaks instead of `,`. + +#### `text/plain` format + +Log in plain text format is widely used throughout the ecosystem. GreptimeDB also supports `text/plain` format as log data input, enabling ingesting logs first hand from log producers. + +The equivalent body payload of previous example is like following: + +```plain +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" +10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" +172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" +``` + +Sending log ingestion request to GreptimeDB requires only modifying the `Content-Type` header to be `text/plain`, and you are good to go! + +Please note that, unlike JSON format, where the input data already have key names as field names to be used in Pipeline processors, `text/plain` format just gives the whole line as input to the Pipeline engine. In this case we use `message` as the field name to refer to the input line, for example: + +```yaml +processors: + - dissect: + fields: + # use `message` as the field name + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + +# rest of the file is ignored +``` + +It is recommended to use `dissect` or `regex` processor to split the input line into fields first and then process the fields accordingly. + +## Set Table Options + +The table options need to be set in the pipeline configurations. +Starting from `v0.15`, the pipeline engine recognizes certain variables, and can set corresponding table options based on the value of the variables. +Combined with the `vrl` processor, it's now easy to create and set table options during the pipeline execution based on input data. + +Here is a list of supported common table option variables: +- `greptime_auto_create_table` +- `greptime_ttl` +- `greptime_append_mode` +- `greptime_merge_mode` +- `greptime_physical_table` +- `greptime_skip_wal` + +Please refer to [table options](/reference/sql/create.md#table-options) for the detailed explanation of each option. + +Here are some pipeline specific variables: +- `greptime_table_suffix`: add suffix to the destined table name. + +Let's use the following pipeline file to demonstrate: +```YAML +processors: + - date: + field: time + formats: + - "%Y-%m-%d %H:%M:%S%.3f" + ignore_missing: true + - vrl: + source: | + .greptime_table_suffix, err = "_" + .id + .greptime_table_ttl = "1d" + . +``` + +In the vrl script, we set the table suffix variable with the input field `.id`(leading with an underscore), and set the ttl to `1d`. +Then we run the ingestion using the following JSON data. + +```JSON +{ + "id": "2436", + "time": "2024-05-25 20:16:37.217" +} +``` + +Assuming the given table name being `d_table`, the final table name would be `d_table_2436` as we would expected. +The table is also set with a ttl of 1 day. + +## Examples + +Please refer to the "Writing Logs" section in the [Quick Start](/user-guide/logs/quick-start.md#direct-http-ingestion) and [Using Custom Pipelines](/user-guide/logs/use-custom-pipelines.md#write-logs) guide for examples. diff --git a/versioned_docs/version-0.17/reference/sql/alter.md b/versioned_docs/version-0.17/reference/sql/alter.md index af18f9d61..5b1676682 100644 --- a/versioned_docs/version-0.17/reference/sql/alter.md +++ b/versioned_docs/version-0.17/reference/sql/alter.md @@ -194,7 +194,7 @@ You can specify the following options using `FULLTEXT INDEX WITH` when enabling - `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. Default is `10240`. - `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. Value is a float between `0` and `1`. Default is `0.01`. -For more information on full-text index configuration and performance comparison, refer to the [Full-Text Index Configuration Guide](/user-guide/logs/fulltext-index-config.md). +For more information on full-text index configuration and performance comparison, refer to the [Full-Text Index Configuration Guide](/user-guide/manage-data/data-index.md#fulltext-index). If `WITH ` is not specified, `FULLTEXT INDEX` will use the default values. diff --git a/versioned_docs/version-0.17/reference/sql/functions/overview.md b/versioned_docs/version-0.17/reference/sql/functions/overview.md index 50b74ba21..467247d5b 100644 --- a/versioned_docs/version-0.17/reference/sql/functions/overview.md +++ b/versioned_docs/version-0.17/reference/sql/functions/overview.md @@ -50,7 +50,7 @@ DataFusion [String Function](./df-functions.md#string-functions). GreptimeDB provides: * `matches_term(expression, term)` for full text search. -For details, read the [Query Logs](/user-guide/logs/query-logs.md). +For details, read the [Query Logs](/user-guide/logs/fulltext-search.md). ### Math Functions diff --git a/versioned_docs/version-0.17/reference/sql/where.md b/versioned_docs/version-0.17/reference/sql/where.md index 421ef5815..ca6a48659 100644 --- a/versioned_docs/version-0.17/reference/sql/where.md +++ b/versioned_docs/version-0.17/reference/sql/where.md @@ -77,4 +77,4 @@ SELECT * FROM go_info WHERE instance LIKE 'localhost:____'; ``` -For searching terms in logs, please read [Query Logs](/user-guide/logs/query-logs.md). \ No newline at end of file +For searching terms in logs, please read [Query Logs](/user-guide/logs/fulltext-search.md). diff --git a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md index adf5b364a..59cb82a30 100644 --- a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md +++ b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/fluent-bit.md @@ -43,7 +43,7 @@ In params Uri, - `table` is the table name you want to write logs to. - `pipeline_name` is the pipeline name you want to use for processing logs. -In this example, the [Logs Http API](/user-guide/logs/write-logs.md#http-api) interface is used. For more information, refer to the [Write Logs](/user-guide/logs/write-logs.md) guide. +In this example, the [Logs Http API](/reference/pipeline/write-log-api.md#http-api) interface is used. For more information, refer to the [Write Logs](/user-guide/logs/use-custom-pipelines.md#ingest-logs-using-the-pipeline) guide. ## OpenTelemetry diff --git a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md index 5d5b9c61b..8238d7e88 100644 --- a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md +++ b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/kafka.md @@ -128,8 +128,8 @@ For logs in text format, such as the access log format below, you'll need to cre #### Create a pipeline To create a custom pipeline, -please refer to the [Create Pipeline](/user-guide/logs/quick-start.md#create-a-pipeline) -and [Pipeline Configuration](/user-guide/logs/pipeline-config.md) documentation for detailed instructions. +please refer to the [using custom pipelines](/user-guide/logs/use-custom-pipelines.md) +documentation for detailed instructions. #### Ingest data diff --git a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/loki.md b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/loki.md index ddacc2650..fb61b4bdb 100644 --- a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/loki.md +++ b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/loki.md @@ -184,7 +184,7 @@ transform: ``` The pipeline content is straightforward: we use `vrl` processor to parse the line into a JSON object, then extract the fields to the root level. -`log_time` is specified as the time index in the transform section, other fields will be auto-inferred by the pipeline engine, see [pipeline version 2](/user-guide/logs/pipeline-config.md#transform-in-version-2) for details. +`log_time` is specified as the time index in the transform section, other fields will be auto-inferred by the pipeline engine, see [pipeline version 2](/reference/pipeline/pipeline-config.md#transform-in-version-2) for details. Note that the input field name is `loki_line`, which contains the original log line from Loki. @@ -264,4 +264,4 @@ log_source: application This output demonstrates that the pipeline engine has successfully parsed the original JSON log lines and extracted the structured data into separate columns. -For more details about pipeline configuration and features, refer to the [pipeline documentation](/user-guide/logs/pipeline-config.md). +For more details about pipeline configuration and features, refer to the [pipeline documentation](/reference/pipeline/pipeline-config.md). diff --git a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md index 889282fd0..82508948b 100644 --- a/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md +++ b/versioned_docs/version-0.17/user-guide/ingest-data/for-observability/prometheus.md @@ -303,7 +303,7 @@ mysql> select * from `go_memstats_mcache_inuse_bytes`; 2 rows in set (0.01 sec) ``` -You can refer to the [pipeline's documentation](/user-guide/logs/pipeline-config.md) for more details. +You can refer to the [pipeline's documentation](/user-guide/logs/use-custom-pipelines.md) for more details. ## Performance tuning diff --git a/versioned_docs/version-0.17/user-guide/logs/fulltext-index-config.md b/versioned_docs/version-0.17/user-guide/logs/fulltext-index-config.md deleted file mode 100644 index a0f244399..000000000 --- a/versioned_docs/version-0.17/user-guide/logs/fulltext-index-config.md +++ /dev/null @@ -1,131 +0,0 @@ ---- -keywords: [fulltext index, tantivy, bloom, analyzer, case_sensitive, configuration] -description: Comprehensive guide for configuring full-text index in GreptimeDB, including backend selection and other configuration options. ---- - -# Full-Text Index Configuration - -This document provides a comprehensive guide for configuring full-text index in GreptimeDB, including backend selection and other configuration options. - -## Overview - -GreptimeDB provides full-text indexing capabilities to accelerate text search operations. You can configure full-text index when creating or altering tables, with various options to optimize for different use cases. For a general introduction to different types of indexes in GreptimeDB, including inverted index and skipping index, please refer to the [Data Index](/user-guide/manage-data/data-index) guide. - -## Configuration Options - -When creating or modifying a full-text index, you can specify the following options using `FULLTEXT INDEX WITH`: - -### Basic Options - -- `analyzer`: Sets the language analyzer for the full-text index - - Supported values: `English`, `Chinese` - - Default: `English` - - Note: The Chinese analyzer requires significantly more time to build the index due to the complexity of Chinese text segmentation. Consider using it only when Chinese text search is a primary requirement. - -- `case_sensitive`: Determines whether the full-text index is case-sensitive - - Supported values: `true`, `false` - - Default: `false` - - Note: Setting to `true` may slightly improve performance for case-sensitive queries, but will degrade performance for case-insensitive queries. This setting does not affect the results of `matches_term` queries. - -- `backend`: Sets the backend for the full-text index - - Supported values: `bloom`, `tantivy` - - Default: `bloom` - -- `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. - - Supported values: positive integer - - Default: `10240` - -- `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. - - Supported values: float between `0` and `1` - - Default: `0.01` - -### Backend Selection - -GreptimeDB provides two full-text index backends for efficient log searching: - -1. **Bloom Backend** - - Best for: General-purpose log searching - - Features: - - Uses Bloom filter for efficient filtering - - Lower storage overhead - - Consistent performance across different query patterns - - Limitations: - - Slightly slower for high-selectivity queries - - Storage Cost Example: - - Original data: ~10GB - - Bloom index: ~1GB - -2. **Tantivy Backend** - - Best for: High-selectivity queries (e.g., unique values like TraceID) - - Features: - - Uses inverted index for fast exact matching - - Excellent performance for high-selectivity queries - - Limitations: - - Higher storage overhead (close to original data size) - - Slower performance for low-selectivity queries - - Storage Cost Example: - - Original data: ~10GB - - Tantivy index: ~10GB - -### Performance Comparison - -The following table shows the performance comparison between different query methods (using Bloom as baseline): - -| Query Type | High Selectivity (e.g., TraceID) | Low Selectivity (e.g., "HTTP") | -|------------|----------------------------------|--------------------------------| -| LIKE | 50x slower | 1x | -| Tantivy | 5x faster | 5x slower | -| Bloom | 1x (baseline) | 1x (baseline) | - -Key observations: -- For high-selectivity queries (e.g., unique values), Tantivy provides the best performance -- For low-selectivity queries, Bloom offers more consistent performance -- Bloom has significant storage advantage over Tantivy (1GB vs 10GB in test case) - -## Configuration Examples - -### Creating a Table with Full-Text Index - -```sql --- Using Bloom backend (recommended for most cases) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'bloom', - analyzer = 'English', - case_sensitive = 'false' - ) -); - --- Using Tantivy backend (for high-selectivity queries) -CREATE TABLE logs ( - timestamp TIMESTAMP(9) TIME INDEX, - message STRING FULLTEXT INDEX WITH ( - backend = 'tantivy', - analyzer = 'English', - case_sensitive = 'false' - ) -); -``` - -### Modifying an Existing Table - -```sql --- Enable full-text index on an existing column -ALTER TABLE monitor -MODIFY COLUMN load_15 -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'bloom' -); - --- Change full-text index configuration -ALTER TABLE logs -MODIFY COLUMN message -SET FULLTEXT INDEX WITH ( - analyzer = 'English', - case_sensitive = 'false', - backend = 'tantivy' -); -``` diff --git a/docs/user-guide/logs/query-logs.md b/versioned_docs/version-0.17/user-guide/logs/fulltext-search.md similarity index 99% rename from docs/user-guide/logs/query-logs.md rename to versioned_docs/version-0.17/user-guide/logs/fulltext-search.md index cc994f1eb..466392abd 100644 --- a/docs/user-guide/logs/query-logs.md +++ b/versioned_docs/version-0.17/user-guide/logs/fulltext-search.md @@ -3,12 +3,10 @@ keywords: [query logs, pattern matching, matches_term, query statements, log ana description: Provides a guide on using GreptimeDB's query language for effective searching and analysis of log data, including pattern matching and query statements. --- -# Query Logs +# Full-Text Search This document provides a guide on how to use GreptimeDB's query language for effective searching and analysis of log data. -## Overview - GreptimeDB allows for flexible querying of data using SQL statements. This section introduces specific search functions and query statements designed to enhance your log querying capabilities. ## Pattern Matching Using the `matches_term` Function diff --git a/versioned_docs/version-0.17/user-guide/logs/manage-pipelines.md b/versioned_docs/version-0.17/user-guide/logs/manage-pipelines.md index 7fe9d931a..b870c5e40 100644 --- a/versioned_docs/version-0.17/user-guide/logs/manage-pipelines.md +++ b/versioned_docs/version-0.17/user-guide/logs/manage-pipelines.md @@ -7,14 +7,14 @@ description: Guides on creating, deleting, and managing pipelines in GreptimeDB In GreptimeDB, each `pipeline` is a collection of data processing units used for parsing and transforming the ingested log content. This document provides guidance on creating and deleting pipelines to efficiently manage the processing flow of log data. -For specific pipeline configurations, please refer to the [Pipeline Configuration](pipeline-config.md) documentation. +For specific pipeline configurations, please refer to the [Pipeline Configuration](/reference/pipeline/pipeline-config.md) documentation. ## Authentication The HTTP API for managing pipelines requires authentication. For more information, see the [Authentication](/user-guide/protocols/http.md#authentication) documentation. -## Create a Pipeline +## Upload a Pipeline GreptimeDB provides a dedicated HTTP interface for creating pipelines. Assuming you have prepared a pipeline configuration file `pipeline.yaml`, use the following command to upload the configuration file, where `test` is the name you specify for the pipeline: @@ -28,6 +28,23 @@ curl -X "POST" "http://localhost:4000/v1/pipelines/test" \ The created Pipeline is shared for all databases. +## Pipeline Versions + +You can upload multiple versions of a pipeline with the same name. +Each time you upload a pipeline with an existing name, a new version is created automatically. +You can specify which version to use when [ingesting logs](/reference/pipeline/write-log-api.md#http-api), [querying](#query-pipelines), or [deleting](#delete-a-pipeline) a pipeline. +The last uploaded version is used by default if no version is specified. + +After successfully uploading a pipeline, the response will include version information: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"} +``` + +The version is a timestamp in UTC format that indicates when the pipeline was created. +This timestamp serves as a unique identifier for each pipeline version. + + ## Delete a Pipeline You can use the following HTTP interface to delete a pipeline: @@ -319,3 +336,120 @@ At this point, the Pipeline processing is successful, and the output is as follo ``` It can be seen that the `.` in the string `1998.08` has been replaced with `-`, indicating a successful processing of the Pipeline. + +## Get Table DDL from a Pipeline Configuration + +When using pipelines, GreptimeDB automatically creates target tables upon first data ingestion by default. +However, you may want to manually create tables beforehand to add custom table options, +such as partition rules for better performance. + +While the auto-created table schema is deterministic for a given pipeline configuration, +manually writing the table DDL (Data Definition Language) according to the configuration can be tedious. +The `/ddl` API endpoint simplifies this process. + +For an existing pipeline, you can use the `/v1/pipelines/{pipeline_name}/ddl` endpoint to generate the `CREATE TABLE` SQL. +This API examines the transform definition in the pipeline configuration and infers the appropriate table schema. +You can use this API to generate the basic table DDL, fine-tune table options and manually create the table before ingesting data. Some common cases would be: +- Add [partition rules](/user-guide/deployments-administration/manage-data/table-sharding.md) +- Modify [index options](/user-guide/manage-data/data-index.md) +- Add other [table options](/reference/sql/create.md#table-options) + +Here is an example demonstrating how to use this API. Consider the following pipeline configuration: +```YAML +# pipeline.yaml +processors: +- dissect: + fields: + - message + patterns: + - '%{ip_address} - %{username} [%{timestamp}] "%{http_method} %{request_line} %{protocol}" %{status_code} %{response_size}' + ignore_missing: true +- date: + fields: + - timestamp + formats: + - "%d/%b/%Y:%H:%M:%S %z" + +transform: + - fields: + - timestamp + type: time + index: timestamp + - fields: + - ip_address + type: string + index: skipping + - fields: + - username + type: string + tag: true + - fields: + - http_method + type: string + index: inverted + - fields: + - request_line + type: string + index: fulltext + - fields: + - protocol + type: string + - fields: + - status_code + type: int32 + index: inverted + tag: true + - fields: + - response_size + type: int64 + on_failure: default + default: 0 + - fields: + - message + type: string +``` + +First, upload the pipeline to the database using the following command: +```bash +curl -X "POST" "http://localhost:4000/v1/pipelines/pp" -F "file=@pipeline.yaml" +``` +Then, query the table DDL using the following command: +```bash +curl -X "GET" "http://localhost:4000/v1/pipelines/pp/ddl?table=test_table" +``` +The API returns the following output in JSON format: +```JSON +{ + "sql": { + "sql": "CREATE TABLE IF NOT EXISTS `test_table` (\n `timestamp` TIMESTAMP(9) NOT NULL,\n `ip_address` STRING NULL SKIPPING INDEX WITH(false_positive_rate = '0.01', granularity = '10240', type = 'BLOOM'),\n `username` STRING NULL,\n `http_method` STRING NULL INVERTED INDEX,\n `request_line` STRING NULL FULLTEXT INDEX WITH(analyzer = 'English', backend = 'bloom', case_sensitive = 'false', false_positive_rate = '0.01', granularity = '10240'),\n `protocol` STRING NULL,\n `status_code` INT NULL INVERTED INDEX,\n `response_size` BIGINT NULL,\n `message` STRING NULL,\n TIME INDEX (`timestamp`),\n PRIMARY KEY (`username`, `status_code`)\n)\nENGINE=mito\nWITH(\n append_mode = 'true'\n)" + }, + "execution_time_ms": 3 +} +``` +After formatting the `sql` field in the response, you can see the inferred table schema: +```SQL +CREATE TABLE IF NOT EXISTS `test_table` ( + `timestamp` TIMESTAMP(9) NOT NULL, + `ip_address` STRING NULL SKIPPING INDEX WITH(false_positive_rate = '0.01', granularity = '10240', type = 'BLOOM'), + `username` STRING NULL, + `http_method` STRING NULL INVERTED INDEX, + `request_line` STRING NULL FULLTEXT INDEX WITH(analyzer = 'English', backend = 'bloom', case_sensitive = 'false', false_positive_rate = '0.01', granularity = '10240'), + `protocol` STRING NULL, + `status_code` INT NULL INVERTED INDEX, + `response_size` BIGINT NULL, + `message` STRING NULL, + TIME INDEX (`timestamp`), + PRIMARY KEY (`username`, `status_code`) + ) +ENGINE=mito +WITH( + append_mode = 'true' +) +``` + +You can use the inferred table DDL as a starting point. +After customizing the DDL to meet your requirements, execute it manually before ingesting data through the pipeline. + +**Notes:** +1. The API only infers the table schema from the pipeline configuration; it doesn't check if the table already exists. +2. The API doesn't account for table suffixes. If you're using `dispatcher`, `table_suffix`, or table suffix hints in your pipeline configuration, you'll need to adjust the table name manually. diff --git a/versioned_docs/version-0.17/user-guide/logs/overview.md b/versioned_docs/version-0.17/user-guide/logs/overview.md index bf424898e..9c9e2aa1e 100644 --- a/versioned_docs/version-0.17/user-guide/logs/overview.md +++ b/versioned_docs/version-0.17/user-guide/logs/overview.md @@ -1,16 +1,108 @@ --- keywords: [log service, quick start, pipeline configuration, manage pipelines, query logs] -description: Provides links to various guides on using GreptimeDB's log service, including quick start, pipeline configuration, managing pipelines, writing logs, querying logs, and full-text index configuration. +description: Comprehensive guide to GreptimeDB's log management capabilities, covering log collection architecture, pipeline processing, integration with popular collectors like Vector and Kafka, and advanced querying with full-text search. --- # Logs -In this chapter, we will walk-through GreptimeDB's features for logs support, -from basic ingestion/query, to advanced transformation, full-text index topics. +GreptimeDB provides a comprehensive log management solution designed for modern observability needs. +It offers seamless integration with popular log collectors, +flexible pipeline processing, +and powerful querying capabilities, including full-text search. -- [Quick Start](./quick-start.md): Provides an introduction on how to quickly get started with GreptimeDB log service. -- [Pipeline Configuration](./pipeline-config.md): Provides in-depth information on each specific configuration of pipelines in GreptimeDB. +Key features include: + +- **Unified Storage**: Store logs alongside metrics and traces in a single database +- **Pipeline Processing**: Transform and enrich raw logs with customizable pipelines, supporting various log collectors and formats +- **Advanced Querying**: SQL-based analysis with full-text search capabilities +- **Real-time Processing**: Process and query logs in real-time for monitoring and alerting + + +## Log Collection Flow + +![log-collection-flow](/log-collection-flow.drawio.svg) + +The diagram above illustrates the comprehensive log collection architecture, +which follows a structured four-stage process: Log Sources, Log Collectors, Pipeline Processing, and Storage in GreptimeDB. + +### Log Sources + +Log sources represent the foundational layer where log data originates within your infrastructure. +GreptimeDB supports ingestion from diverse source types to accommodate comprehensive observability requirements: + +- **Applications**: Application-level logs from microservices architectures, web applications, mobile applications, and custom software components +- **IoT Devices**: Device logs, sensor event logs, and operational status logs from Internet of Things ecosystems +- **Infrastructure**: Cloud platform logs, container orchestration logs (Kubernetes, Docker), load balancer logs, and network infrastructure component logs +- **System Components**: Operating system logs, kernel events, system daemon logs, and hardware monitoring logs +- **Custom Sources**: Any other log sources specific to your environment or applications + +### Log Collectors + +Log collectors are responsible for efficiently gathering log data from diverse sources and reliably forwarding it to the storage backend. GreptimeDB seamlessly integrates with industry-standard log collectors, +including Vector, Fluent Bit, Apache Kafka, OpenTelemetry Collector and more. + +GreptimeDB functions as a powerful sink backend for these collectors, +providing robust data ingestion capabilities. +During the ingestion process, +GreptimeDB's pipeline system enables real-time transformation and enrichment of log data, +ensuring optimal structure and quality before storage. + +### Pipeline Processing + +GreptimeDB's pipeline mechanism transforms raw logs into structured, queryable data: + +- **Parse**: Extract structured data from unstructured log messages +- **Transform**: Enrich logs with additional context and metadata +- **Index**: Configure indexes to optimize query performance and enable efficient searching, including full-text indexes, time indexes, and more + +### Storage in GreptimeDB + +After processing through the pipeline, +the logs are stored in GreptimeDB enabling flexible analysis and visualization: + +- **SQL Querying**: Use familiar SQL syntax to analyze log data +- **Time-based Analysis**: Leverage time-series capabilities for temporal analysis +- **Full-text Search**: Perform advanced text searches across log messages +- **Real-time Analytics**: Query logs in real-time for monitoring and alerting + +## Quick Start + +You can quickly get started by using the built-in `greptime_identity` pipeline for log ingestion. +For more information, please refer to the [Quick Start](./quick-start.md) guide. + +## Integrate with Log Collectors + +GreptimeDB integrates seamlessly with various log collectors to provide a comprehensive logging solution. The integration process follows these key steps: + +1. **Select Appropriate Log Collectors**: Choose collectors based on your infrastructure requirements, data sources, and performance needs +2. **Analyze Output Format**: Understand the log format and structure produced by your chosen collector +3. **Configure Pipeline**: Create and configure pipelines in GreptimeDB to parse, transform, and enrich the incoming log data +4. **Store and Query**: Efficiently store processed logs in GreptimeDB for real-time analysis and monitoring + +To successfully integrate your log collector with GreptimeDB, you'll need to: +- First understand how pipelines work in GreptimeDB +- Then configure the sink settings in your log collector to send data to GreptimeDB + +Please refer to the following guides for detailed instructions on integrating GreptimeDB with log collectors: + +- [Vector](/user-guide/ingest-data/for-observability/vector.md#using-greptimedb_logs-sink-recommended) +- [Kafka](/user-guide/ingest-data/for-observability/kafka.md#logs) +- [Fluent Bit](/user-guide/ingest-data/for-observability/fluent-bit.md#http) +- [OpenTelemetry Collector](/user-guide/ingest-data/for-observability/otel-collector.md) +- [Loki](/user-guide/ingest-data/for-observability/loki.md#using-pipeline-with-loki-push-api) + +## Learn More About Pipelines + +- [Using Custom Pipelines](./use-custom-pipelines.md): Explains how to create and use custom pipelines for log ingestion. - [Managing Pipelines](./manage-pipelines.md): Explains how to create and delete pipelines. -- [Writing Logs with Pipelines](./write-logs.md): Provides detailed instructions on efficiently writing log data by leveraging the pipeline mechanism. -- [Query Logs](./query-logs.md): Describes how to query logs using the GreptimeDB SQL interface. -- [Full-Text Index Configuration](./fulltext-index-config.md): Describes how to configure full-text index in GreptimeDB. + +## Query Logs + +- [Full-Text Search](./fulltext-search.md): Guide on using GreptimeDB's query language for effective searching and analysis of log data. + +## Reference + +- [Built-in Pipelines](/reference/pipeline/built-in-pipelines.md): Lists and describes the details of the built-in pipelines provided by GreptimeDB for log ingestion. +- [APIs for Writing Logs](/reference/pipeline/write-log-api.md): Describes the HTTP API for writing logs to GreptimeDB. +- [Pipeline Configuration](/reference/pipeline/pipeline-config.md): Provides in-depth information on each specific configuration of pipelines in GreptimeDB. + diff --git a/versioned_docs/version-0.17/user-guide/logs/quick-start.md b/versioned_docs/version-0.17/user-guide/logs/quick-start.md index dfb42b7d3..593d6daf1 100644 --- a/versioned_docs/version-0.17/user-guide/logs/quick-start.md +++ b/versioned_docs/version-0.17/user-guide/logs/quick-start.md @@ -1,333 +1,123 @@ --- -keywords: [quick start, write logs, query logs, pipeline, structured data, log ingestion, log collection, log management tools] -description: A comprehensive guide to quickly writing and querying logs in GreptimeDB, including direct log writing and using pipelines for structured data. +keywords: [logs, log service, pipeline, greptime_identity, quick start, json logs] +description: Quick start guide for GreptimeDB log service, including basic log ingestion using the built-in greptime_identity pipeline and integration with log collectors. --- # Quick Start -This guide provides step-by-step instructions for quickly writing and querying logs in GreptimeDB. +This guide will walk you through the essential steps to get started with GreptimeDB's log service. +You'll learn how to ingest logs using the built-in `greptime_identity` pipeline and integrate with log collectors. -GreptimeDB supports a pipeline mechanism to parse and transform structured log messages into multiple columns for efficient storage and querying. +GreptimeDB provides a powerful pipeline-based log ingestion system. +For quick setup with JSON-formatted logs, +you can use the built-in `greptime_identity` pipeline, which: -For unstructured logs, you can write them directly into a table without utilizing a pipeline. +- Automatically handles field mapping from JSON to table columns +- Creates tables automatically if they don't exist +- Supports flexible schemas for varying log structures +- Requires minimal configuration to get started -## Write logs by Pipeline +## Direct HTTP Ingestion -Pipelines enable automatic parsing and transformation of log messages into multiple columns, -as well as automatic table creation and alteration. +The simplest way to ingest logs into GreptimeDB is through a direct HTTP request using the `greptime_identity` pipeline. -### Write JSON logs using the built-in `greptime_identity` Pipeline - -GreptimeDB offers a built-in pipeline, `greptime_identity`, for handling JSON log formats. This pipeline simplifies the process of writing JSON logs. +For example, you can use `curl` to send a POST request with JSON log data: ```shell curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity" \ + "http://localhost:4000/v1/ingest?db=public&table=demo_logs&pipeline_name=greptime_identity" \ -H "Content-Type: application/json" \ -H "Authorization: Basic {{authentication}}" \ -d '[ { - "name": "Alice", - "age": 20, - "is_student": true, - "score": 90.5, - "object": { "a": 1, "b": 2 } - }, - { - "age": 21, - "is_student": false, - "score": 85.5, - "company": "A", - "whatever": null + "timestamp": "2024-01-15T10:30:00Z", + "level": "INFO", + "service": "web-server", + "message": "User login successful", + "user_id": 12345, + "ip_address": "192.168.1.100" }, { - "name": "Charlie", - "age": 22, - "is_student": true, - "score": 95.5, - "array": [1, 2, 3] + "timestamp": "2024-01-15T10:31:00Z", + "level": "ERROR", + "service": "database", + "message": "Connection timeout occurred", + "error_code": 500, + "retry_count": 3 } ]' ``` -- [`Authorization`](/user-guide/protocols/http.md#authentication) header. -- `pipeline_name=greptime_identity` specifies the built-in pipeline. -- `table=pipeline_logs` specifies the target table. If the table does not exist, it will be created automatically. - -The `greptime_identity` pipeline automatically creates columns for each field in the JSON log. -A successful command execution returns: - -```json -{"output":[{"affectedrows":3}],"execution_time_ms":9} -``` - -For more details about the `greptime_identity` pipeline, please refer to the [Write Logs](write-logs.md#greptime_identity) document. - -### Write logs using a custom Pipeline - -Custom pipelines allow you to parse and transform log messages into multiple columns based on specific patterns, -and automatically create tables. - -#### Create a Pipeline - -GreptimeDB provides an HTTP interface for creating pipelines. -Here is how to do it: - -First, create a pipeline file, for example, `pipeline.yaml`. - -```yaml -version: 2 -processors: - - dissect: - fields: - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - - date: - fields: - - timestamp - formats: - - "%d/%b/%Y:%H:%M:%S %z" - - select: - type: exclude - fields: - - message - -transform: - - fields: - - ip_address - type: string - index: inverted - tag: true - - fields: - - status_code - type: int32 - index: inverted - tag: true - - fields: - - request_line - - user_agent - type: string - index: fulltext - - fields: - - response_size - type: int32 - - fields: - - timestamp - type: time - index: timestamp -``` - -The pipeline splits the message field using the specified pattern to extract the `ip_address`, `timestamp`, `http_method`, `request_line`, `status_code`, `response_size`, and `user_agent`. -It then parses the `timestamp` field using the format` %d/%b/%Y:%H:%M:%S %z` to convert it into a proper timestamp format that the database can understand. -Finally, it converts each field to the appropriate datatype and indexes it accordingly. -Note at the beginning the pipeline is using version 2 format, see [here](./pipeline-config.md#transform-in-version-2) for more details. -In short, the version 2 indicates the pipeline engine to find fields that are not specified in the transform section, and persist them using the default datatype. -You can see in the [later section](#differences-between-using-a-pipeline-and-writing-unstructured-logs-directly) that although the `http_method` is not specified in the transform, it is persisted as well. -Also, a `select` processor is used to filter out the original `message` field. -It is worth noting that the `request_line` and `user_agent` fields are indexed as `fulltext` to optimize full-text search queries. -And there must be one time index column specified by the `timestamp`. - -Execute the following command to upload the configuration file: - -```shell -curl -X "POST" \ - "http://localhost:4000/v1/pipelines/nginx_pipeline" \ - -H 'Authorization: Basic {{authentication}}' \ - -F "file=@pipeline.yaml" -``` - -After successfully executing this command, a pipeline named `nginx_pipeline` will be created, and the result will be returned as: - -```json -{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. -``` - -You can create multiple versions for the same pipeline name. -All pipelines are stored at the `greptime_private.pipelines` table. -Please refer to [Query Pipelines](manage-pipelines.md#query-pipelines) to view the pipeline data in the table. - -#### Write logs - -The following example writes logs to the `custom_pipeline_logs` table and uses the `nginx_pipeline` pipeline to format and transform the log messages. - -```shell -curl -X POST \ - "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d '[ - { - "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" - }, - { - "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" - }, - { - "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" - }, - { - "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" - } - ]' -``` +The key parameters are: -You will see the following output if the command is successful: +- `db=public`: Target database name (use your database name) +- `table=demo_logs`: Target table name (created automatically if it doesn't exist) +- `pipeline_name=greptime_identity`: Uses `greptime_identity` identity pipeline for JSON processing +- `Authorization` header: Basic authentication with base64-encoded `username:password`, see the [HTTP Authentication Guide](/user-guide/protocols/http.md#authentication) +A successful request returns: ```json -{"output":[{"affectedrows":4}],"execution_time_ms":79} -``` - -## Write unstructured logs directly - -When your log messages are unstructured text, -you can write them directly to the database. -However, this method limits the ability to perform high-performance analysis. - -### Create a table for unstructured logs - -You need to create a table to store the logs before inserting. -Use the following SQL statement to create a table named `origin_logs`: - -* The `FULLTEXT INDEX` on the `message` column optimizes text search queries -* Setting `append_mode` to `true` optimizes log insertion by only appending new rows to the table - -```sql -CREATE TABLE `origin_logs` ( - `message` STRING FULLTEXT INDEX, - `time` TIMESTAMP TIME INDEX -) WITH ( - append_mode = 'true' -); -``` - -### Insert logs - -#### Write logs using the SQL protocol - -Use the `INSERT` statement to insert logs into the table. - -```sql -INSERT INTO origin_logs (message, time) VALUES -('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), -('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), -('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), -('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); -``` - -The above SQL inserts the entire log text into a single column, -and you must add an extra timestamp for each log. - -#### Write logs using the gRPC protocol - -You can also write logs using the gRPC protocol, which is a more efficient method. - -Refer to [Write Data Using gRPC](/user-guide/ingest-data/for-iot/grpc-sdks/overview.md) to learn how to write logs using the gRPC protocol. - -## Differences between using a pipeline and writing unstructured logs directly - -In the above examples, the table `custom_pipeline_logs` is automatically created by writing logs using pipeline, -and the table `origin_logs` is created by writing logs directly. -Let's explore the differences between these two tables. - -```sql -DESC custom_pipeline_logs; -``` - -```sql -+---------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------------+---------------------+------+------+---------+---------------+ -| ip_address | String | PRI | YES | | TAG | -| status_code | Int32 | PRI | YES | | TAG | -| request_line | String | | YES | | FIELD | -| user_agent | String | | YES | | FIELD | -| response_size | Int32 | | YES | | FIELD | -| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -| http_method | String | | YES | | FIELD | -+---------------+---------------------+------+------+---------+---------------+ -7 rows in set (0.00 sec) +{ + "output": [{"affectedrows": 2}], + "execution_time_ms": 15 +} ``` +After successful ingestion, +the corresponding table `demo_logs` is automatically created with columns based on the JSON fields. +The schema is as follows: ```sql -DESC origin_logs; ++--------------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++--------------------+---------------------+------+------+---------+---------------+ +| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| ip_address | String | | YES | | FIELD | +| level | String | | YES | | FIELD | +| message | String | | YES | | FIELD | +| service | String | | YES | | FIELD | +| timestamp | String | | YES | | FIELD | +| user_id | Int64 | | YES | | FIELD | +| error_code | Int64 | | YES | | FIELD | +| retry_count | Int64 | | YES | | FIELD | ++--------------------+---------------------+------+------+---------+---------------+ ``` -```sql -+---------+----------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+---------+----------------------+------+------+---------+---------------+ -| message | String | | YES | | FIELD | -| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | -+---------+----------------------+------+------+---------+---------------+ -``` - -From the table structure, you can see that the `origin_logs` table has only two columns, -with the entire log message stored in a single column. -The `custom_pipeline_logs` table stores the log message in multiple columns. - -It is recommended to use the pipeline method to split the log message into multiple columns, which offers the advantage of explicitly querying specific values within certain columns. Column matching query proves superior to full-text searching for several key reasons: - -- **Performance Efficiency**: Column matching query is typically faster than full-text searching. -- **Resource Consumption**: Due to GreptimeDB's columnar storage engine, structured data is more conducive to compression. Additionally, the inverted index used for tag matching query typically consumes significantly fewer resources than a full-text index, especially in terms of storage size. -- **Maintainability**: Tag matching query is straightforward and easier to understand, write, and debug. - -Of course, if you need keyword searching within large text blocks, you must use full-text searching as it is specifically designed for that purpose. - -## Query logs +## Integration with Log Collectors -We use the `custom_pipeline_logs` table as an example to query logs. +For production environments, +you'll typically use log collectors to automatically forward logs to GreptimeDB. +Here is an example about how to configure Vector to send logs to GreptimeDB using the `greptime_identity` pipeline: -### Query logs by tags - -With the multiple tag columns in `custom_pipeline_logs`, -you can query data by tags flexibly. -For example, query the logs with `status_code` 200 and `http_method` GET. - -```sql -SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; -``` - -```sql -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -+------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -1 row in set (0.02 sec) +```toml +[sinks.my_sink_id] +type = "greptimedb_logs" +dbname = "public" +endpoint = "http://:4000" +pipeline_name = "greptime_identity" +table = "
" +username = "" +password = "" +# Additional configurations as needed ``` -### Full-Text Search +The key configuration parameters are: +- `type = "greptimedb_logs"`: Specifies the GreptimeDB logs sink +- `dbname`: Target database name +- `endpoint`: GreptimeDB HTTP endpoint +- `pipeline_name`: Uses `greptime_identity` pipeline for JSON processing +- `table`: Target table name (created automatically if it doesn't exist) +- `username` and `password`: Credentials for HTTP Basic Authentication -For the text fields `request_line` and `user_agent`, you can use `matches_term` function to search logs. -Remember, we created the full-text index for these two columns when [creating a pipeline](#create-a-pipeline). \ -This allows for high-performance full-text searches. +For details about the Vector configuration and options, +refer to the [Vector Integration Guide](/user-guide/ingest-data/for-observability/vector.md#using-greptimedb_logs-sink-recommended). -For example, query the logs with `request_line` containing `/index.html` or `/api/login`. -```sql -SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); -``` - -```sql -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | -| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | -+-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ -2 rows in set (0.00 sec) -``` +## Next Steps -You can refer to the [Query Logs](query-logs.md) document for detailed usage of the `matches_term` function. +You've successfully ingested your first logs, here are the recommended next steps: -## Next steps +- **Learn more about the behaviours of built-in Pipelines**: Refer to the [Built-in Pipelines](/reference/pipeline/built-in-pipelines.md) guide for detailed information on available built-in pipelines and their configurations. +- **Integrate with Popular Log Collectors**: For detailed instructions on integrating GreptimeDB with various log collectors like Fluent Bit, Fluentd, and others, refer to the [Integrate with Popular Log Collectors](./overview.md#integrate-with-log-collectors) section in the [Logs Overview](./overview.md) guide. +- **Using Custom Pipelines**: To learn more about creating custom pipelines for advanced log processing and transformation, refer to the [Using Custom Pipelines](./use-custom-pipelines.md) guide. -You have now experienced GreptimeDB's logging capabilities. -You can explore further by following the documentation below: -- [Pipeline Configuration](./pipeline-config.md): Provides in-depth information on each specific configuration of pipelines in GreptimeDB. -- [Managing Pipelines](./manage-pipelines.md): Explains how to create and delete pipelines. -- [Writing Logs with Pipelines](./write-logs.md): Provides detailed instructions on efficiently writing log data by leveraging the pipeline mechanism. -- [Query Logs](./query-logs.md): Describes how to query logs using the GreptimeDB SQL interface. diff --git a/versioned_docs/version-0.17/user-guide/logs/use-custom-pipelines.md b/versioned_docs/version-0.17/user-guide/logs/use-custom-pipelines.md new file mode 100644 index 000000000..951934823 --- /dev/null +++ b/versioned_docs/version-0.17/user-guide/logs/use-custom-pipelines.md @@ -0,0 +1,317 @@ +--- +keywords: [quick start, write logs, query logs, pipeline, structured data, log ingestion, log collection, log management tools] +description: A comprehensive guide to quickly writing and querying logs in GreptimeDB, including direct log writing and using pipelines for structured data. +--- + +# Using Custom Pipelines + +GreptimeDB automatically parses and transforms logs into structured, +multi-column data based on your pipeline configuration. +When built-in pipelines cannot handle your specific log format, +you can create custom pipelines to define exactly how your log data should be parsed and transformed. + +## Identify Your Original Log Format + +Before creating a custom pipeline, it's essential to understand the format of original log data. +If you're using log collectors and aren't sure about the log format, +there are two ways to examine your logs: + +1. **Read the collector official documentation**: Configure your collector to output data to console or file to inspect the log format. +2. **Use the `greptime_identity` pipeline**: Ingest sample logs directly into GreptimeDB using the built-in `greptime_identity` pipeline. + The `greptime_identity` pipeline treats the entire text log as a single `message` field, + which makes it very convenient to see the raw log content directly. + +Once understand the log format you want to process, +you can create a custom pipeline. +This document uses the following Nginx access log entry as an example: + +```txt +127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" +``` + +## Create a Custom Pipeline + +GreptimeDB provides an HTTP interface for creating pipelines. +Here's how to create one. + +First, create an example pipeline configuration file to process Nginx access logs, +naming it `pipeline.yaml`: + +```yaml +version: 2 +processors: + - dissect: + fields: + - message + patterns: + - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' + ignore_missing: true + - date: + fields: + - timestamp + formats: + - "%d/%b/%Y:%H:%M:%S %z" + - select: + type: exclude + fields: + - message + - vrl: + source: | + .greptime_table_ttl = "7d" + . + +transform: + - fields: + - ip_address + type: string + index: inverted + tag: true + - fields: + - status_code + type: int32 + index: inverted + tag: true + - fields: + - request_line + - user_agent + type: string + index: fulltext + - fields: + - response_size + type: int32 + - fields: + - timestamp + type: time + index: timestamp +``` + +The pipeline configuration above uses the [version 2](/reference/pipeline/pipeline-config.md#transform-in-version-2) format, +contains `processors` and `transform` sections that work together to structure your log data: + +**Processors**: Used to preprocess log data before transformation: +- **Data Extraction**: The `dissect` processor uses pattern matching to parse the `message` field and extract structured data including `ip_address`, `timestamp`, `http_method`, `request_line`, `status_code`, `response_size`, and `user_agent`. +- **Timestamp Processing**: The `date` processor parses the extracted `timestamp` field using the format `%d/%b/%Y:%H:%M:%S %z` and converts it to a proper timestamp data type. +- **Field Selection**: The `select` processor excludes the original `message` field from the final output while retaining all other fields. +- **Table Options**: The `vrl` processor sets the table options based on the extracted fields, such as adding a suffix to the table name and setting the TTL. The `greptime_table_ttl = "7d"` line configures the table data to have a time-to-live of 7 days. + +**Transform**: Defines how to convert and index the extracted fields: +- **Field Transformation**: Each extracted field is converted to its appropriate data type with specific indexing configurations. Fields like `http_method` retain their default data types when no explicit configuration is provided. +- **Indexing Strategy**: + - `ip_address` and `status_code` use inverted indexing as tags for fast filtering + - `request_line` and `user_agent` use full-text indexing for optimal text search capabilities + - `timestamp` serves as the required time index column + +For detailed information about pipeline configuration options, +please refer to the [Pipeline Configuration](/reference/pipeline/pipeline-config.md) documentation. + +## Upload the Pipeline + +Execute the following command to upload the pipeline configuration: + +```shell +curl -X "POST" \ + "http://localhost:4000/v1/pipelines/nginx_pipeline" \ + -H 'Authorization: Basic {{authentication}}' \ + -F "file=@pipeline.yaml" +``` + +After successful execution, a pipeline named `nginx_pipeline` will be created and return the following result: + +```json +{"name":"nginx_pipeline","version":"2024-06-27 12:02:34.257312110Z"}. +``` + +You can create multiple versions for the same pipeline name. +All pipelines are stored in the `greptime_private.pipelines` table. +Refer to [Query Pipelines](manage-pipelines.md#query-pipelines) to view pipeline data. + +## Ingest Logs Using the Pipeline + +The following example writes logs to the `custom_pipeline_logs` table using the `nginx_pipeline` pipeline to format and transform the log messages: + +```shell +curl -X POST \ + "http://localhost:4000/v1/ingest?db=public&table=custom_pipeline_logs&pipeline_name=nginx_pipeline" \ + -H "Content-Type: application/json" \ + -H "Authorization: Basic {{authentication}}" \ + -d '[ + { + "message": "127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"" + }, + { + "message": "192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"" + }, + { + "message": "10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\"" + }, + { + "message": "172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\"" + } + ]' +``` + +The command will return the following output upon success: + +```json +{"output":[{"affectedrows":4}],"execution_time_ms":79} +``` + +The `custom_pipeline_logs` table content is automatically created based on the pipeline configuration: + +```sql ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| ip_address | http_method | status_code | request_line | user_agent | response_size | timestamp | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +| 10.0.0.1 | GET | 304 | /images/logo.png HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0 | 0 | 2024-05-25 20:18:37 | +| 127.0.0.1 | GET | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | +| 172.16.0.1 | GET | 404 | /contact HTTP/1.1 | Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1 | 162 | 2024-05-25 20:19:37 | +| 192.168.1.1 | POST | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | ++-------------+-------------+-------------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+ +``` +For more detailed information about the log ingestion API endpoint `/ingest`, +including additional parameters and configuration options, +please refer to the [APIs for Writing Logs](/reference/pipeline/write-log-api.md) documentation. + +## Query Logs + +We use the `custom_pipeline_logs` table as an example to query logs. + +### Query logs by tags + +With the multiple tag columns in `custom_pipeline_logs`, +you can query data by tags flexibly. +For example, query the logs with `status_code` 200 and `http_method` GET. + +```sql +SELECT * FROM custom_pipeline_logs WHERE status_code = 200 AND http_method = 'GET'; +``` + +```sql ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | ++------------+-------------+----------------------+---------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +1 row in set (0.02 sec) +``` + +### Full‑Text Search + +For the text fields `request_line` and `user_agent`, you can use `matches_term` function to search logs. +Remember, we created the full-text index for these two columns when [creating a pipeline](#create-a-pipeline). +This allows for high-performance full-text searches. + +For example, query the logs with `request_line` containing `/index.html` or `/api/login`. + +```sql +SELECT * FROM custom_pipeline_logs WHERE matches_term(request_line, '/index.html') OR matches_term(request_line, '/api/login'); +``` + +```sql ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| ip_address | status_code | request_line | user_agent | response_size | timestamp | http_method | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +| 127.0.0.1 | 200 | /index.html HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | 612 | 2024-05-25 20:16:37 | GET | +| 192.168.1.1 | 200 | /api/login HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 | 1784 | 2024-05-25 20:17:37 | POST | ++-------------+-------------+----------------------+--------------------------------------------------------------------------------------------------------------------------+---------------+---------------------+-------------+ +2 rows in set (0.00 sec) +``` + +You can refer to the [Full-Text Search](fulltext-search.md) document for detailed usage of the `matches_term` function. + + +## Benefits of Using Pipelines + +Using pipelines to process logs provides structured data and automatic field extraction, +enabling more efficient querying and analysis. + +You can also write logs directly to the database without pipelines, +but this approach limits high-performance analysis capabilities. + +### Direct Log Insertion (Without Pipeline) + +For comparison, you can create a table to store original log messages: + +```sql +CREATE TABLE `origin_logs` ( + `message` STRING FULLTEXT INDEX, + `time` TIMESTAMP TIME INDEX +) WITH ( + append_mode = 'true' +); +``` + +Use the `INSERT` statement to insert logs into the table. +Note that you need to manually add a timestamp field for each log: + +```sql +INSERT INTO origin_logs (message, time) VALUES +('127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"', '2024-05-25 20:16:37.217'), +('192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"', '2024-05-25 20:17:37.217'), +('10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"', '2024-05-25 20:18:37.217'), +('172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"', '2024-05-25 20:19:37.217'); +``` + +### Schema Comparison: Pipeline vs Raw + +In the above examples, the table `custom_pipeline_logs` is automatically created by writing logs using pipeline, +and the table `origin_logs` is created by writing logs directly. +Let's explore the differences between these two tables. + +```sql +DESC custom_pipeline_logs; +``` + +```sql ++---------------+---------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------------+---------------------+------+------+---------+---------------+ +| ip_address | String | PRI | YES | | TAG | +| status_code | Int32 | PRI | YES | | TAG | +| request_line | String | | YES | | FIELD | +| user_agent | String | | YES | | FIELD | +| response_size | Int32 | | YES | | FIELD | +| timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | +| http_method | String | | YES | | FIELD | ++---------------+---------------------+------+------+---------+---------------+ +7 rows in set (0.00 sec) +``` + +```sql +DESC origin_logs; +``` + +```sql ++---------+----------------------+------+------+---------+---------------+ +| Column | Type | Key | Null | Default | Semantic Type | ++---------+----------------------+------+------+---------+---------------+ +| message | String | | YES | | FIELD | +| time | TimestampMillisecond | PRI | NO | | TIMESTAMP | ++---------+----------------------+------+------+---------+---------------+ +``` + +Comparing the table structures shows the key differences: + +The `custom_pipeline_logs` table (created with pipeline) automatically structures log data into multiple columns: +- `ip_address`, `status_code` as indexed tags for fast filtering +- `request_line`, `user_agent` with full-text indexing for text search +- `response_size`, `http_method` as regular fields +- `timestamp` as the time index + +The `origin_logs` table (direct insertion) stores everything in a single `message` column. + +### Why Use Pipelines? + +It is recommended to use the pipeline method to split the log message into multiple columns, +which offers the advantage of explicitly querying specific values within certain columns. +Column matching query proves superior to full-text searching for several key reasons: + +- **Performance**: Column-based queries are typically faster than full-text searches +- **Storage Efficiency**: GreptimeDB's columnar storage compresses structured data better; inverted indexes for tags consume less storage than full-text indexes +- **Query Simplicity**: Tag-based queries are easier to write, understand, and debug + +## Next Steps + +- **Full-Text Search**: Explore the [Full-Text Search](fulltext-search.md) guide to learn advanced text search capabilities and query techniques in GreptimeDB +- **Pipeline Configuration**: Explore the [Pipeline Configuration](/reference/pipeline/pipeline-config.md) documentation to learn more about creating and customizing pipelines for various log formats and processing needs + diff --git a/versioned_docs/version-0.17/user-guide/logs/write-logs.md b/versioned_docs/version-0.17/user-guide/logs/write-logs.md deleted file mode 100644 index 61c91542d..000000000 --- a/versioned_docs/version-0.17/user-guide/logs/write-logs.md +++ /dev/null @@ -1,347 +0,0 @@ ---- -keywords: [write logs, HTTP interface, log formats, request parameters, JSON logs] -description: Describes how to write logs to GreptimeDB using a pipeline via the HTTP interface, including supported formats and request parameters. ---- - -# Writing Logs Using a Pipeline - -This document describes how to write logs to GreptimeDB by processing them through a specified pipeline using the HTTP interface. - -Before writing logs, please read the [Pipeline Configuration](pipeline-config.md) and [Managing Pipelines](manage-pipelines.md) documents to complete the configuration setup and upload. - -## HTTP API - -You can use the following command to write logs via the HTTP interface: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - -## Request parameters - -This interface accepts the following parameters: - -- `db`: The name of the database. -- `table`: The name of the table. -- `pipeline_name`: The name of the [pipeline](./pipeline-config.md). -- `version`: The version of the pipeline. Optional, default use the latest one. - -## `Content-Type` and body format - -GreptimeDB uses `Content-Type` header to decide how to decode the payload body. Currently the following two format is supported: -- `application/json`: this includes normal JSON format and NDJSON format. -- `application/x-ndjson`: specifically uses NDJSON format, which will try to split lines and parse for more accurate error checking. -- `text/plain`: multiple log lines separated by line breaks. - -### `application/json` and `application/x-ndjson` format - -Here is an example of JSON format body payload - -```JSON -[ - {"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""}, - {"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""}, - {"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""}, - {"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -] -``` - -Note the whole JSON is an array (log lines). Each JSON object represents one line to be processed by Pipeline engine. - -The name of the key in JSON objects, which is `message` here, is used as field name in Pipeline processors. For example: - -```yaml -processors: - - dissect: - fields: - # `message` is the key in JSON object - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# rest of the file is ignored -``` - -We can also rewrite the payload into NDJSON format like following: - -```JSON -{"message":"127.0.0.1 - - [25/May/2024:20:16:37 +0000] \"GET /index.html HTTP/1.1\" 200 612 \"-\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\""} -{"message":"192.168.1.1 - - [25/May/2024:20:17:37 +0000] \"POST /api/login HTTP/1.1\" 200 1784 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\""} -{"message":"10.0.0.1 - - [25/May/2024:20:18:37 +0000] \"GET /images/logo.png HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0\""} -{"message":"172.16.0.1 - - [25/May/2024:20:19:37 +0000] \"GET /contact HTTP/1.1\" 404 162 \"-\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\""} -``` - -Note the outer array is eliminated, and lines are separated by line breaks instead of `,`. - -### `text/plain` format - -Log in plain text format is widely used throughout the ecosystem. GreptimeDB also supports `text/plain` format as log data input, enabling ingesting logs first hand from log producers. - -The equivalent body payload of previous example is like following: - -```plain -127.0.0.1 - - [25/May/2024:20:16:37 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" -192.168.1.1 - - [25/May/2024:20:17:37 +0000] "POST /api/login HTTP/1.1" 200 1784 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36" -10.0.0.1 - - [25/May/2024:20:18:37 +0000] "GET /images/logo.png HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" -172.16.0.1 - - [25/May/2024:20:19:37 +0000] "GET /contact HTTP/1.1" 404 162 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1" -``` - -Sending log ingestion request to GreptimeDB requires only modifying the `Content-Type` header to be `text/plain`, and you are good to go! - -Please note that, unlike JSON format, where the input data already have key names as field names to be used in Pipeline processors, `text/plain` format just gives the whole line as input to the Pipeline engine. In this case we use `message` as the field name to refer to the input line, for example: - -```yaml -processors: - - dissect: - fields: - # use `message` as the field name - - message - patterns: - - '%{ip_address} - - [%{timestamp}] "%{http_method} %{request_line}" %{status_code} %{response_size} "-" "%{user_agent}"' - ignore_missing: true - -# rest of the file is ignored -``` - -It is recommended to use `dissect` or `regex` processor to split the input line into fields first and then process the fields accordingly. - -## Built-in Pipelines - -GreptimeDB offers built-in pipelines for common log formats, allowing you to use them directly without creating new pipelines. - -Note that the built-in pipelines are not editable. Additionally, the "greptime_" prefix of the pipeline name is reserved. - -### `greptime_identity` - -The `greptime_identity` pipeline is designed for writing JSON logs and automatically creates columns for each field in the JSON log. - -- The first-level keys in the JSON log are used as column names. -- An error is returned if the same field has different types. -- Fields with `null` values are ignored. -- If time index is not specified, an additional column, `greptime_timestamp`, is added to the table as the time index to indicate when the log was written. - -#### Type conversion rules - -- `string` -> `string` -- `number` -> `int64` or `float64` -- `boolean` -> `bool` -- `null` -> ignore -- `array` -> `json` -- `object` -> `json` - - -For example, if we have the following json data: - -```json -[ - {"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}}, - {"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null}, - {"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]} -] -``` - -We'll merge the schema for each row of this batch to get the final schema. The table schema will be: - -```sql -mysql> desc pipeline_logs; -+--------------------+---------------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------------------+---------------------+------+------+---------+---------------+ -| age | Int64 | | YES | | FIELD | -| is_student | Boolean | | YES | | FIELD | -| name | String | | YES | | FIELD | -| object | Json | | YES | | FIELD | -| score | Float64 | | YES | | FIELD | -| company | String | | YES | | FIELD | -| array | Json | | YES | | FIELD | -| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP | -+--------------------+---------------------+------+------+---------+---------------+ -8 rows in set (0.00 sec) -``` - -The data will be stored in the table as follows: - -```sql -mysql> select * from pipeline_logs; -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| age | is_student | name | object | score | company | array | greptime_timestamp | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 | -| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 | -| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 | -+------+------------+---------+---------------+-------+---------+---------+----------------------------+ -3 rows in set (0.01 sec) -``` - -#### Specify time index - -A time index is necessary in GreptimeDB. Since the `greptime_identity` pipeline does not require a YAML configuration, you must set the time index in the query parameters if you want to use the timestamp from the log data instead of the automatically generated timestamp when the data arrives. - -Example of Incoming Log Data: -```JSON -[ - {"action": "login", "ts": 1742814853} -] -``` - -To instruct the server to use ts as the time index, set the following query parameter in the HTTP header: -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \ - -H "Content-Type: application/json" \ - -H "Authorization: Basic {{authentication}}" \ - -d $'[{"action": "login", "ts": 1742814853}]' -``` - -The `custom_time_index` parameter accepts two formats, depending on the input data format: -- Epoch number format: `;epoch;` - - The field can be an integer or a string. - - The resolution must be one of: `s`, `ms`, `us`, or `ns`. -- Date string format: `;datestr;` - - For example, if the input data contains a timestamp like `2025-03-24 19:31:37+08:00`, the corresponding format should be `%Y-%m-%d %H:%M:%S%:z`. - -With the configuration above, the resulting table will correctly use the specified log data field as the time index. -```sql -DESC pipeline_logs; -``` -```sql -+--------+-----------------+------+------+---------+---------------+ -| Column | Type | Key | Null | Default | Semantic Type | -+--------+-----------------+------+------+---------+---------------+ -| ts | TimestampSecond | PRI | NO | | TIMESTAMP | -| action | String | | YES | | FIELD | -+--------+-----------------+------+------+---------+---------------+ -2 rows in set (0.02 sec) -``` - -Here are some example of using `custom_time_index` assuming the time variable is named `input_ts`: -- 1742814853: `custom_time_index=input_ts;epoch;s` -- 1752749137000: `custom_time_index=input_ts;epoch;ms` -- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z` -- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z` - - -#### Flatten JSON objects - -If flattening a JSON object into a single-level structure is needed, add the `x-greptime-pipeline-params` header to the request and set `flatten_json_object` to `true`. - -Here is a sample request: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=greptime_identity&version=" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -H "x-greptime-pipeline-params: flatten_json_object=true" \ - -d "$" -``` - -With this configuration, GreptimeDB will automatically flatten each field of the JSON object into separate columns. For example: - -```JSON -{ - "a": { - "b": { - "c": [1, 2, 3] - } - }, - "d": [ - "foo", - "bar" - ], - "e": { - "f": [7, 8, 9], - "g": { - "h": 123, - "i": "hello", - "j": { - "k": true - } - } - } -} -``` - -Will be flattened to: - -```json -{ - "a.b.c": [1,2,3], - "d": ["foo","bar"], - "e.f": [7,8,9], - "e.g.h": 123, - "e.g.i": "hello", - "e.g.j.k": true -} -``` - - - -## Variable hints in the pipeline context - -Starting from `v0.15`, the pipeline engine now recognizes certain variables, and can set corresponding table options based on the value of the variables. -Combined with the `vrl` processor, it's now easy to create and set table options during the pipeline execution based on input data. - -Here is a list of supported common table option variables: -- `greptime_auto_create_table` -- `greptime_ttl` -- `greptime_append_mode` -- `greptime_merge_mode` -- `greptime_physical_table` -- `greptime_skip_wal` -You can find the explanation [here](/reference/sql/create.md#table-options). - -Here are some pipeline specific variables: -- `greptime_table_suffix`: add suffix to the destined table name. - -Let's use the following pipeline file to demonstrate: -```YAML -processors: - - date: - field: time - formats: - - "%Y-%m-%d %H:%M:%S%.3f" - ignore_missing: true - - vrl: - source: | - .greptime_table_suffix, err = "_" + .id - .greptime_table_ttl = "1d" - . -``` - -In the vrl script, we set the table suffix variable with the input field `.id`(leading with an underscore), and set the ttl to `1d`. -Then we run the ingestion using the following JSON data. - -```JSON -{ - "id": "2436", - "time": "2024-05-25 20:16:37.217" -} -``` - -Assuming the given table name being `d_table`, the final table name would be `d_table_2436` as we would expected. -The table is also set with a ttl of 1 day. - -## Examples - -Please refer to the "Writing Logs" section in the [Quick Start](quick-start.md#write-logs) guide for examples. - -## Append Only - -By default, logs table created by HTTP ingestion API are in [append only -mode](/user-guide/deployments-administration/performance-tuning/design-table.md#when-to-use-append-only-tables). - -## Skip Errors with skip_error - -If you want to skip errors when writing logs, you can add the `skip_error` parameter to the HTTP request's query params. For example: - -```shell -curl -X "POST" "http://localhost:4000/v1/ingest?db=&table=&pipeline_name=&version=&skip_error=true" \ - -H "Content-Type: application/x-ndjson" \ - -H "Authorization: Basic {{authentication}}" \ - -d "$" -``` - -With this, GreptimeDB will skip the log entry when an error is encountered and continue processing the remaining logs. The entire request will not fail due to an error in a single log entry. \ No newline at end of file diff --git a/versioned_docs/version-0.17/user-guide/manage-data/data-index.md b/versioned_docs/version-0.17/user-guide/manage-data/data-index.md index c50ebacba..c1c38a58c 100644 --- a/versioned_docs/version-0.17/user-guide/manage-data/data-index.md +++ b/versioned_docs/version-0.17/user-guide/manage-data/data-index.md @@ -1,6 +1,6 @@ --- -keywords: [index, inverted index, skipping index, fulltext index, query performance] -description: Learn about different types of indexes in GreptimeDB, including inverted index, skipping index, and fulltext index, and how to use them effectively to optimize query performance. +keywords: [index, inverted index, skipping index, full-text index, query performance] +description: Learn about different types of indexes in GreptimeDB, including inverted index, skipping index, and full-text index, and how to use them effectively to optimize query performance. --- # Data Index @@ -75,11 +75,11 @@ CREATE TABLE sensor_data ( ); ``` -Skipping index can't handle complex filter conditions, and usually has a lower filtering performance compared to inverted index or fulltext index. +Skipping index can't handle complex filter conditions, and usually has a lower filtering performance compared to inverted index or full-text index. -### Fulltext Index +### Full-Text Index -Fulltext index is designed for text search operations on string columns. It enables efficient searching of text content using word-based matching and text search capabilities. You can query text data with flexible keywords, phrases, or pattern matching queries. +Full-text index is designed for text search operations on string columns. It enables efficient searching of text content using word-based matching and text search capabilities. You can query text data with flexible keywords, phrases, or pattern matching queries. **Use Cases:** - Text search operations @@ -95,20 +95,120 @@ CREATE TABLE logs ( ); ``` -Fulltext index supports options by `WITH`: -* `analyzer`: Sets the language analyzer for the fulltext index. Supported values are `English` and `Chinese`. Default to `English`. -* `case_sensitive`: Determines whether the fulltext index is case-sensitive. Supported values are `true` and `false`. Default to `false`. -* `backend`: Sets the backend for the fulltext index. Supported values are `bloom` and `tantivy`. Default to `bloom`. -* `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. Default is `10240`. -* `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. Value is a float between `0` and `1`. Default is `0.01`. +#### Configuration Options -For example: +When creating or modifying a full-text index, you can specify the following options using `FULLTEXT INDEX WITH`: + +- `analyzer`: Sets the language analyzer for the full-text index + - Supported values: `English`, `Chinese` + - Default: `English` + - Note: The Chinese analyzer requires significantly more time to build the index due to the complexity of Chinese text segmentation. Consider using it only when Chinese text search is a primary requirement. + +- `case_sensitive`: Determines whether the full-text index is case-sensitive + - Supported values: `true`, `false` + - Default: `false` + - Note: Setting to `true` may slightly improve performance for case-sensitive queries, but will degrade performance for case-insensitive queries. This setting does not affect the results of `matches_term` queries. + +- `backend`: Sets the backend for the full-text index + - Supported values: `bloom`, `tantivy` + - Default: `bloom` + +- `granularity`: (For `bloom` backend) The size of data chunks covered by each filter. A smaller granularity improves filtering but increases index size. + - Supported values: positive integer + - Default: `10240` + +- `false_positive_rate`: (For `bloom` backend) The probability of misidentifying a block. A lower rate improves accuracy (better filtering) but increases index size. + - Supported values: float between `0` and `1` + - Default: `0.01` + +#### Backend Selection + +GreptimeDB provides two full-text index backends for efficient log searching: + +1. **Bloom Backend** + - Best for: General-purpose log searching + - Features: + - Uses Bloom filter for efficient filtering + - Lower storage overhead + - Consistent performance across different query patterns + - Limitations: + - Slightly slower for high-selectivity queries + - Storage Cost Example: + - Original data: ~10GB + - Bloom index: ~1GB + +2. **Tantivy Backend** + - Best for: High-selectivity queries (e.g., unique values like TraceID) + - Features: + - Uses inverted index for fast exact matching + - Excellent performance for high-selectivity queries + - Limitations: + - Higher storage overhead (close to original data size) + - Slower performance for low-selectivity queries + - Storage Cost Example: + - Original data: ~10GB + - Tantivy index: ~10GB + +#### Performance Comparison + +The following table shows the performance comparison between different query methods (using Bloom as baseline): + +| Query Type | High Selectivity (e.g., TraceID) | Low Selectivity (e.g., "HTTP") | +|------------|----------------------------------|--------------------------------| +| LIKE | 50x slower | 1x | +| Tantivy | 5x faster | 5x slower | +| Bloom | 1x (baseline) | 1x (baseline) | + +Key observations: +- For high-selectivity queries (e.g., unique values), Tantivy provides the best performance +- For low-selectivity queries, Bloom offers more consistent performance +- Bloom has significant storage advantage over Tantivy (1GB vs 10GB in test case) + +#### Examples + +**Creating a Table with Full-Text Index** ```sql +-- Using Bloom backend (recommended for most cases) CREATE TABLE logs ( - message STRING FULLTEXT INDEX WITH(analyzer='English', case_sensitive='true', backend='bloom', granularity=1024, false_positive_rate=0.01), - `level` STRING PRIMARY KEY, - `timestamp` TIMESTAMP TIME INDEX, + timestamp TIMESTAMP(9) TIME INDEX, + message STRING FULLTEXT INDEX WITH ( + backend = 'bloom', + analyzer = 'English', + case_sensitive = 'false' + ) +); + +-- Using Tantivy backend (for high-selectivity queries) +CREATE TABLE logs ( + timestamp TIMESTAMP(9) TIME INDEX, + message STRING FULLTEXT INDEX WITH ( + backend = 'tantivy', + analyzer = 'English', + case_sensitive = 'false' + ) +); +``` + +**Modifying an Existing Table** + +```sql +-- Enable full-text index on an existing column +ALTER TABLE monitor +MODIFY COLUMN load_15 +SET FULLTEXT INDEX WITH ( + analyzer = 'English', + case_sensitive = 'false', + backend = 'bloom' +); + +-- Change full-text index configuration +ALTER TABLE logs +MODIFY COLUMN message +SET FULLTEXT INDEX WITH ( + analyzer = 'English', + case_sensitive = 'false', + backend = 'tantivy' ); ``` @@ -118,9 +218,7 @@ Fulltext index usually comes with following drawbacks: - Increased flush and compaction latency as each text document needs to be tokenized and indexed - May not be optimal for simple prefix or suffix matching operations -Consider using fulltext index only when you need advanced text search capabilities and flexible query patterns. - -For more detailed information about fulltext index configuration and backend selection, please refer to the [Full-Text Index Configuration](/user-guide/logs/fulltext-index-config) guide. +Consider using full-text index only when you need advanced text search capabilities and flexible query patterns. ## Modify indexes diff --git a/versioned_docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md b/versioned_docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md index b926db148..4cf6a8892 100644 --- a/versioned_docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md +++ b/versioned_docs/version-0.17/user-guide/migrate-to-greptimedb/migrate-from-clickhouse.md @@ -117,7 +117,7 @@ CREATE TABLE logs ( **Notes:** - `host` and `service` serve as common query filters and are included in the primary key to optimize filtering. If there are very many hosts, you might not want to include `host` in the primary key but instead create a skip index. -- `log_message` is treated as raw content with a full-text index created. If you want the full-text index to take effect during queries, you also need to adjust your SQL query syntax. Please refer to [the log query documentation](/user-guide/logs/query-logs.md) for details +- `log_message` is treated as raw content with a full-text index created. If you want the full-text index to take effect during queries, you also need to adjust your SQL query syntax. Please refer to [the log query documentation](/user-guide/logs/fulltext-search.md) for details - Since `trace_id` and `span_id` are mostly high-cardinality fields, it is not recommended to use them in the primary key, but skip indexes have been added. --- @@ -228,7 +228,7 @@ Alternatively, you can convert the CSV to standard INSERT statements for batch i ## Frequently Asked Questions and Optimization Tips ### What if SQL/types are incompatible? - Before migration, audit all query SQL and rewrite or translate as necessary, referring to the [official documentation](/user-guide/query-data/sql.md) (especially for [log query](/user-guide/logs/query-logs.md)) for any incompatible syntax or data types. + Before migration, audit all query SQL and rewrite or translate as necessary, referring to the [official documentation](/user-guide/query-data/sql.md) (especially for [log query](/user-guide/logs/fulltext-search.md)) for any incompatible syntax or data types. ### How do I efficiently import very large datasets in batches? For large tables or full historical data, export and import by partition or shard as appropriate. Monitor write speed and import progress closely. diff --git a/versioned_sidebars/version-0.17-sidebars.json b/versioned_sidebars/version-0.17-sidebars.json index 3f531fd67..19a7fd1fe 100644 --- a/versioned_sidebars/version-0.17-sidebars.json +++ b/versioned_sidebars/version-0.17-sidebars.json @@ -235,11 +235,9 @@ "label": "Overview" }, "user-guide/logs/quick-start", - "user-guide/logs/pipeline-config", - "user-guide/logs/manage-pipelines", - "user-guide/logs/write-logs", - "user-guide/logs/query-logs", - "user-guide/logs/fulltext-index-config" + "user-guide/logs/use-custom-pipelines", + "user-guide/logs/fulltext-search", + "user-guide/logs/manage-pipelines" ] }, { @@ -709,6 +707,15 @@ } ] }, + { + "type": "category", + "label": "Pipeline", + "items": [ + "reference/pipeline/built-in-pipelines", + "reference/pipeline/write-log-api", + "reference/pipeline/pipeline-config" + ] + }, "reference/http-endpoints", "reference/telemetry", "reference/gtctl"