diff --git a/docs/_snippets/_gather_your_details_http.mdx b/docs/_snippets/_gather_your_details_http.mdx index 3415ec3119c..f5517f41b91 100644 --- a/docs/_snippets/_gather_your_details_http.mdx +++ b/docs/_snippets/_gather_your_details_http.mdx @@ -4,17 +4,18 @@ import Image from '@theme/IdealImage'; To connect to ClickHouse with HTTP(S) you need this information: -- The HOST and PORT: typically, the port is 8443 when using TLS or 8123 when not using TLS. +| Parameter(s) | Description | +|-------------------------|---------------------------------------------------------------------------------------------------------------| +|`HOST` and `PORT` | Typically, the port is 8443 when using TLS or 8123 when not using TLS. | +|`DATABASE NAME` | Out of the box, there is a database named `default`, use the name of the database that you want to connect to.| +|`USERNAME` and `PASSWORD`| Out of the box, the username is `default`. Use the username appropriate for your use case. | -- The DATABASE NAME: out of the box, there is a database named `default`, use the name of the database that you want to connect to. - -- The USERNAME and PASSWORD: out of the box, the username is `default`. Use the username appropriate for your use case. - -The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. Select the service that you will connect to and click **Connect**: +The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. +Select a service and click **Connect**: ClickHouse Cloud service connect button -Choose **HTTPS**, and the details are available in an example `curl` command. +Choose **HTTPS**. Connection details are displayed in an example `curl` command. ClickHouse Cloud HTTPS connection details diff --git a/docs/_snippets/_gather_your_details_native.md b/docs/_snippets/_gather_your_details_native.md index e17ff46d692..73bf893a979 100644 --- a/docs/_snippets/_gather_your_details_native.md +++ b/docs/_snippets/_gather_your_details_native.md @@ -4,13 +4,14 @@ import Image from '@theme/IdealImage'; To connect to ClickHouse with native TCP you need this information: -- The HOST and PORT: typically, the port is 9440 when using TLS, or 9000 when not using TLS. - -- The DATABASE NAME: out of the box there is a database named `default`, use the name of the database that you want to connect to. - -- The USERNAME and PASSWORD: out of the box the username is `default`. Use the username appropriate for your use case. - -The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. Select the service that you will connect to and click **Connect**: +| Parameter(s) | Description | +|---------------------------|---------------------------------------------------------------------------------------------------------------| +| `HOST` and `PORT` | Typically, the port is 9440 when using TLS, or 9000 when not using TLS. | +| `DATABASE NAME` | Out of the box there is a database named `default`, use the name of the database that you want to connect to. | +| `USERNAME` and `PASSWORD` | Out of the box the username is `default`. Use the username appropriate for your use case. | + +The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. +Select the service that you will connect to and click **Connect**: ClickHouse Cloud service connect button diff --git a/docs/integrations/data-ingestion/etl-tools/airbyte-and-clickhouse.md b/docs/integrations/data-ingestion/etl-tools/airbyte-and-clickhouse.md index 24fd1ee68cb..1a7e7419c67 100644 --- a/docs/integrations/data-ingestion/etl-tools/airbyte-and-clickhouse.md +++ b/docs/integrations/data-ingestion/etl-tools/airbyte-and-clickhouse.md @@ -30,7 +30,9 @@ Please note that the Airbyte source and destination for ClickHouse are currently Airbyte is an open-source data integration platform. It allows the creation of ELT data pipelines and is shipped with more than 140 out-of-the-box connectors. This step-by-step tutorial shows how to connect Airbyte to ClickHouse as a destination and load a sample dataset. -## 1. Download and run Airbyte {#1-download-and-run-airbyte} + + +## Download and run Airbyte {#1-download-and-run-airbyte} 1. Airbyte runs on Docker and uses `docker-compose`. Make sure to download and install the latest versions of Docker. @@ -50,7 +52,7 @@ Please note that the Airbyte source and destination for ClickHouse are currently Alternatively, you can signup and use Airbyte Cloud ::: -## 2. Add ClickHouse as a destination {#2-add-clickhouse-as-a-destination} +## Add ClickHouse as a destination {#2-add-clickhouse-as-a-destination} In this section, we will display how to add a ClickHouse instance as a destination. @@ -80,7 +82,7 @@ GRANT CREATE ON * TO my_airbyte_user; ``` ::: -## 3. Add a dataset as a source {#3-add-a-dataset-as-a-source} +## Add a dataset as a source {#3-add-a-dataset-as-a-source} The example dataset we will use is the New York City Taxi Data (on Github). For this tutorial, we will use a subset of this dataset which corresponds to the month of Jan 2022. @@ -98,7 +100,7 @@ The example dataset we will use is the more details). 8. Congratulations - you have successfully loaded the NYC taxi data into ClickHouse using Airbyte! + + \ No newline at end of file diff --git a/docs/integrations/data-ingestion/etl-tools/dlt-and-clickhouse.md b/docs/integrations/data-ingestion/etl-tools/dlt-and-clickhouse.md index 3a848abf988..44e39701190 100644 --- a/docs/integrations/data-ingestion/etl-tools/dlt-and-clickhouse.md +++ b/docs/integrations/data-ingestion/etl-tools/dlt-and-clickhouse.md @@ -24,7 +24,9 @@ pip install "dlt[clickhouse]" ## Setup guide {#setup-guide} -### 1. Initialize the dlt Project {#1-initialize-the-dlt-project} + + +### Initialize the dlt Project {#1-initialize-the-dlt-project} Start by initializing a new `dlt` project as follows: ```bash @@ -42,7 +44,7 @@ pip install -r requirements.txt or with `pip install dlt[clickhouse]`, which installs the `dlt` library and the necessary dependencies for working with ClickHouse as a destination. -### 2. Setup ClickHouse Database {#2-setup-clickhouse-database} +### Setup ClickHouse Database {#2-setup-clickhouse-database} To load data into ClickHouse, you need to create a ClickHouse database. Here's a rough outline of what should you do: @@ -60,7 +62,7 @@ GRANT SELECT ON INFORMATION_SCHEMA.COLUMNS TO dlt; GRANT CREATE TEMPORARY TABLE, S3 ON *.* TO dlt; ``` -### 3. Add credentials {#3-add-credentials} +### Add credentials {#3-add-credentials} Next, set up the ClickHouse credentials in the `.dlt/secrets.toml` file as shown below: @@ -78,8 +80,7 @@ secure = 1 # Set to 1 if using HTTPS, else 0. dataset_table_separator = "___" # Separator for dataset table names from dataset. ``` -:::note -HTTP_PORT +:::note HTTP_PORT The `http_port` parameter specifies the port number to use when connecting to the ClickHouse server's HTTP interface. This is different from default port 9000, which is used for the native TCP protocol. You must set `http_port` if you are not using external staging (i.e. you don't set the staging parameter in your pipeline). This is because the built-in ClickHouse local storage staging uses the clickhouse content library, which communicates with ClickHouse over HTTP. @@ -94,6 +95,8 @@ You can pass a database connection string similar to the one used by the `clickh destination.clickhouse.credentials="clickhouse://dlt:Dlt*12345789234567@localhost:9000/dlt?secure=1" ``` + + ## Write disposition {#write-disposition} All [write dispositions](https://dlthub.com/docs/general-usage/incremental-loading#choosing-a-write-disposition) diff --git a/docs/integrations/data-ingestion/etl-tools/nifi-and-clickhouse.md b/docs/integrations/data-ingestion/etl-tools/nifi-and-clickhouse.md index 6678c164539..8cd22c0c30a 100644 --- a/docs/integrations/data-ingestion/etl-tools/nifi-and-clickhouse.md +++ b/docs/integrations/data-ingestion/etl-tools/nifi-and-clickhouse.md @@ -33,20 +33,23 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained'; Apache NiFi is an open-source workflow management software designed to automate data flow between software systems. It allows the creation of ETL data pipelines and is shipped with more than 300 data processors. This step-by-step tutorial shows how to connect Apache NiFi to ClickHouse as both a source and destination, and to load a sample dataset. -## 1. Gather your connection details {#1-gather-your-connection-details} + + +## Gather your connection details {#1-gather-your-connection-details} + -## 2. Download and run Apache NiFi {#2-download-and-run-apache-nifi} +## Download and run Apache NiFi {#2-download-and-run-apache-nifi} -1. For a new setup, download the binary from https://nifi.apache.org/download.html and start by running `./bin/nifi.sh start` +For a new setup, download the binary from https://nifi.apache.org/download.html and start by running `./bin/nifi.sh start` -## 3. Download the ClickHouse JDBC driver {#3-download-the-clickhouse-jdbc-driver} +## Download the ClickHouse JDBC driver {#3-download-the-clickhouse-jdbc-driver} 1. Visit the ClickHouse JDBC driver release page on GitHub and look for the latest JDBC release version 2. In the release version, click on "Show all xx assets" and look for the JAR file containing the keyword "shaded" or "all", for example, `clickhouse-jdbc-0.5.0-all.jar` 3. Place the JAR file in a folder accessible by Apache NiFi and take note of the absolute path -## 4. Add `DBCPConnectionPool` Controller Service and configure its properties {#4-add-dbcpconnectionpool-controller-service-and-configure-its-properties} +## Add `DBCPConnectionPool` Controller Service and configure its properties {#4-add-dbcpconnectionpool-controller-service-and-configure-its-properties} 1. To configure a Controller Service in Apache NiFi, visit the NiFi Flow Configuration page by clicking on the "gear" button @@ -90,7 +93,7 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained'; Controller Services list showing enabled ClickHouse JDBC service -## 5. Read from a table using the `ExecuteSQL` processor {#5-read-from-a-table-using-the-executesql-processor} +## Read from a table using the `ExecuteSQL` processor {#5-read-from-a-table-using-the-executesql-processor} 1. Add an ​`​ExecuteSQL` processor, along with the appropriate upstream and downstream processors @@ -115,7 +118,7 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained'; FlowFile content viewer showing query results in formatted view -## 6. Write to a table using `MergeRecord` and `PutDatabaseRecord` processor {#6-write-to-a-table-using-mergerecord-and-putdatabaserecord-processor} +## Write to a table using `MergeRecord` and `PutDatabaseRecord` processor {#6-write-to-a-table-using-mergerecord-and-putdatabaserecord-processor} 1. To write multiple rows in a single insert, we first need to merge multiple records into a single record. This can be done using the `MergeRecord` processor @@ -153,3 +156,5 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained'; Query results showing row count in the destination table 5. Congratulations - you have successfully loaded your data into ClickHouse using Apache NiFi ! + + \ No newline at end of file diff --git a/docs/integrations/data-ingestion/etl-tools/vector-to-clickhouse.md b/docs/integrations/data-ingestion/etl-tools/vector-to-clickhouse.md index ef0e58ff34e..1ee888eb43b 100644 --- a/docs/integrations/data-ingestion/etl-tools/vector-to-clickhouse.md +++ b/docs/integrations/data-ingestion/etl-tools/vector-to-clickhouse.md @@ -17,167 +17,212 @@ import PartnerBadge from '@theme/badges/PartnerBadge'; -Being able to analyze your logs in real time is critical for production applications. Have you ever wondered if ClickHouse is good at storing and analyzing log data? Just checkout Uber's experience with converting their logging infrastructure from ELK to ClickHouse. - -This guide shows how to use the popular data pipeline Vector to tail an Nginx log file and send it to ClickHouse. The steps below would be similar for tailing any type of log file. We will assume you already have ClickHouse up and running and Vector installed (no need to start it yet though). - -## 1. Create a database and table {#1-create-a-database-and-table} - -Let's define a table to store the log events: - -1. We will start with a new database named `nginxdb`: - ```sql - CREATE DATABASE IF NOT EXISTS nginxdb - ``` - -2. For starters, we are just going to insert the entire log event as a single string. Obviously this is not a great format for performing analytics on the log data, but we will figure that part out below using ***materialized views***. - ```sql - CREATE TABLE IF NOT EXISTS nginxdb.access_logs ( - message String - ) - ENGINE = MergeTree() - ORDER BY tuple() - ``` - :::note - There is not really a need for a primary key yet, so that is why **ORDER BY** is set to **tuple()**. - ::: - -## 2. Configure Nginx {#2--configure-nginx} - -We certainly do not want to spend too much time explaining Nginx, but we also do not want to hide all the details, so in this step we will provide you with enough details to get Nginx logging configured. - -1. The following `access_log` property sends logs to `/var/log/nginx/my_access.log` in the **combined** format. This value goes in the `http` section of your `nginx.conf` file: - ```bash - http { - include /etc/nginx/mime.types; - default_type application/octet-stream; - access_log /var/log/nginx/my_access.log combined; - sendfile on; - keepalive_timeout 65; - include /etc/nginx/conf.d/*.conf; - } - ``` +Being able to analyze your logs in real time is critical for production applications. +ClickHouse excels at storing and analyzing log data due to it's excellent compression (up to [170x](https://clickhouse.com/blog/log-compression-170x) for logs) +and ability to aggregate large amounts of data quickly. + +This guide shows you how to use the popular data pipeline [Vector](https://vector.dev/docs/about/what-is-vector/) to tail an Nginx log file and send it to ClickHouse. +The steps below are similar for tailing any type of log file. + +**Prerequisites:** +- You already have ClickHouse up and running +- You have Vector installed + + + +## Create a database and table {#1-create-a-database-and-table} + +Define a table to store the log events: + +1. Begin with a new database named `nginxdb`: + +```sql +CREATE DATABASE IF NOT EXISTS nginxdb +``` + +2. Insert the entire log event as a single string. Obviously this is not a great format for performing analytics on the log data, but we will figure that part out below using ***materialized views***. + +```sql +CREATE TABLE IF NOT EXISTS nginxdb.access_logs ( + message String +) +ENGINE = MergeTree() +ORDER BY tuple() +``` + +:::note +**ORDER BY** is set to **tuple()** (an empty tuple) as there is no need for a primary key yet. +::: + +## Configure Nginx {#2--configure-nginx} + +In this step, you will be shown how to get Nginx logging configured. + +1. The following `access_log` property sends logs to `/var/log/nginx/my_access.log` in the **combined** format. +This value goes in the `http` section of your `nginx.conf` file: + +```bash +http { + include /etc/nginx/mime.types; + default_type application/octet-stream; + access_log /var/log/nginx/my_access.log combined; + sendfile on; + keepalive_timeout 65; + include /etc/nginx/conf.d/*.conf; +} +``` 2. Be sure to restart Nginx if you had to modify `nginx.conf`. -3. Generate some log events in the access log by visiting pages on your web server. Logs in the **combined** format have the following format: - ```bash - 192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET / HTTP/1.1" 200 615 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" - 192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET /favicon.ico HTTP/1.1" 404 555 "http://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" - 192.168.208.1 - - [12/Oct/2021:03:31:49 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" - ``` - -## 3. Configure Vector {#3-configure-vector} - -Vector collects, transforms and routes logs, metrics, and traces (referred to as **sources**) to lots of different vendors (referred to as **sinks**), including out-of-the-box compatibility with ClickHouse. Sources and sinks are defined in a configuration file named **vector.toml**. - -1. The following **vector.toml** defines a **source** of type **file** that tails the end of **my_access.log**, and it also defines a **sink** as the **access_logs** table defined above: - ```bash - [sources.nginx_logs] - type = "file" - include = [ "/var/log/nginx/my_access.log" ] - read_from = "end" - - [sinks.clickhouse] - type = "clickhouse" - inputs = ["nginx_logs"] - endpoint = "http://clickhouse-server:8123" - database = "nginxdb" - table = "access_logs" - skip_unknown_fields = true - ``` - -2. Start up Vector using the configuration above. Visit the Vector documentation for more details on defining sources and sinks. - -3. Verify the access logs are being inserted into ClickHouse. Run the following query and you should see the access logs in your table: - ```sql - SELECT * FROM nginxdb.access_logs - ``` - View ClickHouse logs in table format - -## 4. Parse the Logs {#4-parse-the-logs} - -Having the logs in ClickHouse is great, but storing each event as a single string does not allow for much data analysis. Let's see how to parse the log events using a materialized view. - -1. A **materialized view** (MV, for short) is a new table based on an existing table, and when inserts are made to the existing table, the new data is also added to the materialized view. Let's see how to define a MV that contains a parsed representation of the log events in **access_logs**, in other words: - ```bash - 192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" - ``` - - There are various functions in ClickHouse to parse the string, but for starters let's take a look at **splitByWhitespace** - which parses a string by whitespace and returns each token in an array. To demonstrate, run the following command: - ```sql - SELECT splitByWhitespace('192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"') - ``` - - Notice the response is pretty close to what we want! A few of the strings have some extra characters, and the user agent (the browser details) did not need to be parsed, but we will resolve that in the next step: - ```text - ["192.168.208.1","-","-","[12/Oct/2021:15:32:43","+0000]","\"GET","/","HTTP/1.1\"","304","0","\"-\"","\"Mozilla/5.0","(Macintosh;","Intel","Mac","OS","X","10_15_7)","AppleWebKit/537.36","(KHTML,","like","Gecko)","Chrome/93.0.4577.63","Safari/537.36\""] - ``` - -2. Similar to **splitByWhitespace**, the **splitByRegexp** function splits a string into an array based on a regular expression. Run the following command, which returns two strings. - ```sql - SELECT splitByRegexp('\S \d+ "([^"]*)"', '192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"') - ``` - - Notice the second string returned is the user agent successfully parsed from the log: - ```text - ["192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] \"GET / HTTP/1.1\" 30"," \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36\""] - ``` - -3. Before looking at the final **CREATE MATERIALIZED VIEW** command, let's view a couple more functions used to cleanup the data. For example, the `RequestMethod` looks like **"GET** with an unwanted double-quote. Run the following **trim** function, which removes the double quote: - ```sql - SELECT trim(LEADING '"' FROM '"GET') - ``` - -4. The time string has a leading square bracket, and also is not in a format that ClickHouse can parse into a date. However, if we change the separator from a colon (**:**) to a comma (**,**) then the parsing works great: - ```sql - SELECT parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM '[12/Oct/2021:15:32:43'), ':', ' ')) - ``` - -5. We are now ready to define our materialized view. Our definition includes **POPULATE**, which means the existing rows in **access_logs** will be processed and inserted right away. Run the following SQL statement: - ```sql - CREATE MATERIALIZED VIEW nginxdb.access_logs_view - ( - RemoteAddr String, - Client String, - RemoteUser String, - TimeLocal DateTime, - RequestMethod String, - Request String, - HttpVersion String, - Status Int32, - BytesSent Int64, - UserAgent String - ) - ENGINE = MergeTree() - ORDER BY RemoteAddr - POPULATE AS - WITH - splitByWhitespace(message) as split, - splitByRegexp('\S \d+ "([^"]*)"', message) as referer - SELECT - split[1] AS RemoteAddr, - split[2] AS Client, - split[3] AS RemoteUser, - parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM split[4]), ':', ' ')) AS TimeLocal, - trim(LEADING '"' FROM split[6]) AS RequestMethod, - split[7] AS Request, - trim(TRAILING '"' FROM split[8]) AS HttpVersion, - split[9] AS Status, - split[10] AS BytesSent, - trim(BOTH '"' from referer[2]) AS UserAgent - FROM - (SELECT message FROM nginxdb.access_logs) - ``` - -6. Now verify it worked. You should see the access logs nicely parsed into columns: - ```sql - SELECT * FROM nginxdb.access_logs_view - ``` - View parsed ClickHouse logs in table format - - :::note - The lesson above stored the data in two tables, but you could change the initial `nginxdb.access_logs` table to use the **Null** table engine - the parsed data will still end up in the `nginxdb.access_logs_view` table, but the raw data will not be stored in a table. - ::: - -**Summary:** By using Vector, which only required a simple install and quick configuration, we can send logs from an Nginx server to a table in ClickHouse. By using a clever materialized view, we can parse those logs into columns for easier analytics. +3. Generate some log events in the access log by visiting pages on your web server. +Logs in the **combined** format look as follows: + + ```bash + 192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET / HTTP/1.1" 200 615 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" + 192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET /favicon.ico HTTP/1.1" 404 555 "http://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" + 192.168.208.1 - - [12/Oct/2021:03:31:49 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" + ``` + +## Configure Vector {#3-configure-vector} + +Vector collects, transforms and routes logs, metrics, and traces (referred to as **sources**) to many different vendors (referred to as **sinks**), including out-of-the-box compatibility with ClickHouse. +Sources and sinks are defined in a configuration file named **vector.toml**. + +1. The following **vector.toml** file defines a **source** of type **file** that tails the end of **my_access.log**, and it also defines a **sink** as the **access_logs** table defined above: + +```bash +[sources.nginx_logs] +type = "file" +include = [ "/var/log/nginx/my_access.log" ] +read_from = "end" + +[sinks.clickhouse] +type = "clickhouse" +inputs = ["nginx_logs"] +endpoint = "http://clickhouse-server:8123" +database = "nginxdb" +table = "access_logs" +skip_unknown_fields = true +``` + +2. Start Vector using the configuration above. Visit the Vector [documentation](https://vector.dev/docs/) for more details on defining sources and sinks. + +3. Verify that the access logs are being inserted into ClickHouse by running the following query. You should see the access logs in your table: + +```sql +SELECT * FROM nginxdb.access_logs +``` + +View ClickHouse logs in table format + +## Parse the Logs {#4-parse-the-logs} + +Having the logs in ClickHouse is great, but storing each event as a single string does not allow for much data analysis. +We'll next look at how to parse the log events using a [materialized view](/materialized-view/incremental-materialized-view). + +A **materialized view** functions similarly to an insert trigger in SQL. When rows of data are inserted into a source table, the materialized view makes some transformation of these rows and inserts the results into a target table. +The materialized view can be configured to configure a parsed representation of the log events in **access_logs**. +An example of one such log event is shown below: + +```bash +192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" +``` + +There are various functions in ClickHouse to parse the above string. The [`splitByWhitespace`](/sql-reference/functions/splitting-merging-functions#splitByWhitespace) function parses a string by whitespace and returns each token in an array. +To demonstrate, run the following command: + +```sql title="Query" +SELECT splitByWhitespace('192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"') +``` + +```text title="Response" +["192.168.208.1","-","-","[12/Oct/2021:15:32:43","+0000]","\"GET","/","HTTP/1.1\"","304","0","\"-\"","\"Mozilla/5.0","(Macintosh;","Intel","Mac","OS","X","10_15_7)","AppleWebKit/537.36","(KHTML,","like","Gecko)","Chrome/93.0.4577.63","Safari/537.36\""] +``` + +A few of the strings have some extra characters, and the user agent (the browser details) did not need to be parsed, but +the resulting array is close to what is needed. + +Similar to `splitByWhitespace`, the [`splitByRegexp`](/sql-reference/functions/splitting-merging-functions#splitByRegexp) function splits a string into an array based on a regular expression. +Run the following command, which returns two strings. + +```sql +SELECT splitByRegexp('\S \d+ "([^"]*)"', '192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"') +``` + +Notice that the second string returned is the user agent successfully parsed from the log: + +```text +["192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] \"GET / HTTP/1.1\" 30"," \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36\""] +``` + +Before looking at the final `CREATE MATERIALIZED VIEW` command, let's view a couple more functions used to clean up the data. +For example, the value of `RequestMethod` is `"GET` containing an unwanted double-quote. +You can use the [`trim`](/sql-reference/functions/string-functions#trim) function to remove the double quote: + +```sql +SELECT trim(LEADING '"' FROM '"GET') +``` + +The time string has a leading square bracket, and is also not in a format that ClickHouse can parse into a date. +However, if we change the separator from a colon (**:**) to a comma (**,**) then the parsing works great: + +```sql +SELECT parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM '[12/Oct/2021:15:32:43'), ':', ' ')) +``` + +We are now ready to define the materialized view. +The definition below includes `POPULATE`, which means the existing rows in **access_logs** will be processed and inserted right away. +Run the following SQL statement: + +```sql +CREATE MATERIALIZED VIEW nginxdb.access_logs_view +( + RemoteAddr String, + Client String, + RemoteUser String, + TimeLocal DateTime, + RequestMethod String, + Request String, + HttpVersion String, + Status Int32, + BytesSent Int64, + UserAgent String +) +ENGINE = MergeTree() +ORDER BY RemoteAddr +POPULATE AS +WITH + splitByWhitespace(message) as split, + splitByRegexp('\S \d+ "([^"]*)"', message) as referer +SELECT + split[1] AS RemoteAddr, + split[2] AS Client, + split[3] AS RemoteUser, + parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM split[4]), ':', ' ')) AS TimeLocal, + trim(LEADING '"' FROM split[6]) AS RequestMethod, + split[7] AS Request, + trim(TRAILING '"' FROM split[8]) AS HttpVersion, + split[9] AS Status, + split[10] AS BytesSent, + trim(BOTH '"' from referer[2]) AS UserAgent +FROM + (SELECT message FROM nginxdb.access_logs) +``` + +Now verify it worked. +You should see the access logs nicely parsed into columns: + +```sql +SELECT * FROM nginxdb.access_logs_view +``` + +View parsed ClickHouse logs in table format + +:::note +The lesson above stored the data in two tables, but you could change the initial `nginxdb.access_logs` table to use the [`Null`](/engines/table-engines/special/null) table engine. +The parsed data will still end up in the `nginxdb.access_logs_view` table, but the raw data will not be stored in a table. +::: + + + +> By using Vector, which only requires a simple install and quick configuration, you can send logs from an Nginx server to a table in ClickHouse. By using a materialized view, you can parse those logs into columns for easier analytics. diff --git a/docs/integrations/data-ingestion/google-dataflow/dataflow.md b/docs/integrations/data-ingestion/google-dataflow/dataflow.md index d0560a44a85..46078d69435 100644 --- a/docs/integrations/data-ingestion/google-dataflow/dataflow.md +++ b/docs/integrations/data-ingestion/google-dataflow/dataflow.md @@ -15,10 +15,13 @@ import ClickHouseSupportedBadge from '@theme/badges/ClickHouseSupported'; [Google Dataflow](https://cloud.google.com/dataflow) is a fully managed stream and batch data processing service. It supports pipelines written in Java or Python and is built on the Apache Beam SDK. -There are two main ways to use Google Dataflow with ClickHouse, both are leveraging [`ClickHouseIO Apache Beam connector`](/integrations/apache-beam): +There are two main ways to use Google Dataflow with ClickHouse, both of which leverage [`ClickHouseIO Apache Beam connector`](/integrations/apache-beam). +These are: +- [Java runner](#1-java-runner) +- [Predefined templates](#2-predefined-templates) -## 1. Java runner {#1-java-runner} -The [Java Runner](./java-runner) allows users to implement custom Dataflow pipelines using the Apache Beam SDK `ClickHouseIO` integration. This approach provides full flexibility and control over the pipeline logic, enabling users to tailor the ETL process to specific requirements. +## Java runner {#1-java-runner} +The [Java runner](./java-runner) allows users to implement custom Dataflow pipelines using the Apache Beam SDK `ClickHouseIO` integration. This approach provides full flexibility and control over the pipeline logic, enabling users to tailor the ETL process to specific requirements. However, this option requires knowledge of Java programming and familiarity with the Apache Beam framework. ### Key features {#key-features} @@ -26,7 +29,7 @@ However, this option requires knowledge of Java programming and familiarity with - Ideal for complex or advanced use cases. - Requires coding and understanding of the Beam API. -## 2. Predefined templates {#2-predefined-templates} +## Predefined templates {#2-predefined-templates} ClickHouse offers [predefined templates](./templates) designed for specific use cases, such as importing data from BigQuery into ClickHouse. These templates are ready-to-use and simplify the integration process, making them an excellent choice for users who prefer a no-code solution. ### Key features {#key-features-1} diff --git a/sidebars.js b/sidebars.js index e4ca70c8456..df033e1fdde 100644 --- a/sidebars.js +++ b/sidebars.js @@ -467,6 +467,19 @@ const sidebars = { }, ], }, + { + type: "category", + label: "Formats", + collapsed: true, + collapsible: true, + link: { type: "doc", id: "interfaces/formats" }, + items: [ + { + type: "autogenerated", + dirName: "interfaces/formats", + }, + ], + }, ], integrations: [ @@ -863,19 +876,10 @@ const sidebars = { "integrations/data-ingestion/data-formats/arrow-avro-orc", "integrations/data-ingestion/data-formats/templates-regex", { - type: "category", - label: "View All Formats", - link: { - type: "doc", - id: "interfaces/formats", - }, - items: [ - { - type: "autogenerated", - dirName: "interfaces/formats", - } - ] - }, + type: "link", + label: "View all formats", + href: "/interfaces/formats", + } ], }, {