Skip to content

Commit

Permalink
[Doc][Improve] connector-v2 clickhouse/hbase/pulsar sink support chin…
Browse files Browse the repository at this point in the history
…ese (#6811)

* [Doc][Improve] connector-v2 clickhouse/hbase/pulsar sink support chinese

* update doc style

---------

Co-authored-by: fanchengbo <fanchengbo@dobest.com>
  • Loading branch information
fcb-xiaobo and fanchengbo committed May 15, 2024
1 parent 4b6c13e commit 874f904
Show file tree
Hide file tree
Showing 3 changed files with 469 additions and 0 deletions.
179 changes: 179 additions & 0 deletions docs/zh/connector-v2/sink/Clickhouse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Clickhouse

> Clickhouse 数据连接器
## 支持引擎

> Spark<br/>
> Flink<br/>
> SeaTunnel Zeta<br/>
## 核心特性

- [ ] [精准一次](../../concept/connector-v2-features.md)
- [x] [cdc](../../concept/connector-v2-features.md)

> Clickhouse sink 插件通过实现幂等写入可以达到精准一次,需要配合 aggregating merge tree 支持重复数据删除的引擎。
## 描述

用于将数据写入 Clickhouse。

## 支持的数据源信息

为了使用 Clickhouse 连接器,需要以下依赖项。它们可以通过 install-plugin.sh 或从 Maven 中央存储库下载。

| 数据源 | 支持的版本 | 依赖 |
|------------|-----------|------------------------------------------------------------------------------------------------------------|
| Clickhouse | universal | [下载](https://mvnrepository.com/artifact/org.apache.seatunnel/seatunnel-connectors-v2/connector-clickhouse) |

## 数据类型映射

| SeaTunnel 数据类型 | Clickhouse 数据类型 |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| STRING | String / Int128 / UInt128 / Int256 / UInt256 / Point / Ring / Polygon MultiPolygon |
| INT | Int8 / UInt8 / Int16 / UInt16 / Int32 |
| BIGINT | UInt64 / Int64 / IntervalYear / IntervalQuarter / IntervalMonth / IntervalWeek / IntervalDay / IntervalHour / IntervalMinute / IntervalSecond |
| DOUBLE | Float64 |
| DECIMAL | Decimal |
| FLOAT | Float32 |
| DATE | Date |
| TIME | DateTime |
| ARRAY | Array |
| MAP | Map |

## 输出选项

| 名称 | 类型 | 是否必须 | 默认值 | 描述 |
|---------------------------------------|---------|------|-------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | String | Yes | - | `ClickHouse` 集群地址, 格式是`host:port` , 允许多个`hosts`配置. 例如 `"host1:8123,host2:8123"`. |
| database | String | Yes | - | `ClickHouse` 数据库名称. |
| table | String | Yes | - | 表名称. |
| username | String | Yes | - | `ClickHouse` 用户账号. |
| password | String | Yes | - | `ClickHouse` 用户密码. |
| clickhouse.config | Map | No | | 除了上述必须由 `clickhouse-jdbc` 指定的必填参数外,用户还可以指定多个可选参数,这些参数涵盖了 `clickhouse-jdbc` 提供的所有[参数](https://github.com/ClickHouse/clickhouse-jdbc/tree/master/clickhouse-client#configuration). |
| bulk_size | String | No | 20000 | 每次通过[Clickhouse-jdbc](https://github.com/ClickHouse/clickhouse-jdbc) 写入的行数,即默认是20000. |
| split_mode | String | No | false | 此模式仅支持引擎为`Distributed``clickhouse` 表。选项 `internal_replication` 应该是 `true` 。他们将在 seatunnel 中拆分分布式表数据,并直接对每个分片进行写入。分片权重定义为 `clickhouse` 将计算在内。 |
| sharding_key | String | No | - | 使用 `split_mode` 时,将数据发送到哪个节点是个问题,默认为随机选择,但可以使用`sharding_key`参数来指定分片算法的字段。此选项仅在`split_mode``true` 时有效. |
| primary_key | String | No | - | 标记`clickhouse`表中的主键列,并根据主键执行INSERT/UPDATE/DELETE到`clickhouse`表. |
| support_upsert | Boolean | No | false | 支持按查询主键更新插入行. |
| allow_experimental_lightweight_delete | Boolean | No | false | 允许基于`MergeTree`表引擎实验性轻量级删除. |
| common-options | | No | - | Sink插件查用参数,详见[Sink常用选项](common-options.md). |

## 如何创建一个clickhouse 同步任务

以下示例演示如何创建将随机生成的数据写入Clickhouse数据库的数据同步作业。

```bash
# Set the basic configuration of the task to be performed
env {
parallelism = 1
job.mode = "BATCH"
checkpoint.interval = 1000
}

source {
FakeSource {
row.num = 2
bigint.min = 0
bigint.max = 10000000
split.num = 1
split.read-interval = 300
schema {
fields {
c_bigint = bigint
}
}
}
}

sink {
Clickhouse {
host = "127.0.0.1:9092"
database = "default"
table = "test"
username = "xxxxx"
password = "xxxxx"
}
}
```

### 小提示

> 1.[SeaTunnel 部署文档](../../start-v2/locally/deployment.md). <br/>
> 2.需要在同步前提前创建要写入的表.<br/>
> 3.当写入 ClickHouse 表,无需设置其结构,因为连接器会在写入前向 ClickHouse 查询当前表的结构信息.<br/>
## Clickhouse 接收器配置

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"
clickhouse.config = {
max_rows_to_read = "100"
read_overflow_mode = "throw"
}
}
}
```

## 切分模式

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"
# split mode options
split_mode = true
sharding_key = "age"
}
}
```

## CDC(Change data capture) Sink

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"
# cdc options
primary_key = "id"
support_upsert = true
}
}
```

## CDC(Change data capture) for *MergeTree engine

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"
# cdc options
primary_key = "id"
support_upsert = true
allow_experimental_lightweight_delete = true
}
}
```

122 changes: 122 additions & 0 deletions docs/zh/connector-v2/sink/Hbase.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Hbase

> Hbase 数据连接器
## 描述

将数据输出到hbase

## 主要特性

- [ ] [精准一次](../../concept/connector-v2-features.md)

## 选项

| 名称 | 类型 | 是否必须 | 默认值 |
|--------------------|---------|------|-----------------|
| zookeeper_quorum | string | yes | - |
| table | string | yes | - |
| rowkey_column | list | yes | - |
| family_name | config | yes | - |
| rowkey_delimiter | string | no | "" |
| version_column | string | no | - |
| null_mode | string | no | skip |
| wal_write | boolean | yes | false |
| write_buffer_size | string | no | 8 * 1024 * 1024 |
| encoding | string | no | utf8 |
| hbase_extra_config | string | no | - |
| common-options | | no | - |

### zookeeper_quorum [string]

hbase的zookeeper集群主机, 示例: "hadoop001:2181,hadoop002:2181,hadoop003:2181"

### table [string]

要写入的表名, 例如: "seatunnel"

### rowkey_column [list]

行键的列名列表, 例如: ["id", "uuid"]

### family_name [config]

字段的列簇名称映射。例如,上游的行如下所示:

| id | name | age |
|----|---------------|-----|
| 1 | tyrantlucifer | 27 |

id作为行键和其他写入不同列簇的字段,可以分配

family_name {
name = "info1"
age = "info2"
}

这主要是name写入列簇info1,age写入将写给列簇 info2

如果要将其他字段写入同一列簇,可以分配

family_name {
all_columns = "info"
}

这意味着所有字段都将写入该列簇 info

### rowkey_delimiter [string]

连接多行键的分隔符,默认 ""

### version_column [string]

版本列名称,您可以使用它来分配 hbase 记录的时间戳

### null_mode [double]

写入 null 值的模式,支持 [ skip , empty], 默认 skip

- skip: 当字段为 null ,连接器不会将此字段写入 hbase
- empty: 当字段为null时,连接器将写入并为此字段生成空值

### wal_write [boolean]

wal log 写入标志,默认值 false

### write_buffer_size [int]

hbase 客户端的写入缓冲区大小,默认 8 * 1024 * 1024

### encoding [string]

字符串字段的编码,支持[ utf8 , gbk],默认 utf8

### hbase_extra_config [config]

hbase扩展配置

### 常见选项

Sink 插件常用参数,详见 Sink 常用选项 [Sink Common Options](common-options.md)

## 案例

```hocon
Hbase {
zookeeper_quorum = "hadoop001:2181,hadoop002:2181,hadoop003:2181"
table = "seatunnel_test"
rowkey_column = ["name"]
family_name {
all_columns = seatunnel
}
}
```

## 更改日志

### 下一个版本

- 添加 hbase 输出连接器 ([4049](https://github.com/apache/seatunnel/pull/4049))

0 comments on commit 874f904

Please sign in to comment.