Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][Improve] connector-v2 clickhouse/hbase/pulsar sink support chinese #6811

Merged
merged 3 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
179 changes: 179 additions & 0 deletions docs/zh/connector-v2/sink/Clickhouse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Clickhouse

> Clickhouse 数据连接器

## 支持引擎

> Spark<br/>
> Flink<br/>
> SeaTunnel Zeta<br/>

## 核心特性

- [ ] [精准一次](../../concept/connector-v2-features.md)
- [x] [cdc](../../concept/connector-v2-features.md)

> Clickhouse sink 插件通过实现幂等写入可以达到精准一次,需要配合 aggregating merge tree 支持重复数据删除的引擎。

## 描述

用于将数据写入 Clickhouse。

## 支持的数据源信息

为了使用 Clickhouse 连接器,需要以下依赖项。它们可以通过 install-plugin.sh 或从 Maven 中央存储库下载。

| 数据源 | 支持的版本 | 依赖 |
|------------|-----------|------------------------------------------------------------------------------------------------------------|
| Clickhouse | universal | [下载](https://mvnrepository.com/artifact/org.apache.seatunnel/seatunnel-connectors-v2/connector-clickhouse) |

## 数据类型映射

| SeaTunnel 数据类型 | Clickhouse 数据类型 |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| STRING | String / Int128 / UInt128 / Int256 / UInt256 / Point / Ring / Polygon MultiPolygon |
| INT | Int8 / UInt8 / Int16 / UInt16 / Int32 |
| BIGINT | UInt64 / Int64 / IntervalYear / IntervalQuarter / IntervalMonth / IntervalWeek / IntervalDay / IntervalHour / IntervalMinute / IntervalSecond |
| DOUBLE | Float64 |
| DECIMAL | Decimal |
| FLOAT | Float32 |
| DATE | Date |
| TIME | DateTime |
| ARRAY | Array |
| MAP | Map |

## 输出选项

| 名称 | 类型 | 是否必须 | 默认值 | 描述 |
|---------------------------------------|---------|------|-------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | String | Yes | - | `ClickHouse` 集群地址, 格式是`host:port` , 允许多个`hosts`配置. 例如 `"host1:8123,host2:8123"`. |
| database | String | Yes | - | `ClickHouse` 数据库名称. |
| table | String | Yes | - | 表名称. |
| username | String | Yes | - | `ClickHouse` 用户账号. |
| password | String | Yes | - | `ClickHouse` 用户密码. |
| clickhouse.config | Map | No | | 除了上述必须由 `clickhouse-jdbc` 指定的必填参数外,用户还可以指定多个可选参数,这些参数涵盖了 `clickhouse-jdbc` 提供的所有[参数](https://github.com/ClickHouse/clickhouse-jdbc/tree/master/clickhouse-client#configuration). |
| bulk_size | String | No | 20000 | 每次通过[Clickhouse-jdbc](https://github.com/ClickHouse/clickhouse-jdbc) 写入的行数,即默认是20000. |
| split_mode | String | No | false | 此模式仅支持引擎为`Distributed`的 `clickhouse` 表。选项 `internal_replication` 应该是 `true` 。他们将在 seatunnel 中拆分分布式表数据,并直接对每个分片进行写入。分片权重定义为 `clickhouse` 将计算在内。 |
| sharding_key | String | No | - | 使用 `split_mode` 时,将数据发送到哪个节点是个问题,默认为随机选择,但可以使用`sharding_key`参数来指定分片算法的字段。此选项仅在`split_mode`为 `true` 时有效. |
| primary_key | String | No | - | 标记`clickhouse`表中的主键列,并根据主键执行INSERT/UPDATE/DELETE到`clickhouse`表. |
| support_upsert | Boolean | No | false | 支持按查询主键更新插入行. |
| allow_experimental_lightweight_delete | Boolean | No | false | 允许基于`MergeTree`表引擎实验性轻量级删除. |
| common-options | | No | - | Sink插件查用参数,详见[Sink常用选项](common-options.md). |

## 如何创建一个clickhouse 同步任务

以下示例演示如何创建将随机生成的数据写入Clickhouse数据库的数据同步作业。

```bash
# Set the basic configuration of the task to be performed
env {
parallelism = 1
job.mode = "BATCH"
checkpoint.interval = 1000
}

source {
FakeSource {
row.num = 2
bigint.min = 0
bigint.max = 10000000
split.num = 1
split.read-interval = 300
schema {
fields {
c_bigint = bigint
}
}
}
}

sink {
Clickhouse {
host = "127.0.0.1:9092"
database = "default"
table = "test"
username = "xxxxx"
password = "xxxxx"
}
}
```

### 小提示

> 1.[SeaTunnel 部署文档](../../start-v2/locally/deployment.md). <br/>
> 2.需要在同步前提前创建要写入的表.<br/>
> 3.当写入 ClickHouse 表,无需设置其结构,因为连接器会在写入前向 ClickHouse 查询当前表的结构信息.<br/>

## Clickhouse 接收器配置

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"
clickhouse.config = {
max_rows_to_read = "100"
read_overflow_mode = "throw"
}
}
}
```

## 切分模式

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"

# split mode options
split_mode = true
sharding_key = "age"
}
}
```

## CDC(Change data capture) Sink

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"

# cdc options
primary_key = "id"
support_upsert = true
}
}
```

## CDC(Change data capture) for *MergeTree engine

```hocon
sink {
Clickhouse {
host = "localhost:8123"
database = "default"
table = "fake_all"
username = "xxxxx"
password = "xxxxx"

# cdc options
primary_key = "id"
support_upsert = true
allow_experimental_lightweight_delete = true
}
}
```

122 changes: 122 additions & 0 deletions docs/zh/connector-v2/sink/Hbase.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Hbase

> Hbase 数据连接器

## 描述

将数据输出到hbase

## 主要特性

- [ ] [精准一次](../../concept/connector-v2-features.md)

## 选项

| 名称 | 类型 | 是否必须 | 默认值 |
|--------------------|---------|------|-----------------|
| zookeeper_quorum | string | yes | - |
| table | string | yes | - |
| rowkey_column | list | yes | - |
| family_name | config | yes | - |
| rowkey_delimiter | string | no | "" |
| version_column | string | no | - |
| null_mode | string | no | skip |
| wal_write | boolean | yes | false |
| write_buffer_size | string | no | 8 * 1024 * 1024 |
| encoding | string | no | utf8 |
| hbase_extra_config | string | no | - |
| common-options | | no | - |

### zookeeper_quorum [string]

hbase的zookeeper集群主机, 示例: "hadoop001:2181,hadoop002:2181,hadoop003:2181"

### table [string]

要写入的表名, 例如: "seatunnel"

### rowkey_column [list]

行键的列名列表, 例如: ["id", "uuid"]

### family_name [config]

字段的列簇名称映射。例如,上游的行如下所示:

| id | name | age |
|----|---------------|-----|
| 1 | tyrantlucifer | 27 |

id作为行键和其他写入不同列簇的字段,可以分配

family_name {
name = "info1"
age = "info2"
}

这主要是name写入列簇info1,age写入将写给列簇 info2

如果要将其他字段写入同一列簇,可以分配

family_name {
all_columns = "info"
}

这意味着所有字段都将写入该列簇 info

### rowkey_delimiter [string]

连接多行键的分隔符,默认 ""

### version_column [string]

版本列名称,您可以使用它来分配 hbase 记录的时间戳

### null_mode [double]

写入 null 值的模式,支持 [ skip , empty], 默认 skip

- skip: 当字段为 null ,连接器不会将此字段写入 hbase
- empty: 当字段为null时,连接器将写入并为此字段生成空值

### wal_write [boolean]

wal log 写入标志,默认值 false

### write_buffer_size [int]

hbase 客户端的写入缓冲区大小,默认 8 * 1024 * 1024

### encoding [string]

字符串字段的编码,支持[ utf8 , gbk],默认 utf8

### hbase_extra_config [config]

hbase扩展配置

### 常见选项

Sink 插件常用参数,详见 Sink 常用选项 [Sink Common Options](common-options.md)

## 案例

```hocon

Hbase {
zookeeper_quorum = "hadoop001:2181,hadoop002:2181,hadoop003:2181"
table = "seatunnel_test"
rowkey_column = ["name"]
family_name {
all_columns = seatunnel
}
}

```

## 更改日志

### 下一个版本

- 添加 hbase 输出连接器 ([4049](https://github.com/apache/seatunnel/pull/4049))