[Feature][Connector Hive] support hive savemode #6842

liunaijie · 2024-05-11T09:59:28Z

Purpose of this pull request

subtask of #5390

implement hive savemode feature
remove hive metadata using hive2 jdbc (remove hive thrift url, using hive2 jdbc instead)

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update change log that in connector document. For more details you can refer to connector-v2
2. Update plugin-mapping.properties and add new connector information in it
3. Update the pom file of seatunnel-dist
Update the release-note.

liunaijie · 2024-05-16T10:52:48Z

@EricJoy2048 @dailai @ruanwenjun hi, guys. PTAL when you have time.

liunaijie · 2024-05-16T10:55:16Z

...-hive/src/main/java/org/apache/seatunnel/connectors/seatunnel/hive/sink/HiveSinkFactory.java

@@ -66,7 +74,8 @@ public OptionRule optionRule() {

        ReadonlyConfig finalReadonlyConfig =
                generateCurrentReadonlyConfig(readonlyConfig, catalogTable);
-        return () -> new HiveSink(finalReadonlyConfig, catalogTable);
+        CatalogTable finalCatalog = renameCatalogTable(finalReadonlyConfig, catalogTable);


replace with target hive sink table name, if not replace here, will pass source table name to hive.
like fake to hive sink. so when use this catalog, will has issue, replaced here

liunaijie · 2024-05-16T11:01:01Z

...ve/src/main/java/org/apache/seatunnel/connectors/seatunnel/hive/catalog/HiveJDBCCatalog.java

+        String describeFormattedTableQuery = "describe formatted " + tablePath.getFullName();
+        try (PreparedStatement ps = connection.prepareStatement(describeFormattedTableQuery)) {
+            ResultSet rs = ps.executeQuery();
+            return processResult(rs, tablePath, builder, partitionKeys);


Now hive table informaction is parse from the query result. That's not very elegant, but it work.....

liunaijie · 2024-05-21T07:05:38Z

...nnector-hive/src/main/java/org/apache/seatunnel/connectors/seatunnel/hive/sink/HiveSink.java

+                                .withValue(
+                                        FIELD_DELIMITER.key(),
+                                        ConfigValueFactory.fromAnyRef(
+                                                parameters.get("field.delim")))


this line will has issue if the field.delim is \t, the ConfigValueFactory.fromAnyRef with replace it to \\t. then the writted data will has issue.

NoPr · 2024-05-22T02:56:15Z

大佬你好，创建的statement好像有点问题
org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

liunaijie · 2024-05-22T06:13:04Z

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

hailin0 · 2024-05-22T07:19:06Z

docs/en/connector-v2/sink/Hive.md

@@ -33,7 +33,7 @@ By default, we use 2PC commit to ensure `exactly-once`
 |             name              |  type   | required | default value  |
 |-------------------------------|---------|----------|----------------|
 | table_name                    | string  | yes      | -              |
-| metastore_uri                 | string  | yes      | -              |
+| hive_jdbc_url                 | string  | yes      | -              |


How to be compatible with older versions?

it not compatible with old version, because i don't want both use hive2 jdbc and hive metastore. so i removed the metastore only use jdbc.

it not compatible with old version, because i don't want both use hive2 jdbc and hive metastore. so i removed the metastore only use jdbc.

As an open source project we have to consider the compatibility of features, we know that many users are using Hive Connector, in order to be compatible with those older users, I think it is a better way to support both jdbc and metastore.

yes.
but if we want use savemode feature only with hive metastore, is difficulty to create table.
like table format, bucket setting, table location etc. we need lots of parameter to config them.
so i want use sql template to let user define the template we can replace table name, columns in this template and run the sql to create table.

I can add metastore_url back, and it will make the code more easier. But user need both config jdbc and thrift on hive connector.

yes. but if we want use savemode feature only with hive metastore, is difficulty to create table. like table format, bucket setting, table location etc. we need lots of parameter to config them. so i want use sql template to let user define the template we can replace table name, columns in this template and run the sql to create table.

I can add metastore_url back, and it will make the code more easier. But user need both config jdbc and thrift on hive connector.

Yes, you can add metastore_url back and tell users save mode only can be use when jdbc_url is configed.

In the future, in seatunnel version 2.4.x, we can remove metastore_url configuration, and we can make some incompatible changes from 2.3.x to 2.4.x.

NoPr · 2024-05-22T07:29:38Z

partition

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

是的，当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时，save_mode_create_template 是必须的吗？当为 “” 时，报错为我描述的那个错误

liunaijie · 2024-05-22T07:55:06Z

partition

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

是的，当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时，save_mode_create_template 是必须的吗？当为 “” 时，报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

NoPr · 2024-05-22T08:05:35Z

partition

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

是的，当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时，save_mode_create_template 是必须的吗？当为 “” 时，报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话，是否该source-sink的conf就不能成立？这个和mysql的schema_save_mode 的配置实现的效果不一样吗？

liunaijie · 2024-05-22T08:12:06Z

partition

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

是的，当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时，save_mode_create_template 是必须的吗？当为 “” 时，报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话，是否该source-sink的conf就不能成立？这个和mysql的schema_save_mode 的配置实现的效果不一样吗？

有些许的不一样
可以拿到source的表结构, 然后根据这个结构去创建表
但是其他的Hive配置, 比如内表/外表, 外表路径, 存储格式等等的配置是无法拿到的. 所以添加了这个参数希望用户自定义DDL语句

NoPr · 2024-05-22T08:28:29Z

partition

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

是的，当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时，save_mode_create_template 是必须的吗？当为 “” 时，报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话，是否该source-sink的conf就不能成立？这个和mysql的schema_save_mode 的配置实现的效果不一样吗？

有些许的不一样可以拿到source的表结构, 然后根据这个结构去创建表但是其他的Hive配置, 比如内表/外表, 外表路径, 存储格式等等的配置是无法拿到的. 所以添加了这个参数希望用户自定义DDL语句

那也就是说hive的自定义建表需要满足：
1.已知source表结构
2.自定义建表语句并作为sink的参数

NoPr · 2024-05-23T06:50:52Z

老哥稳，我试过了。

liunaijie · 2024-05-23T07:32:41Z

老哥稳，我试过了。

目前的代码还有几个问题:

你上面的语句没有指定分隔符, 如果指定了类似于 \t 这样的分割符, 写入Config后会变成 \\t, 导致文件写入有问题.
由于查询hive表结构是通过desc formatted <table_name>的方式然后解析sql结果. 目前发现在不同的版本中返回语句会略有不同, 3.1.3 版本在 # col_name与真正的字段名直接没有空行, 而在2.1.1版本中发现会多一个空行. 目前的代码这块的处理逻辑需要兼容以及优化.
不确定开启kerberos认证后是否可以连通, 我这边是使用用户名密码, 可以通过在jdbc url中设置完成这个认证.

zhilinli123 · 2024-05-24T08:01:42Z

Modify the hive documentation by referring to mysql.md
@liunaijie
https://github.com/apache/seatunnel/blob/dev/docs/en/connector-v2/sink/Mysql.md
https://github.com/apache/seatunnel/blob/dev/docs/en/connector-v2/source/Mysql.md

zhilinli123 · 2024-06-03T07:59:08Z

docs/en/connector-v2/sink/Hive.md

+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive.
+
+If you use SeaTunnel Engine, You need put `seatunnel-hadoop3-3.1.4-uber.jar` and `hive-exec-<hive_version>.jar` and `hive-jdbc-<hive_version>.jar` and `libfb303-0.9.3.jar` in $SEATUNNEL_HOME/lib/ dir.

 ## Key features


Key Features

zhilinli123 · 2024-06-03T08:00:16Z

docs/en/connector-v2/sink/Hive.md

+| abort_drop_partition_metadata | boolean | no       | true                         | Flag to decide whether to drop partition metadata from Hive Metastore during an abort operation. Note: this only affects the metadata in the metastore, the data in the partition will always be deleted(data generated during the synchronization process).                |
+| common-options                |         | no       | -                            | Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details                                                                                                                                                                         |
+
+### schema_save_mode[Enum]


Added to options

it is already in options, just add more explanation

zhilinli123 · 2024-06-03T08:00:50Z

docs/en/connector-v2/source/Hive.md

+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive.
+
+If you use SeaTunnel Engine, You need put `seatunnel-hadoop3-3.1.4-uber.jar` and `hive-exec-<hive_version>.jar` and `libfb303-0.9.3.jar` in $SEATUNNEL_HOME/lib/ dir.

 ## Key features


zhilinli123 · 2024-06-03T08:01:10Z

docs/en/connector-v2/source/Hive.md

+
+## Source Options
+
+|         name         |  type  | required | default value  |                                                                 Description                                                                  |


capitalize the first letter

zhilinli123 · 2024-06-03T08:23:54Z

...ve/src/main/java/org/apache/seatunnel/connectors/seatunnel/hive/catalog/HiveJDBCCatalog.java

+    @Override
+    public boolean isExistsData(TablePath tablePath) {
+        String tableName = tablePath.getFullName();
+        String sql = String.format("select * from %s limit 1;", tableName);


Wouldn't it be better to use show create table

no, this method is to check whether this table has data.
if use show create table, can't check has data or not. can only check table exist

liunaijie · 2024-06-03T09:31:03Z

老哥稳，我试过了。

目前的代码还有几个问题:

你上面的语句没有指定分隔符, 如果指定了类似于 \t 这样的分割符, 写入Config后会变成 \\t, 导致文件写入有问题.

由于查询hive表结构是通过desc formatted <table_name>的方式然后解析sql结果. 目前发现在不同的版本中返回语句会略有不同, 3.1.3 版本在 # col_name与真正的字段名直接没有空行, 而在2.1.1版本中发现会多一个空行. 目前的代码这块的处理逻辑需要兼容以及优化.

不确定开启kerberos认证后是否可以连通, 我这边是使用用户名密码, 可以通过在jdbc url中设置完成这个认证.

problem 1, 2 solved
problem 3, i don't have this env, can't verify about this.

liunaijie · 2024-06-07T03:15:53Z

partition

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

大佬你好，创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗还是哪个语句?

是的，当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时，save_mode_create_template 是必须的吗？当为 “” 时，报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话，是否该source-sink的conf就不能成立？这个和mysql的schema_save_mode 的配置实现的效果不一样吗？

代码更新了一版, 可以在不知道上游表结构的情况下, 使用变量的形式运行时替换掉.
有一点需要注意的是如果存在分区, 则需要添加额外的参数标明分区字段, 并且分区字段不能变量化, 即多张上游表只能用一个模板内的分区定义, 不能针对每个表建不同的分区.

liunaijie marked this pull request as ready for review May 16, 2024 10:49

liunaijie commented May 16, 2024

View reviewed changes

EricJoy2048 changed the title ~~[Feature] support hive savemode~~ [Feature][Connector Hive] support hive savemode May 16, 2024

[Feature] support hive savemode

6fc57ec

liunaijie commented May 21, 2024

View reviewed changes

liunaijie force-pushed the hive-savemode branch from 3deae69 to 1720428 Compare May 22, 2024 06:13

hailin0 reviewed May 22, 2024

View reviewed changes

liunaijie force-pushed the hive-savemode branch 2 times, most recently from 991eb3b to d004328 Compare May 27, 2024 01:46

zhilinli123 reviewed Jun 3, 2024

View reviewed changes

liunaijie force-pushed the hive-savemode branch from e79408f to 51ec660 Compare June 3, 2024 08:00

zhilinli123 reviewed Jun 3, 2024

View reviewed changes

liunaijie force-pushed the hive-savemode branch from f9d9dee to 99dfdbc Compare June 3, 2024 09:36

[Feature] hive savemode

ca85c8d

liunaijie force-pushed the hive-savemode branch from b325495 to ca85c8d Compare June 7, 2024 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Connector Hive] support hive savemode #6842

[Feature][Connector Hive] support hive savemode #6842

liunaijie commented May 11, 2024 •

edited

liunaijie commented May 16, 2024

liunaijie May 16, 2024

liunaijie May 16, 2024

liunaijie May 21, 2024

NoPr commented May 22, 2024

liunaijie commented May 22, 2024

hailin0 May 22, 2024

liunaijie May 22, 2024

EricJoy2048 Jun 3, 2024

liunaijie Jun 3, 2024

EricJoy2048 Jun 7, 2024 •

edited

NoPr commented May 22, 2024

liunaijie commented May 22, 2024

NoPr commented May 22, 2024

liunaijie commented May 22, 2024

NoPr commented May 22, 2024 •

edited

NoPr commented May 23, 2024

liunaijie commented May 23, 2024 •

edited

zhilinli123 commented May 24, 2024 •

edited

zhilinli123 Jun 3, 2024

zhilinli123 Jun 3, 2024

liunaijie Jun 3, 2024

zhilinli123 Jun 3, 2024

zhilinli123 Jun 3, 2024

zhilinli123 Jun 3, 2024

liunaijie Jun 3, 2024

liunaijie commented Jun 3, 2024

liunaijie commented Jun 7, 2024


		## Source Options

		\| name \| type \| required \| default value \| Description \|

[Feature][Connector Hive] support hive savemode #6842

Are you sure you want to change the base?

[Feature][Connector Hive] support hive savemode #6842

Conversation

liunaijie commented May 11, 2024 • edited

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

liunaijie commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NoPr commented May 22, 2024

liunaijie commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricJoy2048 Jun 7, 2024 • edited

Choose a reason for hiding this comment

NoPr commented May 22, 2024

liunaijie commented May 22, 2024

NoPr commented May 22, 2024

liunaijie commented May 22, 2024

NoPr commented May 22, 2024 • edited

NoPr commented May 23, 2024

liunaijie commented May 23, 2024 • edited

zhilinli123 commented May 24, 2024 • edited

Choose a reason for hiding this comment

Key Features

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liunaijie commented Jun 3, 2024

liunaijie commented Jun 7, 2024

liunaijie commented May 11, 2024 •

edited

EricJoy2048 Jun 7, 2024 •

edited

NoPr commented May 22, 2024 •

edited

liunaijie commented May 23, 2024 •

edited

zhilinli123 commented May 24, 2024 •

edited