Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Connector Hive] support hive savemode #6842

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

liunaijie
Copy link
Contributor

@liunaijie liunaijie commented May 11, 2024

Purpose of this pull request

subtask of #5390

  1. implement hive savemode feature
  2. remove hive metadata using hive2 jdbc (remove hive thrift url, using hive2 jdbc instead)

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@liunaijie liunaijie marked this pull request as ready for review May 16, 2024 10:49
@liunaijie
Copy link
Contributor Author

@EricJoy2048 @dailai @ruanwenjun hi, guys. PTAL when you have time.

@@ -66,7 +74,8 @@ public OptionRule optionRule() {

ReadonlyConfig finalReadonlyConfig =
generateCurrentReadonlyConfig(readonlyConfig, catalogTable);
return () -> new HiveSink(finalReadonlyConfig, catalogTable);
CatalogTable finalCatalog = renameCatalogTable(finalReadonlyConfig, catalogTable);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with target hive sink table name, if not replace here, will pass source table name to hive.
like fake to hive sink. so when use this catalog, will has issue, replaced here

String describeFormattedTableQuery = "describe formatted " + tablePath.getFullName();
try (PreparedStatement ps = connection.prepareStatement(describeFormattedTableQuery)) {
ResultSet rs = ps.executeQuery();
return processResult(rs, tablePath, builder, partitionKeys);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now hive table informaction is parse from the query result. That's not very elegant, but it work.....

@EricJoy2048 EricJoy2048 changed the title [Feature] support hive savemode [Feature][Connector Hive] support hive savemode May 16, 2024
.withValue(
FIELD_DELIMITER.key(),
ConfigValueFactory.fromAnyRef(
parameters.get("field.delim")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line will has issue if the field.delim is \t, the ConfigValueFactory.fromAnyRef with replace it to \\t. then the writted data will has issue.

@NoPr
Copy link

NoPr commented May 22, 2024

大佬你好,创建的statement好像有点问题
org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

@liunaijie
Copy link
Contributor Author

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

@@ -33,7 +33,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| name | type | required | default value |
|-------------------------------|---------|----------|----------------|
| table_name | string | yes | - |
| metastore_uri | string | yes | - |
| hive_jdbc_url | string | yes | - |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to be compatible with older versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it not compatible with old version, because i don't want both use hive2 jdbc and hive metastore. so i removed the metastore only use jdbc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it not compatible with old version, because i don't want both use hive2 jdbc and hive metastore. so i removed the metastore only use jdbc.

As an open source project we have to consider the compatibility of features, we know that many users are using Hive Connector, in order to be compatible with those older users, I think it is a better way to support both jdbc and metastore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.
but if we want use savemode feature only with hive metastore, is difficulty to create table.
like table format, bucket setting, table location etc. we need lots of parameter to config them.
so i want use sql template to let user define the template we can replace table name, columns in this template and run the sql to create table.

I can add metastore_url back, and it will make the code more easier. But user need both config jdbc and thrift on hive connector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but if we want use savemode feature only with hive metastore, is difficulty to create table. like table format, bucket setting, table location etc. we need lots of parameter to config them. so i want use sql template to let user define the template we can replace table name, columns in this template and run the sql to create table.

I can add metastore_url back, and it will make the code more easier. But user need both config jdbc and thrift on hive connector.

Yes, you can add metastore_url back and tell users save mode only can be use when jdbc_url is configed.

In the future, in seatunnel version 2.4.x, we can remove metastore_url configuration, and we can make some incompatible changes from 2.3.x to 2.4.x.

@NoPr
Copy link

NoPr commented May 22, 2024

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

@liunaijie
Copy link
Contributor Author

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

@NoPr
Copy link

NoPr commented May 22, 2024

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话,是否该source-sink的conf就不能成立?这个和mysql的schema_save_mode 的配置实现的效果不一样吗?

@liunaijie
Copy link
Contributor Author

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话,是否该source-sink的conf就不能成立?这个和mysql的schema_save_mode 的配置实现的效果不一样吗?

有些许的不一样
可以拿到source的表结构, 然后根据这个结构去创建表
但是其他的Hive配置, 比如内表/外表, 外表路径, 存储格式等等的配置是无法拿到的. 所以添加了这个参数 希望用户自定义DDL语句

@NoPr
Copy link

NoPr commented May 22, 2024

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话,是否该source-sink的conf就不能成立?这个和mysql的schema_save_mode 的配置实现的效果不一样吗?

有些许的不一样 可以拿到source的表结构, 然后根据这个结构去创建表 但是其他的Hive配置, 比如内表/外表, 外表路径, 存储格式等等的配置是无法拿到的. 所以添加了这个参数 希望用户自定义DDL语句

那也就是说hive的自定义建表需要满足:
1.已知source表结构
2.自定义建表语句并作为sink的参数

@NoPr
Copy link

NoPr commented May 23, 2024

image
老哥稳,我试过了。
image

@liunaijie
Copy link
Contributor Author

liunaijie commented May 23, 2024

image 老哥稳,我试过了。 image

目前的代码还有几个问题:

  1. 你上面的语句 没有指定分隔符, 如果指定了类似于 \t 这样的分割符, 写入Config后会变成 \\t, 导致文件写入有问题.
  2. 由于查询hive表结构是通过desc formatted <table_name>的方式 然后解析sql结果. 目前发现在不同的版本中 返回语句会略有不同, 3.1.3 版本 在 # col_name与真正的字段名直接没有空行, 而在2.1.1版本中 发现会多一个空行. 目前的代码这块的处理逻辑需要兼容 以及优化.
  3. 不确定开启kerberos认证后 是否可以连通, 我这边是使用用户名密码, 可以通过在jdbc url中设置完成这个认证.

@zhilinli123
Copy link
Contributor

zhilinli123 commented May 24, 2024

@liunaijie liunaijie force-pushed the hive-savemode branch 2 times, most recently from 991eb3b to d004328 Compare May 27, 2024 01:46

In order to use this connector, You must ensure your spark/flink cluster already integrated hive.

If you use SeaTunnel Engine, You need put `seatunnel-hadoop3-3.1.4-uber.jar` and `hive-exec-<hive_version>.jar` and `hive-jdbc-<hive_version>.jar` and `libfb303-0.9.3.jar` in $SEATUNNEL_HOME/lib/ dir.

## Key features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key Features

| abort_drop_partition_metadata | boolean | no | true | Flag to decide whether to drop partition metadata from Hive Metastore during an abort operation. Note: this only affects the metadata in the metastore, the data in the partition will always be deleted(data generated during the synchronization process). |
| common-options | | no | - | Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details |

### schema_save_mode[Enum]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already in options, just add more explanation


In order to use this connector, You must ensure your spark/flink cluster already integrated hive.

If you use SeaTunnel Engine, You need put `seatunnel-hadoop3-3.1.4-uber.jar` and `hive-exec-<hive_version>.jar` and `libfb303-0.9.3.jar` in $SEATUNNEL_HOME/lib/ dir.

## Key features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


## Source Options

| name | type | required | default value | Description |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalize the first letter

@Override
public boolean isExistsData(TablePath tablePath) {
String tableName = tablePath.getFullName();
String sql = String.format("select * from %s limit 1;", tableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use show create table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this method is to check whether this table has data.
if use show create table, can't check has data or not. can only check table exist

@liunaijie
Copy link
Contributor Author

image 老哥稳,我试过了。 image

目前的代码还有几个问题:

  1. 你上面的语句 没有指定分隔符, 如果指定了类似于 \t 这样的分割符, 写入Config后会变成 \\t, 导致文件写入有问题.
  2. 由于查询hive表结构是通过desc formatted <table_name>的方式 然后解析sql结果. 目前发现在不同的版本中 返回语句会略有不同, 3.1.3 版本 在 # col_name与真正的字段名直接没有空行, 而在2.1.1版本中 发现会多一个空行. 目前的代码这块的处理逻辑需要兼容 以及优化.
  3. 不确定开启kerberos认证后 是否可以连通, 我这边是使用用户名密码, 可以通过在jdbc url中设置完成这个认证.

problem 1, 2 solved
problem 3, i don't have this env, can't verify about this.

@liunaijie
Copy link
Contributor Author

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话,是否该source-sink的conf就不能成立?这个和mysql的schema_save_mode 的配置实现的效果不一样吗?

代码更新了一版, 可以在不知道上游表结构的情况下, 使用变量的形式 运行时替换掉.
有一点需要注意的是 如果存在分区, 则需要添加额外的参数标明分区字段, 并且分区字段不能变量化, 即多张上游表只能用一个模板内的分区定义, 不能针对每个表建不同的分区.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants