Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support insert function in offline mode #3854

Merged
merged 20 commits into from Apr 15, 2024
Merged

Conversation

Matagits
Copy link
Collaborator

@Matagits Matagits commented Apr 8, 2024

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    feature: Enable insert function in offline mode, add corresponding test cases

  • What is the current behavior? (You can also link to an open issue here)
    Insert function is not supported in offline mode

  • What is the new behavior (if this is a feature change)?
    We can use insert in offline mode

@github-actions github-actions bot added batch-engine openmldb batch(offline) engine execute-engine hybridse sql engine storage-engine openmldb storage engine. nameserver & tablet task-manager openmldb taskmanager labels Apr 8, 2024
Copy link
Contributor

github-actions bot commented Apr 8, 2024

SDK Test Report

102 files  +1  102 suites  +1   2m 12s ⏱️ -1s
357 tests +8  343 ✅ +8  14 💤 ±0  0 ❌ ±0 
483 runs  +8  469 ✅ +8  14 💤 ±0  0 ❌ ±0 

Results for commit 05dbe85. ± Comparison against base commit 7f758af.

This pull request removes 30 and adds 17 tests. Note that renamed tests count towards both.
  PARTITION BY db1.t1.col2 ORDER BY db1.t1.col1
  PARTITION BY t1.col2 ORDER BY t1.col1
  ROWS_RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
 ) limit 10;](1)
 ) limit 10;](2)
 ) limit 10;](3)
 FROM db1.t1
 FROM t1
 WINDOW w1 AS (
 last join db2.t2 order by db2.t2.col1
…
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlLastJoinWithMultipleDB[,  SELECT sum(db1.t1.col1) over w1 as sum_t1_col1, db2.t2.str1 as t2_str1
 FROM db1.t1
 last join db2.t2 order by db2.t2.col1
 on db1.t1.col1 = db2.t2.col1 and db1.t1.col2 = db2.t2.col0
 WINDOW w1 AS (
  PARTITION BY db1.t1.col2 ORDER BY db1.t1.col1
  ROWS_RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
 ) limit 10;](2)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlLastJoinWithMultipleDB[db1,  SELECT sum(t1.col1) over w1 as sum_t1_col1, db2.t2.str1 as t2_str1
 FROM t1
 last join db2.t2 order by db2.t2.col1
 on t1.col1 = db2.t2.col1 and t1.col2 = db2.t2.col0
 WINDOW w1 AS (
  PARTITION BY t1.col2 ORDER BY t1.col1
  ROWS_RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
 ) limit 10;](1)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlLastJoinWithMultipleDB[null,  SELECT sum(db1.t1.col1) over w1 as sum_t1_col1, db2.t2.str1 as t2_str1
 FROM db1.t1
 last join db2.t2 order by db2.t2.col1
 on db1.t1.col1 = db2.t2.col1 and db1.t1.col2 = db2.t2.col0
 WINDOW w1 AS (
  PARTITION BY db1.t1.col2 ORDER BY db1.t1.col1
  ROWS_RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
 ) limit 10;](3)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlMultipleDBErrorTest[, SELECT db2.t2.str1 as t2_str1
 FROM t1
 last join db2.t2 order by db2.t2.col1
 on t1.col1 = db2.t2.col1 and t1.col2 = db2.t2.col0;
, SQL parse error: Fail to transform data provider op: table t1 not exists in database []](4)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlMultipleDBErrorTest[db1, SELECT db1.t2.str1 as t2_str1
 FROM t1
 last join db2.t2 order by db2.t2.col1
 on t1.col1 = db2.t2.col1 and t1.col2 = db2.t2.col0;
, SQL parse error: Column Not found: db1.t2.str1](2)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlMultipleDBErrorTest[db1, SELECT db2.t2.str1 as t2_str1
 FROM t1
 last join db2.t2 order by db2.t2.col1
 on t1.col1 = t2.col1 and t1.col2 = db2.t2.col0;
, SQL parse error: Column Not found: .t2.col1](3)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlMultipleDBErrorTest[db1, SELECT t2.str1 as t2_str1
 FROM t1
 last join db2.t2 order by db2.t2.col1
 on t1.col1 = db2.t2.col1 and t1.col2 = db2.t2.col0;
, SQL parse error: Column Not found: .t2.str1](1)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlMultipleDBErrorTest[null, SELECT db2.t2.str1 as t2_str1
 FROM t1
 last join db2.t2 order by db2.t2.col1
 on t1.col1 = db2.t2.col1 and t1.col2 = db2.t2.col0;
, SQL parse error: Fail to transform data provider op: table t1 not exists in database []](5)
com._4paradigm.hybridse.sdk.SqlEngineTest ‑ sqlWindowLastJoin[ SELECT sum(t1.col1) over w1 as sum_t1_col1, t2.str1 as t2_str1
 FROM t1
 last join t2 order by t2.col1
 on t1.col1 = t2.col1 and t1.col2 = t2.col0
 WINDOW w1 AS (
  PARTITION BY t1.col2 ORDER BY t1.col1
  ROWS_RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
 ) limit 10;](1)
com._4paradigm.openmldb.batch.TestInsertPlan ‑ Test column with default value
…

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Apr 8, 2024

HybridSE Mac Test Report

20 124 tests  ±0   20 122 ✅ ±0   7m 36s ⏱️ - 1m 3s
   256 suites ±0        2 💤 ±0 
    68 files   ±0        0 ❌ ±0 

Results for commit 05dbe85. ± Comparison against base commit 7f758af.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Apr 8, 2024

HybridSE Linux Test Report

20 124 tests  ±0   20 122 ✅ ±0   6m 21s ⏱️ ±0s
   256 suites ±0        2 💤 ±0 
    68 files   ±0        0 ❌ ±0 

Results for commit 05dbe85. ± Comparison against base commit 7f758af.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Apr 8, 2024

Linux Test Report

    57 files  ±0     244 suites  ±0   1h 41m 48s ⏱️ + 3m 26s
12 631 tests ±0  12 624 ✅ ±0  7 💤 ±0  0 ❌ ±0 
17 908 runs  ±0  17 901 ✅ ±0  7 💤 ±0  0 ❌ ±0 

Results for commit 05dbe85. ± Comparison against base commit 7f758af.

♻️ This comment has been updated with latest results.

val newOfflineInfo = OfflineTableInfo
.newBuilder()
.setPath(offlineDataPath)
.setFormat("csv")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format parquet

val spark = ctx.getSparkSession
var insertDf = spark.createDataFrame(spark.sparkContext.parallelize(insertRows), insertSchema)
val schemaDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], oriSchema)
insertDf = schemaDf.unionByName(insertDf, allowMissingColumns = true)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value

val offlineDataPath = getOfflineDataPath(ctx, db, table)
val newTableInfoBuilder = tableInfo.toBuilder
val hasOfflineTableInfo = tableInfo.hasOfflineTableInfo
if (!hasOfflineTableInfo) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果有软链接,直接抛异常,拒绝这次写入

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symbolic path



class TestInsertPlan extends SparkTestSuite {
var sparkSession: SparkSession = _
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑写入已load data的table,对应的情况

@vagetablechicken
Copy link
Collaborator

add desc about offline insert in docs/zh/openmldb_sql/dml/INSERT_STATEMENT.md, offline insert can use 'yyyy-MM-dd xx' format, but online insert can't now.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 11, 2024
@Matagits
Copy link
Collaborator Author

已做出如下修改点:
1、默认存储方式改为parquet
2、考虑列有default值的情况
3、拒绝有软链接的table执行离线数据写入
4、拒绝default mode之外的insert方式(如insert or ignore)
5、增加对应的测试代码(如对已load data的table执行离线insert)
6、修改对应中英文doc

- 默认`INSERT`不会去重,`INSERT OR IGNORE` 则可以忽略已存在于表中的数据,可以反复重试。
- 离线模式仅支持`INSERT`,不支持`INSERT OR IGNORE`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有限制:“离线insert不能用于有软链接的表“,由于format对一张表唯一,如果format为hive等,我们没法给它建硬拷贝地址,并保存insert数据到硬拷贝地址的parquet文件。使用insert只能用户先保证无软链接。

Copy link
Collaborator

@tobegit3hub tobegit3hub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tobegit3hub tobegit3hub merged commit 67138ef into 4paradigm:main Apr 15, 2024
29 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
batch-engine openmldb batch(offline) engine documentation Improvements or additions to documentation execute-engine hybridse sql engine storage-engine openmldb storage engine. nameserver & tablet task-manager openmldb taskmanager
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants