Skip to content

Bug: fail to compile and run some SQL statements with OpenMLDB-batch when enable batch_window_parallelization #437

@jingchen2222

Description

@jingchen2222

Issue tracker is ONLY used for reporting bugs. New features should be discussed on our discussion

We fail to compile and run SQL with OpenMLDB-batch when enable batch_window_parallelization

-- t1: ["col0 string", "col1 int", "col2 int"]
-- t2: ["str0 string", "str1 string", "col0 int", "col1 int"]
SELECT sum(t1.col1) over w1 as sum_t1_col1, t2.str1 as t2_str1 FROM t1
               last join t2 order by t2.col1
               on t1.col1 = t2.col1 and t1.col2 = t2.col0
               WINDOW w1 AS (PARTITION BY t1.col2 ORDER BY t1.col1
               ROWS_RANGE BETWEEN 3 PRECEDING AND CURRENT ROW);

Expected Behavior

Compile and run successly for SQL

Current Behavior

Fail to compile SQL.

Fail to find column id #48 in current schema context
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/sql_compiler.cc:260)
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/sql_compiler.cc:166)
    (Caused by) Fail to generate physical plan batch mode
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/transform.cc:1489)
    (Caused by) Fail to generate functions for physical plan: LIMIT(limit=10)
  SIMPLE_PROJECT(sources=(sum_t1_col1, t2_str1))
    JOIN(type=kJoinTypeConcat)
      SIMPLE_PROJECT(sources=(t2.str1 -> t2_str1))
        JOIN(type=LastJoin, right_sort=(t2.col1 ASC), condition=, left_keys=(t1.col1,t1.col2), right_keys=(t2.col1,t2.col0), index_keys=)
          DATA_PROVIDER(table=t1)
          DATA_PROVIDER(table=t2)
      PROJECT(type=WindowAggregation)
        +-WINDOW(partition_keys=(t1.col2), orders=(t1.col1 ASC), range=(t1.col1, -3, 0))
        JOIN(type=LastJoin, right_sort=(t2.col1 ASC), condition=, left_keys=(t1.col1,t1.col2), right_keys=(t2.col1,t2.col0), index_keys=)
          DATA_PROVIDER(table=t1)
          DATA_PROVIDER(table=t2)
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/transform.cc:197)
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/transform.cc:197)
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/transform.cc:197)
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/transform.cc:290)
    (Caused by) Instantiate 0th native function "__internal_sql_codegen_12" failed at node:
SIMPLE_PROJECT(sources=(t2.str1 -> t2_str1))
  JOIN(type=LastJoin, right_sort=(t2.col1 ASC), condition=, left_keys=(t1.col1,t1.col2), right_keys=(t2.col1,t2.col0), index_keys=)
    DATA_PROVIDER(table=t1)
    DATA_PROVIDER(table=t2)
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/codegen/fn_let_ir_builder.cc:130)
    (Caused by) Build expr failed at 0:
+-expr[get field]
  +-input:
    +-expr[id]
      +-var: %44(row)
  +-column_id: 48
  +-column_name: str1
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/codegen/fn_let_ir_builder.cc:215)
    (Caused by) Fail to codegen project expression: #48:str1
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/codegen/expr_ir_builder.cc:162)
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/codegen/expr_ir_builder.cc:835)
    (Caused by) Fail to resolve column #48:str1 from row
    (At /Users/chenjing/work/chenjing/OpenMLDB/hybridse/src/vm/schemas_context.cc:275)
    (Caused by) Fail to find column id #48 in current schema context(#1: 0, 0),(#2: 0, 1),(#3: 0, 2),(#51: 1, 0),(#52: 1, 1),(#53: 1, 2),(#54: 1, 3),
column_name_map_:col0, col1, col2, str0, str1, 

Possible Solution

I think the problem is raised by the incorrect column resolve and inappropriate schema update.

In fact, things change after we applying for optimization passes on the physical plan. So there must be something wrong with the optimization passes.
We also observe that the SIMPLE_PROJECT(sources=(t2.str1 -> t2_str1)) and PROJECT(type=WindowAggregation) share the same JOIN node. If we process passes in a post-DFS order, we might update JOIN node firstly and update SimpleProject later (with on new JOIN schema), and JOIN again and update WindowAggregation Project finally.
So, based on the observation above, I strongly concern that we might apply optimization on JOIN twice so that SIMPLE_PROJCT can't be up-to-date. That's why we fail to resolve column t2_str1 on SimpleProject.

Solution

  1. Keep track of if a PhysicalNode has been visited or not.
  2. Stop process optimization if a node has been visited before.

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions