Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO MERGE]fix: expression ownership in the temp table of subquery #168

Closed
wants to merge 3 commits into from

Conversation

KKould
Copy link
Member

@KKould KKould commented Mar 16, 2024

What problem does this PR solve?

Fix code for:#164

Code changes

  • Has Rust code change
  • Has CI related scripts change

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Note for reviewer

@crwen
Copy link
Member

crwen commented Mar 16, 2024

Have you tried this before?

CREATE TABLE t1(pk int primary key, a int, b int);
CREATE TABLE t2(pk int primary key, a int, c int);
explain SELECT b FROM t1 WHERE a in (SELECT a FROM t2 where a > 1);
                               PLAN
-------------------------------------------------------------------
 Projection [t1.b] [Project]                                      +
   LeftSemi Join On t1.a = (t1.a) as (_temp_table_1_.a) [HashJoin]+
     Scan t1 -> [a, b] [SeqScan]                                  +
     Projection [t1.a] [Project]                                  +
       Filter (t1.a > 1), Is Having: false [Filter]               +
         Scan t2 -> [] [SeqScan]

The problem is here: First bind t1 and t1.b, t1.a, then the subquery. When binding the subquery, it hit these code, and get t1.a but not t2.a.

FnckSQL/src/binder/expr.rs

Lines 291 to 299 in f8488f4

let mut got_column = None;
for table_catalog in self.context.bind_table.values() {
if let Some(column_catalog) = table_catalog.get_column_by_name(&column_name) {
got_column = Some(column_catalog);
}
if got_column.is_some() {
break;
}
}

@KKould
Copy link
Member Author

KKould commented Mar 16, 2024

You are a genius! Found so many issues and I fixed them, please check them out sir!

let _ = fnck_sql
    .run("CREATE TABLE t1(pk int primary key, a int, b int);")
    .await?;
let _ = fnck_sql
    .run("CREATE TABLE t2(pk int primary key, a int, c int);")
    .await?;
let (schema, tuples) = fnck_sql
    .run("explain SELECT b FROM t1 WHERE a in (SELECT a FROM t2 where a > 1);")
    .await?;
+-------------------------------------------------------------------+
| PLAN                                                              |
+===================================================================+
| Projection [t1.b] [Project]                                       |
|   LeftSemi Join On t1.a = (t2.a) as (_temp_table_1_.a) [HashJoin] |
|     Scan t1 -> [a, b] [SeqScan]                                   |
|     Projection [t2.a] [Project]                                   |
|       Filter (t2.a > 1), Is Having: false [Filter]                |
|         Scan t2 -> [a] [SeqScan]                                  |
+-------------------------------------------------------------------+

@KKould
Copy link
Member Author

KKould commented Mar 16, 2024

Have you tried this before?

CREATE TABLE t1(pk int primary key, a int, b int);
CREATE TABLE t2(pk int primary key, a int, c int);
explain SELECT b FROM t1 WHERE a in (SELECT a FROM t2 where a > 1);
                               PLAN
-------------------------------------------------------------------
 Projection [t1.b] [Project]                                      +
   LeftSemi Join On t1.a = (t1.a) as (_temp_table_1_.a) [HashJoin]+
     Scan t1 -> [a, b] [SeqScan]                                  +
     Projection [t1.a] [Project]                                  +
       Filter (t1.a > 1), Is Having: false [Filter]               +
         Scan t2 -> [] [SeqScan]

The problem is here: First bind t1 and t1.b, t1.a, then the subquery. When binding the subquery, it hit these code, and get t1.a but not t2.a.

@crwen
Please put this case in tests/slt/subquery.slt to avoid similar problems from recurring.

@crwen
Copy link
Member

crwen commented Mar 16, 2024

consider this also:

explain select a from t1 where a in (select t2.a from t1 as t2  where t2.pk > 10);
                               PLAN
-------------------------------------------------------------------
 Projection [t1.a] [Project]                                      +
   LeftSemi Join On t1.a = (t1.a) as (_temp_table_1_.a) [HashJoin]+
     Scan t1 -> [a] [SeqScan]                                     +
     Projection [t1.a] [Project]                                  +
       Filter (t1.pk > 10), Is Having: false [Filter]             +
         Scan t1 -> [pk, a] [SeqScan]

When calling try_reference, a alias will be eliminated and lead to a wrong pos.

pub fn unpack_alias_ref(&self) -> &ScalarExpression {
if let ScalarExpression::Alias { expr, .. } = self {
expr.unpack_alias_ref()
} else {
self
}
}
pub fn try_reference(&mut self, output_exprs: &[ScalarExpression]) {
let fn_output_column = |expr: &ScalarExpression| expr.unpack_alias_ref().output_column();
let self_column = fn_output_column(self);
if let Some((pos, _)) = output_exprs
.iter()
.find_position(|expr| self_column.summary() == fn_output_column(expr).summary())
{
let expr = Box::new(mem::replace(self, ScalarExpression::Empty));
*self = ScalarExpression::Reference { expr, pos };
return;
}

@KKould
Copy link
Member Author

KKould commented Mar 16, 2024

consider this also:

explain select a from t1 where a in (select t2.a from t1 as t2  where t2.pk > 10);
                               PLAN
-------------------------------------------------------------------
 Projection [t1.a] [Project]                                      +
   LeftSemi Join On t1.a = (t1.a) as (_temp_table_1_.a) [HashJoin]+
     Scan t1 -> [a] [SeqScan]                                     +
     Projection [t1.a] [Project]                                  +
       Filter (t1.pk > 10), Is Having: false [Filter]             +
         Scan t1 -> [pk, a] [SeqScan]

When calling try_reference, a alias will be eliminated and lead to a wrong pos.

Great, your observation skills are amazing.

I fixed the problem. Now the pos in the case you gave seems to be the same as before, but it is correct. I checked it in the new commit. The corresponding relationship of pos.

By the way, their pos may be the same, but they actually mean the positions in the respective left tuple and right tuple. This is because expr needs to be used in HashJoin to calculate the hash of the respective tuple. value, so if you need to use on_keys, please ensure that on_keys is calculated based on tuples that have not yet been merged.

left_build: https://github.com/KipData/FnckSQL/blob/main/src/execution/volcano/dql/join/hash_join.rs#L119
right_probe: https://github.com/KipData/FnckSQL/blob/main/src/execution/volcano/dql/join/hash_join.rs#L144

best_plan plan: LogicalPlan {
    operator: Project(
        ProjectOperator {
            exprs: [
                Reference {
                    expr: ColumnRef(
                        ColumnCatalog {
                            summary: ColumnSummary {
                                id: Some(
                                    1,
                                ),
                                name: "a",
                                table_name: Some(
                                    "t1",
                                ),
                            },
                            nullable: true,
                            desc: ColumnDesc {
                                column_datatype: Integer,
                                is_primary: false,
                                is_unique: false,
                                default: None,
                            },
                        },
                    ),
                    pos: 0,
                },
            ],
        },
    ),
    childrens: [
        LogicalPlan {
            operator: Join(
                JoinOperator {
                    on: On {
                        on: [
                            (
                                Reference {
                                    expr: ColumnRef(
                                        ColumnCatalog {
                                            summary: ColumnSummary {
                                                id: Some(
                                                    1,
                                                ),
                                                name: "a",
                                                table_name: Some(
                                                    "t1",
                                                ),
                                            },
                                            nullable: true,
                                            desc: ColumnDesc {
                                                column_datatype: Integer,
                                                is_primary: false,
                                                is_unique: false,
                                                default: None,
                                            },
                                        },
                                    ),
                                    pos: 0,
                                },
                                Reference {
                                    expr: Alias {
                                        expr: ColumnRef(
                                            ColumnCatalog {
                                                summary: ColumnSummary {
                                                    id: Some(
                                                        1,
                                                    ),
                                                    name: "a",
                                                    table_name: Some(
                                                        "t1",
                                                    ),
                                                },
                                                nullable: true,
                                                desc: ColumnDesc {
                                                    column_datatype: Integer,
                                                    is_primary: false,
                                                    is_unique: false,
                                                    default: None,
                                                },
                                            },
                                        ),
                                        alias: Expr(
                                            ColumnRef(
                                                ColumnCatalog {
                                                    summary: ColumnSummary {
                                                        id: Some(
                                                            1,
                                                        ),
                                                        name: "a",
                                                        table_name: Some(
                                                            "_temp_table_1_",
                                                        ),
                                                    },
                                                    nullable: true,
                                                    desc: ColumnDesc {
                                                        column_datatype: Integer,
                                                        is_primary: false,
                                                        is_unique: false,
                                                        default: None,
                                                    },
                                                },
                                            ),
                                        ),
                                    },
                                    pos: 0,
                                },
                            ),
                        ],
                        filter: None,
                    },
                    join_type: LeftSemi,
                },
            ),
            childrens: [
                LogicalPlan {
                    operator: Scan(
                        ScanOperator {
                            table_name: "t1",
                            primary_key: 0,
                            columns: [
                                (
                                    1,
                                    ColumnCatalog {
                                        summary: ColumnSummary {
                                            id: Some(
                                                1,
                                            ),
                                            name: "a",
                                            table_name: Some(
                                                "t1",
                                            ),
                                        },
                                        nullable: true,
                                        desc: ColumnDesc {
                                            column_datatype: Integer,
                                            is_primary: false,
                                            is_unique: false,
                                            default: None,
                                        },
                                    },
                                ),
                            ],
                            limit: (
                                None,
                                None,
                            ),
                            index_infos: [
                                IndexInfo {
                                    meta: IndexMeta {
                                        id: 0,
                                        column_ids: [
                                            0,
                                        ],
                                        table_name: "t1",
                                        pk_ty: Integer,
                                        name: "pk_pk",
                                        ty: PrimaryKey,
                                    },
                                    range: None,
                                },
                            ],
                        },
                    ),
                    childrens: [],
                    physical_option: Some(
                        SeqScan,
                    ),
                    _output_schema_ref: None,
                },
                LogicalPlan {
                    operator: Project(
                        ProjectOperator {
                            exprs: [
                                Alias {
                                    expr: Reference {
                                        expr: ColumnRef(
                                            ColumnCatalog {
                                                summary: ColumnSummary {
                                                    id: Some(
                                                        1,
                                                    ),
                                                    name: "a",
                                                    table_name: Some(
                                                        "t1",
                                                    ),
                                                },
                                                nullable: true,
                                                desc: ColumnDesc {
                                                    column_datatype: Integer,
                                                    is_primary: false,
                                                    is_unique: false,
                                                    default: None,
                                                },
                                            },
                                        ),
                                        pos: 0,
                                    },
                                    alias: Expr(
                                        ColumnRef(
                                            ColumnCatalog {
                                                summary: ColumnSummary {
                                                    id: Some(
                                                        1,
                                                    ),
                                                    name: "a",
                                                    table_name: Some(
                                                        "_temp_table_1_",
                                                    ),
                                                },
                                                nullable: true,
                                                desc: ColumnDesc {
                                                    column_datatype: Integer,
                                                    is_primary: false,
                                                    is_unique: false,
                                                    default: None,
                                                },
                                            },
                                        ),
                                    ),
                                },
                            ],
                        },
                    ),
                    childrens: [
                        LogicalPlan {
                            operator: Project(
                                ProjectOperator {
                                    exprs: [
                                        Reference {
                                            expr: ColumnRef(
                                                ColumnCatalog {
                                                    summary: ColumnSummary {
                                                        id: Some(
                                                            1,
                                                        ),
                                                        name: "a",
                                                        table_name: Some(
                                                            "t1",
                                                        ),
                                                    },
                                                    nullable: true,
                                                    desc: ColumnDesc {
                                                        column_datatype: Integer,
                                                        is_primary: false,
                                                        is_unique: false,
                                                        default: None,
                                                    },
                                                },
                                            ),
                                            pos: 1,
                                        },
                                    ],
                                },
                            ),
                            childrens: [
                                LogicalPlan {
                                    operator: Filter(
                                        FilterOperator {
                                            predicate: Binary {
                                                op: Gt,
                                                left_expr: Reference {
                                                    expr: ColumnRef(
                                                        ColumnCatalog {
                                                            summary: ColumnSummary {
                                                                id: Some(
                                                                    0,
                                                                ),
                                                                name: "pk",
                                                                table_name: Some(
                                                                    "t1",
                                                                ),
                                                            },
                                                            nullable: false,
                                                            desc: ColumnDesc {
                                                                column_datatype: Integer,
                                                                is_primary: true,
                                                                is_unique: false,
                                                                default: None,
                                                            },
                                                        },
                                                    ),
                                                    pos: 0,
                                                },
                                                right_expr: Constant(
                                                    Int32(0),
                                                ),
                                                ty: Boolean,
                                            },
                                            having: false,
                                        },
                                    ),
                                    childrens: [
                                        LogicalPlan {
                                            operator: Scan(
                                                ScanOperator {
                                                    table_name: "t1",
                                                    primary_key: 0,
                                                    columns: [
                                                        (
                                                            0,
                                                            ColumnCatalog {
                                                                summary: ColumnSummary {
                                                                    id: Some(
                                                                        0,
                                                                    ),
                                                                    name: "pk",
                                                                    table_name: Some(
                                                                        "t1",
                                                                    ),
                                                                },
                                                                nullable: false,
                                                                desc: ColumnDesc {
                                                                    column_datatype: Integer,
                                                                    is_primary: true,
                                                                    is_unique: false,
                                                                    default: None,
                                                                },
                                                            },
                                                        ),
                                                        (
                                                            1,
                                                            ColumnCatalog {
                                                                summary: ColumnSummary {
                                                                    id: Some(
                                                                        1,
                                                                    ),
                                                                    name: "a",
                                                                    table_name: Some(
                                                                        "t1",
                                                                    ),
                                                                },
                                                                nullable: true,
                                                                desc: ColumnDesc {
                                                                    column_datatype: Integer,
                                                                    is_primary: false,
                                                                    is_unique: false,
                                                                    default: None,
                                                                },
                                                            },
                                                        ),
                                                    ],
                                                    limit: (
                                                        None,
                                                        None,
                                                    ),
                                                    index_infos: [
                                                        IndexInfo {
                                                            meta: IndexMeta {
                                                                id: 0,
                                                                column_ids: [
                                                                    0,
                                                                ],
                                                                table_name: "t1",
                                                                pk_ty: Integer,
                                                                name: "pk_pk",
                                                                ty: PrimaryKey,
                                                            },
                                                            range: Some(
                                                                Scope {
                                                                    min: Excluded(
                                                                        Int32(0),
                                                                    ),
                                                                    max: Unbounded,
                                                                },
                                                            ),
                                                        },
                                                    ],
                                                },
                                            ),
                                            childrens: [],
                                            physical_option: Some(
                                                SeqScan,
                                            ),
                                            _output_schema_ref: None,
                                        },
                                    ],
                                    physical_option: Some(
                                        Filter,
                                    ),
                                    _output_schema_ref: None,
                                },
                            ],
                            physical_option: Some(
                                Project,
                            ),
                            _output_schema_ref: None,
                        },
                    ],
                    physical_option: Some(
                        Project,
                    ),
                    _output_schema_ref: None,
                },
            ],
            physical_option: Some(
                HashJoin,
            ),
            _output_schema_ref: None,
        },
    ],
    physical_option: Some(
        Project,
    ),
    _output_schema_ref: None,
}

@crwen
Copy link
Member

crwen commented Mar 17, 2024

It seems to project twice on temp table.

And I don't think it can handle more complex situations.

explain SELECT b FROM t1 WHERE a in (SELECT a FROM t2 where (select a from t1));
                               PLAN
-------------------------------------------------------------------
 Projection [t1.b] [Project]                                      +
   LeftSemi Join On t1.a = (t2.a) as (_temp_table_2_.a) [HashJoin]+
     Scan t1 -> [a, b] [SeqScan]                                  +
     Projection [(t2.a) as (_temp_table_2_.a)] [Project]          +
       Projection [t2.a] [Project]                                +
         Inner Join Where _temp_table_1_.a [HashJoin]             +
           Scan t2 -> [a] [SeqScan]                               +
           Projection [(t2.a) as (_temp_table_1_.a)] [Project]    +
             Projection [t2.a] [Project]                          +
               Scan t1 -> [] [SeqScan]

I think every query using their own binder context may be a solution, but it may be a big work.

@KKould
Copy link
Member Author

KKould commented Mar 17, 2024

It seems to project twice on temp table.

And I don't think it can handle more complex situations.

explain SELECT b FROM t1 WHERE a in (SELECT a FROM t2 where (select a from t1));
                               PLAN
-------------------------------------------------------------------
 Projection [t1.b] [Project]                                      +
   LeftSemi Join On t1.a = (t2.a) as (_temp_table_2_.a) [HashJoin]+
     Scan t1 -> [a, b] [SeqScan]                                  +
     Projection [(t2.a) as (_temp_table_2_.a)] [Project]          +
       Projection [t2.a] [Project]                                +
         Inner Join Where _temp_table_1_.a [HashJoin]             +
           Scan t2 -> [a] [SeqScan]                               +
           Projection [(t2.a) as (_temp_table_1_.a)] [Project]    +
             Projection [t2.a] [Project]                          +
               Scan t1 -> [] [SeqScan]

I think every query using their own binder context may be a solution, but it may be a big work.

Yes, let me create an issue for this problem to avoid affecting your PR.

@KKould
Copy link
Member Author

KKould commented Mar 17, 2024

For the two Projects, I referred to Datafusion’s TableAlias ​​and Project.

@KKould
Copy link
Member Author

KKould commented Mar 19, 2024

close by #164

@KKould KKould closed this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants