Skip to content
This repository has been archived by the owner on May 12, 2024. It is now read-only.

instruction set #12

Merged
merged 24 commits into from
Jul 17, 2022
Merged

instruction set #12

merged 24 commits into from
Jul 17, 2022

Conversation

Samyak2
Copy link
Collaborator

@Samyak2 Samyak2 commented Jun 25, 2022

PR Info

  • Dependents:

Changes

  • Completed the Register data structure to support all the designed instructions. Added all supporting structures too. Some of the types are placeholder, which can only be known once sqlparser is integrated.
    • The View variant will do most of the heaving lifting in many queries. A placeholder method is implemented for now.
  • Added structures for all instructions designed in Design the instruction set #9. Some placeholders for operators and other types are present. It is documented thoroughly.

@Samyak2 Samyak2 added the wip DO NOT MERGE label Jun 25, 2022
- Added more variants to support the newly designed instruction set
- Added data structures:
    - View: does the heavy lifting in the IC
    - InsertDef and InsertRow: for insert instructions
@Samyak2 Samyak2 mentioned this pull request Jun 26, 2022
4 tasks
@Samyak2 Samyak2 added enhancement New feature or request and removed wip DO NOT MERGE labels Jun 30, 2022
@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jun 30, 2022

@tyt2y3 @billy1624 @shpun817 this is ready for review now. Please take a look.

src/ic.rs Outdated
ColumnDef {
index: RegisterIndex,
/// The column name.
name: String,
Copy link
Member

@tyt2y3 tyt2y3 Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting a String everywhere is not a good thing, because it allocates on the heap and it's not copy friendly. I'd prefer using ArrayString with a fixed size (say 32 or 64)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about arraystring, thanks!

Although, I'm not sure if it's needed in this project because:

  • From their benchmarks, the difference in cloning a CacheString (which is ArrayString with 63 chars) and a String is only about 3x. Both take less than 50 ns for it.
  • I chose 63 chars because that's the limit imposed in Postgres and MySQL. Some datawarehouses that also use SQL have much higher limits. For example, Snowflake and AWS Athena have 255 chars. ArrayString seems slow for those sizes.
  • Performance is a non-goal of this project.
  • ArrayString doesn't seem to very well maintained, it was last updated 2 years ago. I would rather stick with std to avoid surprises. Please let me know if I got the wrong crate, this was the first result for "rust ArrayString".

Copy link
Member

@tyt2y3 tyt2y3 Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the crate we are looking for. I think being able to copy the instruction without cloning every time is a huge saver in terms of coding ergonomic. I mean we can effectively derive copy on it! Making it easier to pass around etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, that makes sense. I'll implement it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to derive copy for Instruction. I hit a blocker on trying to derive copy for Value. It uses String and Vec: https://github.com/SeaQL/sql-assembly/blob/539036d064f96f578a15422d9d957d0f0c4c18b6/src/value.rs#L37-L41.

I don't want to make those bounded too because DBs don't impose low limits on those either. As an example, SQLite puts a limit of 1 billion bytes: https://www.sqlite.org/limits.html#max_length. So that always has to be heap allocated.

Any suggestions here?

Copy link
Member

@tyt2y3 tyt2y3 Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um... there is no way around it. though it might hint to us that we'd want to pass values in chunk instead of a free sized blob. forget about it for now then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although using ArrayString on the ic is half the way there

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll leave out Copy for now but I'll keep ArrayString

Copy link
Collaborator Author

@Samyak2 Samyak2 Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some of the structs/enums from sqlparser do not implement Copy: https://github.com/SeaQL/sql-assembly/blob/78e735f8ac30d63a752371a01e36ab49ce4fb328/src/ic.rs#L130

We will have think of those too.

@tyt2y3
Copy link
Member

tyt2y3 commented Jul 1, 2022

I don't have a particular opinion on this. That said, it all pretty makes sense!
However, I would like to have some unit tests or at least some non-working snippets to illustrate how it all comes to live.
For example, can you illustrate how'd some basic SQL would be represented?

Starting from:

SELECT * FROM mytable WHERE col_1 = 1

we should have a list of these statements and write a bunch of unit tests for them (for testing the AST -> ASM conversion stage)

@Samyak2 Samyak2 changed the title wip: instruction set instruction set Jul 1, 2022
@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 1, 2022

Definitely, that's a good idea. It will also help in assessing that we have all the instructions we need for basic queries. I will work on this today.

@tyt2y3
Copy link
Member

tyt2y3 commented Jul 1, 2022

Yeah I think a lot of the design decisions can be verified without really implementing. We can dry run in our brains given a hypothetical specification

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 1, 2022

Yes, I have considered quite a few queries while designing the instruction set. Although that was all in my mind. It would be good to make them concrete in code as tests.

Note: these are only examples. No actual conversion takes place since
the parser is not implemented yet.

#[test]
fn insert_statements() {
// `INSERT INTO table1 VALUES (1, 'foo', 2)`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on adding these too. Should be done by first half tomorrow.

src/ic.rs Outdated Show resolved Hide resolved
src/ic.rs Outdated Show resolved Hide resolved
src/ic.rs Outdated Show resolved Hide resolved
src/ic.rs Outdated Show resolved Hide resolved
src/ic.rs Show resolved Hide resolved
Copy link
Member

@tyt2y3 tyt2y3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions in my head.
Can you explain how View should work?
How do we distinguish between creating an empty view and create a view from a table?
And how should Project work? Does it add a column to a view?

Right now I think there are two types of project being mixed, 1 is get a column from a view, 2 is create a new column from an existing column.

The question is, how do we represent SELECT col_1 + 1 FROM my_table? (note that I use my_ here to avoid confusion)

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 2, 2022

Can you explain how View should work? How do we distinguish between creating an empty view and create a view from a table?

I was thinking about this too. In the current instruction set, there isn't a way to distinguish between these two cases. My idea was to add a field existing that is set to true when it's on an existing table and false when it's a new temporary table. But then we'll also need an instruction to get the type of a column from it's name to add columns to the temporary table.

I will think some more on this and update.

And how should Project work? Does it add a column to a view?

No, Project in the current instruction set is only used to restrict the columns that are returned from the view. It's the projection operation from relational algebra. Like SELECT a, b, c, FROM ....

Right now I think there are two types of project being mixed, 1 is get a column from a view, 2 is create a new column from an existing column.

The question is, how do we represent SELECT col_1 + 1 FROM my_table? (note that I use my_ here to avoid confusion)

That's a good question. It's also something I realized when writing the examples for UPDATE: https://github.com/SeaQL/sql-assembly/blob/78e735f8ac30d63a752371a01e36ab49ce4fb328/src/ic.rs#L207-L208

In the current instruction set, there isn't any support for expressions, only column names. I will think of how to add support for expressions. That will also make ProjectAggregate redundant since the aggregation function will be part of the expression.

@tyt2y3
Copy link
Member

tyt2y3 commented Jul 2, 2022

I don't think we should be lazy here, otherwise it will create a complex dependency tree, essentially becoming another AST to evaluate

@tyt2y3
Copy link
Member

tyt2y3 commented Jul 2, 2022

Would that make more sense?

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 2, 2022

I actually think that the IC can 'outsource' the expressions part, i.e. it takes a Fn lambda and let the caller implement the necessary transform. Same as filter, we can also allow a custom Fn to be passed in, so they can filter by any expression

Will this lambda be on a row or on a column? Aggregations will need the whole column.

Oh I think now it's being too elaborate in terms of the VM. I think may be we should simply reuse the AST from sqlparser and interpret that 'on the fly' for evaluating expressions. The gist of the VM is to handle dataflow, not evaluate expressions

I think this makes more sense. We can re-use Expr from sqlparser directly. There isn't a need for additional abstractions or transformations here.

I don't think we should be lazy here, otherwise it will create a complex dependency tree, essentially becoming another AST to evaluate

I was thinking the same thing. The View register was almost mirroring the SELECT statement's AST. Eager makes more sense. I'll change it.

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 2, 2022

So I have been thinking about how UPDATE will work with eager execution. Once the data has been filtered (the WHERE clause of the update query), we won't have references to the original table to update only those rows. Two solutions I could think of:

  • Keep a unique ID for each row of all tables. Regardless of whether they have a primary key or unique columns. Then the rows can be matched even after filtered. This unique ID will be hidden in actual outputs.
  • Make filters lazy for UPDATE. This will be a different instruction that stores the filters one by one and executes the update on them. No projections are involved here.

Which do you think is a better idea?

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 3, 2022

I prefer the first solution - store an ID for each row. The reason is that UPDATE can include tables with JOINs too - https://dev.mysql.com/doc/refman/8.0/en/update.html.

Another approach would be to execute the AST directly for UPDATE and DELETE. For reference, the AST for UPDATE is: https://docs.rs/sqlparser/0.18.0/sqlparser/ast/enum.Statement.html#variant.Update (we can ignore from since that is a Postgres exclusive feature). This avoids changing anything else in the DQL instructions for these DML statements.

@tyt2y3
Copy link
Member

tyt2y3 commented Jul 3, 2022

Keep a unique ID

This makes the most sense. I think if a table hasn't have a primary key already. Most implementations would make one implicitly.

Will this lambda be on a row or on a column? Aggregations will need the whole column.

I think the signature will be simply Fn(Value) -> Value so it is per row. I think aggregation is a different problem and that it needs to be implemented by the VM itself

@tyt2y3
Copy link
Member

tyt2y3 commented Jul 3, 2022

Ah, actually, project can be thought as map then aggregate is reduce

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 3, 2022

I think if a table hasn't have a primary key already. Most implementations would make one implicitly.

Makes sense.

I think the signature will be simply Fn(Value) -> Value so it is per row.

It must be per row like Fn(Row) -> Value or Fn(Vec<Value>) -> Value since expressions can include multiple columns too. For example, SELECT col1 + col2 FROM table1.

Also, the caller will need to know at which index a particular column is. I think it will be easier to just store the Expr directly.

@tyt2y3
Copy link
Member

tyt2y3 commented Jul 3, 2022

Agreed

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 3, 2022

Ah, actually, project can be thought as map then aggregate is reduce

I missed this comment. That makes sense, let me think about it some more. Getting the row to be addressable by column name might be a challenge though.

@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 3, 2022

So I thought about the lambda idea:

  • The closures passed from the AST will look something like this:

    match op {
        BinOp::Plus => |row| row.get_col(col_name) + value
        // Similarly for each operator
    }

    Or an abstraction like:

    match op {
        BinOp => |row| row.get_col(col_name).apply_binary(op, value)
        // Similarly for each type of operator
    }
  • The Value itself will implement the basic operators like +, -. Here, all the operators or operator classes will be repeated again in the AST -> IC conversion.

  • I don't think this provides any value over just taking an Expr and evaluating it ourselves. There can be an expr evaluator that handles any Expr.

Maybe it will help that the executor doesn't need to deal with Expr, but I'm not sure. Thoughts?

Note: still using our custom `Expr` since we have a custom `Value`.
`sqlparser`'s `Expr` uses their `Value` which will require conversion
anyway.
src/ic.rs Outdated Show resolved Hide resolved
This unique key will be used to identify and match rows of the original
table from rows of the filtered or joined table.

Ultimately, this will be used in `UPDATE` queries - either from a simple
table or from the join/filter of multiple tables.
@tyt2y3
Copy link
Member

tyt2y3 commented Jul 4, 2022

Maybe it will help that the executor doesn't need to deal with Expr, but I'm not sure. Thoughts?

I think the goal is to free the responsibility from the VM, such that the VM only takes care execution, sort of like how users can apply custom functions in Excel.

Since it was unrelated to the VM.
@Samyak2
Copy link
Collaborator Author

Samyak2 commented Jul 5, 2022

I think the goal is to free the responsibility from the VM, such that the VM only takes care execution, sort of like how users can apply custom functions in Excel.

Yes, I agree with this architecture. That's how it's currently architected too. The VM and the intermediate code only deal with the high level execution of a query - like an execution plan in most DBs.

The VM will call a fn execute(expr: Expr, data: Row) -> Value whenever it needs to evaluate an Expr. For a Filter, it will look for Value::Bool and error otherwise. So the abstraction for a custom function/operation is the Expr. I don't see why we need yet another level of abstraction with a closure for this.

src/vm.rs Outdated
/// Filters applied over a table.
Filter(Filter),
/// An entire table.
Table(Mrc<RwLock<Table>>),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tyt2y3 I need some help here. I need to store a reference to a table in a register - it will either be:

  • An existing table. In this case, it will be held in the tables field of the schema. So an Rc or a reference is definitely needed here.
  • A new empty table. This table can be "created" => added to the schema, so Rc is needed again. An empty table by itself is useless, so it also needs to be mutable.

In both cases, I need it to be mutable because instructions like Insert, Update, AddColumn, RemoveColumn can mutate the table.

Using an Rc<Mutex> or Rc<RwLock> was the only option that I could see that satisfies both. But I don't think that's a good idea because we aren't using multi-threading and wrong usage of RwLock could lead to deadlocks.

I thought of adding a new variant MutableTable when mutability is needed. But it also has the same issue. What will the type of that variant be? How will ownership be "transferred" to the schema when the table is "created"?

Copy link
Member

@tyt2y3 tyt2y3 Jul 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not mutex and/or locks. If you think it is essential then you have a flaw in the design.

The VM should be the sole owner of the table and any mutation on it is required to go through an IC. Definitely not some outsider to modify it freely? Ownership can be transferred in, but not out (unless it's a drop, then we'd probably can transfer it to another VM)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think I have an idea for this, to make VM the owner instead of the schema or the register.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this in 558237a and 2505f11.

The idea is to use indices instead of references. These indices are only usable along with a reference to the VM. Since the execute fn already is on a mutable reference to the VM, these indices can be used for mutation too.
There is a problem with freeing unused indices, but that's not a big issue IMO because:

  • This is intended for short-lived tests, not for production use-cases where it will be running for days.
  • There is a relatively small number of places where new tables are created. We can ensure, with a lot of tests, that indices are correctly freed on a DROP TABLE or after the temporary table is no longer needed (to be done with a DropTable instruction at code generation time).

This approach is described better by Niko Matsakis here: http://smallcultfollowing.com/babysteps/blog/2015/04/06/modeling-graphs-in-rust-using-vector-indices/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed x 10

@tyt2y3 tyt2y3 merged commit 72faba9 into main Jul 17, 2022
@tyt2y3
Copy link
Member

tyt2y3 commented Jul 17, 2022

Let's move forward

@Samyak2 Samyak2 deleted the instruction-set branch July 17, 2022 16:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design the instruction set
2 participants