instruction set #12

Samyak2 · 2022-06-25T18:21:47Z

PR Info

Closes Design the instruction set #9

Dependencies:
- Data structures improvements #11

Dependents:

Changes

Completed the Register data structure to support all the designed instructions. Added all supporting structures too. Some of the types are placeholder, which can only be known once sqlparser is integrated.
- The View variant will do most of the heaving lifting in many queries. A placeholder method is implemented for now.
Added structures for all instructions designed in Design the instruction set #9. Some placeholders for operators and other types are present. It is documented thoroughly.

- Added more variants to support the newly designed instruction set - Added data structures: - View: does the heavy lifting in the IC - InsertDef and InsertRow: for insert instructions

Samyak2 · 2022-06-30T19:06:55Z

@tyt2y3 @billy1624 @shpun817 this is ready for review now. Please take a look.

tyt2y3 · 2022-07-01T08:11:49Z

src/ic.rs

+    ColumnDef {
+        index: RegisterIndex,
+        /// The column name.
+        name: String,


I think putting a String everywhere is not a good thing, because it allocates on the heap and it's not copy friendly. I'd prefer using ArrayString with a fixed size (say 32 or 64)

TIL about arraystring, thanks!

Although, I'm not sure if it's needed in this project because:

From their benchmarks, the difference in cloning a CacheString (which is ArrayString with 63 chars) and a String is only about 3x. Both take less than 50 ns for it.

I chose 63 chars because that's the limit imposed in Postgres and MySQL. Some datawarehouses that also use SQL have much higher limits. For example, Snowflake and AWS Athena have 255 chars. ArrayString seems slow for those sizes.

Performance is a non-goal of this project.

ArrayString doesn't seem to very well maintained, it was last updated 2 years ago. I would rather stick with std to avoid surprises. Please let me know if I got the wrong crate, this was the first result for "rust ArrayString".

Yes, that's the crate we are looking for. I think being able to copy the instruction without cloning every time is a huge saver in terms of coding ergonomic. I mean we can effectively derive copy on it! Making it easier to pass around etc.

Ah I see, that makes sense. I'll implement it

I was trying to derive copy for Instruction. I hit a blocker on trying to derive copy for Value. It uses String and Vec: https://github.com/SeaQL/sql-assembly/blob/539036d064f96f578a15422d9d957d0f0c4c18b6/src/value.rs#L37-L41.

I don't want to make those bounded too because DBs don't impose low limits on those either. As an example, SQLite puts a limit of 1 billion bytes: https://www.sqlite.org/limits.html#max_length. So that always has to be heap allocated.

Any suggestions here?

um... there is no way around it. though it might hint to us that we'd want to pass values in chunk instead of a free sized blob. forget about it for now then

although using ArrayString on the ic is half the way there

Okay, I'll leave out Copy for now but I'll keep ArrayString

Also, some of the structs/enums from sqlparser do not implement Copy: https://github.com/SeaQL/sql-assembly/blob/78e735f8ac30d63a752371a01e36ab49ce4fb328/src/ic.rs#L130

We will have think of those too.

tyt2y3 · 2022-07-01T08:16:30Z

I don't have a particular opinion on this. That said, it all pretty makes sense!
However, I would like to have some unit tests or at least some non-working snippets to illustrate how it all comes to live.
For example, can you illustrate how'd some basic SQL would be represented?

Starting from:

SELECT * FROM mytable WHERE col_1 = 1

we should have a list of these statements and write a bunch of unit tests for them (for testing the AST -> ASM conversion stage)

Samyak2 · 2022-07-01T09:14:33Z

Definitely, that's a good idea. It will also help in assessing that we have all the instructions we need for basic queries. I will work on this today.

tyt2y3 · 2022-07-01T11:57:15Z

Yeah I think a lot of the design decisions can be verified without really implementing. We can dry run in our brains given a hypothetical specification

Samyak2 · 2022-07-01T14:07:26Z

Yes, I have considered quite a few queries while designing the instruction set. Although that was all in my mind. It would be good to make them concrete in code as tests.

Note: these are only examples. No actual conversion takes place since the parser is not implemented yet.

Samyak2 · 2022-07-01T19:02:29Z

src/ic.rs

+
+    #[test]
+    fn insert_statements() {
+        // `INSERT INTO table1 VALUES (1, 'foo', 2)`


I'm working on adding these too. Should be done by first half tomorrow.

src/ic.rs

tyt2y3

Two questions in my head.
Can you explain how View should work?
How do we distinguish between creating an empty view and create a view from a table?
And how should Project work? Does it add a column to a view?

Right now I think there are two types of project being mixed, 1 is get a column from a view, 2 is create a new column from an existing column.

The question is, how do we represent SELECT col_1 + 1 FROM my_table? (note that I use my_ here to avoid confusion)

Samyak2 · 2022-07-02T10:52:24Z

Can you explain how View should work? How do we distinguish between creating an empty view and create a view from a table?

I was thinking about this too. In the current instruction set, there isn't a way to distinguish between these two cases. My idea was to add a field existing that is set to true when it's on an existing table and false when it's a new temporary table. But then we'll also need an instruction to get the type of a column from it's name to add columns to the temporary table.

I will think some more on this and update.

And how should Project work? Does it add a column to a view?

No, Project in the current instruction set is only used to restrict the columns that are returned from the view. It's the projection operation from relational algebra. Like SELECT a, b, c, FROM ....

Right now I think there are two types of project being mixed, 1 is get a column from a view, 2 is create a new column from an existing column.

The question is, how do we represent SELECT col_1 + 1 FROM my_table? (note that I use my_ here to avoid confusion)

That's a good question. It's also something I realized when writing the examples for UPDATE: https://github.com/SeaQL/sql-assembly/blob/78e735f8ac30d63a752371a01e36ab49ce4fb328/src/ic.rs#L207-L208

In the current instruction set, there isn't any support for expressions, only column names. I will think of how to add support for expressions. That will also make ProjectAggregate redundant since the aggregation function will be part of the expression.

tyt2y3 · 2022-07-02T18:13:36Z

I don't think we should be lazy here, otherwise it will create a complex dependency tree, essentially becoming another AST to evaluate

tyt2y3 · 2022-07-02T18:15:58Z

Would that make more sense?

Samyak2 · 2022-07-02T18:19:49Z

I actually think that the IC can 'outsource' the expressions part, i.e. it takes a Fn lambda and let the caller implement the necessary transform. Same as filter, we can also allow a custom Fn to be passed in, so they can filter by any expression

Will this lambda be on a row or on a column? Aggregations will need the whole column.

Oh I think now it's being too elaborate in terms of the VM. I think may be we should simply reuse the AST from sqlparser and interpret that 'on the fly' for evaluating expressions. The gist of the VM is to handle dataflow, not evaluate expressions

I think this makes more sense. We can re-use Expr from sqlparser directly. There isn't a need for additional abstractions or transformations here.

I don't think we should be lazy here, otherwise it will create a complex dependency tree, essentially becoming another AST to evaluate

I was thinking the same thing. The View register was almost mirroring the SELECT statement's AST. Eager makes more sense. I'll change it.

Samyak2 · 2022-07-02T18:59:14Z

So I have been thinking about how UPDATE will work with eager execution. Once the data has been filtered (the WHERE clause of the update query), we won't have references to the original table to update only those rows. Two solutions I could think of:

Keep a unique ID for each row of all tables. Regardless of whether they have a primary key or unique columns. Then the rows can be matched even after filtered. This unique ID will be hidden in actual outputs.
Make filters lazy for UPDATE. This will be a different instruction that stores the filters one by one and executes the update on them. No projections are involved here.

Which do you think is a better idea?

Samyak2 · 2022-07-03T10:59:57Z

I prefer the first solution - store an ID for each row. The reason is that UPDATE can include tables with JOINs too - https://dev.mysql.com/doc/refman/8.0/en/update.html.

Another approach would be to execute the AST directly for UPDATE and DELETE. For reference, the AST for UPDATE is: https://docs.rs/sqlparser/0.18.0/sqlparser/ast/enum.Statement.html#variant.Update (we can ignore from since that is a Postgres exclusive feature). This avoids changing anything else in the DQL instructions for these DML statements.

tyt2y3 · 2022-07-03T11:58:28Z

Keep a unique ID

This makes the most sense. I think if a table hasn't have a primary key already. Most implementations would make one implicitly.

Will this lambda be on a row or on a column? Aggregations will need the whole column.

I think the signature will be simply Fn(Value) -> Value so it is per row. I think aggregation is a different problem and that it needs to be implemented by the VM itself

tyt2y3 · 2022-07-03T12:15:18Z

Ah, actually, project can be thought as map then aggregate is reduce

Samyak2 · 2022-07-03T12:16:05Z

I think if a table hasn't have a primary key already. Most implementations would make one implicitly.

Makes sense.

I think the signature will be simply Fn(Value) -> Value so it is per row.

It must be per row like Fn(Row) -> Value or Fn(Vec<Value>) -> Value since expressions can include multiple columns too. For example, SELECT col1 + col2 FROM table1.

Also, the caller will need to know at which index a particular column is. I think it will be easier to just store the Expr directly.

tyt2y3 · 2022-07-03T13:37:18Z

Agreed

Samyak2 · 2022-07-03T13:40:40Z

Ah, actually, project can be thought as map then aggregate is reduce

I missed this comment. That makes sense, let me think about it some more. Getting the row to be addressable by column name might be a challenge though.

Samyak2 · 2022-07-03T14:18:00Z

So I thought about the lambda idea:

The closures passed from the AST will look something like this:

match op {
    BinOp::Plus => |row| row.get_col(col_name) + value
    // Similarly for each operator
}

Or an abstraction like:

match op {
    BinOp => |row| row.get_col(col_name).apply_binary(op, value)
    // Similarly for each type of operator
}

The Value itself will implement the basic operators like +, -. Here, all the operators or operator classes will be repeated again in the AST -> IC conversion.
I don't think this provides any value over just taking an Expr and evaluating it ourselves. There can be an expr evaluator that handles any Expr.

Maybe it will help that the executor doesn't need to deal with Expr, but I'm not sure. Thoughts?

Note: still using our custom `Expr` since we have a custom `Value`. `sqlparser`'s `Expr` uses their `Value` which will require conversion anyway.

src/ic.rs

This unique key will be used to identify and match rows of the original table from rows of the filtered or joined table. Ultimately, this will be used in `UPDATE` queries - either from a simple table or from the join/filter of multiple tables.

tyt2y3 · 2022-07-04T16:31:10Z

Maybe it will help that the executor doesn't need to deal with Expr, but I'm not sure. Thoughts?

I think the goal is to free the responsibility from the VM, such that the VM only takes care execution, sort of like how users can apply custom functions in Excel.

Since it was unrelated to the VM.

Samyak2 · 2022-07-05T18:19:39Z

I think the goal is to free the responsibility from the VM, such that the VM only takes care execution, sort of like how users can apply custom functions in Excel.

Yes, I agree with this architecture. That's how it's currently architected too. The VM and the intermediate code only deal with the high level execution of a query - like an execution plan in most DBs.

The VM will call a fn execute(expr: Expr, data: Row) -> Value whenever it needs to evaluate an Expr. For a Filter, it will look for Value::Bool and error otherwise. So the abstraction for a custom function/operation is the Expr. I don't see why we need yet another level of abstraction with a closure for this.

Samyak2 · 2022-07-05T18:48:54Z

src/vm.rs

-    /// Filters applied over a table.
-    Filter(Filter),
+    /// An entire table.
+    Table(Mrc<RwLock<Table>>),


@tyt2y3 I need some help here. I need to store a reference to a table in a register - it will either be:

An existing table. In this case, it will be held in the tables field of the schema. So an Rc or a reference is definitely needed here.

A new empty table. This table can be "created" => added to the schema, so Rc is needed again. An empty table by itself is useless, so it also needs to be mutable.

In both cases, I need it to be mutable because instructions like Insert, Update, AddColumn, RemoveColumn can mutate the table.

Using an Rc<Mutex> or Rc<RwLock> was the only option that I could see that satisfies both. But I don't think that's a good idea because we aren't using multi-threading and wrong usage of RwLock could lead to deadlocks.

I thought of adding a new variant MutableTable when mutability is needed. But it also has the same issue. What will the type of that variant be? How will ownership be "transferred" to the schema when the table is "created"?

Definitely not mutex and/or locks. If you think it is essential then you have a flaw in the design.

The VM should be the sole owner of the table and any mutation on it is required to go through an IC. Definitely not some outsider to modify it freely? Ownership can be transferred in, but not out (unless it's a drop, then we'd probably can transfer it to another VM)

I see. I think I have an idea for this, to make VM the owner instead of the schema or the register.

I have implemented this in 558237a and 2505f11.

The idea is to use indices instead of references. These indices are only usable along with a reference to the VM. Since the execute fn already is on a mutable reference to the VM, these indices can be used for mutation too.
There is a problem with freeing unused indices, but that's not a big issue IMO because:

This is intended for short-lived tests, not for production use-cases where it will be running for days.

There is a relatively small number of places where new tables are created. We can ensure, with a lot of tests, that indices are correctly freed on a DROP TABLE or after the temporary table is no longer needed (to be done with a DropTable instruction at code generation time).

This approach is described better by Niko Matsakis here: http://smallcultfollowing.com/babysteps/blog/2015/04/06/modeling-graphs-in-rust-using-vector-indices/

Agreed x 10

tyt2y3 · 2022-07-17T13:03:57Z

Let's move forward

Samyak2 added the wip DO NOT MERGE label Jun 25, 2022

feat: flesh out Register for IC

1b82fb5

- Added more variants to support the newly designed instruction set - Added data structures: - View: does the heavy lifting in the IC - InsertDef and InsertRow: for insert instructions

Samyak2 force-pushed the instruction-set branch from 1b2987d to 1b82fb5 Compare June 26, 2022 09:51

build: add local cargo config to gitignore

1f801f9

Samyak2 mentioned this pull request Jun 26, 2022

Plan #5

Open

4 tasks

Samyak2 added 5 commits June 28, 2022 23:02

feat: implement structure for general instructions

a05b19c

feat: add instructions for creation statements

a47a6d5

feat: add instructions for insert and update

c4a44e6

feat: add instructions to support JOINs

f15d00f

feat: add instructions and view struct for GROUP BY and aggregations

b568713

Samyak2 added enhancement New feature or request and removed wip DO NOT MERGE labels Jun 30, 2022

tyt2y3 reviewed Jul 1, 2022

View reviewed changes

Samyak2 changed the title ~~wip: instruction set~~ instruction set Jul 1, 2022

Samyak2 added 3 commits July 1, 2022 21:02

style: run cargo fmt

f5b98cf

test: add example tests of SQL to IC conversion

eeabbe6

Note: these are only examples. No actual conversion takes place since the parser is not implemented yet.

feat: use a bounded ArrayString for identifiers

78e735f

Samyak2 commented Jul 1, 2022

View reviewed changes