-
-
Notifications
You must be signed in to change notification settings - Fork 2
Conversation
- Added more variants to support the newly designed instruction set - Added data structures: - View: does the heavy lifting in the IC - InsertDef and InsertRow: for insert instructions
@tyt2y3 @billy1624 @shpun817 this is ready for review now. Please take a look. |
src/ic.rs
Outdated
ColumnDef { | ||
index: RegisterIndex, | ||
/// The column name. | ||
name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think putting a String everywhere is not a good thing, because it allocates on the heap and it's not copy friendly. I'd prefer using ArrayString with a fixed size (say 32 or 64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about arraystring, thanks!
Although, I'm not sure if it's needed in this project because:
- From their benchmarks, the difference in cloning a
CacheString
(which isArrayString
with 63 chars) and aString
is only about 3x. Both take less than 50 ns for it. - I chose 63 chars because that's the limit imposed in Postgres and MySQL. Some datawarehouses that also use SQL have much higher limits. For example, Snowflake and AWS Athena have 255 chars. ArrayString seems slow for those sizes.
- Performance is a non-goal of this project.
- ArrayString doesn't seem to very well maintained, it was last updated 2 years ago. I would rather stick with std to avoid surprises. Please let me know if I got the wrong crate, this was the first result for "rust ArrayString".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the crate we are looking for. I think being able to copy the instruction without cloning every time is a huge saver in terms of coding ergonomic. I mean we can effectively derive copy on it! Making it easier to pass around etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, that makes sense. I'll implement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to derive copy for Instruction
. I hit a blocker on trying to derive copy for Value
. It uses String and Vec: https://github.com/SeaQL/sql-assembly/blob/539036d064f96f578a15422d9d957d0f0c4c18b6/src/value.rs#L37-L41.
I don't want to make those bounded too because DBs don't impose low limits on those either. As an example, SQLite puts a limit of 1 billion bytes: https://www.sqlite.org/limits.html#max_length. So that always has to be heap allocated.
Any suggestions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
um... there is no way around it. though it might hint to us that we'd want to pass values in chunk instead of a free sized blob. forget about it for now then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although using ArrayString on the ic is half the way there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll leave out Copy for now but I'll keep ArrayString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, some of the structs/enums from sqlparser
do not implement Copy
: https://github.com/SeaQL/sql-assembly/blob/78e735f8ac30d63a752371a01e36ab49ce4fb328/src/ic.rs#L130
We will have think of those too.
I don't have a particular opinion on this. That said, it all pretty makes sense! Starting from: SELECT * FROM mytable WHERE col_1 = 1 we should have a list of these statements and write a bunch of unit tests for them (for testing the AST -> ASM conversion stage) |
Definitely, that's a good idea. It will also help in assessing that we have all the instructions we need for basic queries. I will work on this today. |
Yeah I think a lot of the design decisions can be verified without really implementing. We can dry run in our brains given a hypothetical specification |
Yes, I have considered quite a few queries while designing the instruction set. Although that was all in my mind. It would be good to make them concrete in code as tests. |
Note: these are only examples. No actual conversion takes place since the parser is not implemented yet.
|
||
#[test] | ||
fn insert_statements() { | ||
// `INSERT INTO table1 VALUES (1, 'foo', 2)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on adding these too. Should be done by first half tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions in my head.
Can you explain how View
should work?
How do we distinguish between creating an empty view and create a view from a table?
And how should Project
work? Does it add a column to a view?
Right now I think there are two types of project being mixed, 1 is get a column from a view
, 2 is create a new column from an existing column
.
The question is, how do we represent SELECT col_1 + 1 FROM my_table
? (note that I use my_ here to avoid confusion)
I was thinking about this too. In the current instruction set, there isn't a way to distinguish between these two cases. My idea was to add a field I will think some more on this and update.
No,
That's a good question. It's also something I realized when writing the examples for In the current instruction set, there isn't any support for expressions, only column names. I will think of how to add support for expressions. That will also make |
I don't think we should be lazy here, otherwise it will create a complex dependency tree, essentially becoming another AST to evaluate |
Would that make more sense? |
Will this lambda be on a row or on a column? Aggregations will need the whole column.
I think this makes more sense. We can re-use
I was thinking the same thing. The |
So I have been thinking about how
Which do you think is a better idea? |
I prefer the first solution - store an ID for each row. The reason is that Another approach would be to execute the AST directly for |
This makes the most sense. I think if a table hasn't have a primary key already. Most implementations would make one implicitly.
I think the signature will be simply |
Ah, actually, project can be thought as |
Makes sense.
It must be per row like Also, the caller will need to know at which index a particular column is. I think it will be easier to just store the |
Agreed |
I missed this comment. That makes sense, let me think about it some more. Getting the row to be addressable by column name might be a challenge though. |
So I thought about the lambda idea:
Maybe it will help that the executor doesn't need to deal with |
Note: still using our custom `Expr` since we have a custom `Value`. `sqlparser`'s `Expr` uses their `Value` which will require conversion anyway.
This unique key will be used to identify and match rows of the original table from rows of the filtered or joined table. Ultimately, this will be used in `UPDATE` queries - either from a simple table or from the join/filter of multiple tables.
I think the goal is to free the responsibility from the VM, such that the VM only takes care execution, sort of like how users can apply custom functions in Excel. |
Since it was unrelated to the VM.
Yes, I agree with this architecture. That's how it's currently architected too. The VM and the intermediate code only deal with the high level execution of a query - like an execution plan in most DBs. The VM will call a |
src/vm.rs
Outdated
/// Filters applied over a table. | ||
Filter(Filter), | ||
/// An entire table. | ||
Table(Mrc<RwLock<Table>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tyt2y3 I need some help here. I need to store a reference to a table in a register - it will either be:
- An existing table. In this case, it will be held in the
tables
field of the schema. So an Rc or a reference is definitely needed here. - A new empty table. This table can be "created" => added to the schema, so Rc is needed again. An empty table by itself is useless, so it also needs to be mutable.
In both cases, I need it to be mutable because instructions like Insert
, Update
, AddColumn
, RemoveColumn
can mutate the table.
Using an Rc<Mutex>
or Rc<RwLock>
was the only option that I could see that satisfies both. But I don't think that's a good idea because we aren't using multi-threading and wrong usage of RwLock
could lead to deadlocks.
I thought of adding a new variant MutableTable
when mutability is needed. But it also has the same issue. What will the type of that variant be? How will ownership be "transferred" to the schema when the table is "created"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not mutex and/or locks. If you think it is essential then you have a flaw in the design.
The VM should be the sole owner of the table and any mutation on it is required to go through an IC. Definitely not some outsider to modify it freely? Ownership can be transferred in, but not out (unless it's a drop, then we'd probably can transfer it to another VM)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think I have an idea for this, to make VM the owner instead of the schema or the register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented this in 558237a and 2505f11.
The idea is to use indices instead of references. These indices are only usable along with a reference to the VM. Since the execute fn already is on a mutable reference to the VM, these indices can be used for mutation too.
There is a problem with freeing unused indices, but that's not a big issue IMO because:
- This is intended for short-lived tests, not for production use-cases where it will be running for days.
- There is a relatively small number of places where new tables are created. We can ensure, with a lot of tests, that indices are correctly freed on a
DROP TABLE
or after the temporary table is no longer needed (to be done with aDropTable
instruction at code generation time).
This approach is described better by Niko Matsakis here: http://smallcultfollowing.com/babysteps/blog/2015/04/06/modeling-graphs-in-rust-using-vector-indices/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed x 10
Let's move forward |
PR Info
Changes
Register
data structure to support all the designed instructions. Added all supporting structures too. Some of the types are placeholder, which can only be known oncesqlparser
is integrated.View
variant will do most of the heaving lifting in many queries. A placeholder method is implemented for now.