-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: initial external table schema #2333
Conversation
Known issues:
|
can we mark this as a draft/wip/rfc? |
…cho/external-schema
There's a bit of refactoring, but this looks like it works in the simple case:
|
What would struct syntax look like here? Also why a new a CREATE EXTERNAL TABLE my_table (
x int,
y int
) OPTIONS (...);
Imo this would be most useful with with "schemaless" files (CSV, JSON) since inference could be unreliable. I think the next highest impact would be with the sql-based data sources as that could provide workarounds for completely unsupported types (e.g. just mark something as a string and we call the I think erroring on unimplemented ones is completely reasonable. The alternative being we accept the columns and just ignore them which would be surprising. |
tablefuncs provide their arguments as strings which we could then parse, presumably, however we want. I think also because these are effectively projections, does it make sense to have a "
The
I agree that this is most important for schemaless files, but I think it has applications everywhere (also in theory as a security feature, using glaredb to filter out sensitive columns. I'm not sure about automatically casting, or how to handle non-trivial type mismatches. |
While as a feature "explicit schema" for an external table (we could also think about these as projection-only-non-materialized-views), is something that we could think about doing in different ways, and with more coverage, landing this PR is pretty low risk: it's limited to some less-often used data sources, and sets us up well for adding support to the other data sources. |
(the test failure here is the huggingface outage) |
@universalmind303 @vrongmeal @scsmithr could you all take another look at this. |
statement ok | ||
create external table dst from bson | ||
options ( | ||
location '${TMP}/external_schema.bson', | ||
file_type 'bson' | ||
) | ||
columns ( | ||
x int, | ||
z int | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting indenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol 😂
I... see what you mean... I think at the time I didn't think of it as super different than like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for the case of trying to specify a column that doesn't exist when creating the exernal table? (Specifically making sure the error message is decent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is an error.
If you create a column for a field that doesn't exist, how do we know until the end of processing the query? What if it only exists in some of the documents? For inputs that don't have fixed schema I think "adding a column and putting a null in that field for every value, is sort of the right answer.
For inputs that do have a fixed schema, or where it's known during table creation (which until now it has never been known,) we would either:
- error very early when you query if the types don't match (e.g. you specify a number for a field and the upstream source has a string).
- produce a column or field with many nulls.
I think our query parsing layer (should) prevent us from querying a table that has a defined schema and accessing/interacting with data that's in the underlying source that isn't in the result set. That'd be a leak if we allowed fields to be totally empty.
Anyway, I/we haven't implemented this for fixed schema sources and I think we should figure out what we want to do for this at some point soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will want to think a bit more on the error/non-error here. I'm looking at this feature and thinking it may extend into the more typical CREATE EXTERNAL TABLE (a int, b int)
syntax, but I don't know what the precise behavior should be.
This kinda brings in the question of optional schemas (column names) in data sources like csv files, where the behavior of specifying the column names in the create external table may differ depending on if the file has a header line or not.
Either way, can we add a test case for specifying a column that doesn't exist to show the current behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will want to think a bit more on the error/non-error here. I'm looking at this feature and thinking it may extend into the more typical
CREATE EXTERNAL TABLE (a int, b int)
syntax, but I don't know what the precise behavior should be.
Definitely. I think (in any case) the support for this might want to be per-datasource no matter what we decide here.
Prior to merging this we made OPTIONS
required before option blocks so that the above syntax would become available to us.
One thing we don't presently do, but could, is error for data sources don't support explicit columns if someone attempts to specify them.
This kinda brings in the question of optional schemas (column names) in data sources like csv files, where the behavior of specifying the column names in the create external table may differ depending on if the file has a header line or not.
Yeah the no-header CSV case (which isn't the default) might require an additional option when specifying column names (or could require a header.) I think a reasonable solution might be: requiring a header and having queries error if you try and run a query against a file and it is missing one of the columns that are explicitly specified.
Globbing for CSV probably makes them a bit more like JSON and BSON than a table, so maybe the strict mode should be opt-out. Actually any kind of file where we can glob means that we'd end up with a stream of input data that could have different schema over time, regardless of the kind of file (JSON and BSON just can do this within a file.)
I think, at least for the moment:
- JSON and BSON (being morally identical) make the "correct" semantics pretty obvious.
- we should be able to find a solution that's workable for other "schema inference" types (CSV.)
- we could not provide this for fixed schema data sources, or if we do, error if the input source is missing the field we need (this would probably happen anyway, as our parser would/could expand
select *
to a list of the schema (if it doesn't already) and we'd query for a field that doesn't exist, so we might not have to write much
Will add another test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(test added)
This is just a start, mostly for an initial discussion, or model for
talking about more. Some questions/remaining work that come to mind
instantly:
interface for table functions I think someone mentioned
using struct syntax, which I think makes more sense than using table
creation syntax, but we'd need to write that. I do like that
COLUMNS
increate external table
mirror schema definition in"normal" SQL.
implementing for other data sources. I don't think we need to
have this done on every data source in order to ship it, but it
would be good to have a plan for how we support it everywhere. This
might be some work for the SQL data sources, but the object-store
data sources (other than BSON) might be harder because more of the
implementation is in the upstream providers.
error when someone specifies columns for a data source that
doesn't have it implemented (yet.)
attempt it in a SQL data source to prove that its possible.
decide how to handle JSON/CSV/Parquet
testing speaks for itself, but I think is particularly
important to validate some of the corners
ensure that you can't compute values from fields in the source
table that's excluded from the schema.
ensure that this doesn't hurt query and projection pushdown
otherwise.
should also apply for writes in the
INSERT
path.