Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial external table schema #2333

Merged
merged 25 commits into from
Mar 6, 2024
Merged

feat: initial external table schema #2333

merged 25 commits into from
Mar 6, 2024

Conversation

tychoish
Copy link
Collaborator

This is just a start, mostly for an initial discussion, or model for
talking about more. Some questions/remaining work that come to mind
instantly:

  • interface for table functions I think someone mentioned
    using struct syntax, which I think makes more sense than using table
    creation syntax, but we'd need to write that. I do like that
    COLUMNS in create external table mirror schema definition in
    "normal" SQL.

  • implementing for other data sources. I don't think we need to
    have this done on every data source in order to ship it, but it
    would be good to have a plan for how we support it everywhere. This
    might be some work for the SQL data sources, but the object-store
    data sources (other than BSON) might be harder because more of the
    implementation is in the upstream providers.

    • error when someone specifies columns for a data source that
      doesn't have it implemented (yet.)

    • attempt it in a SQL data source to prove that its possible.

    • decide how to handle JSON/CSV/Parquet

  • testing speaks for itself, but I think is particularly
    important to validate some of the corners

    • ensure that you can't compute values from fields in the source
      table that's excluded from the schema.

    • ensure that this doesn't hurt query and projection pushdown
      otherwise.

    • should also apply for writes in the INSERT path.

@tychoish
Copy link
Collaborator Author

Known issues:

  • fields duplicated for internal tables (???)
  • (existing): bson table queries will error if source documents have fields that aren't in the schema
  • mongodb tables project too much.

@universalmind303
Copy link
Contributor

This is just a start, mostly for an initial discussion, or model for
talking about more.

can we mark this as a draft/wip/rfc?

@tychoish tychoish marked this pull request as draft January 2, 2024 17:33
@tychoish tychoish changed the base branch from main to tycho/chore-consistent-internal-mongodb-naming January 4, 2024 21:27
Base automatically changed from tycho/chore-consistent-internal-mongodb-naming to main January 4, 2024 21:52
@tychoish
Copy link
Collaborator Author

tychoish commented Jan 5, 2024

There's a bit of refactoring, but this looks like it works in the simple case:

  • I haven't added support for this in a datasource that supports inserts, so while I think that's a useful test case to write, I don't think this should wait on that.
  • I [think] I plumbed things through in all the places that one would need to to work for hybrid execution, but I'm not positive (or sure how to test that, exactly)
  • Not sure what the best way to handle the cases where someone specifies a schema for a data source that doesn't support it.

@scsmithr
Copy link
Member

scsmithr commented Jan 5, 2024

interface for table functions I think someone mentioned
using struct syntax, which I think makes more sense than using table
creation syntax, but we'd need to write that. I do like that
COLUMNS in create external table mirror schema definition in
"normal" SQL.

What would struct syntax look like here?

Also why a new a COLUMNS keyword over doing something like this:

CREATE EXTERNAL TABLE my_table (
  x int,
  y int
) OPTIONS (...);

implementing for other data sources. I don't think we need to
have this done on every data source in order to ship it, but it
would be good to have a plan for how we support it everywhere. This
might be some work for the SQL data sources, but the object-store
data sources (other than BSON) might be harder because more of the
implementation is in the upstream providers.

Imo this would be most useful with with "schemaless" files (CSV, JSON) since inference could be unreliable. I think the next highest impact would be with the sql-based data sources as that could provide workarounds for completely unsupported types (e.g. just mark something as a string and we call the to_string method on what we get back from the source).

I think erroring on unimplemented ones is completely reasonable. The alternative being we accept the columns and just ignore them which would be surprising.

@tychoish
Copy link
Collaborator Author

tychoish commented Jan 5, 2024

https://github.com/GlareDB/glaredb/pull/2333/files#diff-c1c7559f834c783121db9ec5b2e8b19b67e0e11365ba44bf0b78e0feeb524e0c

What would struct syntax look like here?

tablefuncs provide their arguments as strings which we could then parse, presumably, however we want.

I think also because these are effectively projections, does it make sense to have a "SELECT with types` form that would be more native feeling in these cases?

Also why a new a COLUMNS keyword over doing something like this:

The OPTIONS word is already optional, so if we have two parenthetical groupings, which do we parse?

Imo this would be most useful with with "schemaless" files (CSV, JSON) since inference could be unreliable. I think the next highest impact would be with the sql-based data sources as that could provide workarounds for completely unsupported types (e.g. just mark something as a string and we call the to_string method on what we get back from the source).

I agree that this is most important for schemaless files, but I think it has applications everywhere (also in theory as a security feature, using glaredb to filter out sensitive columns.

I'm not sure about automatically casting, or how to handle non-trivial type mismatches.

@tychoish
Copy link
Collaborator Author

While as a feature "explicit schema" for an external table (we could also think about these as projection-only-non-materialized-views), is something that we could think about doing in different ways, and with more coverage, landing this PR is pretty low risk: it's limited to some less-often used data sources, and sets us up well for adding support to the other data sources.

@tychoish tychoish marked this pull request as ready for review February 28, 2024 20:23
@tychoish
Copy link
Collaborator Author

(the test failure here is the huggingface outage)

@tychoish
Copy link
Collaborator Author

@universalmind303 @vrongmeal @scsmithr could you all take another look at this.

Comment on lines +10 to +19
statement ok
create external table dst from bson
options (
location '${TMP}/external_schema.bson',
file_type 'bson'
)
columns (
x int,
z int
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting indenting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol 😂

I... see what you mean... I think at the time I didn't think of it as super different than like this:

https://github.com/glaredb/glaredb/blob/tycho/external-schema/testdata/sqllogictests_cassandra/basic.slt#L4-L12

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for the case of trying to specify a column that doesn't exist when creating the exernal table? (Specifically making sure the error message is decent)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an error.

If you create a column for a field that doesn't exist, how do we know until the end of processing the query? What if it only exists in some of the documents? For inputs that don't have fixed schema I think "adding a column and putting a null in that field for every value, is sort of the right answer.

For inputs that do have a fixed schema, or where it's known during table creation (which until now it has never been known,) we would either:

  • error very early when you query if the types don't match (e.g. you specify a number for a field and the upstream source has a string).
  • produce a column or field with many nulls.

I think our query parsing layer (should) prevent us from querying a table that has a defined schema and accessing/interacting with data that's in the underlying source that isn't in the result set. That'd be a leak if we allowed fields to be totally empty.

Anyway, I/we haven't implemented this for fixed schema sources and I think we should figure out what we want to do for this at some point soon?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will want to think a bit more on the error/non-error here. I'm looking at this feature and thinking it may extend into the more typical CREATE EXTERNAL TABLE (a int, b int) syntax, but I don't know what the precise behavior should be.

This kinda brings in the question of optional schemas (column names) in data sources like csv files, where the behavior of specifying the column names in the create external table may differ depending on if the file has a header line or not.

Either way, can we add a test case for specifying a column that doesn't exist to show the current behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will want to think a bit more on the error/non-error here. I'm looking at this feature and thinking it may extend into the more typical CREATE EXTERNAL TABLE (a int, b int) syntax, but I don't know what the precise behavior should be.

Definitely. I think (in any case) the support for this might want to be per-datasource no matter what we decide here.

Prior to merging this we made OPTIONS required before option blocks so that the above syntax would become available to us.

One thing we don't presently do, but could, is error for data sources don't support explicit columns if someone attempts to specify them.

This kinda brings in the question of optional schemas (column names) in data sources like csv files, where the behavior of specifying the column names in the create external table may differ depending on if the file has a header line or not.

Yeah the no-header CSV case (which isn't the default) might require an additional option when specifying column names (or could require a header.) I think a reasonable solution might be: requiring a header and having queries error if you try and run a query against a file and it is missing one of the columns that are explicitly specified.

Globbing for CSV probably makes them a bit more like JSON and BSON than a table, so maybe the strict mode should be opt-out. Actually any kind of file where we can glob means that we'd end up with a stream of input data that could have different schema over time, regardless of the kind of file (JSON and BSON just can do this within a file.)

I think, at least for the moment:

  • JSON and BSON (being morally identical) make the "correct" semantics pretty obvious.
  • we should be able to find a solution that's workable for other "schema inference" types (CSV.)
  • we could not provide this for fixed schema data sources, or if we do, error if the input source is missing the field we need (this would probably happen anyway, as our parser would/could expand select * to a list of the schema (if it doesn't already) and we'd query for a field that doesn't exist, so we might not have to write much

Will add another test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(test added)

@tychoish tychoish requested a review from scsmithr March 6, 2024 19:12
@tychoish tychoish merged commit 03b8ffb into main Mar 6, 2024
25 checks passed
@tychoish tychoish deleted the tycho/external-schema branch March 6, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants