feat: initial external table schema #2333

tychoish · 2023-12-31T01:32:20Z

This is just a start, mostly for an initial discussion, or model for
talking about more. Some questions/remaining work that come to mind
instantly:

interface for table functions I think someone mentioned
using struct syntax, which I think makes more sense than using table
creation syntax, but we'd need to write that. I do like that
COLUMNS in create external table mirror schema definition in
"normal" SQL.
implementing for other data sources. I don't think we need to
have this done on every data source in order to ship it, but it
would be good to have a plan for how we support it everywhere. This
might be some work for the SQL data sources, but the object-store
data sources (other than BSON) might be harder because more of the
implementation is in the upstream providers.
- error when someone specifies columns for a data source that
  doesn't have it implemented (yet.)
- attempt it in a SQL data source to prove that its possible.
- decide how to handle JSON/CSV/Parquet
testing speaks for itself, but I think is particularly
important to validate some of the corners
- ensure that you can't compute values from fields in the source
  table that's excluded from the schema.
- ensure that this doesn't hurt query and projection pushdown
  otherwise.
- should also apply for writes in the INSERT path.

tychoish · 2023-12-31T02:06:50Z

Known issues:

fields duplicated for internal tables (???)
(existing): bson table queries will error if source documents have fields that aren't in the schema
mongodb tables project too much.

universalmind303 · 2024-01-02T14:42:53Z

This is just a start, mostly for an initial discussion, or model for
talking about more.

can we mark this as a draft/wip/rfc?

…cho/external-schema

tychoish · 2024-01-05T04:07:50Z

There's a bit of refactoring, but this looks like it works in the simple case:

I haven't added support for this in a datasource that supports inserts, so while I think that's a useful test case to write, I don't think this should wait on that.
I [think] I plumbed things through in all the places that one would need to to work for hybrid execution, but I'm not positive (or sure how to test that, exactly)
Not sure what the best way to handle the cases where someone specifies a schema for a data source that doesn't support it.

scsmithr · 2024-01-05T17:04:36Z

interface for table functions I think someone mentioned
using struct syntax, which I think makes more sense than using table
creation syntax, but we'd need to write that. I do like that
COLUMNS in create external table mirror schema definition in
"normal" SQL.

What would struct syntax look like here?

Also why a new a COLUMNS keyword over doing something like this:

CREATE EXTERNAL TABLE my_table (
  x int,
  y int
) OPTIONS (...);

implementing for other data sources. I don't think we need to
have this done on every data source in order to ship it, but it
would be good to have a plan for how we support it everywhere. This
might be some work for the SQL data sources, but the object-store
data sources (other than BSON) might be harder because more of the
implementation is in the upstream providers.

Imo this would be most useful with with "schemaless" files (CSV, JSON) since inference could be unreliable. I think the next highest impact would be with the sql-based data sources as that could provide workarounds for completely unsupported types (e.g. just mark something as a string and we call the to_string method on what we get back from the source).

I think erroring on unimplemented ones is completely reasonable. The alternative being we accept the columns and just ignore them which would be surprising.

tychoish · 2024-01-05T18:11:36Z

https://github.com/GlareDB/glaredb/pull/2333/files#diff-c1c7559f834c783121db9ec5b2e8b19b67e0e11365ba44bf0b78e0feeb524e0c

What would struct syntax look like here?

tablefuncs provide their arguments as strings which we could then parse, presumably, however we want.

I think also because these are effectively projections, does it make sense to have a "SELECT with types` form that would be more native feeling in these cases?

Also why a new a COLUMNS keyword over doing something like this:

The OPTIONS word is already optional, so if we have two parenthetical groupings, which do we parse?

Imo this would be most useful with with "schemaless" files (CSV, JSON) since inference could be unreliable. I think the next highest impact would be with the sql-based data sources as that could provide workarounds for completely unsupported types (e.g. just mark something as a string and we call the to_string method on what we get back from the source).

I agree that this is most important for schemaless files, but I think it has applications everywhere (also in theory as a security feature, using glaredb to filter out sensitive columns.

I'm not sure about automatically casting, or how to handle non-trivial type mismatches.

tychoish · 2024-02-28T20:23:08Z

While as a feature "explicit schema" for an external table (we could also think about these as projection-only-non-materialized-views), is something that we could think about doing in different ways, and with more coverage, landing this PR is pretty low risk: it's limited to some less-often used data sources, and sets us up well for adding support to the other data sources.

tychoish · 2024-02-28T22:08:51Z

(the test failure here is the huggingface outage)

tychoish · 2024-02-29T17:13:28Z

@universalmind303 @vrongmeal @scsmithr could you all take another look at this.

scsmithr · 2024-03-06T16:09:08Z

testdata/sqllogictests/external_schema.slt

+statement ok
+create external table dst from bson
+options (
+	location '${TMP}/external_schema.bson',
+	file_type 'bson'
+	)
+columns (
+	x int,
+	z int
+);


Very interesting indenting.

lol 😂

I... see what you mean... I think at the time I didn't think of it as super different than like this:

https://github.com/glaredb/glaredb/blob/tycho/external-schema/testdata/sqllogictests_cassandra/basic.slt#L4-L12

scsmithr · 2024-03-06T16:11:18Z

testdata/sqllogictests/external_schema.slt

Can we add a test for the case of trying to specify a column that doesn't exist when creating the exernal table? (Specifically making sure the error message is decent)

I don't think this is an error.

If you create a column for a field that doesn't exist, how do we know until the end of processing the query? What if it only exists in some of the documents? For inputs that don't have fixed schema I think "adding a column and putting a null in that field for every value, is sort of the right answer.

For inputs that do have a fixed schema, or where it's known during table creation (which until now it has never been known,) we would either:

error very early when you query if the types don't match (e.g. you specify a number for a field and the upstream source has a string).

produce a column or field with many nulls.

I think our query parsing layer (should) prevent us from querying a table that has a defined schema and accessing/interacting with data that's in the underlying source that isn't in the result set. That'd be a leak if we allowed fields to be totally empty.

Anyway, I/we haven't implemented this for fixed schema sources and I think we should figure out what we want to do for this at some point soon?

Ok I will want to think a bit more on the error/non-error here. I'm looking at this feature and thinking it may extend into the more typical CREATE EXTERNAL TABLE (a int, b int) syntax, but I don't know what the precise behavior should be.

This kinda brings in the question of optional schemas (column names) in data sources like csv files, where the behavior of specifying the column names in the create external table may differ depending on if the file has a header line or not.

Either way, can we add a test case for specifying a column that doesn't exist to show the current behavior?

I will want to think a bit more on the error/non-error here. I'm looking at this feature and thinking it may extend into the more typical CREATE EXTERNAL TABLE (a int, b int) syntax, but I don't know what the precise behavior should be.

Definitely. I think (in any case) the support for this might want to be per-datasource no matter what we decide here.

Prior to merging this we made OPTIONS required before option blocks so that the above syntax would become available to us.

One thing we don't presently do, but could, is error for data sources don't support explicit columns if someone attempts to specify them.

This kinda brings in the question of optional schemas (column names) in data sources like csv files, where the behavior of specifying the column names in the create external table may differ depending on if the file has a header line or not.

Yeah the no-header CSV case (which isn't the default) might require an additional option when specifying column names (or could require a header.) I think a reasonable solution might be: requiring a header and having queries error if you try and run a query against a file and it is missing one of the columns that are explicitly specified.

Globbing for CSV probably makes them a bit more like JSON and BSON than a table, so maybe the strict mode should be opt-out. Actually any kind of file where we can glob means that we'd end up with a stream of input data that could have different schema over time, regardless of the kind of file (JSON and BSON just can do this within a file.)

I think, at least for the moment:

JSON and BSON (being morally identical) make the "correct" semantics pretty obvious.

we should be able to find a solution that's workable for other "schema inference" types (CSV.)

we could not provide this for fixed schema data sources, or if we do, error if the input source is missing the field we need (this would probably happen anyway, as our parser would/could expand select * to a list of the schema (if it doesn't already) and we'd query for a field that doesn't exist, so we might not have to write much

Will add another test.

(test added)

tychoish added 4 commits December 30, 2023 19:23

feat: initial external table schema

c28042f

complete wiring

9d82be7

clippy

ef7b747

fix test compile

95c0090

tychoish mentioned this pull request Dec 31, 2023

fix: bson builder should handle schema flexibly #2334

Merged

tychoish marked this pull request as draft January 2, 2024 17:33

tychoish added 5 commits January 3, 2024 11:41

Merge remote-tracking branch 'origin/main' into tycho/external-schema

9cf7002

Merge remote-tracking branch 'origin/main' into tycho/external-schema

34699e3

bugfixes

e56c889

fixup

cb192ca

chore: mongo -> mongodb in symbol naming

8d9a2f2

tychoish changed the base branch from main to tycho/chore-consistent-internal-mongodb-naming January 4, 2024 21:27

Merge branch 'tycho/chore-consistent-internal-mongodb-naming' into ty…

52873fa

…cho/external-schema

tychoish mentioned this pull request Jan 4, 2024

chore: mongo -> mongodb in symbol naming #2343

Merged

Base automatically changed from tycho/chore-consistent-internal-mongodb-naming to main January 4, 2024 21:52

tychoish added 3 commits January 4, 2024 16:55

Merge remote-tracking branch 'origin/main' into tycho/external-schema

df705f2

more

847329b

Merge remote-tracking branch 'origin/main' into tycho/external-schema

96a6567

clippy

c36ebf3

scsmithr requested review from scsmithr, universalmind303 and vrongmeal January 5, 2024 15:15

tychoish added 3 commits February 28, 2024 10:12

update

76a9594

Merge remote-tracking branch 'origin/main' into tycho/external-schema

86dfdc2

fix compiles and imports

4bb7b72

fix test

23bae55

tychoish marked this pull request as ready for review February 28, 2024 20:23

fixup

64c0a39

tychoish and others added 2 commits February 29, 2024 10:30

Merge branch 'main' into tycho/external-schema

e13a69e

Merge remote-tracking branch 'origin/main' into tycho/external-schema

57e9115

This was referenced Mar 4, 2024

support SQLite's dynamic types #2728

Open

Chore: refactor datasources #2744

Open

scsmithr reviewed Mar 6, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into tycho/external-schema

ced5c4f

tychoish requested a review from scsmithr March 6, 2024 19:12

tychoish added 3 commits March 6, 2024 14:14

fix compile

0e0ebdf

more cleanup

8943c95

nullable consistency

3c196a3

scsmithr approved these changes Mar 6, 2024

View reviewed changes

tychoish merged commit 03b8ffb into main Mar 6, 2024
25 checks passed

tychoish deleted the tycho/external-schema branch March 6, 2024 20:37

tychoish mentioned this pull request Mar 6, 2024

Improve MongoDB Schema Inference #2144

Closed

universalmind303 mentioned this pull request Mar 11, 2024

refactor: table options #2767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: initial external table schema #2333

feat: initial external table schema #2333

tychoish commented Dec 31, 2023

tychoish commented Dec 31, 2023

universalmind303 commented Jan 2, 2024

tychoish commented Jan 5, 2024

scsmithr commented Jan 5, 2024 •

edited

tychoish commented Jan 5, 2024

tychoish commented Feb 28, 2024

tychoish commented Feb 28, 2024

tychoish commented Feb 29, 2024

scsmithr Mar 6, 2024

tychoish Mar 6, 2024

scsmithr Mar 6, 2024

tychoish Mar 6, 2024

scsmithr Mar 6, 2024

tychoish Mar 6, 2024

tychoish Mar 6, 2024

feat: initial external table schema #2333

feat: initial external table schema #2333

Conversation

tychoish commented Dec 31, 2023

tychoish commented Dec 31, 2023

universalmind303 commented Jan 2, 2024

tychoish commented Jan 5, 2024

scsmithr commented Jan 5, 2024 • edited

tychoish commented Jan 5, 2024

tychoish commented Feb 28, 2024

tychoish commented Feb 28, 2024

tychoish commented Feb 29, 2024

scsmithr Mar 6, 2024

Choose a reason for hiding this comment

tychoish Mar 6, 2024

Choose a reason for hiding this comment

scsmithr Mar 6, 2024

Choose a reason for hiding this comment

tychoish Mar 6, 2024

Choose a reason for hiding this comment

scsmithr Mar 6, 2024

Choose a reason for hiding this comment

tychoish Mar 6, 2024

Choose a reason for hiding this comment

tychoish Mar 6, 2024

Choose a reason for hiding this comment

scsmithr commented Jan 5, 2024 •

edited