Improve MongoDB Schema Inference #2144

tychoish · 2023-11-22T11:44:01Z

#2143 makes some improvements in this direction, but may not be efficient for frequent queries. There's probably no one size fits all solution, but making it user-specifiable would be good. Options include:

provide a bson document to the catalog to use as a prototype (hard to get the byte-sequence into the system), but nice.
provide the _id of a document to use as a prototype
provide a query or aggregation pipeline to select the fields and types of the document using the current logic, but just giving users the control over what the schema becomes
find one document, ordered by the _id index, selecting either the highest or lowest value (first or last in most cases)

Providing some way of caching the schema for a table would (for small queries,) improve performance a bunch.

I don't think we'd want to implement all of them at once, but there's a lot of room for improvement here.

As a related feature it'd be cool to be able to write a query or aggregation pipeline (in mongodb query language, but for other data sources, SQL makes sense) that would be a "pre-filter" (and for mongoDB you could just add another stage to an aggregation pipeline), and I beleive that you can use that pipeline to normalize the schema of the documents.

universalmind303 · 2023-11-23T20:59:01Z

Some other quick wins we could do that could improve inference:

allow the user to manually specify the schema. (no inference needed)
expose sample rate/size as configuration parameters.
expose alternative algorithm that just selects the first n rows. A lot of schema inference algorithms do it this way instead of sampling. (this would likely be faster, but less accurate)

something like

select * from mongo_scan('connection_str', infer_sample_rate => 0.1, infer_sample_size=1000) -- customized sampling parameters
select * from mongo_scan('connection_str', schema => (a int, b utf8) ) -- no inference 
select * from mongo_scan('connection_str', infer_algorithm => 'sample' | 'n_rows' | '<other infer algorithm>' ) -- alternative inference algorithms (defaults to 'sample')

tychoish · 2023-11-24T08:23:45Z

@universalmind303, at least as I understand it your first and third options are consistent with what I think first and third options are. 😂

tychoish · 2024-03-06T22:41:20Z

#2333 makes schema specification implicit, and I think this becomes closeable.

The bson and json handling allow inferred schemas to have a configurable number of documents: we could add this to mongodb too, but don't have to.

I like @universalmind303's idea to have alternate algorithms for determining schema, and think we could do some cool things there, but probably I'm more inclined just to infer less data all together.

tychoish added the feat 🎇 New feature or request label Nov 22, 2023

tychoish closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MongoDB Schema Inference #2144

Improve MongoDB Schema Inference #2144

tychoish commented Nov 22, 2023

universalmind303 commented Nov 23, 2023

tychoish commented Nov 24, 2023

tychoish commented Mar 6, 2024

Improve MongoDB Schema Inference #2144

Improve MongoDB Schema Inference #2144

Comments

tychoish commented Nov 22, 2023

universalmind303 commented Nov 23, 2023

tychoish commented Nov 24, 2023

tychoish commented Mar 6, 2024