Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MongoDB Schema Inference #2144

Closed
tychoish opened this issue Nov 22, 2023 · 3 comments
Closed

Improve MongoDB Schema Inference #2144

tychoish opened this issue Nov 22, 2023 · 3 comments
Labels
feat 🎇 New feature or request

Comments

@tychoish
Copy link
Collaborator

#2143 makes some improvements in this direction, but may not be efficient for frequent queries. There's probably no one size fits all solution, but making it user-specifiable would be good. Options include:

  • provide a bson document to the catalog to use as a prototype (hard to get the byte-sequence into the system), but nice.
  • provide the _id of a document to use as a prototype
  • provide a query or aggregation pipeline to select the fields and types of the document using the current logic, but just giving users the control over what the schema becomes
  • find one document, ordered by the _id index, selecting either the highest or lowest value (first or last in most cases)

Providing some way of caching the schema for a table would (for small queries,) improve performance a bunch.

I don't think we'd want to implement all of them at once, but there's a lot of room for improvement here.

As a related feature it'd be cool to be able to write a query or aggregation pipeline (in mongodb query language, but for other data sources, SQL makes sense) that would be a "pre-filter" (and for mongoDB you could just add another stage to an aggregation pipeline), and I beleive that you can use that pipeline to normalize the schema of the documents.

@tychoish tychoish added the feat 🎇 New feature or request label Nov 22, 2023
@universalmind303
Copy link
Contributor

Some other quick wins we could do that could improve inference:

  • allow the user to manually specify the schema. (no inference needed)
  • expose sample rate/size as configuration parameters.
  • expose alternative algorithm that just selects the first n rows. A lot of schema inference algorithms do it this way instead of sampling. (this would likely be faster, but less accurate)

something like

select * from mongo_scan('connection_str', infer_sample_rate => 0.1, infer_sample_size=1000) -- customized sampling parameters
select * from mongo_scan('connection_str', schema => (a int, b utf8) ) -- no inference 
select * from mongo_scan('connection_str', infer_algorithm => 'sample' | 'n_rows' | '<other infer algorithm>' ) -- alternative inference algorithms (defaults to 'sample')

@tychoish
Copy link
Collaborator Author

@universalmind303, at least as I understand it your first and third options are consistent with what I think first and third options are. 😂

@tychoish
Copy link
Collaborator Author

tychoish commented Mar 6, 2024

#2333 makes schema specification implicit, and I think this becomes closeable.

The bson and json handling allow inferred schemas to have a configurable number of documents: we could add this to mongodb too, but don't have to.

I like @universalmind303's idea to have alternate algorithms for determining schema, and think we could do some cool things there, but probably I'm more inclined just to infer less data all together.

@tychoish tychoish closed this as completed Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat 🎇 New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants