You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#2143 makes some improvements in this direction, but may not be efficient for frequent queries. There's probably no one size fits all solution, but making it user-specifiable would be good. Options include:
provide a bson document to the catalog to use as a prototype (hard to get the byte-sequence into the system), but nice.
provide the _id of a document to use as a prototype
provide a query or aggregation pipeline to select the fields and types of the document using the current logic, but just giving users the control over what the schema becomes
find one document, ordered by the _id index, selecting either the highest or lowest value (first or last in most cases)
Providing some way of caching the schema for a table would (for small queries,) improve performance a bunch.
I don't think we'd want to implement all of them at once, but there's a lot of room for improvement here.
As a related feature it'd be cool to be able to write a query or aggregation pipeline (in mongodb query language, but for other data sources, SQL makes sense) that would be a "pre-filter" (and for mongoDB you could just add another stage to an aggregation pipeline), and I beleive that you can use that pipeline to normalize the schema of the documents.
The text was updated successfully, but these errors were encountered:
Some other quick wins we could do that could improve inference:
allow the user to manually specify the schema. (no inference needed)
expose sample rate/size as configuration parameters.
expose alternative algorithm that just selects the first n rows. A lot of schema inference algorithms do it this way instead of sampling. (this would likely be faster, but less accurate)
something like
select*from mongo_scan('connection_str', infer_sample_rate =>0.1, infer_sample_size=1000) -- customized sampling parametersselect*from mongo_scan('connection_str', schema => (a int, b utf8) ) -- no inference select*from mongo_scan('connection_str', infer_algorithm =>'sample' | 'n_rows' | '<other infer algorithm>' ) -- alternative inference algorithms (defaults to 'sample')
#2333 makes schema specification implicit, and I think this becomes closeable.
The bson and json handling allow inferred schemas to have a configurable number of documents: we could add this to mongodb too, but don't have to.
I like @universalmind303's idea to have alternate algorithms for determining schema, and think we could do some cool things there, but probably I'm more inclined just to infer less data all together.
#2143 makes some improvements in this direction, but may not be efficient for frequent queries. There's probably no one size fits all solution, but making it user-specifiable would be good. Options include:
Providing some way of caching the schema for a table would (for small queries,) improve performance a bunch.
I don't think we'd want to implement all of them at once, but there's a lot of room for improvement here.
As a related feature it'd be cool to be able to write a query or aggregation pipeline (in mongodb query language, but for other data sources, SQL makes sense) that would be a "pre-filter" (and for mongoDB you could just add another stage to an aggregation pipeline), and I beleive that you can use that pipeline to normalize the schema of the documents.
The text was updated successfully, but these errors were encountered: