-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MongoDB integration refactoring #63279
base: master
Are you sure you want to change the base?
Conversation
a46d70b
to
fa6bebb
Compare
@@ -203,7 +203,8 @@ TRAP(lgammal) | |||
TRAP(nftw) | |||
TRAP(nl_langinfo) | |||
TRAP(putc_unlocked) | |||
TRAP(rand) | |||
//TRAP(rand) // Used in mongo-c-driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver has to be patched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concerns have to be reported to MongoDB authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean mongo-c-driver? Could you please refer to the issue, is it small patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you handle it?
This is an automated comment for commit eb6713d with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
Great initiative! A couple of questions for now:
|
It's supposed, but we have to test it |
should not, but need to be tested. |
yes, it works now |
this will not be broken, but your example is not valid. docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the enormous effort you've put into this integration!
Firstly, I've concentrated on high-level aspects of the implementation, and didn't dig into details too much.
I have a few suggestions regarding smoother transition to the new library.
We can maintain the current implementation based on Poco alongside the new one for some time, adding a global setting or configuration option that allows users to toggle between the two. (Additionally, the existing Poco-based implementation could be used with old analyzer, allow_experimental_analyzer = 0
)
Source files of the current implementation can be kept appending a suffix like *PocoLegacy
.
The new implementation tries to construct queries for MongoDB and throws errors for unsupported queries. Instead we can fallback to reading all data and processes it internally, similar to the existing approach.
Let me know what do you think about these suggestions.
@@ -203,7 +203,8 @@ TRAP(lgammal) | |||
TRAP(nftw) | |||
TRAP(nl_langinfo) | |||
TRAP(putc_unlocked) | |||
TRAP(rand) | |||
//TRAP(rand) // Used in mongo-c-driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you handle it?
add_library(ch_contrib::libmongoc ALIAS _libmongoc) | ||
target_include_directories(_libmongoc SYSTEM PUBLIC ${LIBMONGOC_SOURCE_DIR} ${COMMON_SOURCE_DIR} ${UTF8PROC_SOURCE_DIR}) | ||
target_compile_definitions(_libmongoc PRIVATE MONGOC_COMPILATION) | ||
target_link_libraries(_libmongoc ch_contrib::libbson ch_contrib::c-ares ch_contrib::zlib resolv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any linking with resolv
elsewhere (but resolv.h is used in base/poco/Net/src/DNS.cpp
) why it's required explicitly here? Where we will take this library from ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a system library, which should persists in all unix-like systems.
mongo-c-driver needs it here, without explicitly spec build will fail.
If key not found in MongoDB document, default value or null(if the column is nullable) will be inserted. | ||
|
||
## Supported clauses | ||
*Hint: you can use MongoDB table in CTE to perform any clauses, but be aware, that in some cases, performance will be significantly degraded.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to use
WITH (SELECT * FROM mongo_table) as m
SELECT ... FROM m GROUP BY ...
So, we read all data mongo_table
first and then process the data on ClickHouse side? What will happen when table is used with unsupported clause, will we get a error?
I assume we can calrify this in the documentation here, add an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean this.
Okay, I will provide some examples.
|
||
bool exists_in_current_document = document->exists(name); | ||
if (!exists_in_current_document) | ||
if (sample_column.type->isNullable()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't we just insert NULL
for Nullable column in insertDefaultValue
? Why do we need explicitly set value in nullmap again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will more shorter, thanks.
Array BSONArrayAsArray(size_t dimensions, const bsoncxx::types::b_array & array, const DataTypePtr & type, const Field & default_value, const std::string & name) | ||
{ | ||
auto arr = Array(); | ||
if (dimensions > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add some santity check for the number of dimensions
? Probably ClickHouse won't allow to define too nested array, maybe just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not responsibility of source, is it?
In source we just check data from Mongo, and make sure this can be inserted into given schema.
return json; | ||
} | ||
case bsoncxx::type::k_binary: | ||
return base64Encode(std::string(reinterpret_cast<const char*>(value.get_binary().bytes), value.get_binary().size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we encode it with base64? We can store raw data in clickhouse String
s.
UPD: But it won't work for the case where binary is inside document or array that we serialize to json. How does Mongo return this data with native client when we request document with binary field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, good idea.
As BSON binary field, the same as when outside document.
I think we shouldn't support old Poco implementation, just make a settings which control WHERE and ORDER BY behavior. And ignore it when new analyzer used. |
@allmazz Feel free to throw out the mongodb in Poco once it is unused. ClickHouse maintains its own heavily-patched copy (dump) of Poco in base/poco/ and the less of it exists the smaller the burden. |
You mean we shouldn't keep legacy implementation, right? |
Yes, Poco::MongoDB will no longer be used (I suppose) after your PR, so we can/should get rid of it. |
@vdimir thinks we should keep old implementation too, so it will be deleted later. |
86c3296
to
06b0289
Compare
06b0289
to
eb6713d
Compare
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
MongoDB integration refactored: migration to new driver mongocxx from deprecated Poco::MongoDB, remove support for deprecated old protocol, support for connection by URI, support for all MongoDB types, support for WHERE and ORDER BY statements on MongoDB side, restriction for expression unsupported by MongoDB.
Documentation entry for user-facing changes
Current MongoDB integration is very limited: not all types are supported; WHERE and ORDER BY conditions applied on ClickHouse side, what needs to read ALL data from MongoDB collection.
Modify your CI run
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: