Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB integration refactoring #63279

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

allmazz
Copy link
Contributor

@allmazz allmazz commented May 2, 2024

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

MongoDB integration refactored: migration to new driver mongocxx from deprecated Poco::MongoDB, remove support for deprecated old protocol, support for connection by URI, support for all MongoDB types, support for WHERE and ORDER BY statements on MongoDB side, restriction for expression unsupported by MongoDB.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Current MongoDB integration is very limited: not all types are supported; WHERE and ORDER BY conditions applied on ClickHouse side, what needs to read ALL data from MongoDB collection.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Modify your CI run

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • All with Azure
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 4, 2024
@@ -203,7 +203,8 @@ TRAP(lgammal)
TRAP(nftw)
TRAP(nl_langinfo)
TRAP(putc_unlocked)
TRAP(rand)
//TRAP(rand) // Used in mongo-c-driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver has to be patched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concerns have to be reported to MongoDB authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean mongo-c-driver? Could you please refer to the issue, is it small patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you handle it?

@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-feature Pull request with new product feature label May 4, 2024
@alexey-milovidov alexey-milovidov removed the can be tested Allows running workflows for external contributors label May 4, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the submodule changed At least one submodule changed in this PR. label May 4, 2024
@robot-clickhouse-ci-2
Copy link
Contributor

robot-clickhouse-ci-2 commented May 4, 2024

This is an automated comment for commit eb6713d with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
A SyncIf it fails, ask a maintainer for help⏳ pending
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR❌ failure
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process⏳ pending
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here❌ error
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Successful checks
Check nameDescriptionStatus
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success

@vdimir vdimir self-assigned this May 6, 2024
@vdimir
Copy link
Member

vdimir commented May 6, 2024

Great initiative!

A couple of questions for now:

  • Does it support the mongodb+srv:// protocol? Can it be used directly with Atlas?

  • Are the connection parameters supported by the current Poco::MongoDB implementation? A user may pass something like authSource=admin;ssl=true;readPreference=primaryPreferred;retryWrites=true;connectTimeoutMS=5000;socketTimeoutMS=5000 as the last storage argument, and some of these parameters are parsed by Poco. Won't this be broken?

@allmazz
Copy link
Contributor Author

allmazz commented May 6, 2024

  • Does it support the mongodb+srv:// protocol? Can it be used directly with Atlas?

It's supposed, but we have to test it

@allmazz
Copy link
Contributor Author

allmazz commented May 6, 2024

  • Won't this be broken?

should not, but need to be tested.
I will check it out soon

@allmazz
Copy link
Contributor Author

allmazz commented May 12, 2024

  • Does it support the mongodb+srv:// protocol? Can it be used directly with Atlas?

It's supposed, but we have to test it

yes, it works now

@allmazz
Copy link
Contributor Author

allmazz commented May 12, 2024

  • Won't this be broken?

should not, but need to be tested. I will check it out soon

this will not be broken, but your example is not valid. docs

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 12, 2024
@vdimir vdimir marked this pull request as draft May 21, 2024 09:52
@allmazz allmazz marked this pull request as ready for review May 25, 2024 23:06
@allmazz allmazz changed the title MongoDB integration refactoring: WIP MongoDB integration refactoring May 25, 2024
@allmazz allmazz requested a review from vdimir May 26, 2024 17:53
Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the enormous effort you've put into this integration!

Firstly, I've concentrated on high-level aspects of the implementation, and didn't dig into details too much.

I have a few suggestions regarding smoother transition to the new library.
We can maintain the current implementation based on Poco alongside the new one for some time, adding a global setting or configuration option that allows users to toggle between the two. (Additionally, the existing Poco-based implementation could be used with old analyzer, allow_experimental_analyzer = 0)
Source files of the current implementation can be kept appending a suffix like *PocoLegacy.

The new implementation tries to construct queries for MongoDB and throws errors for unsupported queries. Instead we can fallback to reading all data and processes it internally, similar to the existing approach.

Let me know what do you think about these suggestions.

@@ -203,7 +203,8 @@ TRAP(lgammal)
TRAP(nftw)
TRAP(nl_langinfo)
TRAP(putc_unlocked)
TRAP(rand)
//TRAP(rand) // Used in mongo-c-driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you handle it?

contrib/mongo-cxx-driver-cmake/CMakeLists.txt Show resolved Hide resolved
contrib/mongo-c-driver-cmake/CMakeLists.txt Show resolved Hide resolved
add_library(ch_contrib::libmongoc ALIAS _libmongoc)
target_include_directories(_libmongoc SYSTEM PUBLIC ${LIBMONGOC_SOURCE_DIR} ${COMMON_SOURCE_DIR} ${UTF8PROC_SOURCE_DIR})
target_compile_definitions(_libmongoc PRIVATE MONGOC_COMPILATION)
target_link_libraries(_libmongoc ch_contrib::libbson ch_contrib::c-ares ch_contrib::zlib resolv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any linking with resolv elsewhere (but resolv.h is used in base/poco/Net/src/DNS.cpp ) why it's required explicitly here? Where we will take this library from ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a system library, which should persists in all unix-like systems.
mongo-c-driver needs it here, without explicitly spec build will fail.

If key not found in MongoDB document, default value or null(if the column is nullable) will be inserted.

## Supported clauses
*Hint: you can use MongoDB table in CTE to perform any clauses, but be aware, that in some cases, performance will be significantly degraded.*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to use

WITH (SELECT * FROM mongo_table) as m
SELECT ... FROM m GROUP BY ...

So, we read all data mongo_table first and then process the data on ClickHouse side? What will happen when table is used with unsupported clause, will we get a error?

I assume we can calrify this in the documentation here, add an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean this.
Okay, I will provide some examples.

tests/integration/helpers/cluster.py Show resolved Hide resolved
src/Dictionaries/MongoDBDictionarySource.cpp Show resolved Hide resolved

bool exists_in_current_document = document->exists(name);
if (!exists_in_current_document)
if (sample_column.type->isNullable())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we just insert NULL for Nullable column in insertDefaultValue? Why do we need explicitly set value in nullmap again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will more shorter, thanks.

Array BSONArrayAsArray(size_t dimensions, const bsoncxx::types::b_array & array, const DataTypePtr & type, const Field & default_value, const std::string & name)
{
auto arr = Array();
if (dimensions > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some santity check for the number of dimensions ? Probably ClickHouse won't allow to define too nested array, maybe just in case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not responsibility of source, is it?
In source we just check data from Mongo, and make sure this can be inserted into given schema.

return json;
}
case bsoncxx::type::k_binary:
return base64Encode(std::string(reinterpret_cast<const char*>(value.get_binary().bytes), value.get_binary().size));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we encode it with base64? We can store raw data in clickhouse Strings.
UPD: But it won't work for the case where binary is inside document or array that we serialize to json. How does Mongo return this data with native client when we request document with binary field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, good idea.

As BSON binary field, the same as when outside document.

@allmazz
Copy link
Contributor Author

allmazz commented May 28, 2024

Thank you for the enormous effort you've put into this integration!

Firstly, I've concentrated on high-level aspects of the implementation, and didn't dig into details too much.

I have a few suggestions regarding smoother transition to the new library. We can maintain the current implementation based on Poco alongside the new one for some time, adding a global setting or configuration option that allows users to toggle between the two. (Additionally, the existing Poco-based implementation could be used with old analyzer, allow_experimental_analyzer = 0) Source files of the current implementation can be kept appending a suffix like *PocoLegacy.

The new implementation tries to construct queries for MongoDB and throws errors for unsupported queries. Instead we can fallback to reading all data and processes it internally, similar to the existing approach.

Let me know what do you think about these suggestions.

I think we shouldn't support old Poco implementation, just make a settings which control WHERE and ORDER BY behavior. And ignore it when new analyzer used.

@rschu1ze
Copy link
Member

@allmazz Feel free to throw out the mongodb in Poco once it is unused. ClickHouse maintains its own heavily-patched copy (dump) of Poco in base/poco/ and the less of it exists the smaller the burden.

@allmazz
Copy link
Contributor Author

allmazz commented May 29, 2024

@allmazz Feel free to throw out the mongodb in Poco once it is unused. ClickHouse maintains its own heavily-patched copy (dump) of Poco in base/poco/ and the less of it exists the smaller the burden.

You mean we shouldn't keep legacy implementation, right?

@rschu1ze
Copy link
Member

Yes, Poco::MongoDB will no longer be used (I suppose) after your PR, so we can/should get rid of it.

@allmazz
Copy link
Contributor Author

allmazz commented May 29, 2024

Yes, Poco::MongoDB will no longer be used (I suppose) after your PR, so we can/should get rid of it.

@vdimir thinks we should keep old implementation too, so it will be deleted later.

@CLAassistant
Copy link

CLAassistant commented Jun 2, 2024

CLA assistant check
All committers have signed the CLA.

@allmazz allmazz force-pushed the mongodb_refactoring branch 2 times, most recently from 86c3296 to 06b0289 Compare June 2, 2024 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants