Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mongodb] Improve mongodb quantization #463

Merged
merged 13 commits into from
Jun 20, 2018

Conversation

pawelchcki
Copy link
Contributor

@pawelchcki pawelchcki commented Jun 14, 2018

This PR backports Hash Quantization from 0.13-dev with some necessary changes to make it a tiny bit more compatible with previous MongoDb quantization.

It does so by adding "truncate_arrays" option ensuring the quantization will only include first item from array with nested arrays or objects. Its an important facet to effective MongoDb quantization.

in addition it normalizes hash keys used in resource string. Allowing expected merging of to occur and avoiding situation when one field was included twice.

{.., :collection: "name", "collection": "name",...}

assert_equal('{"operation"=>:insert, "database"=>"test", "collection"=>"people", "documents"=>[{:name=>"?", :hobbies=>["?"]}, "?"], "ordered"=>"?"}', span.resource)

TODO:

  • docs for quantization in mongo

@pawelchcki pawelchcki added bug Involves a bug integrations Involves tracing integrations feature Involves a product feature labels Jun 14, 2018
@pawelchcki pawelchcki changed the title Improve mongodb quantization [mongodb] Improve mongodb quantization Jun 14, 2018
@pawelchcki pawelchcki requested review from delner and removed request for delner June 14, 2018 15:12
@pawelchcki pawelchcki requested a review from delner June 14, 2018 15:47
@delner
Copy link
Contributor

delner commented Jun 14, 2018

Can you explain what truncate_arrays does and why it's necessary? And the other features like indifferent equals and so on?

I'm concerned this is making the quantization more complicated than it should be, and thus reducing performance.

@pawelchcki
Copy link
Contributor Author

Expanding on the PR description:

Array truncation is something that was already used in MongoDB quantization. I reimplemented that in Hash quantization.

The case for it is that mongodb can have quite a lot of embeded documents in its query. And that number can be variable.

Imagine a query that depending on conditions updates 1 to 10000 different documents.
That would mean that there will up to 10000 different resource names created for essentially the same operation. And that would convey little information to the enduser So instead of creating a resource:

{"u"=>[{a: ?}, {a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?}

Array truncation will cut it down to just

{"u"=>[{a:?}, "?"]}

This is something that already exists in current implementation of mongodb quantization.
However it results in following resource name:

{"u"=>{a:?}}

Which I think hides too much information. So I tweaked it a tiny bit, however I'm not married to that idea.

As for performance of the changes, adding one if and a method call overhead that is performed only on arrays should have minimal penalty.

When arrays are truncated it should however give quite measurable speedup due to not having to traverse all objects in array.

@delner
Copy link
Contributor

delner commented Jun 15, 2018

I think your case where {"u"=>[{a: ?}, {a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?},{a:?}]} repeats, it's pretty clear how it benefits. But what if there are cases where that {a:?} isn't a simple hash but a meaningful object? Such as a document bulk insert? Then that wouldn't be so good, I'd think.

I think we should adopt a simple rule for each data type that makes general sense and live with some of the consequences. Otherwise I fear quantization strategy will constantly thrash whenever the general strategy is sub-optimal for a particular schema. Hash::Quantization should be that general strategy, and if it's too sub-optimal for a specific schema (say ElasticSearch) then, that integration should re-implement its own strategy. In this case, I think the generic strategy is fine for ES.

But getting back to the specifics, and applying what I said in the previous paragraph, I think your point that Arrays are repetitive content, and should be truncated is probably the best general fit. In which case I would suggest Arrays always become ["?"] in Hash::Quantization.

However, I do like your idea of {"u"=>[{a:?}, "?"]} since it gives a sample of what complex object would have been truncated entirely. Maybe that's a decent compromise, and should be integrated in somehow, or at least be an option. And the more I think about this, the more I think we need to refactor the Hash::Quantization to use a strategy pattern, where an integration or some other part of code can define specifics of what strategy it wants to implement (like your truncation strategy.)

But before we do that, let's do the minimal changes here to fix the bug, then pursue a better quantization strategy.

@delner delner added this to the 0.13.0 milestone Jun 15, 2018
…ation_improvements

+ backport tests into spec suite

# Conflicts:
#	lib/ddtrace/quantization/hash.rb
#	spec/ddtrace/quantization/hash_spec.rb
#	test/contrib/mongodb/client_test.rb
@pawelchcki
Copy link
Contributor Author

pawelchcki commented Jun 18, 2018

@delner
I've implemented your initial suggestion of using Hash quantization module, to filter MongoDb unwanted object keys. I think it works well, to remove that I'd have to reimplement quite a bit of code thats already in Hash quantization module.

As for {'a'=>?} it works as before, i.e. if ? is a object and not a strings it will be recursed into to perform quantization and reflect the structure of the object. So [{a:{b:{c:1}}}, {a:{b:...}] will be quantised to [{a:{b:{c:"?"}}}, "?"]

I've rebased this branch on 0.13-dev - once #465 is merged, I'll refresh this PR to reflect only relevant changes.

I'll make the array truncation a default per your suggestions.

context 'given a Array with nested hashes' do
let(:hash) { [{ foo: { bar: 1 } }, { foo: { bar: 2 } }] }
it { is_expected.to eq([{ foo: { bar: '?' } }, '?']) }
end

@pawelchcki pawelchcki changed the base branch from master to 0.13-dev June 18, 2018 10:11
@pawelchcki pawelchcki changed the base branch from 0.13-dev to merge_master_into_0.13-dev June 18, 2018 10:25
@pawelchcki pawelchcki changed the base branch from merge_master_into_0.13-dev to 0.13-dev June 18, 2018 15:13
@pawelchcki
Copy link
Contributor Author

Added this PR #467 to test the changes on ES

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pawelchcki pawelchcki merged commit 07de906 into 0.13-dev Jun 20, 2018
@pawelchcki pawelchcki deleted the bugfix/mongodb_quantization_improvements branch June 21, 2018 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug feature Involves a product feature integrations Involves tracing integrations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants