Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small microsecond spans after most $in mongo queries and mislabeling of getMore operations #719

Closed
millerick opened this issue Oct 23, 2019 · 5 comments
Labels
Milestone

Comments

@millerick
Copy link

Describe the bug
This is potentially a bug, but more of a question as to why dd-trace is behaving the way that it is behaving.

Summary:

  1. It would be nice if getMore operations were separated from the initial find operation.
  2. There are microsecond duration traces that I don't necessarily think represent a round trip to the database to retrieve information.

Details

We noticed that when a large number of documents is returned, the APM spans produced by dd-trace produce multiple different spans, one for the initial query, and then new spans for each of the getMore operations to fetch the remaining documents:
image
This find returns > 101 documents, so there is at least one getMore request involved in retrieving all of the results. The APM spans were useful for noticing this, but it would be nice if the getMore was actually labeled as a getMore instead of a find.

We thought the third microsecond duration span was odd, but hoped it would go away when we used a large enough batchSize to eliminate the subsequent getMore operations:
image
You'll notice that the microsecond duration span is still there. What does it actually represent? IMO it should be lumped together with whatever operation it is actually associated with, either the find, or subsequent getMore operations.

Currently both of these factors greatly skew the APM data and make it difficult to use them to track performance. For the example query provided above, we currently see a tri-modal distribution, where the first peak represents the microsecond response time spans, and the other two peaks represent the find and the getMore operation (not necessarily in that order):
image

Environment

  • Operation system: Ubuntu 16.04
  • Node version: 10.16.3
  • Tracer version: 0.15.4
  • Agent version: 6.13.0
@millerick millerick added the bug Something isn't working label Oct 23, 2019
@rochdev
Copy link
Member

rochdev commented Oct 23, 2019

The way the integration works right now is that we instrument any command going through the driver. This includes queries but also administrative command or any of the available commands. We do not alter how these are captured and displayed specifically to give you more visibility, but I understand that for your use case it's not necessarily the best approach.

It's unlikely this will change for the mongodb-core integration since the goal is specifically to know exactly what the driver is doing, but we could add an additional layer directly at the mongodb level (or even mongoose later on) to give more coarse-grained information about what the app did with the library, additional to the current information of what the library did with the driver. The result would be that the spans from the driver would have a parent span from the library which would span the entire operation.

Do you think this would help for your use case?

@rochdev rochdev added feature-request question Further information is requested and removed bug Something isn't working labels Oct 23, 2019
@millerick
Copy link
Author

millerick commented Oct 23, 2019

It's unlikely this will change for the mongodb-core integration since the goal is specifically to know exactly what the driver is doing

I would argue that currently the spans are not displaying exactly what the driver is doing. There is no way to differentiate between which spans are a round trip to the database, and which are not (remember, my suspicion is that the microsecond duration spans do not result in a network hop), and there is no differentiation between the initial find and the getMore paging through the cursor.

I really want to use the APM information to know what cursors are paged through to find opportunities to reduce the total number of network hops by increasing batchSize provided to the driver where appropriate. Currently the exercise of tracking these down is not straightforward because there isn't much differentiation there.

You're right that the mongodb level rather than the (former) mongodb-core level is probably the right place to instrument in order to guarantee accurate timings across a full operation. I can see arguments for having both since the mongodb level would provide for accurate timings, and if things are ironed out a little bit, the (former) mongodb-core level would give greater granularity to find and iron out performance issues.

@rochdev
Copy link
Member

rochdev commented Oct 23, 2019

Makes sense. I'll have to look into it to see how we could restructure things to isolate actual calls to Mongo.

@rochdev
Copy link
Member

rochdev commented Nov 13, 2020

You'll notice that the microsecond duration span is still there. What does it actually represent?

It likely represents the killCursors command when the cursor has been consumed entirely. As described in #892 (comment), we're in the process of rewriting this integration at a lower level so that we capture what the driver is doing directly. This will provide more accurate results with the correct command name and tags for each individual spans, which should make this a lot less confusing.

Currently both of these factors greatly skew the APM data and make it difficult to use them to track performance. For the example query provided above, we currently see a tri-modal distribution, where the first peak represents the microsecond response time spans, and the other two peaks represent the find and the getMore operation

This actually happens because they all share the same resource name, which will no longer be the case with the new logic, so the stats will be properly aggregated with this additional dimension to separate query from getMore and killCursors.

I can see arguments for having both since the mongodb level would provide for accurate timings

The problem with this approach that I didn't initially consider is that a lot of collection methods return a cursor, and it's also possible to instantiate cursors directly. Cursors don't have a clear beginning and end since they can be reinitialized, killed, rewinded, or just garbage collected when unused which would keep the server cursor open forever unless there is some sort of timeout. This means it's basically not possible to do this in a way that will not cause other issues.

@rochdev
Copy link
Member

rochdev commented Nov 19, 2020

Fixed in #1159. Each span now correctly corresponds to an actual command sent to Mongo with more relevant metadata.

@rochdev rochdev closed this as completed Nov 19, 2020
@rochdev rochdev added this to the 0.29.0 milestone Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants