Small microsecond spans after most $in mongo queries and mislabeling of getMore operations #719

millerick · 2019-10-23T01:45:50Z

Describe the bug
This is potentially a bug, but more of a question as to why dd-trace is behaving the way that it is behaving.

Summary:

It would be nice if getMore operations were separated from the initial find operation.
There are microsecond duration traces that I don't necessarily think represent a round trip to the database to retrieve information.

Details

We noticed that when a large number of documents is returned, the APM spans produced by dd-trace produce multiple different spans, one for the initial query, and then new spans for each of the getMore operations to fetch the remaining documents:

This find returns > 101 documents, so there is at least one getMore request involved in retrieving all of the results. The APM spans were useful for noticing this, but it would be nice if the getMore was actually labeled as a getMore instead of a find.

We thought the third microsecond duration span was odd, but hoped it would go away when we used a large enough batchSize to eliminate the subsequent getMore operations:

You'll notice that the microsecond duration span is still there. What does it actually represent? IMO it should be lumped together with whatever operation it is actually associated with, either the find, or subsequent getMore operations.

Currently both of these factors greatly skew the APM data and make it difficult to use them to track performance. For the example query provided above, we currently see a tri-modal distribution, where the first peak represents the microsecond response time spans, and the other two peaks represent the find and the getMore operation (not necessarily in that order):

Environment

Operation system: Ubuntu 16.04
Node version: 10.16.3
Tracer version: 0.15.4
Agent version: 6.13.0

rochdev · 2019-10-23T17:43:49Z

The way the integration works right now is that we instrument any command going through the driver. This includes queries but also administrative command or any of the available commands. We do not alter how these are captured and displayed specifically to give you more visibility, but I understand that for your use case it's not necessarily the best approach.

It's unlikely this will change for the mongodb-core integration since the goal is specifically to know exactly what the driver is doing, but we could add an additional layer directly at the mongodb level (or even mongoose later on) to give more coarse-grained information about what the app did with the library, additional to the current information of what the library did with the driver. The result would be that the spans from the driver would have a parent span from the library which would span the entire operation.

Do you think this would help for your use case?

millerick · 2019-10-23T19:04:07Z

It's unlikely this will change for the mongodb-core integration since the goal is specifically to know exactly what the driver is doing

I would argue that currently the spans are not displaying exactly what the driver is doing. There is no way to differentiate between which spans are a round trip to the database, and which are not (remember, my suspicion is that the microsecond duration spans do not result in a network hop), and there is no differentiation between the initial find and the getMore paging through the cursor.

I really want to use the APM information to know what cursors are paged through to find opportunities to reduce the total number of network hops by increasing batchSize provided to the driver where appropriate. Currently the exercise of tracking these down is not straightforward because there isn't much differentiation there.

You're right that the mongodb level rather than the (former) mongodb-core level is probably the right place to instrument in order to guarantee accurate timings across a full operation. I can see arguments for having both since the mongodb level would provide for accurate timings, and if things are ironed out a little bit, the (former) mongodb-core level would give greater granularity to find and iron out performance issues.

rochdev · 2019-10-23T20:46:06Z

Makes sense. I'll have to look into it to see how we could restructure things to isolate actual calls to Mongo.

rochdev · 2020-11-13T03:29:39Z

You'll notice that the microsecond duration span is still there. What does it actually represent?

It likely represents the killCursors command when the cursor has been consumed entirely. As described in #892 (comment), we're in the process of rewriting this integration at a lower level so that we capture what the driver is doing directly. This will provide more accurate results with the correct command name and tags for each individual spans, which should make this a lot less confusing.

Currently both of these factors greatly skew the APM data and make it difficult to use them to track performance. For the example query provided above, we currently see a tri-modal distribution, where the first peak represents the microsecond response time spans, and the other two peaks represent the find and the getMore operation

This actually happens because they all share the same resource name, which will no longer be the case with the new logic, so the stats will be properly aggregated with this additional dimension to separate query from getMore and killCursors.

I can see arguments for having both since the mongodb level would provide for accurate timings

The problem with this approach that I didn't initially consider is that a lot of collection methods return a cursor, and it's also possible to instantiate cursors directly. Cursors don't have a clear beginning and end since they can be reinitialized, killed, rewinded, or just garbage collected when unused which would keep the server cursor open forever unless there is some sort of timeout. This means it's basically not possible to do this in a way that will not cause other issues.

rochdev · 2020-11-19T14:55:44Z

Fixed in #1159. Each span now correctly corresponds to an actual command sent to Mongo with more relevant metadata.

millerick added the bug Something isn't working label Oct 23, 2019

rochdev added community integrations labels Oct 23, 2019

rochdev added feature-request question Further information is requested and removed bug Something isn't working labels Oct 23, 2019

rochdev mentioned this issue Nov 13, 2020

Many spans created when .forEach is called on mongodb results #892

Closed

rochdev mentioned this issue Nov 18, 2020

fix mongodb integration not instrumenting correctly #1159

Merged

rochdev closed this as completed Nov 19, 2020

rochdev added this to the 0.29.0 milestone Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small microsecond spans after most $in mongo queries and mislabeling of getMore operations #719

Small microsecond spans after most $in mongo queries and mislabeling of getMore operations #719

millerick commented Oct 23, 2019

rochdev commented Oct 23, 2019

millerick commented Oct 23, 2019 •

edited

Loading

rochdev commented Oct 23, 2019

rochdev commented Nov 13, 2020

rochdev commented Nov 19, 2020

Small microsecond spans after most $in mongo queries and mislabeling of getMore operations #719

Small microsecond spans after most $in mongo queries and mislabeling of getMore operations #719

Comments

millerick commented Oct 23, 2019

rochdev commented Oct 23, 2019

millerick commented Oct 23, 2019 • edited Loading

rochdev commented Oct 23, 2019

rochdev commented Nov 13, 2020

rochdev commented Nov 19, 2020

millerick commented Oct 23, 2019 •

edited

Loading