Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalized URL as the resource name #442

Merged
merged 12 commits into from
May 24, 2019

Conversation

SammyK
Copy link
Contributor

@SammyK SammyK commented May 16, 2019

Description

This PR adds URL normalization so that that they can be used as a low-cardinality resource name. The default substitution patterns scan for UUID's, hexadecimal hashes, and int's. We can certainly add more to that list.

The DD_TRACE_RESOURCE_URI_MAPPING environment variable can be set to a CSV list of URL-to-resource-name mapping rules that can contain * and $* wildcards.

  • The * wildcard will match one or more characters to be replaced with ?
  • The $* wildcard will match one or more characters without any replacement

For example, given DD_TRACE_RESOURCE_URI_MAPPING=/user/*,/city/$*/*:

URL Normalized
/user/sammyk /user/?
/city/foo/123 /city/foo/?
/city/bar/123 /city/bar/?

Readiness checklist

@SammyK SammyK added 🏆 enhancement A new feature or improvement ☠️ do-not-merge/WIP 🍏 core Changes to the core tracing functionality 🎉 new-integration A new integration labels May 16, 2019
@SammyK SammyK added this to the 0.24.0 milestone May 16, 2019
@SammyK SammyK marked this pull request as ready for review May 16, 2019 22:06
@pawelchcki pawelchcki modified the milestones: 0.24.0, 0.25.0 May 17, 2019
@SammyK SammyK force-pushed the sammyk/uri-to-resource-normalization branch from 8f3fdcf to 4de654d Compare May 17, 2019 16:55
labbati
labbati previously approved these changes May 22, 2019
Copy link
Member

@labbati labbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff here! 😄
Just added a few comments if they make sense to you

@@ -112,6 +110,12 @@ private static function initRootSpan()
$span->setTag(Tag::HTTP_URL, $_SERVER['REQUEST_URI']);
// Status code defaults to 200, will be later on changed when http_response_code will be called
$span->setTag(Tag::HTTP_STATUS_CODE, 200);
// Normalized URL as the resource name
$normalizer = new Urls(explode(',', getenv('DD_TRACE_RESOURCE_URI_MAPPING')));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments here, if they make sense to you:

  • While I see our conversation about simplifying the way we read configurations, wouldn't you see more strategic to not use getenv here? What I mean is that you may just want to move this getenv() call to a method in Config and not use all the underlying infrastructure.
  • I was evaluating pros and cons of doing this here or as a post processing operation. Doing it here means that devs cannot, as an example, set this env in their index.php or the likes. In the end we need that info only at the very end, just before we send the trace. Wouldn't you think it is a good idea to push this setting to the very end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points!

What I mean is that you may just want to move this getenv() call to a method in Config and not use all the underlying infrastructure.

I'm certainly open to moving the getenv() to an abstraction. :) What do you see as the main advantage for abstracting this one-time env access?

Wouldn't you think it is a good idea to push this setting to the very end?

This is a great idea! I'll move it to right before the flush... maybe we should start a discussion about possibly adding span filters in a future PR? :)

@@ -0,0 +1,48 @@
<?php

namespace DDTrace\Configuration;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this class, I wonder why it is in the Configuration namespace though? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. Naming is hard. 😄

What do you think about adding an Obfuscation namespace? We could also move the existing obfuscation stuff to that namespace in a separate PR. And possibly organize that code a bit better. What do you think? :)

// UUID's
'|\b([0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89AB][0-9a-f]{3}-[0-9a-f]{12})\b|i',
// 16-512 bit hex hashes
'|\b([0-9a-f]{4,128})\b|i',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I would not add this. Isn't it too risky to have false positives with a 4-128 range? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a security standpoint, I'd rather err on the side of over-obfuscation, but I totally see your point as well.

I was thinking about valid words that are 4+ characters that contain only a-f (like cafe). One way around erroneous obfuscation would be curating a whitelist of valid words that are 4+ and a-f. Perhaps we could eventually offer a feature to add words to the whitelist in the event there is a proper noun or a made-up word that's important (like affeca). There are also i18n considerations to support other languages.

In the end, a configurable whitelist that's also very fast (maybe making use of a Trie data structure or something) would be a bit of an undertaking, but it might be worth the tradeoff to ensure better security.

If you're OK with it, I'd like to ping someone from the security department to weigh in on this one. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've narrowed the scope of this to start at 32-bit hashes since an 8-character word with only a-f is much less likely to occur. :)

/**
* A utility class that provides methods to work on urls
*/
class Urls
{
private static $defaultPatterns = [
// UUID's
'|\b([0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89AB][0-9a-f]{3}-[0-9a-f]{12})\b|i',
Copy link
Member

@labbati labbati May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just based on my previous experience, can we make the - optional? We slightly increase the risk to get false positives here but it is limited to only hex strings with one exact number of chars, which is perfectly acceptable in my opinion.

I really appreciate the detailed regex considering [89AB] from the spec. I would ask you though:

  • to add a comment to the section in wikipedia explaining it as it surprised me :) and I had to go through a few google links to find it 😄
  • for consistency with the other parts of the expression can we make it lowercase, the pattern in insensitive so it should make no difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just based on my previous experience, can we make the - optional?

Good call! There's a test for this and it is obfuscated by the regex that catches "16-512 bit hex hashes". But since we may end up changing/removing that regex based on your other comment above, I'll change the regex to make the dashes optional. :)

to add a comment to the section in wikipedia explaining it

Good idea! I'll add that. :)

for consistency with the other parts of the expression can we make it lowercase

Another good idea. I'll change that. :)

*/
public function __construct(array $patternsWithWildcards = [])
{
$this->replacementPatterns = array_map(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional]: this is a matter of styles and this comment is totally optional. I love to Ctrl+click around to see who is using a method. While I know that UTs would detect something wrong here, I still prefer the plain old for loop. This is personal here :) don't hate me 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol - I don't mind refactoring this to a foreach loop since we aren't able to support ClassName::class. :)

public function wildcardToRegexExamples()
{
return [
['/foo/*/bar', ['|^/foo/.+/bar$|', '/foo/?/bar']],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to ^/foo/.+/bar$ with the pattern /foo/123/legitimate/bar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wildcards are greedy by default. Do you think we should add a non-greedy wildcard like *?? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave like this, and users can add their own specific paths. Also, if we receive feedback about it then we can iterate on it.

@SammyK SammyK force-pushed the sammyk/uri-to-resource-normalization branch 2 times, most recently from d45db31 to a5843a9 Compare May 22, 2019 17:25
@SammyK SammyK force-pushed the sammyk/uri-to-resource-normalization branch from a5843a9 to b4e06ae Compare May 23, 2019 14:59
@@ -34,7 +34,7 @@ public function load()

dd_trace('Laravel\Lumen\Application', 'dispatch', function () use ($span) {
$response = dd_trace_forward_call();
$resourceName = 'unnamed_route';
$resourceName = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at who is sneaking in 😄 👍

Copy link
Member

@labbati labbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey just a couple of things that I think we forgot about and then we are in awesome-land :)

@@ -0,0 +1,29 @@
<?php
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shuold we move this to Obfuscation namespace as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! 🤦‍♂ Nice catch! 😄

public function wildcardToRegexExamples()
{
return [
['/foo/*/bar', ['|^/foo/.+/bar$|', '/foo/?/bar']],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave like this, and users can add their own specific paths. Also, if we receive feedback about it then we can iterate on it.

@@ -399,6 +404,25 @@ private function addHostnameToRootSpan()
}
}

private function addUrlAsResourceNameToRootSpan()
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only thing we miss is to make this OFF by default. This in order to prevent unexpected huge cardinality (and so bill) issues out of the blu.

Copy link
Member

@labbati labbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this new feature. Excellent work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🍏 core Changes to the core tracing functionality 🏆 enhancement A new feature or improvement 🎉 new-integration A new integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants