Normalized URL as the resource name #442

SammyK · 2019-05-16T21:56:56Z

Description

This PR adds URL normalization so that that they can be used as a low-cardinality resource name. The default substitution patterns scan for UUID's, hexadecimal hashes, and int's. We can certainly add more to that list.

The DD_TRACE_RESOURCE_URI_MAPPING environment variable can be set to a CSV list of URL-to-resource-name mapping rules that can contain * and $* wildcards.

The * wildcard will match one or more characters to be replaced with ?
The $* wildcard will match one or more characters without any replacement

For example, given DD_TRACE_RESOURCE_URI_MAPPING=/user/*,/city/$*/*:

URL	Normalized
`/user/sammyk`	`/user/?`
`/city/foo/123`	`/city/foo/?`
`/city/bar/123`	`/city/bar/?`

Readiness checklist

Changelog entry added, if necessary
Tests added for this feature/bug

labbati

Awesome stuff here! 😄
Just added a few comments if they make sense to you

labbati · 2019-05-22T09:04:45Z

src/DDTrace/Bootstrap.php

@@ -112,6 +110,12 @@ private static function initRootSpan()
            $span->setTag(Tag::HTTP_URL, $_SERVER['REQUEST_URI']);
            // Status code defaults to 200, will be later on changed when http_response_code will be called
            $span->setTag(Tag::HTTP_STATUS_CODE, 200);
+            // Normalized URL as the resource name
+            $normalizer = new Urls(explode(',', getenv('DD_TRACE_RESOURCE_URI_MAPPING')));


A couple of comments here, if they make sense to you:

While I see our conversation about simplifying the way we read configurations, wouldn't you see more strategic to not use getenv here? What I mean is that you may just want to move this getenv() call to a method in Config and not use all the underlying infrastructure.

I was evaluating pros and cons of doing this here or as a post processing operation. Doing it here means that devs cannot, as an example, set this env in their index.php or the likes. In the end we need that info only at the very end, just before we send the trace. Wouldn't you think it is a good idea to push this setting to the very end?

Great points!

What I mean is that you may just want to move this getenv() call to a method in Config and not use all the underlying infrastructure.

I'm certainly open to moving the getenv() to an abstraction. :) What do you see as the main advantage for abstracting this one-time env access?

Wouldn't you think it is a good idea to push this setting to the very end?

This is a great idea! I'll move it to right before the flush... maybe we should start a discussion about possibly adding span filters in a future PR? :)

labbati · 2019-05-22T09:09:40Z

src/DDTrace/Configuration/WildcardToRegex.php

@@ -0,0 +1,48 @@
+<?php
+
+namespace DDTrace\Configuration;


I love this class, I wonder why it is in the Configuration namespace though? :)

Very good point. Naming is hard. 😄

What do you think about adding an Obfuscation namespace? We could also move the existing obfuscation stuff to that namespace in a separate PR. And possibly organize that code a bit better. What do you think? :)

labbati · 2019-05-22T09:14:48Z

src/DDTrace/Http/Urls.php

+        // UUID's
+        '|\b([0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89AB][0-9a-f]{3}-[0-9a-f]{12})\b|i',
+        // 16-512 bit hex hashes
+        '|\b([0-9a-f]{4,128})\b|i',


To be honest I would not add this. Isn't it too risky to have false positives with a 4-128 range? WDYT?

From a security standpoint, I'd rather err on the side of over-obfuscation, but I totally see your point as well.

I was thinking about valid words that are 4+ characters that contain only a-f (like cafe). One way around erroneous obfuscation would be curating a whitelist of valid words that are 4+ and a-f. Perhaps we could eventually offer a feature to add words to the whitelist in the event there is a proper noun or a made-up word that's important (like affeca). There are also i18n considerations to support other languages.

In the end, a configurable whitelist that's also very fast (maybe making use of a Trie data structure or something) would be a bit of an undertaking, but it might be worth the tradeoff to ensure better security.

If you're OK with it, I'd like to ping someone from the security department to weigh in on this one. :)

For now I've narrowed the scope of this to start at 32-bit hashes since an 8-character word with only a-f is much less likely to occur. :)

labbati · 2019-05-22T09:16:29Z

src/DDTrace/Http/Urls.php

 /**
 * A utility class that provides methods to work on urls
 */
 class Urls
 {
+    private static $defaultPatterns = [
+        // UUID's
+        '|\b([0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89AB][0-9a-f]{3}-[0-9a-f]{12})\b|i',


This is just based on my previous experience, can we make the - optional? We slightly increase the risk to get false positives here but it is limited to only hex strings with one exact number of chars, which is perfectly acceptable in my opinion.

I really appreciate the detailed regex considering [89AB] from the spec. I would ask you though:

to add a comment to the section in wikipedia explaining it as it surprised me :) and I had to go through a few google links to find it 😄

for consistency with the other parts of the expression can we make it lowercase, the pattern in insensitive so it should make no difference.

This is just based on my previous experience, can we make the - optional?

Good call! There's a test for this and it is obfuscated by the regex that catches "16-512 bit hex hashes". But since we may end up changing/removing that regex based on your other comment above, I'll change the regex to make the dashes optional. :)

to add a comment to the section in wikipedia explaining it

Good idea! I'll add that. :)

for consistency with the other parts of the expression can we make it lowercase

Another good idea. I'll change that. :)

labbati · 2019-05-22T09:31:00Z

src/DDTrace/Http/Urls.php

+     */
+    public function __construct(array $patternsWithWildcards = [])
+    {
+        $this->replacementPatterns = array_map(


[Optional]: this is a matter of styles and this comment is totally optional. I love to Ctrl+click around to see who is using a method. While I know that UTs would detect something wrong here, I still prefer the plain old for loop. This is personal here :) don't hate me 😄

Lol - I don't mind refactoring this to a foreach loop since we aren't able to support ClassName::class. :)

labbati · 2019-05-22T09:35:56Z

tests/Unit/Configuration/WildcardToRegexTest.php

+    public function wildcardToRegexExamples()
+    {
+        return [
+            ['/foo/*/bar',                  ['|^/foo/.+/bar$|', '/foo/?/bar']],


What happens to ^/foo/.+/bar$ with the pattern /foo/123/legitimate/bar?

The wildcards are greedy by default. Do you think we should add a non-greedy wildcard like *?? :)

Let's leave like this, and users can add their own specific paths. Also, if we receive feedback about it then we can iterate on it.

…se-positive matches

…etting custom URL rules at runtime

labbati · 2019-05-23T15:28:15Z

src/DDTrace/Integrations/Lumen/V5/LumenIntegrationLoader.php

@@ -34,7 +34,7 @@ public function load()

        dd_trace('Laravel\Lumen\Application', 'dispatch', function () use ($span) {
            $response = dd_trace_forward_call();
-            $resourceName = 'unnamed_route';
+            $resourceName = null;


Look at who is sneaking in 😄 👍

labbati

Hey just a couple of things that I think we forgot about and then we are in awesome-land :)

labbati · 2019-05-23T15:30:18Z

tests/Unit/Configuration/WildcardToRegexTest.php

@@ -0,0 +1,29 @@
+<?php


Shuold we move this to Obfuscation namespace as well?

Doh! 🤦‍♂ Nice catch! 😄

labbati · 2019-05-23T15:31:49Z

tests/Unit/Configuration/WildcardToRegexTest.php

+    public function wildcardToRegexExamples()
+    {
+        return [
+            ['/foo/*/bar',                  ['|^/foo/.+/bar$|', '/foo/?/bar']],


Let's leave like this, and users can add their own specific paths. Also, if we receive feedback about it then we can iterate on it.

labbati · 2019-05-23T15:33:19Z

src/DDTrace/Tracer.php

@@ -399,6 +404,25 @@ private function addHostnameToRootSpan()
        }
    }

+    private function addUrlAsResourceNameToRootSpan()
+    {


I think the only thing we miss is to make this OFF by default. This in order to prevent unexpected huge cardinality (and so bill) issues out of the blu.

…CE_URL_AS_RESOURCE_NAMES_ENABLED=true

labbati

Love this new feature. Excellent work!

SammyK added 🏆 enhancement A new feature or improvement ☠️ do-not-merge/WIP 🍏 core Changes to the core tracing functionality 🎉 new-integration A new integration labels May 16, 2019

SammyK added this to the 0.24.0 milestone May 16, 2019

SammyK removed the ☠️ do-not-merge/WIP label May 16, 2019

SammyK marked this pull request as ready for review May 16, 2019 22:06

This was referenced May 16, 2019

Disable tracing for specific URIs #293

Closed

Resources on Datadog Dashboard fail to distinguish individual endpoints #427

Closed

pawelchcki modified the milestones: 0.24.0, 0.25.0 May 17, 2019

SammyK force-pushed the sammyk/uri-to-resource-normalization branch from 8f3fdcf to 4de654d Compare May 17, 2019 16:55

SammyK mentioned this pull request May 17, 2019

Lumen Integration not registering Services/Resources based on routes #438

Closed

labbati previously approved these changes May 22, 2019

View reviewed changes

SammyK dismissed labbati’s stale review via d45db31 May 22, 2019 17:22

SammyK force-pushed the sammyk/uri-to-resource-normalization branch 2 times, most recently from d45db31 to a5843a9 Compare May 22, 2019 17:25

SammyK added 9 commits May 23, 2019 10:58

Add normalized URL as the resource name

4cba7de

Update custom integration tests

70df6e1

Update CHANGELOG

515e367

Validate UUIDs without dashes and add comments on UUID regex

ee512dd

Refactor usage of array_map to foreach for better IDE support

cd25d67

Narrow the hash obfuscation to start with 32 bit minimum to avoid fal…

2f12a3b

…se-positive matches

Move setting the URL as resource name to just before flush to allow s…

ec5a68a

…etting custom URL rules at runtime

Move WildcardToRegex to Obfuscation namespace

ff3962a

Update Lumen integration to fall back to default resource name

b4e06ae

SammyK force-pushed the sammyk/uri-to-resource-normalization branch from a5843a9 to b4e06ae Compare May 23, 2019 14:59

labbati reviewed May 23, 2019

View reviewed changes

labbati requested changes May 23, 2019

View reviewed changes

Disable URL-as-resource name feature by default; to enable set DD_TRA…

31df8e9

…CE_URL_AS_RESOURCE_NAMES_ENABLED=true

SammyK added 2 commits May 23, 2019 12:24

Move WildcardToRegex test to the proper namespace

da9d677

Mention DD_TRACE_URL_AS_RESOURCE_NAMES_ENABLED in CHANGELOG

60af700

labbati approved these changes May 24, 2019

View reviewed changes

labbati merged commit f1f0e2a into master May 24, 2019

pawelchcki mentioned this pull request May 27, 2019

Release 0.25.0 #453

Merged

vnwhlr mentioned this pull request May 30, 2019

URL normalization mangles path version #455

Closed

SammyK mentioned this pull request May 31, 2019

Narrow the URL normalization rules boundaries #457

Merged

2 tasks

SammyK mentioned this pull request Oct 4, 2019

Craft CMS Trace Resource all 'web.request' #605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalized URL as the resource name #442

Normalized URL as the resource name #442

SammyK commented May 16, 2019 •

edited

labbati left a comment

labbati May 22, 2019

SammyK May 22, 2019

labbati May 22, 2019

SammyK May 22, 2019

labbati May 22, 2019

SammyK May 22, 2019

SammyK May 22, 2019

labbati May 22, 2019 •

edited

SammyK May 22, 2019

labbati May 22, 2019

SammyK May 22, 2019

labbati May 22, 2019

SammyK May 22, 2019

labbati May 23, 2019

labbati May 23, 2019

labbati left a comment

labbati May 23, 2019

SammyK May 23, 2019

labbati May 23, 2019

labbati May 23, 2019

labbati left a comment

Normalized URL as the resource name #442

Normalized URL as the resource name #442

Conversation

SammyK commented May 16, 2019 • edited

Description

Readiness checklist

labbati left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

labbati May 22, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

labbati left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

labbati left a comment

Choose a reason for hiding this comment

SammyK commented May 16, 2019 •

edited

labbati May 22, 2019 •

edited