Surface CDT's "third party badges" as a dimension #12

igrigorik · 2018-01-02T16:04:58Z

Docs: https://developers.google.com/web/updates/2017/05/devtools-release-notes#badges
CLI tool, courtesy of Paul Irish: https://github.com/paulirish/third-party-decode

Could we add new columns (name + type) in our requests table to record the 3P badges? Also, could we propagate those tags down to any child requests? E.g. if a.js is tagged as XYZ corp, and fetches b.jpg, the latter should have the same tag.

AFAIK, CDT does this analysis at runtime, so it's not available in the trace? Should it be, or do we need to integrate this into our own analysis pipeline?

/cc @paulirish @rviscomi

pmeenan · 2018-01-02T16:21:32Z

I don't think the attribution itself is in the raw data. The best way would probably be to implement it in the WebPageTest agent and maintain a copy of the lookup table there. The dependency chains could also be followed on the agent (though if the attribution was added to the trace or devtools events then a parallel table wouldn't have to be maintained).

Presumably we'd also want some high-levels stats at the page level for breaking out 1st vs 3rd party counts and sizes.

igrigorik · 2018-01-02T17:52:55Z

@paulirish any thoughts on moving the tagging upstream in CDT, to make it part of the devtools trace? Seems like a generally useful feature for various downstream consumers, no? :)

@pmeenan agent approach makes sense (assuming we can't move it even further upstream). Ditto for page-level stats, but those we could produce outside of the agent too when we do our aggregations.

A counter argument for moving this logic upstream is if the tables are periodically updated and improved, having the identification logic live in our pipeline allows us to rerun the pipeline and regenerate the classification. @paulirish how often (if at all) is the CDT classification updated?

LeslieMurphy · 2018-04-13T13:44:30Z

Can I assist with implementing this? Let me know how I could get started.

My thoughts are:

We would do this logic inside of WPT, and have the data flow the tables in BigQuery
pages - add numDomainsThirdParty, add reqTotalThirdParty, add bytesTotalThirdParty
requests - add isThirdParty, thirdPartyCategory

The logic I've used in the past involved (prior to Google creating CDT third party badges) used a regexp of pattern matching for 3rd party domains. And once I detected that a resource was third party, any subsequent requests that had a referrer from that resource would "inherit" the third party attribute.

I'm not sure if we capture the Chrome resource initiator inside of WPT, but if we do, then we could do an even better job on handling 3rd party detection.

A question on the Third Party database - can it be configured to detect this resource:
https://d1z2jf7jlzjs58.cloudfront.net/code/ptrack-v1.0.3-engagedtime.js
(this is the first resource in a chain that loads additional parsely.com resources into the page -- these are for the parse.ly audience insights platform.

Proper detection for parse.ly requires more than just looking at the domain, because there are lots of cloudfront.net resources that are part of customer websites.

/cc @rviscomi

rviscomi · 2019-01-07T19:33:32Z

Bumping the priority of this feature. @LeslieMurphy let me know if you're still interested in working on this.

igrigorik · 2019-01-08T05:10:15Z

A question on the Third Party database - can it be configured to detect this resource:
https://d1z2jf7jlzjs58.cloudfront.net/code/ptrack-v1.0.3-engagedtime.js
(this is the first resource in a chain that loads additional parsely.com resources into the page -- these are for the parse.ly audience insights platform.

This is an important point to get right. We want to make sure that we attribute all of the requests to the right parent/initiator, and then add another layer of smarts that tags these initiators against a set of categories like analytics, advertising, social, etc.

This could be done at runtime within WPT, or after the fact based on the dependency tree.. but that also means we should have high confidence in all the edges being present for the dependency tree. Do we?

rviscomi · 2019-01-08T19:59:01Z

As of now we don't have a complete dependency tree. About 28% of requests are missing the "initiator" field in the HAR payload. I opened this thread on the WPT forum to see if this is an upstream issue. cc @pmeenan

Once we get that sorted out, it should be straightforward to follow the chain of initiators from a known third party to all of its dependent requests. We could use a technique similar to @paulcalvano's where we join a table of known third parties and their host names with the HTTP Archive requests to better understand which requests are third parties, what type of third party are they (ads, analytics, etc), and what are they loading/doing. This wouldn't require any pre-processing of the requests and could be done entirely in BigQuery.

pmeenan · 2019-01-08T21:43:35Z

As far as I know, I report all of the initiator information that Dev tools collects. One thing we discussed maybe adding is to associate all unknown requests in a sub-frame with the main request for the frame which should help with the attribution for ads.

rviscomi · 2019-01-09T22:41:00Z

tldr: The drop in initiator reliability correlates with M70+.

The website I included as an example in the WPT thread is www.usedtrucks.mercedes-benz.co.uk/. In the most recent crawl, only the initial HTML request is annotated with the expected initiator (empty string). The field is omitted entirely from all other requests.

When I manually test the page in Chrome (version 71.0.3578.98), I do see the expected initiator data:

I just retested the page in WPT and now I actually do see initiator fields in the HAR consistently for all 3 runs: https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/

We've been testing this particular URL in HA since July 2018, so we can see if the percent of requests with an initiator field has changed:

SELECT
  _TABLE_SUFFIX AS crawl,
  SUM(IF(JSON_EXTRACT(payload, '$._initiator') IS NOT NULL, 1, 0)) / COUNT(0) AS pct_initiators
FROM
  `httparchive.requests.*`
WHERE
  page = 'http://www.usedtrucks.mercedes-benz.co.uk/'
GROUP BY
  crawl
HAVING
  pct_initiators IS NOT NULL
ORDER BY
  crawl

Surprisingly, things took a nosedive on October 15

date	desktop	mobile
2018_07_01	100.00%
2018_07_15	100.00%	98.37%
2018_08_01	100.00%	100.00%
2018_08_15	100.00%	100.00%
2018_09_01	100.00%	100.00%
2018_09_15	100.00%	100.00%
2018_10_01	100.00%	100.00%
2018_10_15	0.36%	0.36%
2018_11_01	0.34%	0.34%
2018_11_15	0.34%	0.34%
2018_12_01	0.34%	0.34%
2018_12_15	0.32%	0.32%

And when we look at initiators for all requests on all pages things are interesting:

date	desktop	mobile
2018_07_01	99.08%	99.01%
2018_07_15	99.00%	99.00%
2018_08_01	99.08%	99.02%
2018_08_15	99.08%	99.08%
2018_09_01	99.04%	99.06%
2018_09_15	99.04%	99.08%
2018_10_01	99.04%	99.06%
2018_10_15	59.81%	54.89%
2018_11_01	42.76%	44.94%
2018_11_15	42.46%	44.81%
2018_12_01	41.65%	43.40%
2018_12_15	72.21%	60.01%

Again, things changed globally on October 15. And December 15 was actually much better in terms of coverage than crawls since November.

Looking at the Chrome versions during this timeframe, it seems like we switched from Chrome 69 to 70. So I wonder if there were some unexpected reliability issues in M70+ with the initiator field.

date	68	69	70	71
2018_09_01	50.64%	49.36%
2018_09_15		100.00%
2018_10_01		100.00%
2018_10_15		24.33%	75.67%
2018_11_01	0.06%	0.17%	99.77%
2018_11_15			100.00%
2018_12_01			87.02%	12.98%
2018_12_15			0.07%	99.93%

One thing I can't explain is why the Mercedes example from the December 15 crawl had ~0% initiators in Chrome 71.0.3578.98 but from my ad hoc test today 100% of initiators are present in the exact same browser version.

pmeenan · 2019-01-10T01:02:32Z

Could be something was fixed recently in WebPageTest to handle changes from chrome 70. There were some netlog changes that were fixed last week that may have also fixed other things.

…

________________________________ From: Rick Viscomi <notifications@github.com> Sent: Wednesday, January 9, 2019 5:41 PM To: HTTPArchive/httparchive.org Cc: Patrick Meenan; Mention Subject: Re: [HTTPArchive/httparchive.org] Surface CDT's "third party badges" as a dimension (#12) tldr: The drop in initiator reliability correlates with M70+. The website I included as an example in the WPT thread is www.usedtrucks.mercedes-benz.co.uk/<http://www.usedtrucks.mercedes-benz.co.uk/>. In the most recent crawl, only the initial HTML request is annotated with the expected initiator (empty string). The field is omitted entirely from all other requests. When I manually test the page in CDT, I do see the expected initiator data: [image]<https://user-images.githubusercontent.com/1120896/50929156-8be91100-142a-11e9-9a60-61aafa0f0201.png> I just retested the page in WPT and now I actually do see initiator fields in the HAR consistently for all 3 runs: https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/ [image]<https://user-images.githubusercontent.com/1120896/50930160-3a8e5100-142d-11e9-90c2-0f176a59d7b7.png> We've been testing this particular URL in HA since July 2018, so we can see if the percent of requests with an initiator field has changed: SELECT _TABLE_SUFFIX AS crawl, SUM(IF(JSON_EXTRACT(payload, '$._initiator') IS NOT NULL, 1, 0)) / COUNT(0) AS pct_initiators FROM `httparchive.requests.*` WHERE page = 'http://www.usedtrucks.mercedes-benz.co.uk/' GROUP BY crawl HAVING pct_initiators IS NOT NULL ORDER BY crawl Surprisingly, things took a nosedive on October 15 date desktop mobile 2018_07_01 100.00% 2018_07_15 100.00% 98.37% 2018_08_01 100.00% 100.00% 2018_08_15 100.00% 100.00% 2018_09_01 100.00% 100.00% 2018_09_15 100.00% 100.00% 2018_10_01 100.00% 100.00% 2018_10_15 0.36% 0.36% 2018_11_01 0.34% 0.34% 2018_11_15 0.34% 0.34% 2018_12_01 0.34% 0.34% 2018_12_15 0.32% 0.32% And when we look at initiators for all requests on all pages things are interesting: [image]<https://user-images.githubusercontent.com/1120896/50930917-395e2380-142f-11e9-9e50-de756e0edf92.png> date desktop mobile 2018_07_01 99.08% 99.01% 2018_07_15 99.00% 99.00% 2018_08_01 99.08% 99.02% 2018_08_15 99.08% 99.08% 2018_09_01 99.04% 99.06% 2018_09_15 99.04% 99.08% 2018_10_01 99.04% 99.06% 2018_10_15 59.81% 54.89% 2018_11_01 42.76% 44.94% 2018_11_15 42.46% 44.81% 2018_12_01 41.65% 43.40% 2018_12_15 72.21% 60.01% Again, things changed globally on October 15. And December 15 was actually much better in terms of coverage than crawls since November. Looking at the Chrome versions during this timeframe, it seems like we switched from Chrome 69 to 70. So I wonder if there were some unexpected reliability issues in M70+ with the initiator field. date 68 69 70 71 2018_09_01 50.64% 49.36% 2018_09_15 100.00% 2018_10_01 100.00% 2018_10_15 24.33% 75.67% 2018_11_01 0.06% 0.17% 99.77% 2018_11_15 100.00% 2018_12_01 87.02% 12.98% 2018_12_15 0.07% 99.93% One thing I can't explain is why the Mercedes example from the December 15 crawl<http://httparchive.webpagetest.org/result/181215_N4_199B9/1/details/#waterfall_view_step1> had ~0% initiators in Chrome 71.0.3578.98 but from my ad hoc test today<https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/1/details/#waterfall_view_step1> 100% of initiators are present in the exact same browser version. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#12 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAbHBa-FSJWZh2Oa24Mk4biP7qcvR6mOks5vBm_-gaJpZM4RQxmZ>.

rviscomi · 2019-01-10T17:31:46Z

Ok, let's wait for the 1/1 crawl to complete and see if the initiators are appearing as expected.

rviscomi · 2019-02-07T21:38:08Z

Here's an updated table of initiator coverage with 2019_01_01:

date	desktop	mobile
2018_09_01	99.04%	99.06%
2018_09_15	99.04%	99.08%
2018_10_01	99.04%	99.06%
2018_10_15	59.81%	54.89%
2018_11_01	42.76%	44.94%
2018_11_15	42.46%	44.81%
2018_12_01	41.65%	43.40%
2018_12_15	72.21%	60.01%
2019_01_01	83.90%	83.90%

(yes desktop and mobile actually come out to the same rounded value)

There's definitely been some improvement but still not quite back to normal ~99%.

pmeenan · 2019-02-08T18:57:26Z

99% seems unrealistically high. In normal testing I see a few URLs per page that have "other" as the initiator in the raw dev tools data. I can include that if it would help but it amounts to "unknown".

I did JUST take a look and push an improvement for cases where the initiator was a call stack in JavaScript that references a script ID but didn't include the script URL. That can happen when a script inserts a script directly into the dom (or does an eval) so now I monitor all script compilations and walk the call stack for every script ID to see what caused the script to get added and use that as the initiator in those cases.

We're around 1/3 the way into this month's crawl so there will be a decent bump in coverage and then another small bump in March (assuming nothing in Chrome changes between now and then).

rviscomi · 2019-02-08T19:35:57Z

Copying an image from an earlier comment for clarification:

Are you saying the ~99% we saw from May 2017 to October 2018 was anomalous and the ~80% before and after that range is the realistic expectation?

pmeenan · 2019-02-08T19:41:53Z

Yep, pretty much (or that the extra 19% had an empty (but present) initiator. At a minimum, the main document request usually doesn't have one and AFAIK, any iFrame src URLs (in addition to a bunch of other edge cases). 80% ish sounds like a good expectation.

rviscomi · 2020-07-07T23:15:28Z

Closing this out. We can join with @patrickhulce's third-party-web data on BigQuery to achieve the same effect. See this example query.

igrigorik added the enhancement label Jan 2, 2018

rviscomi added the P1 label Jan 7, 2019

rviscomi closed this as completed Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface CDT's "third party badges" as a dimension #12

Surface CDT's "third party badges" as a dimension #12

igrigorik commented Jan 2, 2018

pmeenan commented Jan 2, 2018

igrigorik commented Jan 2, 2018

LeslieMurphy commented Apr 13, 2018

rviscomi commented Jan 7, 2019

igrigorik commented Jan 8, 2019

rviscomi commented Jan 8, 2019

pmeenan commented Jan 8, 2019

rviscomi commented Jan 9, 2019 •

edited

Loading

pmeenan commented Jan 10, 2019 via email

rviscomi commented Jan 10, 2019

rviscomi commented Feb 7, 2019

pmeenan commented Feb 8, 2019

rviscomi commented Feb 8, 2019

pmeenan commented Feb 8, 2019

rviscomi commented Jul 7, 2020

Surface CDT's "third party badges" as a dimension #12

Surface CDT's "third party badges" as a dimension #12

Comments

igrigorik commented Jan 2, 2018

pmeenan commented Jan 2, 2018

igrigorik commented Jan 2, 2018

LeslieMurphy commented Apr 13, 2018

rviscomi commented Jan 7, 2019

igrigorik commented Jan 8, 2019

rviscomi commented Jan 8, 2019

pmeenan commented Jan 8, 2019

rviscomi commented Jan 9, 2019 • edited Loading

pmeenan commented Jan 10, 2019 via email

rviscomi commented Jan 10, 2019

rviscomi commented Feb 7, 2019

pmeenan commented Feb 8, 2019

rviscomi commented Feb 8, 2019

pmeenan commented Feb 8, 2019

rviscomi commented Jul 7, 2020

rviscomi commented Jan 9, 2019 •

edited

Loading