Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface CDT's "third party badges" as a dimension #12

Closed
igrigorik opened this issue Jan 2, 2018 · 15 comments
Closed

Surface CDT's "third party badges" as a dimension #12

igrigorik opened this issue Jan 2, 2018 · 15 comments

Comments

@igrigorik
Copy link
Collaborator

Could we add new columns (name + type) in our requests table to record the 3P badges? Also, could we propagate those tags down to any child requests? E.g. if a.js is tagged as XYZ corp, and fetches b.jpg, the latter should have the same tag.

AFAIK, CDT does this analysis at runtime, so it's not available in the trace? Should it be, or do we need to integrate this into our own analysis pipeline?

/cc @paulirish @rviscomi

@pmeenan
Copy link
Member

pmeenan commented Jan 2, 2018

I don't think the attribution itself is in the raw data. The best way would probably be to implement it in the WebPageTest agent and maintain a copy of the lookup table there. The dependency chains could also be followed on the agent (though if the attribution was added to the trace or devtools events then a parallel table wouldn't have to be maintained).

Presumably we'd also want some high-levels stats at the page level for breaking out 1st vs 3rd party counts and sizes.

@igrigorik
Copy link
Collaborator Author

@paulirish any thoughts on moving the tagging upstream in CDT, to make it part of the devtools trace? Seems like a generally useful feature for various downstream consumers, no? :)

@pmeenan agent approach makes sense (assuming we can't move it even further upstream). Ditto for page-level stats, but those we could produce outside of the agent too when we do our aggregations.


A counter argument for moving this logic upstream is if the tables are periodically updated and improved, having the identification logic live in our pipeline allows us to rerun the pipeline and regenerate the classification. @paulirish how often (if at all) is the CDT classification updated?

@LeslieMurphy
Copy link

Can I assist with implementing this? Let me know how I could get started.

My thoughts are:

  • We would do this logic inside of WPT, and have the data flow the tables in BigQuery
  • pages - add numDomainsThirdParty, add reqTotalThirdParty, add bytesTotalThirdParty
  • requests - add isThirdParty, thirdPartyCategory

The logic I've used in the past involved (prior to Google creating CDT third party badges) used a regexp of pattern matching for 3rd party domains. And once I detected that a resource was third party, any subsequent requests that had a referrer from that resource would "inherit" the third party attribute.

I'm not sure if we capture the Chrome resource initiator inside of WPT, but if we do, then we could do an even better job on handling 3rd party detection.

A question on the Third Party database - can it be configured to detect this resource:
https://d1z2jf7jlzjs58.cloudfront.net/code/ptrack-v1.0.3-engagedtime.js
(this is the first resource in a chain that loads additional parsely.com resources into the page -- these are for the parse.ly audience insights platform.

Proper detection for parse.ly requires more than just looking at the domain, because there are lots of cloudfront.net resources that are part of customer websites.

/cc @rviscomi

@rviscomi rviscomi added the P1 label Jan 7, 2019
@rviscomi
Copy link
Member

rviscomi commented Jan 7, 2019

Bumping the priority of this feature. @LeslieMurphy let me know if you're still interested in working on this.

@igrigorik
Copy link
Collaborator Author

A question on the Third Party database - can it be configured to detect this resource:
https://d1z2jf7jlzjs58.cloudfront.net/code/ptrack-v1.0.3-engagedtime.js
(this is the first resource in a chain that loads additional parsely.com resources into the page -- these are for the parse.ly audience insights platform.

This is an important point to get right. We want to make sure that we attribute all of the requests to the right parent/initiator, and then add another layer of smarts that tags these initiators against a set of categories like analytics, advertising, social, etc.

This could be done at runtime within WPT, or after the fact based on the dependency tree.. but that also means we should have high confidence in all the edges being present for the dependency tree. Do we?

@rviscomi
Copy link
Member

rviscomi commented Jan 8, 2019

As of now we don't have a complete dependency tree. About 28% of requests are missing the "initiator" field in the HAR payload. I opened this thread on the WPT forum to see if this is an upstream issue. cc @pmeenan

Once we get that sorted out, it should be straightforward to follow the chain of initiators from a known third party to all of its dependent requests. We could use a technique similar to @paulcalvano's where we join a table of known third parties and their host names with the HTTP Archive requests to better understand which requests are third parties, what type of third party are they (ads, analytics, etc), and what are they loading/doing. This wouldn't require any pre-processing of the requests and could be done entirely in BigQuery.

@pmeenan
Copy link
Member

pmeenan commented Jan 8, 2019

As far as I know, I report all of the initiator information that Dev tools collects. One thing we discussed maybe adding is to associate all unknown requests in a sub-frame with the main request for the frame which should help with the attribution for ads.

@rviscomi
Copy link
Member

rviscomi commented Jan 9, 2019

tldr: The drop in initiator reliability correlates with M70+.

The website I included as an example in the WPT thread is www.usedtrucks.mercedes-benz.co.uk/. In the most recent crawl, only the initial HTML request is annotated with the expected initiator (empty string). The field is omitted entirely from all other requests.

When I manually test the page in Chrome (version 71.0.3578.98), I do see the expected initiator data:

image

I just retested the page in WPT and now I actually do see initiator fields in the HAR consistently for all 3 runs: https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/

image

We've been testing this particular URL in HA since July 2018, so we can see if the percent of requests with an initiator field has changed:

SELECT
  _TABLE_SUFFIX AS crawl,
  SUM(IF(JSON_EXTRACT(payload, '$._initiator') IS NOT NULL, 1, 0)) / COUNT(0) AS pct_initiators
FROM
  `httparchive.requests.*`
WHERE
  page = 'http://www.usedtrucks.mercedes-benz.co.uk/'
GROUP BY
  crawl
HAVING
  pct_initiators IS NOT NULL
ORDER BY
  crawl

Surprisingly, things took a nosedive on October 15

date desktop mobile
2018_07_01 100.00%  
2018_07_15 100.00% 98.37%
2018_08_01 100.00% 100.00%
2018_08_15 100.00% 100.00%
2018_09_01 100.00% 100.00%
2018_09_15 100.00% 100.00%
2018_10_01 100.00% 100.00%
2018_10_15 0.36% 0.36%
2018_11_01 0.34% 0.34%
2018_11_15 0.34% 0.34%
2018_12_01 0.34% 0.34%
2018_12_15 0.32% 0.32%

And when we look at initiators for all requests on all pages things are interesting:

image

date desktop mobile
2018_07_01 99.08% 99.01%
2018_07_15 99.00% 99.00%
2018_08_01 99.08% 99.02%
2018_08_15 99.08% 99.08%
2018_09_01 99.04% 99.06%
2018_09_15 99.04% 99.08%
2018_10_01 99.04% 99.06%
2018_10_15 59.81% 54.89%
2018_11_01 42.76% 44.94%
2018_11_15 42.46% 44.81%
2018_12_01 41.65% 43.40%
2018_12_15 72.21% 60.01%

Again, things changed globally on October 15. And December 15 was actually much better in terms of coverage than crawls since November.

Looking at the Chrome versions during this timeframe, it seems like we switched from Chrome 69 to 70. So I wonder if there were some unexpected reliability issues in M70+ with the initiator field.

date 68 69 70 71
2018_09_01 50.64% 49.36%    
2018_09_15   100.00%    
2018_10_01   100.00%    
2018_10_15   24.33% 75.67%  
2018_11_01 0.06% 0.17% 99.77%  
2018_11_15     100.00%  
2018_12_01     87.02% 12.98%
2018_12_15     0.07% 99.93%

One thing I can't explain is why the Mercedes example from the December 15 crawl had ~0% initiators in Chrome 71.0.3578.98 but from my ad hoc test today 100% of initiators are present in the exact same browser version.

@pmeenan
Copy link
Member

pmeenan commented Jan 10, 2019 via email

@rviscomi
Copy link
Member

Ok, let's wait for the 1/1 crawl to complete and see if the initiators are appearing as expected.

@rviscomi
Copy link
Member

rviscomi commented Feb 7, 2019

Here's an updated table of initiator coverage with 2019_01_01:

date desktop mobile
2018_09_01 99.04% 99.06%
2018_09_15 99.04% 99.08%
2018_10_01 99.04% 99.06%
2018_10_15 59.81% 54.89%
2018_11_01 42.76% 44.94%
2018_11_15 42.46% 44.81%
2018_12_01 41.65% 43.40%
2018_12_15 72.21% 60.01%
2019_01_01 83.90% 83.90%

(yes desktop and mobile actually come out to the same rounded value)

There's definitely been some improvement but still not quite back to normal ~99%.

@pmeenan
Copy link
Member

pmeenan commented Feb 8, 2019

99% seems unrealistically high. In normal testing I see a few URLs per page that have "other" as the initiator in the raw dev tools data. I can include that if it would help but it amounts to "unknown".

I did JUST take a look and push an improvement for cases where the initiator was a call stack in JavaScript that references a script ID but didn't include the script URL. That can happen when a script inserts a script directly into the dom (or does an eval) so now I monitor all script compilations and walk the call stack for every script ID to see what caused the script to get added and use that as the initiator in those cases.

We're around 1/3 the way into this month's crawl so there will be a decent bump in coverage and then another small bump in March (assuming nothing in Chrome changes between now and then).

@rviscomi
Copy link
Member

rviscomi commented Feb 8, 2019

Copying an image from an earlier comment for clarification:

image

Are you saying the ~99% we saw from May 2017 to October 2018 was anomalous and the ~80% before and after that range is the realistic expectation?

@pmeenan
Copy link
Member

pmeenan commented Feb 8, 2019

Yep, pretty much (or that the extra 19% had an empty (but present) initiator. At a minimum, the main document request usually doesn't have one and AFAIK, any iFrame src URLs (in addition to a bunch of other edge cases). 80% ish sounds like a good expectation.

@rviscomi
Copy link
Member

rviscomi commented Jul 7, 2020

Closing this out. We can join with @patrickhulce's third-party-web data on BigQuery to achieve the same effect. See this example query.

@rviscomi rviscomi closed this as completed Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants