-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface CDT's "third party badges" as a dimension #12
Comments
I don't think the attribution itself is in the raw data. The best way would probably be to implement it in the WebPageTest agent and maintain a copy of the lookup table there. The dependency chains could also be followed on the agent (though if the attribution was added to the trace or devtools events then a parallel table wouldn't have to be maintained). Presumably we'd also want some high-levels stats at the page level for breaking out 1st vs 3rd party counts and sizes. |
@paulirish any thoughts on moving the tagging upstream in CDT, to make it part of the devtools trace? Seems like a generally useful feature for various downstream consumers, no? :) @pmeenan agent approach makes sense (assuming we can't move it even further upstream). Ditto for page-level stats, but those we could produce outside of the agent too when we do our aggregations. A counter argument for moving this logic upstream is if the tables are periodically updated and improved, having the identification logic live in our pipeline allows us to rerun the pipeline and regenerate the classification. @paulirish how often (if at all) is the CDT classification updated? |
Can I assist with implementing this? Let me know how I could get started. My thoughts are:
The logic I've used in the past involved (prior to Google creating CDT third party badges) used a regexp of pattern matching for 3rd party domains. And once I detected that a resource was third party, any subsequent requests that had a referrer from that resource would "inherit" the third party attribute. I'm not sure if we capture the Chrome resource initiator inside of WPT, but if we do, then we could do an even better job on handling 3rd party detection. A question on the Third Party database - can it be configured to detect this resource: Proper detection for parse.ly requires more than just looking at the domain, because there are lots of cloudfront.net resources that are part of customer websites. /cc @rviscomi |
Bumping the priority of this feature. @LeslieMurphy let me know if you're still interested in working on this. |
This is an important point to get right. We want to make sure that we attribute all of the requests to the right parent/initiator, and then add another layer of smarts that tags these initiators against a set of categories like analytics, advertising, social, etc. This could be done at runtime within WPT, or after the fact based on the dependency tree.. but that also means we should have high confidence in all the edges being present for the dependency tree. Do we? |
As of now we don't have a complete dependency tree. About 28% of requests are missing the "initiator" field in the HAR payload. I opened this thread on the WPT forum to see if this is an upstream issue. cc @pmeenan Once we get that sorted out, it should be straightforward to follow the chain of initiators from a known third party to all of its dependent requests. We could use a technique similar to @paulcalvano's where we join a table of known third parties and their host names with the HTTP Archive requests to better understand which requests are third parties, what type of third party are they (ads, analytics, etc), and what are they loading/doing. This wouldn't require any pre-processing of the requests and could be done entirely in BigQuery. |
As far as I know, I report all of the initiator information that Dev tools collects. One thing we discussed maybe adding is to associate all unknown requests in a sub-frame with the main request for the frame which should help with the attribution for ads. |
tldr: The drop in initiator reliability correlates with M70+. The website I included as an example in the WPT thread is www.usedtrucks.mercedes-benz.co.uk/. In the most recent crawl, only the initial HTML request is annotated with the expected initiator (empty string). The field is omitted entirely from all other requests. When I manually test the page in Chrome (version 71.0.3578.98), I do see the expected initiator data: I just retested the page in WPT and now I actually do see initiator fields in the HAR consistently for all 3 runs: https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/ We've been testing this particular URL in HA since July 2018, so we can see if the percent of requests with an initiator field has changed: SELECT
_TABLE_SUFFIX AS crawl,
SUM(IF(JSON_EXTRACT(payload, '$._initiator') IS NOT NULL, 1, 0)) / COUNT(0) AS pct_initiators
FROM
`httparchive.requests.*`
WHERE
page = 'http://www.usedtrucks.mercedes-benz.co.uk/'
GROUP BY
crawl
HAVING
pct_initiators IS NOT NULL
ORDER BY
crawl Surprisingly, things took a nosedive on October 15
And when we look at initiators for all requests on all pages things are interesting:
Again, things changed globally on October 15. And December 15 was actually much better in terms of coverage than crawls since November. Looking at the Chrome versions during this timeframe, it seems like we switched from Chrome 69 to 70. So I wonder if there were some unexpected reliability issues in M70+ with the initiator field.
One thing I can't explain is why the Mercedes example from the December 15 crawl had ~0% initiators in Chrome 71.0.3578.98 but from my ad hoc test today 100% of initiators are present in the exact same browser version. |
Could be something was fixed recently in WebPageTest to handle changes from chrome 70. There were some netlog changes that were fixed last week that may have also fixed other things.
…________________________________
From: Rick Viscomi <notifications@github.com>
Sent: Wednesday, January 9, 2019 5:41 PM
To: HTTPArchive/httparchive.org
Cc: Patrick Meenan; Mention
Subject: Re: [HTTPArchive/httparchive.org] Surface CDT's "third party badges" as a dimension (#12)
tldr: The drop in initiator reliability correlates with M70+.
The website I included as an example in the WPT thread is www.usedtrucks.mercedes-benz.co.uk/<http://www.usedtrucks.mercedes-benz.co.uk/>. In the most recent crawl, only the initial HTML request is annotated with the expected initiator (empty string). The field is omitted entirely from all other requests.
When I manually test the page in CDT, I do see the expected initiator data:
[image]<https://user-images.githubusercontent.com/1120896/50929156-8be91100-142a-11e9-9a60-61aafa0f0201.png>
I just retested the page in WPT and now I actually do see initiator fields in the HAR consistently for all 3 runs: https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/
[image]<https://user-images.githubusercontent.com/1120896/50930160-3a8e5100-142d-11e9-90c2-0f176a59d7b7.png>
We've been testing this particular URL in HA since July 2018, so we can see if the percent of requests with an initiator field has changed:
SELECT
_TABLE_SUFFIX AS crawl,
SUM(IF(JSON_EXTRACT(payload, '$._initiator') IS NOT NULL, 1, 0)) / COUNT(0) AS pct_initiators
FROM
`httparchive.requests.*`
WHERE
page = 'http://www.usedtrucks.mercedes-benz.co.uk/'
GROUP BY
crawl
HAVING
pct_initiators IS NOT NULL
ORDER BY
crawl
Surprisingly, things took a nosedive on October 15
date desktop mobile
2018_07_01 100.00%
2018_07_15 100.00% 98.37%
2018_08_01 100.00% 100.00%
2018_08_15 100.00% 100.00%
2018_09_01 100.00% 100.00%
2018_09_15 100.00% 100.00%
2018_10_01 100.00% 100.00%
2018_10_15 0.36% 0.36%
2018_11_01 0.34% 0.34%
2018_11_15 0.34% 0.34%
2018_12_01 0.34% 0.34%
2018_12_15 0.32% 0.32%
And when we look at initiators for all requests on all pages things are interesting:
[image]<https://user-images.githubusercontent.com/1120896/50930917-395e2380-142f-11e9-9e50-de756e0edf92.png>
date desktop mobile
2018_07_01 99.08% 99.01%
2018_07_15 99.00% 99.00%
2018_08_01 99.08% 99.02%
2018_08_15 99.08% 99.08%
2018_09_01 99.04% 99.06%
2018_09_15 99.04% 99.08%
2018_10_01 99.04% 99.06%
2018_10_15 59.81% 54.89%
2018_11_01 42.76% 44.94%
2018_11_15 42.46% 44.81%
2018_12_01 41.65% 43.40%
2018_12_15 72.21% 60.01%
Again, things changed globally on October 15. And December 15 was actually much better in terms of coverage than crawls since November.
Looking at the Chrome versions during this timeframe, it seems like we switched from Chrome 69 to 70. So I wonder if there were some unexpected reliability issues in M70+ with the initiator field.
date 68 69 70 71
2018_09_01 50.64% 49.36%
2018_09_15 100.00%
2018_10_01 100.00%
2018_10_15 24.33% 75.67%
2018_11_01 0.06% 0.17% 99.77%
2018_11_15 100.00%
2018_12_01 87.02% 12.98%
2018_12_15 0.07% 99.93%
One thing I can't explain is why the Mercedes example from the December 15 crawl<http://httparchive.webpagetest.org/result/181215_N4_199B9/1/details/#waterfall_view_step1> had ~0% initiators in Chrome 71.0.3578.98 but from my ad hoc test today<https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/1/details/#waterfall_view_step1> 100% of initiators are present in the exact same browser version.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#12 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAbHBa-FSJWZh2Oa24Mk4biP7qcvR6mOks5vBm_-gaJpZM4RQxmZ>.
|
Ok, let's wait for the 1/1 crawl to complete and see if the initiators are appearing as expected. |
Here's an updated table of initiator coverage with 2019_01_01:
(yes desktop and mobile actually come out to the same rounded value) There's definitely been some improvement but still not quite back to normal ~99%. |
99% seems unrealistically high. In normal testing I see a few URLs per page that have "other" as the initiator in the raw dev tools data. I can include that if it would help but it amounts to "unknown". I did JUST take a look and push an improvement for cases where the initiator was a call stack in JavaScript that references a script ID but didn't include the script URL. That can happen when a script inserts a script directly into the dom (or does an eval) so now I monitor all script compilations and walk the call stack for every script ID to see what caused the script to get added and use that as the initiator in those cases. We're around 1/3 the way into this month's crawl so there will be a decent bump in coverage and then another small bump in March (assuming nothing in Chrome changes between now and then). |
Yep, pretty much (or that the extra 19% had an empty (but present) initiator. At a minimum, the main document request usually doesn't have one and AFAIK, any iFrame src URLs (in addition to a bunch of other edge cases). 80% ish sounds like a good expectation. |
Closing this out. We can join with @patrickhulce's third-party-web data on BigQuery to achieve the same effect. See this example query. |
Could we add new columns (name + type) in our requests table to record the 3P badges? Also, could we propagate those tags down to any child requests? E.g. if a.js is tagged as XYZ corp, and fetches b.jpg, the latter should have the same tag.
AFAIK, CDT does this analysis at runtime, so it's not available in the trace? Should it be, or do we need to integrate this into our own analysis pipeline?
/cc @paulirish @rviscomi
The text was updated successfully, but these errors were encountered: