-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize assignments: Chapter 5. Third parties #8
Comments
Answering the first question about categorization... In the 2019-05-01 desktop dataset, we see the following.
See Query & Raw ResultsSELECT
COUNT(*),
COUNTIF(pageDomain = requestDomain) AS matchingDomains,
COUNTIF(pageETld = requestETld) AS matchingETlds,
COUNTIF(requestCanonicalDomain IS NULL) as hasNoMatchingThirdParty,
COUNTIF(requestDomainOver50 IS NULL) as hasNoMatchingDomainOver50
FROM (
SELECT
page AS pageUrl,
NET.HOST(page) AS pageDomain,
REGEXP_EXTRACT(NET.HOST(page), r'([^.]+\.(?:[^.]+|(?:gov|com|co|ne)\.\w{2})$)') AS pageETld,
url AS requestUrl,
NET.HOST(url) AS requestDomain,
REGEXP_EXTRACT(NET.HOST(url), r'([^.]+\.(?:[^.]+|(?:gov|com|co|ne)\.\w{2})$)') AS requestETld,
ThirdPartyCanonicalDomainTable.canonicalDomain as requestCanonicalDomain,
DomainsOver50Table.requestDomain as requestDomainOver50
FROM
`httparchive.requests.2019_05_01_desktop`
LEFT JOIN
`lighthouse-infrastructure.third_party_web.2019_05_22` as ThirdPartyCanonicalDomainTable
ON NET.HOST(url) = ThirdPartyCanonicalDomainTable.domain
LEFT JOIN
`lighthouse-infrastructure.third_party_web.2019_05_22_all_observed_domains` AS DomainsOver50Table
ON NET.HOST(url) = DomainsOver50Table.requestDomain
)
[
{
"f0_": "387194512",
"matchingDomains": "167351992",
"matchingETlds": "182879255",
"hasNoMatchingThirdParty": "295164996",
"hasNoMatchingDomainOver50": "196211986"
}
]
Given this data, I'm inclined to pick the ">50 pages method". The domain and eTLD methods are really proxies for "Is this request coming from something you control" and isn't perfect by any means given that plenty of sites host their own assets on CDNs where the root domain will differ and there are plenty of sites that are hosted by third parties where eTLD requests aren't actually under their control. Manual listing covers a lot of the third party requests but definitely comes up very short. The only obvious failure mode for >50 approach I see is that it will tend to exclude the smaller third party providers. This seems acceptable when doing mass classification as the consumer of a smaller third party theoretically has greater influence over the behavior of that content anyway, i.e. it's not a "well I absolutely have to use this in order to integrate with my X provider" situation. Metric definitions to follow shortly now that part 1 is settled :) |
Definitions A third party domain is defined as a domain whose resources have been requested by at least 50 different pages in the HTTPArchive dataset. A third party category is defined as the category of the third party domain's primary usage according to third-party-web or 'Unknown' if not found in that dataset. A third party request is defined as a request to a third party domain. A third party script is defined as a third party request whose resource type is 'script'. Resource types:
Third party categories:
Metrics for Almanac
@simonhearne anything else you'd like to see included? |
@flowlabs I've added you as a peer reviewer here. Anything you'd add/change in @patrickhulce's list above? |
This looks really good so far. A few thoughts currently:
|
Great feedback @flowlabs thanks for your time helping out with this section!
Good thinking! Is the primary metric you're thinking here something along the lines of, "Percentage of H/2 requests that are third party requests broken down by resource type"?
The full list in the third party repo is
For this high-level summary it seemed better to consolidate some of these since at a certain point we're analyzing a very small slice. (See image below for how quickly the category becomes pretty small portion) Maybe I'm completely missing the value of more granular categories though? Happy to hear the argument out :)
I LOVE this idea. I'm struggling a bit though with how to represent this in a way that isn't too focused on particular parties. The "average" variability doesn't seem that interesting either, maybe just the variability of script execution compared to 1st party scripts?
Great points! Have any thoughts on how we might track these? All HTTPArchive will be clean slate loads AFAIK, so cookies might be trickier to assess. Good thoughts for potential future Lighthouse audits though too... 🤔 |
Added @jasti, who has been thinking a lot about the state of third parties, as a reviewer 🎉 |
@soulgalore, @zeman would either of you be interested in peer reviewing this effort too? |
@patrickhulce sure thing! |
@patrickhulce hoping to finalize metrics for all chapters today. This one LGTM, just want to make sure you're happy with the current list. I've updated #8 (comment) with the latest metrics. If it LGTY you can tick the checkbox and close this issue. Thanks! |
Yep I'm pretty happy with it! @flowlabs's points are good ones but thornier to execute so we can leave those for a future iteration. |
@rviscomi, sorry for reopening and the delay. Just 3 more metrics to consider if possible:
|
No problem, thanks for the suggestions @jasti! @patrickhulce can you add these metrics if they sound good to you? |
Great suggestions @jasti thanks! They're mostly advertising performance-specific I wonder if this suggests it's worth breaking out advertising into its own category entirely in the future. We could certainly dive much deeper on a lot of ad-specific topics than might make sense in an overall third-party section.
We're not really examining any sort of TTI/load metric for ad frames specifically. I'm not really aware of anything that captures this easily at all right now actually, was there a particular metric you were hoping to examine?
This is a good point though I'm not sure of the solution when we are indeed trying to look at the web as a whole. Maybe the generic version is "what do all of these metrics look like on the subset of pages that include at least one third-party resource". I expect those numbers to be pretty similar in absence of the ad-specific nature though. Idea: what if we added median page-relative % for all of the metrics above? i.e. compute the number of ad requests as a percentage of the total requests on each page and take the median of that number. (Sorry if your idea was the same and I just misunderstood 😄, hurray for secretly being on the same page then!)
I'm not 100% sure what you mean here, as in you'd like to see video advertising companies script execution separate from other advertisers script execution in percentage totals? The categories proposed for the almanac don't break (I've updated the existing metrics list to say "by third-party category by resource type" to indicate that they will be nested too 👍 ) Boiling all of this down this is what I think we'd add...
Does that sound good to you or I have missed the point? |
@jasti do the metrics in @patrickhulce's comment work for you? hoping to close this out today. |
Ping for the open question to @jasti |
@jasti I hope I captured what you were looking for. I've added the "Median page-relative percentage" metrics to the final list 👍 |
Perfect, thank you @patrickhulce! |
* start traduction * process trad * # This is a combination of 9 commits. # This is the 1st commit message: update # The commit message #2 will be skipped: # review # The commit message #3 will be skipped: # review #2 # The commit message #4 will be skipped: # advance # The commit message #5 will be skipped: # update # The commit message #6 will be skipped: # update translation # The commit message #7 will be skipped: # update # The commit message #8 will be skipped: # update # # update # The commit message #9 will be skipped: # update * First quick review (typofixes, translating alternatives) * Preserve original line numbers To facilitate the review of original text vs. translation side-by-side. Also: microtypo fixes. * Review => l338 * End of fine review * Adding @allemas to translators * Rename mise-en-cache to caching * final updates * update accessibility * merge line * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there: 6% of requests have a time to time (TTL) should be: 6% of requests have a Time to Live (TTL) Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script. Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> Co-authored-by: Boris SCHAPIRA <borisschapira@gmail.com> Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>
* start traduction * process trad * # This is a combination of 9 commits. # This is the 1st commit message: update # The commit message #2 will be skipped: # review # The commit message #3 will be skipped: # review #2 # The commit message #4 will be skipped: # advance # The commit message #5 will be skipped: # update # The commit message #6 will be skipped: # update translation # The commit message #7 will be skipped: # update # The commit message #8 will be skipped: # update # # update # The commit message #9 will be skipped: # update * First quick review (typofixes, translating alternatives) * Preserve original line numbers To facilitate the review of original text vs. translation side-by-side. Also: microtypo fixes. * Review => l338 * End of fine review * Adding @allemas to translators * Rename mise-en-cache to caching * final updates * update accessibility * merge line * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there: 6% of requests have a time to time (TTL) should be: 6% of requests have a Time to Live (TTL) Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script. Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * generate caching content in french * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> Co-authored-by: Boris SCHAPIRA <borisschapira@gmail.com> Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>
Due date: To help us stay on schedule, please complete the action items in this issue by June 3.
To do:
Current list of metrics:
👉 AI (@patrickhulce): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.
The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.
Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.
Additional resources:
The text was updated successfully, but these errors were encountered: