Finalize assignments: Chapter 5. Third parties #8

rviscomi · 2019-05-21T01:08:02Z

Section	Chapter	Author	Reviewers
I. Page Content	5. Third parties	@patrickhulce	@simonhearne @flowlabs @jasti @zeman

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

Percentage of pages that include at least one third-party resource.
Percentage of pages that include at least one ad resource.
Percentage of requests that are third party requests broken down by third party category by resource type.
Percentage of total bytes that are from third party requests broken down by third party category by resource type.
Percentage of total script execution time that is from third party scripts broken down by third party category.
Median page-relative percentage of requests that are third party requests broken down by third party category by resource type.
Median page-relative percentage of total bytes that are from third party requests broken down by third party category by resource type.
Median page-relative percentage of total script execution time that is from third party scripts broken down by third party category.
Top 100 third party domains by request volume
Top 100 third party domains by total byte weight
Top 100 third party domains by total script execution time
Top 100 third party requests by request volume
Top 100 third party requests by total script execution time

👉 AI (@patrickhulce): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

patrickhulce · 2019-05-22T22:42:43Z

Answering the first question about categorization...

In the 2019-05-01 desktop dataset, we see the following.

Metric	Count	Percent
Total Requests	387,194,512	100%
Third Party Requests (domain method)	219,842,520	56.8%
Third Party Requests (eTLD method)	204,315,257	52.8%
Third Party Requests (>50 pages method)	190,982,526	49.3%
Third Party Requests (manual list method)	92,029,516	23.8%

See Query & Raw Results

SELECT
  COUNT(*),
  COUNTIF(pageDomain = requestDomain) AS matchingDomains,
  COUNTIF(pageETld = requestETld) AS matchingETlds,
  COUNTIF(requestCanonicalDomain IS NULL) as hasNoMatchingThirdParty,
  COUNTIF(requestDomainOver50 IS NULL) as hasNoMatchingDomainOver50
FROM (
  SELECT
    page AS pageUrl,
    NET.HOST(page) AS pageDomain,
    REGEXP_EXTRACT(NET.HOST(page), r'([^.]+\.(?:[^.]+|(?:gov|com|co|ne)\.\w{2})$)') AS pageETld,
    url AS requestUrl,
    NET.HOST(url) AS requestDomain,
    REGEXP_EXTRACT(NET.HOST(url), r'([^.]+\.(?:[^.]+|(?:gov|com|co|ne)\.\w{2})$)') AS requestETld,
    ThirdPartyCanonicalDomainTable.canonicalDomain as requestCanonicalDomain,
    DomainsOver50Table.requestDomain as requestDomainOver50
  FROM
    `httparchive.requests.2019_05_01_desktop`
  LEFT JOIN
    `lighthouse-infrastructure.third_party_web.2019_05_22` as ThirdPartyCanonicalDomainTable
  ON NET.HOST(url) = ThirdPartyCanonicalDomainTable.domain
  LEFT JOIN
    `lighthouse-infrastructure.third_party_web.2019_05_22_all_observed_domains` AS DomainsOver50Table
  ON NET.HOST(url) = DomainsOver50Table.requestDomain
)

[
  {
    "f0_": "387194512",
    "matchingDomains": "167351992",
    "matchingETlds": "182879255",
    "hasNoMatchingThirdParty": "295164996",
    "hasNoMatchingDomainOver50": "196211986"
  }
]

Given this data, I'm inclined to pick the ">50 pages method". The domain and eTLD methods are really proxies for "Is this request coming from something you control" and isn't perfect by any means given that plenty of sites host their own assets on CDNs where the root domain will differ and there are plenty of sites that are hosted by third parties where eTLD requests aren't actually under their control. Manual listing covers a lot of the third party requests but definitely comes up very short.

The only obvious failure mode for >50 approach I see is that it will tend to exclude the smaller third party providers. This seems acceptable when doing mass classification as the consumer of a smaller third party theoretically has greater influence over the behavior of that content anyway, i.e. it's not a "well I absolutely have to use this in order to integrate with my X provider" situation.

Metric definitions to follow shortly now that part 1 is settled :)

patrickhulce · 2019-05-23T17:07:32Z

Definitions

A third party domain is defined as a domain whose resources have been requested by at least 50 different pages in the HTTPArchive dataset.

A third party category is defined as the category of the third party domain's primary usage according to third-party-web or 'Unknown' if not found in that dataset.

A third party request is defined as a request to a third party domain.

A third party script is defined as a third party request whose resource type is 'script'.

Resource types:

Total
Document
Script
Stylesheet
Image
XHR/Fetch

Third party categories:

Ads
Analytics
Social
Customer Success & Marketing
Hosting & Content
Developer Utility
Other

Metrics for Almanac

Percentage of requests that are third party requests broken down by resource type.
Percentage of requests that are third party requests broken down by third party category.
Percentage of total bytes that are from third party requests broken down by resource type.
Percentage of total bytes that are from third party requests broken down by third party category.
Percentage of total script execution time that is from third party scripts broken down by third party category.
Top 100 third party domains by request volume
Top 100 third party domains by total byte weight
Top 100 third party domains by total script execution time
Top 100 third party requests by request volume
Top 100 third party requests by total script execution time

@simonhearne anything else you'd like to see included?

rviscomi · 2019-05-24T20:13:35Z

@flowlabs I've added you as a peer reviewer here. Anything you'd add/change in @patrickhulce's list above?

flowlabs · 2019-05-24T22:42:29Z

Hi @rviscomi @patrickhulce

This looks really good so far. A few thoughts currently:

I see a lot of 3rd parties still not supporting H/2. This might be something to consider including if possible?
Categories for 3rd parties - is this the full list? Is it programmatically generated? If not, consider including Tag Management Systems, Retargeting, RUM (or keep this in Analytics or Developer Utility), A/B and MVT (or is this part of Customer success & Marketing?). Overall - would it make sense to make this list more granular?
I've seen in my experience a number of 3rd parties that are highly variable in their performance and availability. Possibly consider capturing a variability metric or illustrating performance variability.
General comment: Some 3rd parties are notorious for adding volume to 1st party cookies. Causes all sorts of havoc when you breach request size limits + slows down the request / response cycle esp on HTTP/1.1 requests.
General comment: Overall concern about 3rd parties with my clients relates to whether or not they are blocking, or cause a SPoF / degrade customer experience. Most play better these days compared with a few years ago, but many sites have not upgraded their tag implementation and this is where it gets problematic e.g. tag templates by default do not update in TMS like Tealium, etc.

patrickhulce · 2019-05-25T03:24:13Z

Great feedback @flowlabs thanks for your time helping out with this section!

I see a lot of 3rd parties still not supporting H/2. This might be something to consider including if possible?

Good thinking! Is the primary metric you're thinking here something along the lines of, "Percentage of H/2 requests that are third party requests broken down by resource type"?

is this the full list? Is it programmatically generated? If not, consider including Tag Management Systems, Retargeting, RUM (or keep this in Analytics or Developer Utility), A/B and MVT (or is this part of Customer success & Marketing?). Overall - would it make sense to make this list more granular?

The full list in the third party repo is

Advertising
Analytics
Social
Video
Developer Utilities
Hosting Platforms
Marketing
Customer Success
Content & Publishing
Libraries
Tag Management
Mixed / Other

For this high-level summary it seemed better to consolidate some of these since at a certain point we're analyzing a very small slice. (See image below for how quickly the category becomes pretty small portion)

Maybe I'm completely missing the value of more granular categories though? Happy to hear the argument out :)

I've seen in my experience a number of 3rd parties that are highly variable in their performance and availability. Possibly consider capturing a variability metric or illustrating performance variability.

I LOVE this idea. I'm struggling a bit though with how to represent this in a way that isn't too focused on particular parties. The "average" variability doesn't seem that interesting either, maybe just the variability of script execution compared to 1st party scripts?

Some 3rd parties are notorious for adding volume to 1st party cookies
whether or not they are blocking, or cause a SPoF / degrade customer experience.

Great points! Have any thoughts on how we might track these? All HTTPArchive will be clean slate loads AFAIK, so cookies might be trickier to assess. Good thoughts for potential future Lighthouse audits though too... 🤔

rviscomi · 2019-05-29T01:28:40Z

Added @jasti, who has been thinking a lot about the state of third parties, as a reviewer 🎉

patrickhulce · 2019-05-29T23:03:37Z

@soulgalore, @zeman would either of you be interested in peer reviewing this effort too?

zeman · 2019-05-29T23:08:29Z

@patrickhulce sure thing!

rviscomi · 2019-06-04T00:01:39Z

@patrickhulce hoping to finalize metrics for all chapters today. This one LGTM, just want to make sure you're happy with the current list. I've updated #8 (comment) with the latest metrics. If it LGTY you can tick the checkbox and close this issue. Thanks!

patrickhulce · 2019-06-04T00:21:28Z

Yep I'm pretty happy with it! @flowlabs's points are good ones but thornier to execute so we can leave those for a future iteration.

jasti · 2019-06-04T23:51:54Z

@rviscomi, sorry for reopening and the delay. Just 3 more metrics to consider if possible:

This is specific to ads but one problem is the number of times an ad redirects (also known as waterfalling). It's not uncommon for an ad to waterfall 3 to 4 times. Ideally it's be great to highlight how many times an ad is being waterfalled (whenever an ad is being nested into a different iframe origin) but at the very least want to confirm we are capturing the metrics for the entire ad waterfall chain.
A common issue with understanding the impact of ads on the web is that the denominator tends to include the entire web and not just the web that has at least one ad on the page. Therefore, when looked at in aggregate, the ads may just (for e.g.) account for < 10% of resources but from an individual UX POV, the impact is significant. For e.g.https://www.nytimes.com/interactive/2015/10/01/business/cost-of-mobile-ads.html should be a really big deal.
I see video being categorized separately but it'd be useful to further break down video under advertising. This is by far the biggest culprit.

rviscomi · 2019-06-04T23:56:07Z

No problem, thanks for the suggestions @jasti!

@patrickhulce can you add these metrics if they sound good to you?

patrickhulce · 2019-06-05T20:34:44Z

Great suggestions @jasti thanks! They're mostly advertising performance-specific I wonder if this suggests it's worth breaking out advertising into its own category entirely in the future. We could certainly dive much deeper on a lot of ad-specific topics than might make sense in an overall third-party section.

at the very least want to confirm we are capturing the metrics for the entire ad waterfall chain.

We're not really examining any sort of TTI/load metric for ad frames specifically. I'm not really aware of anything that captures this easily at all right now actually, was there a particular metric you were hoping to examine?

A common issue with understanding the impact of ads on the web is that the denominator tends to include the entire web and not just the web that has at least one ad on the page

This is a good point though I'm not sure of the solution when we are indeed trying to look at the web as a whole. Maybe the generic version is "what do all of these metrics look like on the subset of pages that include at least one third-party resource". I expect those numbers to be pretty similar in absence of the ad-specific nature though.

Idea: what if we added median page-relative % for all of the metrics above? i.e. compute the number of ad requests as a percentage of the total requests on each page and take the median of that number.

(Sorry if your idea was the same and I just misunderstood 😄, hurray for secretly being on the same page then!)

I see video being categorized separately but it'd be useful to further break down video under advertising. This is by far the biggest culprit.

I'm not 100% sure what you mean here, as in you'd like to see video advertising companies script execution separate from other advertisers script execution in percentage totals?

The categories proposed for the almanac don't break Video out separately as a third-party category, and the metrics I've seen thus far don't suggest we should (given that resource type will already be highlighting video separately fairly well). If it turns out in early metrics that the other category is significantly dominated by Video entities, then I like splitting it back out 👍

(I've updated the existing metrics list to say "by third-party category by resource type" to indicate that they will be nested too 👍 )

Boiling all of this down this is what I think we'd add...

p90, p99 number of redirects per resource classified as a third-party resource broken down by resource type. (This seems like it would require a custom metric of some sort to build the chains properly, I'm not 100% sure what's possible in the planned almanac scripts so we'll see)
Median page-relative percentage of requests that are third party requests broken down by resource type.
Median page-relative percentage of requests that are third party requests broken down by third party category.
Median page-relative percentage of total bytes that are from third party requests broken down by resource type.
Median page-relative percentage of total bytes that are from third party requests broken down by third party category.
Median page-relative percentage of total script execution time that is from third party scripts broken down by third party category.

Does that sound good to you or I have missed the point?

rviscomi · 2019-06-07T15:35:43Z

@jasti do the metrics in @patrickhulce's comment work for you? hoping to close this out today.

rviscomi · 2019-06-10T14:50:45Z

Ping for the open question to @jasti

patrickhulce · 2019-06-10T22:54:12Z

@jasti I hope I captured what you were looking for. I've added the "Median page-relative percentage" metrics to the final list 👍

jasti · 2019-06-10T22:55:45Z

Perfect, thank you @patrickhulce!

@allemas

* start traduction * process trad * # This is a combination of 9 commits. # This is the 1st commit message: update # The commit message #2 will be skipped: # review # The commit message #3 will be skipped: # review #2 # The commit message #4 will be skipped: # advance # The commit message #5 will be skipped: # update # The commit message #6 will be skipped: # update translation # The commit message #7 will be skipped: # update # The commit message #8 will be skipped: # update # # update # The commit message #9 will be skipped: # update * First quick review (typofixes, translating alternatives) * Preserve original line numbers To facilitate the review of original text vs. translation side-by-side. Also: microtypo fixes. * Review => l338 * End of fine review * Adding @allemas to translators * Rename mise-en-cache to caching * final updates * update accessibility * merge line * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there: 6% of requests have a time to time (TTL) should be: 6% of requests have a Time to Live (TTL) Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script. Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> Co-authored-by: Boris SCHAPIRA <borisschapira@gmail.com> Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>

@allemas

* start traduction * process trad * # This is a combination of 9 commits. # This is the 1st commit message: update # The commit message #2 will be skipped: # review # The commit message #3 will be skipped: # review #2 # The commit message #4 will be skipped: # advance # The commit message #5 will be skipped: # update # The commit message #6 will be skipped: # update translation # The commit message #7 will be skipped: # update # The commit message #8 will be skipped: # update # # update # The commit message #9 will be skipped: # update * First quick review (typofixes, translating alternatives) * Preserve original line numbers To facilitate the review of original text vs. translation side-by-side. Also: microtypo fixes. * Review => l338 * End of fine review * Adding @allemas to translators * Rename mise-en-cache to caching * final updates * update accessibility * merge line * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there: 6% of requests have a time to time (TTL) should be: 6% of requests have a Time to Live (TTL) Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script. Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * generate caching content in french * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> * Update src/content/fr/2019/caching.md Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com> Co-authored-by: Boris SCHAPIRA <borisschapira@gmail.com> Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>

rviscomi assigned patrickhulce May 21, 2019

rviscomi transferred this issue from HTTPArchive/httparchive.org May 21, 2019

rviscomi added this to the Chapter planning complete milestone May 21, 2019

rviscomi added this to TODO in Web Almanac 2019 via automation May 21, 2019

rviscomi changed the title ~~[Web Almanac] Finalize assignments: Chapter 5. Third parties~~ Finalize assignments: Chapter 5. Third parties May 21, 2019

rviscomi moved this from TODO to In Progress in Web Almanac 2019 May 21, 2019

rviscomi mentioned this issue May 23, 2019

Assign subject matter experts and peer reviewers to each chapter #2

Closed

patrickhulce mentioned this issue May 28, 2019

core(budget): support additional first-party domains GoogleChrome/lighthouse#9069

Closed

rviscomi closed this as completed Jun 4, 2019

Web Almanac 2019 automation moved this from In Progress to Done Jun 4, 2019

jasti reopened this Jun 4, 2019

Web Almanac 2019 automation moved this from Done to In Progress Jun 4, 2019

rviscomi added the ASAP This issue is blocking progress label Jun 6, 2019

jasti closed this as completed Jun 10, 2019

Web Almanac 2019 automation moved this from In Progress to Done Jun 10, 2019

rviscomi mentioned this issue Jun 11, 2019

Finalize assignments: Chapter 2. CSS #4

Closed

3 tasks

rviscomi mentioned this issue Jul 23, 2019

Query metrics: Chapter 5. Third Parties #86

Closed

13 tasks

This was referenced Sep 8, 2019

Write content: Chapter 5. Third Parties #134

Closed

Write content: Chapter 6. Fonts #143

Closed

Write content: Chapter 12. Mobile Web #147

Closed

Write content: Chapter 13. Ecommerce #135

Closed

Write content: Chapter 15. Compression #145

Closed

rviscomi removed the ASAP This issue is blocking progress label Sep 25, 2019

gregorywolf mentioned this issue Sep 12, 2020

HTTP/2 2020 queries #1098

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize assignments: Chapter 5. Third parties #8

Finalize assignments: Chapter 5. Third parties #8

rviscomi commented May 21, 2019 •

edited by patrickhulce

Loading

patrickhulce commented May 22, 2019

patrickhulce commented May 23, 2019 •

edited

Loading

rviscomi commented May 24, 2019

flowlabs commented May 24, 2019

patrickhulce commented May 25, 2019

rviscomi commented May 29, 2019

patrickhulce commented May 29, 2019

zeman commented May 29, 2019

rviscomi commented Jun 4, 2019 •

edited

Loading

patrickhulce commented Jun 4, 2019

jasti commented Jun 4, 2019

rviscomi commented Jun 4, 2019

patrickhulce commented Jun 5, 2019

rviscomi commented Jun 7, 2019

rviscomi commented Jun 10, 2019

patrickhulce commented Jun 10, 2019

jasti commented Jun 10, 2019

Finalize assignments: Chapter 5. Third parties #8

Finalize assignments: Chapter 5. Third parties #8

Comments

rviscomi commented May 21, 2019 • edited by patrickhulce Loading

patrickhulce commented May 22, 2019

patrickhulce commented May 23, 2019 • edited Loading

rviscomi commented May 24, 2019

flowlabs commented May 24, 2019

patrickhulce commented May 25, 2019

rviscomi commented May 29, 2019

patrickhulce commented May 29, 2019

zeman commented May 29, 2019

rviscomi commented Jun 4, 2019 • edited Loading

patrickhulce commented Jun 4, 2019

jasti commented Jun 4, 2019

rviscomi commented Jun 4, 2019

patrickhulce commented Jun 5, 2019

rviscomi commented Jun 7, 2019

rviscomi commented Jun 10, 2019

patrickhulce commented Jun 10, 2019

jasti commented Jun 10, 2019

rviscomi commented May 21, 2019 •

edited by patrickhulce

Loading

patrickhulce commented May 23, 2019 •

edited

Loading

rviscomi commented Jun 4, 2019 •

edited

Loading