Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize assignments: Chapter 5. Third parties #8

Closed
3 tasks done
rviscomi opened this issue May 21, 2019 · 17 comments
Closed
3 tasks done

Finalize assignments: Chapter 5. Third parties #8

rviscomi opened this issue May 21, 2019 · 17 comments

Comments

@rviscomi
Copy link
Member

rviscomi commented May 21, 2019

Section Chapter Author Reviewers
I. Page Content 5. Third parties @patrickhulce @simonhearne @flowlabs @jasti @zeman

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • Percentage of pages that include at least one third-party resource.
  • Percentage of pages that include at least one ad resource.
  • Percentage of requests that are third party requests broken down by third party category by resource type.
  • Percentage of total bytes that are from third party requests broken down by third party category by resource type.
  • Percentage of total script execution time that is from third party scripts broken down by third party category.
  • Median page-relative percentage of requests that are third party requests broken down by third party category by resource type.
  • Median page-relative percentage of total bytes that are from third party requests broken down by third party category by resource type.
  • Median page-relative percentage of total script execution time that is from third party scripts broken down by third party category.
  • Top 100 third party domains by request volume
  • Top 100 third party domains by total byte weight
  • Top 100 third party domains by total script execution time
  • Top 100 third party requests by request volume
  • Top 100 third party requests by total script execution time

👉 AI (@patrickhulce): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

@rviscomi rviscomi transferred this issue from HTTPArchive/httparchive.org May 21, 2019
@rviscomi rviscomi added this to the Chapter planning complete milestone May 21, 2019
@rviscomi rviscomi added this to TODO in Web Almanac 2019 via automation May 21, 2019
@rviscomi rviscomi changed the title [Web Almanac] Finalize assignments: Chapter 5. Third parties Finalize assignments: Chapter 5. Third parties May 21, 2019
@rviscomi rviscomi moved this from TODO to In Progress in Web Almanac 2019 May 21, 2019
@patrickhulce
Copy link
Contributor

Answering the first question about categorization...

In the 2019-05-01 desktop dataset, we see the following.

Metric Count Percent
Total Requests 387,194,512 100%
Third Party Requests (domain method) 219,842,520 56.8%
Third Party Requests (eTLD method) 204,315,257 52.8%
Third Party Requests (>50 pages method) 190,982,526 49.3%
Third Party Requests (manual list method) 92,029,516 23.8%
See Query & Raw Results
SELECT
  COUNT(*),
  COUNTIF(pageDomain = requestDomain) AS matchingDomains,
  COUNTIF(pageETld = requestETld) AS matchingETlds,
  COUNTIF(requestCanonicalDomain IS NULL) as hasNoMatchingThirdParty,
  COUNTIF(requestDomainOver50 IS NULL) as hasNoMatchingDomainOver50
FROM (
  SELECT
    page AS pageUrl,
    NET.HOST(page) AS pageDomain,
    REGEXP_EXTRACT(NET.HOST(page), r'([^.]+\.(?:[^.]+|(?:gov|com|co|ne)\.\w{2})$)') AS pageETld,
    url AS requestUrl,
    NET.HOST(url) AS requestDomain,
    REGEXP_EXTRACT(NET.HOST(url), r'([^.]+\.(?:[^.]+|(?:gov|com|co|ne)\.\w{2})$)') AS requestETld,
    ThirdPartyCanonicalDomainTable.canonicalDomain as requestCanonicalDomain,
    DomainsOver50Table.requestDomain as requestDomainOver50
  FROM
    `httparchive.requests.2019_05_01_desktop`
  LEFT JOIN
    `lighthouse-infrastructure.third_party_web.2019_05_22` as ThirdPartyCanonicalDomainTable
  ON NET.HOST(url) = ThirdPartyCanonicalDomainTable.domain
  LEFT JOIN
    `lighthouse-infrastructure.third_party_web.2019_05_22_all_observed_domains` AS DomainsOver50Table
  ON NET.HOST(url) = DomainsOver50Table.requestDomain
)

[
  {
    "f0_": "387194512",
    "matchingDomains": "167351992",
    "matchingETlds": "182879255",
    "hasNoMatchingThirdParty": "295164996",
    "hasNoMatchingDomainOver50": "196211986"
  }
]

Given this data, I'm inclined to pick the ">50 pages method". The domain and eTLD methods are really proxies for "Is this request coming from something you control" and isn't perfect by any means given that plenty of sites host their own assets on CDNs where the root domain will differ and there are plenty of sites that are hosted by third parties where eTLD requests aren't actually under their control. Manual listing covers a lot of the third party requests but definitely comes up very short.

The only obvious failure mode for >50 approach I see is that it will tend to exclude the smaller third party providers. This seems acceptable when doing mass classification as the consumer of a smaller third party theoretically has greater influence over the behavior of that content anyway, i.e. it's not a "well I absolutely have to use this in order to integrate with my X provider" situation.

Metric definitions to follow shortly now that part 1 is settled :)

@patrickhulce
Copy link
Contributor

patrickhulce commented May 23, 2019

Definitions

A third party domain is defined as a domain whose resources have been requested by at least 50 different pages in the HTTPArchive dataset.

A third party category is defined as the category of the third party domain's primary usage according to third-party-web or 'Unknown' if not found in that dataset.

A third party request is defined as a request to a third party domain.

A third party script is defined as a third party request whose resource type is 'script'.

Resource types:

  • Total
  • Document
  • Script
  • Stylesheet
  • Image
  • XHR/Fetch

Third party categories:

  • Ads
  • Analytics
  • Social
  • Customer Success & Marketing
  • Hosting & Content
  • Developer Utility
  • Other

Metrics for Almanac

  • Percentage of requests that are third party requests broken down by resource type.
  • Percentage of requests that are third party requests broken down by third party category.
  • Percentage of total bytes that are from third party requests broken down by resource type.
  • Percentage of total bytes that are from third party requests broken down by third party category.
  • Percentage of total script execution time that is from third party scripts broken down by third party category.
  • Top 100 third party domains by request volume
  • Top 100 third party domains by total byte weight
  • Top 100 third party domains by total script execution time
  • Top 100 third party requests by request volume
  • Top 100 third party requests by total script execution time

@simonhearne anything else you'd like to see included?

@rviscomi
Copy link
Member Author

@flowlabs I've added you as a peer reviewer here. Anything you'd add/change in @patrickhulce's list above?

@flowlabs
Copy link

Hi @rviscomi @patrickhulce

This looks really good so far. A few thoughts currently:

  1. I see a lot of 3rd parties still not supporting H/2. This might be something to consider including if possible?

  2. Categories for 3rd parties - is this the full list? Is it programmatically generated? If not, consider including Tag Management Systems, Retargeting, RUM (or keep this in Analytics or Developer Utility), A/B and MVT (or is this part of Customer success & Marketing?). Overall - would it make sense to make this list more granular?

  3. I've seen in my experience a number of 3rd parties that are highly variable in their performance and availability. Possibly consider capturing a variability metric or illustrating performance variability.

  4. General comment: Some 3rd parties are notorious for adding volume to 1st party cookies. Causes all sorts of havoc when you breach request size limits + slows down the request / response cycle esp on HTTP/1.1 requests.

  5. General comment: Overall concern about 3rd parties with my clients relates to whether or not they are blocking, or cause a SPoF / degrade customer experience. Most play better these days compared with a few years ago, but many sites have not upgraded their tag implementation and this is where it gets problematic e.g. tag templates by default do not update in TMS like Tealium, etc.

@patrickhulce
Copy link
Contributor

Great feedback @flowlabs thanks for your time helping out with this section!

I see a lot of 3rd parties still not supporting H/2. This might be something to consider including if possible?

Good thinking! Is the primary metric you're thinking here something along the lines of, "Percentage of H/2 requests that are third party requests broken down by resource type"?

is this the full list? Is it programmatically generated? If not, consider including Tag Management Systems, Retargeting, RUM (or keep this in Analytics or Developer Utility), A/B and MVT (or is this part of Customer success & Marketing?). Overall - would it make sense to make this list more granular?

The full list in the third party repo is

Advertising
Analytics
Social
Video
Developer Utilities
Hosting Platforms
Marketing
Customer Success
Content & Publishing
Libraries
Tag Management
Mixed / Other

For this high-level summary it seemed better to consolidate some of these since at a certain point we're analyzing a very small slice. (See image below for how quickly the category becomes pretty small portion)
image

Maybe I'm completely missing the value of more granular categories though? Happy to hear the argument out :)

I've seen in my experience a number of 3rd parties that are highly variable in their performance and availability. Possibly consider capturing a variability metric or illustrating performance variability.

I LOVE this idea. I'm struggling a bit though with how to represent this in a way that isn't too focused on particular parties. The "average" variability doesn't seem that interesting either, maybe just the variability of script execution compared to 1st party scripts?

Some 3rd parties are notorious for adding volume to 1st party cookies
whether or not they are blocking, or cause a SPoF / degrade customer experience.

Great points! Have any thoughts on how we might track these? All HTTPArchive will be clean slate loads AFAIK, so cookies might be trickier to assess. Good thoughts for potential future Lighthouse audits though too... 🤔

@rviscomi
Copy link
Member Author

Added @jasti, who has been thinking a lot about the state of third parties, as a reviewer 🎉

@patrickhulce
Copy link
Contributor

@soulgalore, @zeman would either of you be interested in peer reviewing this effort too?

@zeman
Copy link

zeman commented May 29, 2019

@patrickhulce sure thing!

@rviscomi
Copy link
Member Author

rviscomi commented Jun 4, 2019

@patrickhulce hoping to finalize metrics for all chapters today. This one LGTM, just want to make sure you're happy with the current list. I've updated #8 (comment) with the latest metrics. If it LGTY you can tick the checkbox and close this issue. Thanks!

@patrickhulce
Copy link
Contributor

Yep I'm pretty happy with it! @flowlabs's points are good ones but thornier to execute so we can leave those for a future iteration.

@rviscomi rviscomi closed this as completed Jun 4, 2019
Web Almanac 2019 automation moved this from In Progress to Done Jun 4, 2019
@jasti
Copy link

jasti commented Jun 4, 2019

@rviscomi, sorry for reopening and the delay. Just 3 more metrics to consider if possible:

  1. This is specific to ads but one problem is the number of times an ad redirects (also known as waterfalling). It's not uncommon for an ad to waterfall 3 to 4 times. Ideally it's be great to highlight how many times an ad is being waterfalled (whenever an ad is being nested into a different iframe origin) but at the very least want to confirm we are capturing the metrics for the entire ad waterfall chain.

  2. A common issue with understanding the impact of ads on the web is that the denominator tends to include the entire web and not just the web that has at least one ad on the page. Therefore, when looked at in aggregate, the ads may just (for e.g.) account for < 10% of resources but from an individual UX POV, the impact is significant. For e.g.https://www.nytimes.com/interactive/2015/10/01/business/cost-of-mobile-ads.html should be a really big deal.

  3. I see video being categorized separately but it'd be useful to further break down video under advertising. This is by far the biggest culprit.

@jasti jasti reopened this Jun 4, 2019
Web Almanac 2019 automation moved this from Done to In Progress Jun 4, 2019
@rviscomi
Copy link
Member Author

rviscomi commented Jun 4, 2019

No problem, thanks for the suggestions @jasti!

@patrickhulce can you add these metrics if they sound good to you?

@patrickhulce
Copy link
Contributor

Great suggestions @jasti thanks! They're mostly advertising performance-specific I wonder if this suggests it's worth breaking out advertising into its own category entirely in the future. We could certainly dive much deeper on a lot of ad-specific topics than might make sense in an overall third-party section.

at the very least want to confirm we are capturing the metrics for the entire ad waterfall chain.

We're not really examining any sort of TTI/load metric for ad frames specifically. I'm not really aware of anything that captures this easily at all right now actually, was there a particular metric you were hoping to examine?

A common issue with understanding the impact of ads on the web is that the denominator tends to include the entire web and not just the web that has at least one ad on the page

This is a good point though I'm not sure of the solution when we are indeed trying to look at the web as a whole. Maybe the generic version is "what do all of these metrics look like on the subset of pages that include at least one third-party resource". I expect those numbers to be pretty similar in absence of the ad-specific nature though.

Idea: what if we added median page-relative % for all of the metrics above? i.e. compute the number of ad requests as a percentage of the total requests on each page and take the median of that number.

(Sorry if your idea was the same and I just misunderstood 😄, hurray for secretly being on the same page then!)

I see video being categorized separately but it'd be useful to further break down video under advertising. This is by far the biggest culprit.

I'm not 100% sure what you mean here, as in you'd like to see video advertising companies script execution separate from other advertisers script execution in percentage totals?

The categories proposed for the almanac don't break Video out separately as a third-party category, and the metrics I've seen thus far don't suggest we should (given that resource type will already be highlighting video separately fairly well). If it turns out in early metrics that the other category is significantly dominated by Video entities, then I like splitting it back out 👍

(I've updated the existing metrics list to say "by third-party category by resource type" to indicate that they will be nested too 👍 )


Boiling all of this down this is what I think we'd add...

  • p90, p99 number of redirects per resource classified as a third-party resource broken down by resource type. (This seems like it would require a custom metric of some sort to build the chains properly, I'm not 100% sure what's possible in the planned almanac scripts so we'll see)
  • Median page-relative percentage of requests that are third party requests broken down by resource type.
  • Median page-relative percentage of requests that are third party requests broken down by third party category.
  • Median page-relative percentage of total bytes that are from third party requests broken down by resource type.
  • Median page-relative percentage of total bytes that are from third party requests broken down by third party category.
  • Median page-relative percentage of total script execution time that is from third party scripts broken down by third party category.

Does that sound good to you or I have missed the point?

@rviscomi rviscomi added the ASAP This issue is blocking progress label Jun 6, 2019
@rviscomi
Copy link
Member Author

rviscomi commented Jun 7, 2019

@jasti do the metrics in @patrickhulce's comment work for you? hoping to close this out today.

@rviscomi
Copy link
Member Author

Ping for the open question to @jasti

@patrickhulce
Copy link
Contributor

@jasti I hope I captured what you were looking for. I've added the "Median page-relative percentage" metrics to the final list 👍

@jasti
Copy link

jasti commented Jun 10, 2019

Perfect, thank you @patrickhulce!

@jasti jasti closed this as completed Jun 10, 2019
Web Almanac 2019 automation moved this from In Progress to Done Jun 10, 2019
@rviscomi rviscomi removed the ASAP This issue is blocking progress label Sep 25, 2019
allemas added a commit that referenced this issue Mar 6, 2020
* start traduction

* process trad

* # This is a combination of 9 commits.
# This is the 1st commit message:

update

# The commit message #2 will be skipped:

# review

# The commit message #3 will be skipped:

# review #2

# The commit message #4 will be skipped:

# advance

# The commit message #5 will be skipped:

# update

# The commit message #6 will be skipped:

# update translation

# The commit message #7 will be skipped:

# update

# The commit message #8 will be skipped:

# update
#
# update

# The commit message #9 will be skipped:

# update

* First quick review

(typofixes, translating alternatives)

* Preserve original line numbers

    To facilitate the review of original text vs. translation side-by-side.

Also: microtypo fixes.

* Review => l338

* End of fine review

* Adding @allemas to translators

* Rename mise-en-cache to caching

* final updates

* update accessibility

* merge line

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

* Update src/content/fr/2019/caching.md

If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there:

6% of requests have a time to time (TTL)

should be:

6% of requests have a Time to Live (TTL)

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

* Update src/content/fr/2019/caching.md

Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script.

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

Co-authored-by: Boris SCHAPIRA <borisschapira@gmail.com>
Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>
tunetheweb added a commit that referenced this issue Mar 6, 2020
* start traduction

* process trad

* # This is a combination of 9 commits.
# This is the 1st commit message:

update

# The commit message #2 will be skipped:

# review

# The commit message #3 will be skipped:

# review #2

# The commit message #4 will be skipped:

# advance

# The commit message #5 will be skipped:

# update

# The commit message #6 will be skipped:

# update translation

# The commit message #7 will be skipped:

# update

# The commit message #8 will be skipped:

# update
#
# update

# The commit message #9 will be skipped:

# update

* First quick review

(typofixes, translating alternatives)

* Preserve original line numbers

    To facilitate the review of original text vs. translation side-by-side.

Also: microtypo fixes.

* Review => l338

* End of fine review

* Adding @allemas to translators

* Rename mise-en-cache to caching

* final updates

* update accessibility

* merge line

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

* Update src/content/fr/2019/caching.md

If it's not too much effort, could you also fix this in the English version as part of this PR as looks wrong there:

6% of requests have a time to time (TTL)

should be:

6% of requests have a Time to Live (TTL)

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

* Update src/content/fr/2019/caching.md

Do we need to state that all the directives are English language terms or is that overkill? If so need to check this doesn't mess up the markdown->HTML script.

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

* generate caching content in french

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

* Update src/content/fr/2019/caching.md

Co-Authored-By: Barry Pollard <barry_pollard@hotmail.com>

Co-authored-by: Boris SCHAPIRA <borisschapira@gmail.com>
Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>
@gregorywolf gregorywolf mentioned this issue Sep 12, 2020
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

5 participants