Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize assignments: Chapter 20. HTTP/2 #22

Closed
rviscomi opened this issue May 21, 2019 · 44 comments
Closed

Finalize assignments: Chapter 20. HTTP/2 #22

rviscomi opened this issue May 21, 2019 · 44 comments
Assignees
Projects

Comments

@rviscomi
Copy link
Member

@rviscomi rviscomi commented May 21, 2019

Section Chapter Authors Reviewers
IV. Content Distribution 20. HTTP/2 @bazzadp @bagder @rmarx @dotjs

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • Adoption rate of HTTP/2 by site (home page only) and by requests (all request on page) over the years. Trend graph over all available years.
  • Measure of HTTP version negotiated (0.9, 1.0, 1.1, 2, gQUIC) for main page of all sites, and for HTTPS sites. Table for last crawl. For example:
Version All sites HTTPS only sites
HTTP/0.9 0% 0%
HTTP/1.0 2% 0%
HTTP/1.1 48% 20%
HTTP/2 44% 70%
gQUIC 6% 10%

For gQUIC it will be sites that return Alt-Svc HTTP Header which starts with quic.

  • Average percentage of resources loaded over HTTP/2 (or gQUIC) versus HTTP/1.1 per site. Trend graph over all available years.
  • Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • Number of HTTPS sites using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • Number of HTTPS sites not using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • % of sites affected by CDN prioritization issues (H2 and served by CDN) - https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services. If not possible then maybe just list sites by CDN and can then manually vlookup from table in Andy's github issue? Once off stat for last crawl.
  • Count of HTTP/2 sites grouped by server HTTP header value but strip version numbers (e.g. Apache and Apache 2.4.28 and Apache 2.4.29 should all report as Apache, but Apache Tomcat should report as Tomcat. Probably need to massive the results to achieve this). Once off stat for last crawl.
  • Count of non-HTTP/2 sites grouped by server HTTP header value but strip version numbers. Once off stat for last crawl.
  • Count of HTTP/2 sites which use HTTP/2 Push. Trend graph over all available years.
  • Average number of HTTP/2 Pushed resources and average bytes. Once off stat for last crawl.
  • Count and number of bytes pushed by asset type (CSS, JS, Images...etc.). Once off stat for last crawl.
  • Count of preload HTTP Headers with nopush attribute set. Once off stat for last crawl.
  • Is it possible to see HTTP/2 Pushed resources which are not used on the page load?
  • Measure number of TCP Connections per site. Average number of domains per site still going down year on year as per HTTP Archive State of the Web report? Trend graph over all available years.
  • Measure average number of TCP Connections per site for HTTP/1.1 sites versus HTTP/2 sites. Once off stat for last crawl.
  • Count of HTTP/2 sites grouped by SETTINGS_MAX_CONCURRENT_STREAMS (including HTTP/2 sites which don't set this value). Note this was added recently as per #22 (comment). Once off stat for last crawl.

👉 AI (@bazzadp): Finalize which metrics you might like to include in an annual "state of HTTP/2" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the HTTP/2 landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 21, 2019

Hey @mnot, Paul says he reached out to you and you may be interested in being the designated subject matter expert for the H2 chapter. You can learn more about the Almanac project here.

Let me know if you have any questions about the time/deliverable expectations and if you're able to commit.

@rviscomi rviscomi transferred this issue from HTTPArchive/httparchive.org May 21, 2019
@rviscomi rviscomi added this to the Chapter planning complete milestone May 21, 2019
@rviscomi rviscomi added this to TODO in Web Almanac via automation May 21, 2019
@rviscomi rviscomi changed the title [Web Almanac] Finalize assignments: Chapter 20. HTTP/2 Finalize assignments: Chapter 20. HTTP/2 May 21, 2019
@rviscomi rviscomi moved this from TODO to In Progress in Web Almanac May 21, 2019
@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 21, 2019

I’m happy to help out here if you need help in this section? Just published a book on HTTP/2 (https://www.manning.com/books/http2-in-action) so spent last couple of years digging into this topic and what it’s meant in real world since launch. I don’t know HTTP Archive or BigQuery though, but know SQL very well so sure I can learn it with a few pointers in the right direction. Or just help interpret the results others serve up, or review writing or whatever.

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 21, 2019

Hey @bazzadp thanks for reaching out, we'd love to have you!

Sounds like you're a great fit for the subject matter expert role, driving the direction of the metrics included in the chapter and writing your interpretations. I'll put your name down.

@rviscomi rviscomi assigned bazzadp and unassigned rviscomi and paulcalvano May 21, 2019
@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 22, 2019

Potential metrics (just rough thoughts for now but will update):

  • Adoption rate (by site =~ 35% and by traffic =~ 60 %)
  • Measure of all HTTP versions (0.9, 1.0, 1.1, 2). What about gQUIC - see below?
  • How are QUIC sites (e.g. www.google.com on Chrome) reported? HTTP/2? QUIC+HTTP/2? Other?
  • % of sites affected by CDN prioritization issues (H2 and served by CDN)
  • Which CDN / server software are people using for HTTP/2? Check server HTTP header value of sites that support HTTP/2.
  • Push usage? Expect it to be low but would be good to actually quantify.
  • Domain sharding - is it becoming less common with HTTP/2? Average number of domains per site going down? Measure number of domains used by HTTP/2 sites this year and in previous years when not HTTP/2 enabled?
  • QUIC and HTTP/3 support (either h3 or quic in TLS ALPN record, or h3 or quic in Alt-Svc HTTP header)? Or one for next year? gQUIC has been here for some time and seen an uptick in CDN support for that so think we should include this year.
  • Are HTTP/2 settings data exposed in HTTP archive? E.g. max number of concurrent streams? Header Table Size?
  • Anyway of measuring of HPACK? Maybe link to compression chapter?

Also some stats here for validation that in right ballpark: https://http2.netray.io/stats.html but obviously as an HTTP Archive report we should use HTTP Archive stats.

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 22, 2019

That's a great list, big +1 to everything. @pmeenan could you help answer some of Barry's questions? (also let us know if you'd be interested in reviewing this chapter)

Adoption rate (by site =~ 35% and by traffic =~ 60 %)

Did you have something in mind to weigh adoption by traffic? The HTTP Archive dataset doesn't currently include popularity signals.

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 22, 2019

And there's where my lack of HTTP Archive knowledge comes into play! :-)

This shows sites: https://w3techs.com/technologies/details/ce-http2/all/all
This shows usage: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&measure=HTTP_RESPONSE_VERSION

So you can say 60% of web traffic is HTTP/2 but that's dominated by the big boys. Or you can say 35% of sites are HTTP/2. Both are correct but depends what metric you want.

This is something that probably should be decided at a project level as will affect all other chapters too. And if, as you say, HTTP Archive only has one then that's an easy question to answer!

But that begs the next question (also a project level question): should we just use HTTP Archive stats? Or also include other stats like the two examples above? I can understand if want to just use HTTP Archive but thought I'd ask the question in case artificially limiting myself here!

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 22, 2019

Great questions.

So you can say 60% of web traffic is HTTP/2 but that's dominated by the big boys. Or you can say 35% of sites are HTTP/2. Both are correct but depends what metric you want.

Let's stick with "% of websites" or "% of all requests". We include a chart of the latter in our State of the Web report on the website.

should we just use HTTP Archive stats? Or also include other stats like the two examples above? I can understand if want to just use HTTP Archive but thought I'd ask the question in case artificially limiting myself here!

I'd say let's exhaust all of the stats we can extract from the HTTP Archive dataset first, and if we still can't paint a complete picture, then it makes sense that we should pull in outside research and cite it accordingly. Things like % of sites vs traffic are just matters of perspective, but if we're missing out on a key metric then that's worth outsourcing.

@bagder

This comment has been minimized.

Copy link

@bagder bagder commented May 23, 2019

Just a note on the "say 60% of web traffic" @bazzadp: that's 60% of all browser traffic. It might be worth considering that HTTP/2 is only ever attempted when doing HTTPS, which is now on around 80% of the browser page loads so that makes the amount of HTTPS loads done by Firefox that uses HTTP/2 to be 75%.

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 23, 2019

Very valid point!

@bagder has also agreed to review this section @rviscomi so could you update the original comment here and the other matrix?

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 23, 2019

Sounds great, I've added @bagder and sent an invitation to the @HTTPArchive/reviewers team.

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 23, 2019

You’ve a typo in his username: @bagder Am sure he gets this a lot - I know I’ve mistyped it like that before! :-)

@bagder

This comment has been minimized.

Copy link

@bagder bagder commented May 23, 2019

I'm mr typo. 😁

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 23, 2019

Hah! That explains the autocomplete fail :)

@andydavies

This comment has been minimized.

Copy link
Member

@andydavies andydavies commented May 23, 2019

Back in Jan about 26% of the traffic over Akamai's network was H2 - https://developer.akamai.com/blog/2019/01/31/http2-discover-performance-impacts-effective-prioritization

@bagder

This comment has been minimized.

Copy link

@bagder bagder commented May 23, 2019

regarding "Average number of domains per site going down?", I know HTTParchive already tracks number of TCP connections needed, which is of course more a result of number of domains used and/or HTTP/2-"unsharding" of them.

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 23, 2019

Back in Jan about 26% of the traffic over Akamai's network was H2 - https://developer.akamai.com/blog/2019/01/31/http2-discover-performance-impacts-effective-prioritization

I was wondering why so low? As would have expected CDN traffic to be ahead of average not behind assuming it’s on my default for HTTPS users. But I see they only enabled that since March (https://blogs.akamai.com/2019/03/http2-will-be-automatically-enabled-by-default-on-the-akamai-intelligent-edge-platform.html). Wonder what that percentage is now since that change?

@rmarx

This comment has been minimized.

Copy link

@rmarx rmarx commented May 23, 2019

I am also interested in acting as a reviewer for this chapter :)

@pmeenan

This comment has been minimized.

Copy link

@pmeenan pmeenan commented May 24, 2019

  • How are QUIC sites (e.g. www.google.com on Chrome) reported? HTTP/2? QUIC+HTTP/2? Other?
  • QUIC and HTTP/3 support (either h3 or quic in TLS ALPN record, or h3 or quic in Alt-Svc HTTP header)? Or one for next year? gQUIC has been here for some time and seen an uptick in CDN support for that so think we should include this year.

I think the "alt-svc" response header is the only thing we capture that can help with this (at least without processing the raw trace files). The ALPN details for a connection aren't kept though it might be possible to add later.

  • Push usage? Expect it to be low but would be good to actually quantify.

You might be shocked and dismayed because of the automatic translation from "preload" response header to PUSH, it happens WAY more often than I'd like. It may be a small overall * of sites but would be interesting to deep dive into the distribution of number of pushed resources and bytes for those that do use push.

  • Are HTTP/2 settings data exposed in HTTP archive? E.g. max number of concurrent streams? Header Table Size?

No, not currently anyway.

  • Anyway of measuring of HPACK? Maybe link to compression chapter?

As in number of slots available or something else?

For the ALPN and H2 settings that might be of interest, if you file an issue with wptagent I may be able to add the connection-level protocol details (or at least whatever I can extract from the netlog).

@pmeenan

This comment has been minimized.

Copy link

@pmeenan pmeenan commented May 24, 2019

I just added a few connection-level fields to the WebPageTest data collection. They will only be reported for the first request on a given connection (same request that has the connect timings):

"http2_server_settings":{
  "SETTINGS_MAX_HEADER_LIST_SIZE": 16384,
  "SETTINGS_MAX_CONCURRENT_STREAMS": 100,
  "SETTINGS_INITIAL_WINDOW_SIZE": 1048576
},
"tls_resumed": "False",
"tls_next_proto": "h2",
"tls_cipher_suite": 4865,
"tls_version": "TLS 1.3",

The June 1 crawl will include the data. I didn't see any other TLS or H2 session-level settings exposed in the netlog but hopefully this helps.

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 24, 2019

Wow thanks @pmeenan for the quick work! #10 might be interested in these extra settings if doing any TLS analysis? Was TLS version new @pmeenan - surely that was captured before?

You might be shocked and dismayed because of the automatic translation from "preload" response header to PUSH, it happens WAY more often than I'd like.

Good point! Would be interesting to see how many people use the HTTP preload header instead of HTML preload tag. And of those that do what percentage of those are pushed (and perhaps) equally interesting what percentage are not (Not all CDNs and servers use preload header as a push signal I don't think). And also which ones have nopush and non-standard attributes like x-http2-push-only. Perhaps some crossover with #21 here?

It may be a small overall * of sites but would be interesting to deep dive into the distribution of number of pushed resources and bytes for those that do use push.

Agreed!

Anyway of measuring of HPACK? Maybe link to compression chapter?

As in number of slots available or something else?

More thinking what's the saving thanks to HPACK? A bit of an update to this blog post. Can you think of any way to report on this?

Other metrics, this conversation is making me think of:

  • Upgrader header sent on HTTP connection (so HTTP/2 available but not used as did not use HTTPS).
  • Upgrader header sent on HTTP/1.1 HTTPS connection (so HTTP/2 available but not used desite HTTPS being used - Blacklisted Cipher used? Lack of ALPN support?).
  • Upgrade header sent on HTTP/2 connection (Apache does this for example and it causes problems).
  • Push resource without preload header
  • Pushed resources that won't be used. E.g. Pushed fonts or other crossorigin resources, SRI pushed resources - though not sure that last one is feasible? Unless it's possible to check for a page which pushes a resource AND also downloads it as well? Resources that pushed fine, but which page doesn't load I'm guessing are more difficult (impossible?) to track.

At some point I need to stop adding to the metric list and whittle it down to the ones we actually want to use :-) But will keep brainstorming for a little more (and others do pipe in too!) and then do that...

@pmeenan

This comment has been minimized.

Copy link

@pmeenan pmeenan commented May 24, 2019

Was TLS version new @pmeenan - surely that was captured before?

Yeah, the "securityDetails" included the TLS version so I guess I'm just capturing it at a differnt layer.

"securityDetails":{
  "certificateId":0,
  "protocol":"TLS 1.3",
  "keyExchange":"",
  "validTo":1564484100,
  "certificateTransparencyCompliance":"compliant",
  "sanList":[
    "www.google.com"
  ],
  "subjectName":"www.google.com",
  "keyExchangeGroup":"X25519",
  "validFrom":1557228737,
  "signedCertificateTimestampList":[
  {
    "status":"Verified",
    "origin":"TLS extension",
    "logDescription":"Google 'Pilot' log",
    "signatureData":"30450221009C42E3B4400B1ACB5FCB0D8BE97D6E97EA410734B4EF1A3D7E16E941F0433B1802201D53097B43B24C8EFDCD1CAD366BC98A6485C029AD75F513E60D661C123D1A80",
    "timestamp":1557230646581,
    "hashAlgorithm":"SHA-256",
    "logId":"A4B90990B418581487BB13A2CC67700A3C359804F91BDFB8E377CD0EC80DDC10",
    "signatureAlgorithm":"ECDSA"
  },
  {
    "status":"Verified",
    "origin":"TLS extension",
    "logDescription":"DigiCert Log Server",
    "signatureData":"304502207085B84006458EEDE24FD05BABBA437EE227BE1FD4A7A72BCBB7C4AE7EA5851A0221008E680BFC42F28C895EB9CB9F711212C0A36B8EA0473DC2740AF0A4A477DE5565",
    "timestamp":1557230646578,
    "hashAlgorithm":"SHA-256",
    "logId":"5614069A2FD7C2ECD3F5E1BD44B23EC74676B9BC99115CC0EF949855D689D0DD",
    "signatureAlgorithm":"ECDSA"
  }
  ],
  "cipher":"AES_128_GCM",
  "issuer":"Google Internet Authority G3"
},

The ALPN/NPN info wasn't in the security details so I just went ahead and pulled all of the TLS data reported by the connection.

@pmeenan

This comment has been minimized.

Copy link

@pmeenan pmeenan commented May 24, 2019

More thinking what's the saving thanks to HPACK? A bit of an update to this blog post. Can you think of any way to report on this?

If the netlog included the raw HTTP/2 frame events I could calculate the size of the HEADERS frame relative to the decoded headers but looking through the raw netlog events it doesn't look like it does. At best I could infer it from the socket bytes events right beside it but that may include other frames as well and feels too fragile.

@mnot

This comment has been minimized.

Copy link

@mnot mnot commented May 28, 2019

It looks like this is in good hands; I'm going to put some suggestions in a few other places.

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 28, 2019

Thanks @mnot! You're still welcome to contribute to this chapter as a coauthor or reviewer if interested.

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 28, 2019

Definitely! Or if you've any thoughts on what stats to measure then let us know.

@dotjs

This comment has been minimized.

Copy link
Contributor

@dotjs dotjs commented May 29, 2019

Happy to be added as a reviewer.

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 29, 2019

Thanks @dotjs! Happy to add you as a reviewer. Is your first/last name public anywhere?

@dotjs

This comment has been minimized.

Copy link
Contributor

@dotjs dotjs commented May 29, 2019

Yes, Andrew Galloni - Updated my profile

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented May 29, 2019

👍 👍 Thanks!

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented May 30, 2019

I've updated the stats to the following:

  • Adoption rate (by site =~ 35% and by requests)
  • Measure of all HTTP versions (0.9, 1.0, 1.1, 2) for all sites.
  • Measure of all HTTP versions (0.9, 1.0, 1.1, 2) for HTTPS sites.
  • Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2
  • Number of HTTPS sites using HTTP/2 which return upgrade HTTP header containing h2
  • Number of HTTPS sites not using HTTP/2 which return upgrade HTTP header containing h2
  • % of sites affected by CDN prioritization issues (H2 and served by CDN) - https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services
  • Count of HTTP/2 sites grouped by server HTTP header value but strip version numbers.
  • Count of non-HTTP/2 sites grouped by server HTTP header value but strip version numbers.
  • Count of HTTP/2 sites which use HTTP/2 Push.
  • Average number of HTTP/2 Pushed resources and average bytes. By Desktop and Mobile.
  • Is it possible to see HTTP/2 Pushed resources which are not used on the page load?
  • Domain sharding - is it becoming less common with HTTP/2? Measure number of TCP Connections per site. Average number of domains per site going down? Is it possible to measure the number of domains used by HTTP/2 sites this year and in previous years when not HTTP/2 enabled?
  • QUIC - % of sites which return Alt-Svc HTTP Header which starts with quic.
  • Count of HTTP/2 sites grouped by SETTINGS_MAX_CONCURRENT_STREAMS (including sites which don't set this value).

The current HTTP Archive State of the Web lists mobile and desktop but think only the number and bytes pushed should differ between mobile and desktop.

Any other suggestions from anyone? Particularly the reviewers (@bagder , @rmarx , @dotjs)?

@rviscomi , @pmeenan - can you see any of these being a problem? Note sure I'll use all of these, depending on whether they show interesting information or not, so if any are particularly hard to get, or there's too many stats, then let me know.

@dotjs

This comment has been minimized.

Copy link
Contributor

@dotjs dotjs commented May 31, 2019

  • Percentage of requests over h2 compared to h1
  • Number of requests per TCP connection by type and for h2 how many hosts.
    What I would like to get handle on is how many connections and what type are utilised for various performance measures.

Server push

  • bytes by content type. Interested in what is being pushed
  • number of client resets
  • number of no push headers sent

HPACK

  • +1 for a way to measure compression ratio for headers
  • table sizes
@mnot

This comment has been minimized.

Copy link

@mnot mnot commented May 31, 2019

Capturing the mix of h2 and h1 on a single page load would also be interesting, as would the total number of connections per page load in relation to that.

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented Jun 1, 2019

Is it possible to see HTTP/2 Pushed resources which are not used on the page load?

Not familiar with server push or how it appears in WPT results. Two questions:

  1. What are the distinguishing characteristics of a pushed resource?
  2. When the client attempts to use a pushed resource, is there some kind of artifact left in the network logs, similar to how a 304 response tells the client to use what it has in cache?

Is it possible to measure the number of domains used by HTTP/2 sites this year and in previous years when not HTTP/2 enabled?

To complicate things, we've been increasing our sample size ~8x since last year, so many of of the sites in today's dataset were not available last year. So this metric might not be reliable.

@bazzadp bazzadp mentioned this issue Jun 1, 2019
3 of 3 tasks complete
@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented Jun 1, 2019

Incorporated some of these.

@dotjs some comments ion yours:

  • "number of client resets" not sure HTTP Archive would be best measure of this as it's a crawler and presumably crawls with an empty cache? Think we'd need some sort of RUM measurement here (Moxilla Firefox Telemetry?) to get a meaningful stats out of this.
  • "table sizes" not sure how to measure these, or what you're trying to get out of this? Nginx for example which doesn't use dynamic table as I understand, doesn't send a table size of 0 in it's SETTINGS frame.

@rviscomi, a pushed resource has a "SERVER PUSHED" attribute in WebPageTest as shown below.

image

When a client uses a pushed resource it sets the Initiator to "Push/Other" in Chrome Dev tools Network tab:

image

When an asset is pushed that is NOT actually needed by the page, it doesn't show in Dev Tools Network tab at all (but is hidden in the net-externals page). Though this is complicated if the preload header is used (which is often also used as a signal to push). In this case the very presence of the preload header means Chrome thinks it is needed by the page and so does show it in the Network tab. Sigh it's complicated...

WebPagetest seems to always show pushed resources (whether preload header is included or not), but can't see any way it indicates if an asset is pushed, but then not subsequently referenced on the page.

@pmeenan not sure if you've any thoughts on whether possible to measure unnecessarily pushed resources?

@dotjs

This comment has been minimized.

Copy link
Contributor

@dotjs dotjs commented Jun 3, 2019

I was considering the encoding table sizes. For nginx there is a patch that sets the default table size to 4096 https://github.com/cloudflare/sslconfig/blob/hpack_1.13.1/patches/nginx_1.13.1_http2_hpack.patch
for example.

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented Jun 3, 2019

@bazzadp thanks for the context. It sounds like detecting unused pushes should be possible using the resource metadata in HTTP Archive.

  • access to headers to detect SERVER PUSHED
  • access to request initiators to detect used pushes
  • access to headers to detect preload

So we can check for resource that have been pushed but not initiated or preloaded.

@pmeenan does this sound accurate?

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented Jun 3, 2019

I was considering the encoding table sizes. For nginx there is a patch that sets the default table size to 4096 https://github.com/cloudflare/sslconfig/blob/hpack_1.13.1/patches/nginx_1.13.1_http2_hpack.patch
for example.

Yeah, as I say I was aware of that and so I tested that on an unpatched Nginx, expecting a table size of 0 in the initial connection settings but didn't see that. I presume therefore that, without this patch, Nginx handles indexed headers on incoming requests, but just doesn't use them on responses? If so there is no need to explicitly set a table size of 0 if it never uses "indexed header" type in responses, which would explain my observations: the table size is left at the default but just never used for responses. Which means it is not possible to measure this metric (though I agree it would be a good one if we could!).

There's a lot of assumptions in there, so happy to be proven wrong if someone actually knows this or can explain it better?

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented Jun 3, 2019

@bazzadp thanks for the context. It sounds like detecting unused pushes should be possible using the resource metadata in HTTP Archive.

Excellent if that's the case! I've left that in there as one of the metrics. So that list in the first comment is all I can think of, so have marked the "Finalise metrics" tickbox as done.

If anyone has any other comments or suggestions in next few hours (or even if after!), then let us know.

@rviscomi should I close this issue?

@rviscomi

This comment has been minimized.

Copy link
Member Author

@rviscomi rviscomi commented Jun 3, 2019

Woohoo, I think you're the first author to finish your metrics (even before me! 😅). Yes, we're ready to close this issue.

Next step will be for the analysts to review your metrics more carefully. That process will happen on the HTTP Archive discussion forum at https://discuss.httparchive.org. For now, it would be great for you to create an account there if you haven't already so we can @ you in the discussion if needed. I'll also be creating a new tracking issue (and corresponding spreadsheet) to monitor the progress of each metric, which I'll share with you and tag you in when ready.

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented Jun 3, 2019

For now, it would be great for you to create an account there if you haven't already so we can @ you in the discussion if needed. I'll also be creating a new tracking issue (and corresponding spreadsheet) to monitor the progress of each metric, which I'll share with you and tag you in when ready.

Done. My username in there is "tunetheweb". Probably should change my GitHub username too to match what I more commonly go by nowadays, but have it referenced in a few places so would prefer not to. Hope it doesn't cause too much confusion!

@bazzadp bazzadp closed this Jun 3, 2019
Web Almanac automation moved this from In Progress to Done Jun 3, 2019
@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented Jun 24, 2019

Hi @paulcalvano, just had a nosey at how you were getting on triaging these metrics and wanted to clarify a few stats that you currently have down as Not feasible/Needs more info:

20.2 - Measure of all HTTP versions (0.9, 1.0, 1.1, 2, QUIC) for main page of all sites, and for HTTPS sites. Table for last crawl.
We can only see the negotiated protocol, not all of the versions supported.

Badly worded on my part so have reworded in first comment above: #22 (comment). I meant the negotiated version for all home pages crawled and not necessary all the versions supported by that page/site (I presume we will negotiate maximum supported version and every site will support all versions beneath the negotiated version with the possible exception of 0.9). See the example table I created to show what I'm looking for:

Version All sites HTTPS only sites
HTTP/0.9 0% 0%
HTTP/1.0 2% 0%
HTTP/1.1 48% 20%
HTTP/2 44% 70%
gQUIC 6% 10%

As you can see I don't list that HTTP/1.0 is probably supported by 100% of sites but only the 2% of sites that negotiate using that (these are totally made up stats btw but don't think they will be too far off). It's somewhat similar to the first stat requested (the adoption of HTTP/2 over time) but also looks at sites on older versions of HTTP and newer version (QUIC), and I also wanted to look at HTTP/2 usage by HTTP versus HTTPS. I just didn't want to cloud the first stat graph with all that noise hence why I put these in a separate second stat.

As per above example table, I suspect most will be HTTP/1.1 or HTTP/2 with a smaller number on gQUIC. Mozilla telemetry suggests some sites still use HTTP/1.0 but they might be internal sites or assets rather than main page so wouldn't be surprised if they don't show in our stats at all. And I don’t expect any to use HTTP/0.9.

20.7 - % of sites affected by CDN prioritization issues (H2 and served by CDN).
Not sure if this is possible with HA data.

Yeah this one wasn't mine but would be interesting to know. Maybe just list HTTP/2 sites by top CDNs (similar stats to 17.1 and 17.2?) and then can manually vlookup based on the known bad ones from Andy's github listing? Not sure how we know if a site is server by a CDN (server header? IP address range?) but if you can get it for the CDN chapter in stats 17.1 and 17.2 then presume there is some way :-)

20.14 - Is it possible to see HTTP/2 Pushed resources which are not used on the page load?
We only see the resources that were used in HA data since rejected push promises are not logged in the network panel.

Fair enough thought this one might be difficult. There were some comments above in #22 (comment) but it sounds tricky to be honest so happy to skip.

Count of HTTP/2 sites grouped by SETTINGS_MAX_CONCURRENT_STREAMS (including sites which don't set this value). Once off stat for last crawl.
We don't have H2 frame data

@pmeenan added this stat as per #22 (comment) above so stats should be in June crawl. Note if this value is not explicitly set at connection set up like in that example, then it defaults to unlimited so will need to account for that. Also this stat should only be captured for HTTP/2 sites.

Hope that clarifies some things and allows us to get some more of these. Give me a shout if anything is not clear. And off course if they are still too difficult to get them can live without them.

Thanks,
Barry

@pmeenan

This comment has been minimized.

Copy link

@pmeenan pmeenan commented Jun 24, 2019

@bazzadp

This comment has been minimized.

Copy link
Contributor

@bazzadp bazzadp commented Jun 24, 2019

For the prioritization issues, WebPageTest runs it's CDN detection as part of the crawl and should get pretty good coverage.

This is based on identifying the CDN via these server headers being set I presume? Was always curious how this worked!

One possible issue will be what origin(s) to look at for a given page? Just checking the pages origin is probably safest but will miss the cases where the static content is served by a different CDN (like all of shopify for example).

Yeah think best we can probably do is test the website home page and accept it's not 100% accurate. Trying to figure out the "most used CDN" for a web page to identify the shopify scenario is probably overly complicated.

@pmeenan

This comment has been minimized.

Copy link

@pmeenan pmeenan commented Jun 24, 2019

@rviscomi rviscomi mentioned this issue Jul 23, 2019
14 of 14 tasks complete
@rviscomi rviscomi mentioned this issue Sep 25, 2019
3 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Web Almanac
  
Done
9 participants
You can’t perform that action at this time.