Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More H2 Queries #176

Merged
merged 29 commits into from
Oct 7, 2019
Merged

More H2 Queries #176

merged 29 commits into from
Oct 7, 2019

Conversation

paulcalvano
Copy link
Contributor

@paulcalvano paulcalvano commented Sep 26, 2019

Closes #101

@rviscomi rviscomi self-requested a review September 26, 2019 05:29
@rviscomi rviscomi added analysis Querying the dataset ASAP This issue is blocking progress labels Sep 26, 2019
@rviscomi rviscomi added this to the Content written milestone Sep 26, 2019
COUNT(*) AS num_pages,
SUM(IF(url LIKE "https://%", 1, 0)) AS num_https_pages
FROM
`httparchive.requests.2019_07_01_*`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions:

  • help put the absolute counts of pages in context of the total number of websites by also calculating the percent of pages
  • use the almanac.requests table which is partitioned and clustered by firstHtml for efficiency
  • sort the results by the percent for easier analysis
#standardSQL
# 20.2 - Measure of all HTTP versions (0.9, 1.0, 1.1, 2, QUIC) for main page of all sites, and for HTTPS sites. Table for last crawl.
SELECT 
  client,
  JSON_EXTRACT_SCALAR(payload, "$._protocol") AS protocol,
  COUNT(0) AS num_pages,
  SUM(COUNT(0)) OVER (PARTITION BY client) AS total,
  COUNTIF(url LIKE "https://%") AS num_https_pages,
  ROUND(COUNT(0) * 100 / SUM(COUNT(0)) OVER (PARTITION BY client), 2) AS pct_pages,
  ROUND(COUNTIF(url LIKE "https://%") * 100 / SUM(COUNT(0)) OVER (PARTITION BY client), 2) AS pct_https
FROM 
   `httparchive.almanac.requests`
WHERE
   firstHtml
GROUP BY
  client,
  protocol
ORDER BY 
  num_pages / total DESC

SELECT
_TABLE_SUFFIX AS client,
JSON_EXTRACT_SCALAR(payload, "$._is_base_page") AS is_base_page,
COUNT(*) AS num_requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to see this expressed as a pct

#standardSQL
# 20.04 - Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2.
CREATE TEMPORARY FUNCTION getUpgradeHeader(payload STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    var $ = JSON.parse(payload);
    var headers = $.response.headers;
    var st = headers.find(function(e) { 
      return e['name'].toLowerCase() === 'upgrade'
    });
    return st['value'];
  } catch (e) {
    return '';
  }
""";

SELECT
  client,
  COUNTIF(upgrade) AS freq,
  COUNT(0) AS total,
  ROUND(COUNTIF(upgrade) * 100 / COUNT(0), 2) AS pct
FROM (
  SELECT
    client,
    url LIKE "http://%" AND LOWER(getUpgradeHeader(payload)) LIKE "%h2%" AS upgrade
  FROM 
    `httparchive.almanac.requests`
  WHERE
    firstHtml)
GROUP BY
  client

Copy link
Contributor Author

@paulcalvano paulcalvano Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing a discrepancy in the results between the query in my PR and the one you shared above. In the screenshot below, the query on the left processes the httparchive.requests tables and the one on the right processes the httparchive.almanac.requests table.

The one on the left shows 1890586 mobile HTTP/2 pages. The one on the right shows 1996270 mobile HTTP/2 pages . I don't see any reason for the discrepency in the queries. Any ideas?
Image 2019-09-27 12-01-24

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Seems like a difference in requests marked _is_base_page and firstHtml. The former is a WPT annotation while the latter is determined in the HA pipeline.

Looking at the number of base pages:

#standardSQL
SELECT
  _TABLE_SUFFIX AS client,
  COUNT(0)
FROM
  `httparchive.requests.2019_07_01_*`
WHERE
  JSON_EXTRACT_SCALAR(payload, '$._is_base_page') = 'true'
GROUP BY
  client

There are 5,297,105 mobile and 4,371,570 desktop base pages. Compared to the right side totals in your screenshot, which are 5,558,214 and 4,550,580 respectively.

The actual number of pages in the summary_pages dataset for the 2019_07 crawl is:

#standardSQL
SELECT
  _TABLE_SUFFIX AS client,
  COUNT(0)
FROM
  `httparchive.summary_pages.2019_07_01_*`
GROUP BY
  client

5,297,442 and 4,371,973 respectively. So much closer to the base page numbers. I'm not even sure how we get ~200k more firstHtml requests than there are pages. 😖

#standardSQL
SELECT
  client,
  page,
  COUNT(0) AS firstHtml
FROM
  `httparchive.almanac.requests`
WHERE
  firstHtml
GROUP BY
  client,
  page
HAVING
  firstHtml > 1
ORDER BY
  firstHtml DESC

This reveals MANY pages in which there are more than 1 firstHtml value. Some with 400+. On this example page the page makes over 1000 requests, many to /index.php and /. Maybe the HA pipeline saw the first request for /, marked it as firstHtml, then marked all subsequent requests for that URL also as firstHtml?

In any case, this bug affects all other queries that depend on firstHtml. I think the best path forward for now is to open a bug to get this fixed in the pipeline for future crawls, overwrite all firstHtml values in the almanac dataset to match _is_base_page, rerun all queries that include it, and update all results and written analyses that depend on those queries.

Thanks for pointing this out Paul!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to file the bug and was querying the summary_requests dataset directly to eliminate any almanac dataset shenanigans, but it turns out that there are almanac dataset shenanigans!

The last query I shared in the previous comment produced hundreds of thousands of pages with more than 1 firstHtml request. It should have produced 0. This equivalent query of the summary_requests dataset does produce 0 results:

#standardSQL
SELECT
  _TABLE_SUFFIX AS client,
  pageid,
  COUNT(0) AS html
FROM
  `httparchive.summary_requests.2019_07_01_*`
WHERE
  firstHtml
GROUP BY
  client,
  pageid
HAVING
  html > 1
ORDER BY
  html DESC

Now I think I know what the bug is. When I generated the almanac dataset, if the summary request had firstHtml all requests for that page/url were also annotated as firstHtml. I should be able to fix this, but we still need to rerun all the queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulcalvano for now let's change these queries to use firstHtml solely for consistency. At least it'll have the same bug as all the other metrics and we can fix them all in one swoop.


SELECT
_TABLE_SUFFIX AS client,
JSON_EXTRACT_SCALAR(payload, "$._is_base_page") AS is_base_page,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto previous comments. Should is_base_page be a field in the output or a condition in the WHERE clause? Applies to the rest of these queries as well.

@paulcalvano
Copy link
Contributor Author

Added 20.16, which includes alt-svc headers at @bazzadp's request.

@rviscomi
Copy link
Member

rviscomi commented Oct 4, 2019

I've checked off all remaining metrics in #101 that are covered by this PR. Once the queries are updated to use firstHtml rather than is_base_page this should be ready for a final review to turn over to @bazzadp for writing.

@rviscomi rviscomi mentioned this pull request Oct 4, 2019
3 tasks
@tunetheweb
Copy link
Member

tunetheweb commented Oct 6, 2019

@paulcalvano Can we change 20.16 to SUM rather than AVG?

  ROUND(SUM(num_requests),2) AS num_pushed_requests,
  ROUND(SUM(kb_transfered),2) AS num_kb_transfered

Also is there a way to group these into the below categories? I've done a vlookup in the results spreadsheet, but not sure if there is a better way for future years? And categorising by these categories is why I want the sum, rather than the total. I can approximate this (pretty well it turns out!) but multiplying the num pages by average but would be better to have the raw stats. I got VERY different results when I just looked at the averages...

  • Video
  • HTML/Text
  • PDF
  • JavaScript
  • Image
  • Font
  • Data
  • CSS

@paulcalvano
Copy link
Contributor Author

Hey Barry. Sure, I'm updating the queries per Rick's comment earlier (ie, switching to the almanac table and using firstHtml instead of is_base_page). I'm sorry that I'm so behind on this.

Will add a note to each sheet indicating whether the data is updated to use the almanac table.

@tunetheweb
Copy link
Member

Can I also get HTTPS added to the upgrade header spreadsheet?

@paulcalvano
Copy link
Contributor Author

Can I also get HTTPS added to the upgrade header spreadsheet?

Which query is that?

@paulcalvano
Copy link
Contributor Author

I updated all the data in the sheet for the queries I updated above. Will have the rest of these done later today. THe numbers are very close - so the discrepancy I noted above does not appear to be too impactful and shouldn't affect the analysis.

@tunetheweb
Copy link
Member

Can I also get HTTPS added to the upgrade header spreadsheet?
Which query is that?

The 4a, 5a and 6a sheet. Find that one most useful btw so ignore the individual queries as only going to look at this combined one.

@paulcalvano
Copy link
Contributor Author

All queries have been updated to use the almanac tables except for 20.15. There doesn't appear to be a summary_pages almanac table, so I left that one as is. I've updated all the data in the sheets for the queries I updated. In most cases, it resulted in only slight changes in the numbers.

For 20.11, I included both tables because the results were very different between the two queries. I trust the 2019_07_01 table results more FWIW - so we might just want to keep the only query for that one. @rviscomi wdyt?

@paulcalvano
Copy link
Contributor Author

Just added the https indicators to two queries @bazzadp . I didn't want to mess up your pivot tables, so instead of updating those in the sheet I added two extra sheets at the end with the results.

@rviscomi - this is ready for a final review.

@rviscomi rviscomi merged commit a9ef181 into HTTPArchive:master Oct 7, 2019
@rviscomi rviscomi removed the ASAP This issue is blocking progress label Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query metrics: Chapter 20. HTTP/2
3 participants