-
-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More H2 Queries #176
More H2 Queries #176
Conversation
sql/2019/20_HTTP_2/20_02.sql
Outdated
COUNT(*) AS num_pages, | ||
SUM(IF(url LIKE "https://%", 1, 0)) AS num_https_pages | ||
FROM | ||
`httparchive.requests.2019_07_01_*` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions:
- help put the absolute counts of pages in context of the total number of websites by also calculating the percent of pages
- use the
almanac.requests
table which is partitioned and clustered byfirstHtml
for efficiency - sort the results by the percent for easier analysis
#standardSQL
# 20.2 - Measure of all HTTP versions (0.9, 1.0, 1.1, 2, QUIC) for main page of all sites, and for HTTPS sites. Table for last crawl.
SELECT
client,
JSON_EXTRACT_SCALAR(payload, "$._protocol") AS protocol,
COUNT(0) AS num_pages,
SUM(COUNT(0)) OVER (PARTITION BY client) AS total,
COUNTIF(url LIKE "https://%") AS num_https_pages,
ROUND(COUNT(0) * 100 / SUM(COUNT(0)) OVER (PARTITION BY client), 2) AS pct_pages,
ROUND(COUNTIF(url LIKE "https://%") * 100 / SUM(COUNT(0)) OVER (PARTITION BY client), 2) AS pct_https
FROM
`httparchive.almanac.requests`
WHERE
firstHtml
GROUP BY
client,
protocol
ORDER BY
num_pages / total DESC
SELECT | ||
_TABLE_SUFFIX AS client, | ||
JSON_EXTRACT_SCALAR(payload, "$._is_base_page") AS is_base_page, | ||
COUNT(*) AS num_requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to see this expressed as a pct
#standardSQL
# 20.04 - Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2.
CREATE TEMPORARY FUNCTION getUpgradeHeader(payload STRING)
RETURNS STRING
LANGUAGE js AS """
try {
var $ = JSON.parse(payload);
var headers = $.response.headers;
var st = headers.find(function(e) {
return e['name'].toLowerCase() === 'upgrade'
});
return st['value'];
} catch (e) {
return '';
}
""";
SELECT
client,
COUNTIF(upgrade) AS freq,
COUNT(0) AS total,
ROUND(COUNTIF(upgrade) * 100 / COUNT(0), 2) AS pct
FROM (
SELECT
client,
url LIKE "http://%" AND LOWER(getUpgradeHeader(payload)) LIKE "%h2%" AS upgrade
FROM
`httparchive.almanac.requests`
WHERE
firstHtml)
GROUP BY
client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing a discrepancy in the results between the query in my PR and the one you shared above. In the screenshot below, the query on the left processes the httparchive.requests tables and the one on the right processes the httparchive.almanac.requests table.
The one on the left shows 1890586 mobile HTTP/2 pages. The one on the right shows 1996270 mobile HTTP/2 pages . I don't see any reason for the discrepency in the queries. Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. Seems like a difference in requests marked _is_base_page
and firstHtml
. The former is a WPT annotation while the latter is determined in the HA pipeline.
Looking at the number of base pages:
#standardSQL
SELECT
_TABLE_SUFFIX AS client,
COUNT(0)
FROM
`httparchive.requests.2019_07_01_*`
WHERE
JSON_EXTRACT_SCALAR(payload, '$._is_base_page') = 'true'
GROUP BY
client
There are 5,297,105 mobile and 4,371,570 desktop base pages. Compared to the right side totals in your screenshot, which are 5,558,214 and 4,550,580 respectively.
The actual number of pages in the summary_pages
dataset for the 2019_07 crawl is:
#standardSQL
SELECT
_TABLE_SUFFIX AS client,
COUNT(0)
FROM
`httparchive.summary_pages.2019_07_01_*`
GROUP BY
client
5,297,442 and 4,371,973 respectively. So much closer to the base page numbers. I'm not even sure how we get ~200k more firstHtml
requests than there are pages. 😖
#standardSQL
SELECT
client,
page,
COUNT(0) AS firstHtml
FROM
`httparchive.almanac.requests`
WHERE
firstHtml
GROUP BY
client,
page
HAVING
firstHtml > 1
ORDER BY
firstHtml DESC
This reveals MANY pages in which there are more than 1 firstHtml
value. Some with 400+. On this example page the page makes over 1000 requests, many to /index.php
and /
. Maybe the HA pipeline saw the first request for /
, marked it as firstHtml
, then marked all subsequent requests for that URL also as firstHtml
?
In any case, this bug affects all other queries that depend on firstHtml
. I think the best path forward for now is to open a bug to get this fixed in the pipeline for future crawls, overwrite all firstHtml
values in the almanac
dataset to match _is_base_page
, rerun all queries that include it, and update all results and written analyses that depend on those queries.
Thanks for pointing this out Paul!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to file the bug and was querying the summary_requests
dataset directly to eliminate any almanac
dataset shenanigans, but it turns out that there are almanac
dataset shenanigans!
The last query I shared in the previous comment produced hundreds of thousands of pages with more than 1 firstHtml
request. It should have produced 0. This equivalent query of the summary_requests
dataset does produce 0 results:
#standardSQL
SELECT
_TABLE_SUFFIX AS client,
pageid,
COUNT(0) AS html
FROM
`httparchive.summary_requests.2019_07_01_*`
WHERE
firstHtml
GROUP BY
client,
pageid
HAVING
html > 1
ORDER BY
html DESC
Now I think I know what the bug is. When I generated the almanac dataset, if the summary request had firstHtml
all requests for that page/url were also annotated as firstHtml
. I should be able to fix this, but we still need to rerun all the queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulcalvano for now let's change these queries to use firstHtml
solely for consistency. At least it'll have the same bug as all the other metrics and we can fix them all in one swoop.
sql/2019/20_HTTP_2/20_06.sql
Outdated
|
||
SELECT | ||
_TABLE_SUFFIX AS client, | ||
JSON_EXTRACT_SCALAR(payload, "$._is_base_page") AS is_base_page, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto previous comments. Should is_base_page
be a field in the output or a condition in the WHERE clause? Applies to the rest of these queries as well.
Added 20.16, which includes alt-svc headers at @bazzadp's request. |
@paulcalvano Can we change 20.16 to SUM rather than AVG?
Also is there a way to group these into the below categories? I've done a vlookup in the results spreadsheet, but not sure if there is a better way for future years? And categorising by these categories is why I want the sum, rather than the total. I can approximate this (pretty well it turns out!) but multiplying the num pages by average but would be better to have the raw stats. I got VERY different results when I just looked at the averages...
|
Hey Barry. Sure, I'm updating the queries per Rick's comment earlier (ie, switching to the almanac table and using firstHtml instead of is_base_page). I'm sorry that I'm so behind on this. Will add a note to each sheet indicating whether the data is updated to use the almanac table. |
Can I also get HTTPS added to the upgrade header spreadsheet? |
Which query is that? |
I updated all the data in the sheet for the queries I updated above. Will have the rest of these done later today. THe numbers are very close - so the discrepancy I noted above does not appear to be too impactful and shouldn't affect the analysis. |
The 4a, 5a and 6a sheet. Find that one most useful btw so ignore the individual queries as only going to look at this combined one. |
All queries have been updated to use the almanac tables except for 20.15. There doesn't appear to be a summary_pages almanac table, so I left that one as is. I've updated all the data in the sheets for the queries I updated. In most cases, it resulted in only slight changes in the numbers. For 20.11, I included both tables because the results were very different between the two queries. I trust the 2019_07_01 table results more FWIW - so we might just want to keep the only query for that one. @rviscomi wdyt? |
Closes #101