-
-
Notifications
You must be signed in to change notification settings - Fork 162
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' of github.com:HTTPArchive/almanac.httparchive.org…
… into production
- Loading branch information
Showing
107 changed files
with
8,124 additions
and
5,972 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
# Guide | ||
|
||
## file naming | ||
file names start with the table being queried. Next is the custom metric if appropriate. Then "_by_" followed by how the query is grouped. Finally anything more specific needed. | ||
|
||
Queries contain one or more metrics that match the queries grouping structure. | ||
|
||
## general | ||
|
||
percents should be 0-1 | ||
always segment by device (client) | ||
total for the denominator, freq for the numerator, and pct for the fraction | ||
|
||
percentiles (10, 25, 50, 75 and 90th) The y-axis min should almost always be 0 and usually the max that gets automatically rendered is good enough. So no max min required? | ||
|
||
Some common query patterns: | ||
|
||
## _by_device | ||
|
||
Example: pages_markup_by_device.sql | ||
|
||
### AS_PERCENT function | ||
|
||
This makes it simpler to create percent based fields. It may be added as a shared function for all. | ||
|
||
``` | ||
# helper to create percent fields | ||
CREATE TEMP FUNCTION AS_PERCENT (freq FLOAT64, total FLOAT64) RETURNS FLOAT64 AS ( | ||
ROUND(SAFE_DIVIDE(freq, total), 4) | ||
); | ||
``` | ||
|
||
In use: | ||
|
||
``` | ||
AS_PERCENT(COUNTIF(element_count_info.contains_custom_element), COUNT(0)) AS pct_contains_custom_element, | ||
``` | ||
|
||
### custom metrics functions | ||
|
||
For speed the json string for the custom metric (not payload) should be passed in and the function can return multiple values via a STRUCT. This minimises json parsing in the JS which seems to be slow. | ||
|
||
``` | ||
CREATE TEMPORARY FUNCTION get_element_count_info(element_count_string STRING) | ||
RETURNS STRUCT< | ||
count INT64, | ||
contains_custom_element BOOL, | ||
contains_obsolete_element BOOL, | ||
contains_details_element BOOL, | ||
contains_summary_element BOOL | ||
> LANGUAGE js AS ''' | ||
var result = {}; | ||
try { | ||
if (!element_count_string) return result; | ||
var element_count = JSON.parse(element_count_string); | ||
if (Array.isArray(element_count) || typeof element_count != 'object') return result; | ||
// fill result with all the values | ||
result.count = Object.values(element_count).reduce((total, freq) => total + (parseInt(freq, 10) || 0), 0); | ||
//... | ||
} catch (e) {} | ||
return result; | ||
'''; | ||
``` | ||
|
||
Make sure you do null/undefined checks in your js code when digging into an object. We don't want a simple error causing the loss of data. No issue in setting a value to null or undefined as it gets converted into a NULL. | ||
|
||
``` | ||
if (almanac.html_node) { | ||
result.html_node_lang = almanac.html_node.lang; | ||
} | ||
``` | ||
|
||
To get the info first extracts the custom metric from the payload using the fast JSON_EXTRACT_SCALAR. Then the returned STRUCT values can be accessed with dot notation. | ||
|
||
``` | ||
SELECT | ||
client, | ||
COUNT(0) AS total, | ||
# % of pages with custom elements ("slang") related to M242 | ||
COUNTIF(element_count_info.contains_custom_element) AS freq_contains_custom_element, | ||
AS_PERCENT(COUNTIF(element_count_info.contains_custom_element), COUNT(0)) AS pct_contains_custom_element, | ||
#... | ||
FROM | ||
( | ||
SELECT | ||
_TABLE_SUFFIX AS client, | ||
get_element_count_info(JSON_EXTRACT_SCALAR(payload, '$._element_count')) AS element_count_info # LIVE | ||
FROM | ||
`httparchive.sample_data.pages_*` # TEST | ||
) | ||
GROUP BY | ||
client | ||
``` | ||
|
||
## _by_device_and_percentile | ||
|
||
Example: pages_markup_by_device_and_percentile.sql | ||
|
||
This would typically use the exact same function to extract the data. The select adds in the percentiles and uses a standard field definition to extract the data | ||
|
||
``` | ||
SELECT | ||
percentile, | ||
client, | ||
COUNT(DISTINCT url) AS total, | ||
# Elements per page | ||
APPROX_QUANTILES(element_count_info.count, 1000)[OFFSET(percentile * 10)] AS elements_count, | ||
#... | ||
FROM ( | ||
SELECT | ||
_TABLE_SUFFIX AS client, | ||
percentile, | ||
url, | ||
get_element_count_info(JSON_EXTRACT_SCALAR(payload, '$._element_count')) AS element_count_info | ||
FROM | ||
`httparchive.sample_data.pages_*`, # TEST | ||
UNNEST([10, 25, 50, 75, 90]) AS percentile | ||
) | ||
GROUP BY | ||
percentile, | ||
client | ||
ORDER BY | ||
percentile, | ||
client | ||
``` | ||
|
||
## Testing | ||
|
||
For testing I change the start of the function and hard code some random data. e.g. | ||
|
||
``` | ||
var result = {}; | ||
try { | ||
// var almanac = JSON.parse(almanac_string); // LIVE | ||
// TEST | ||
var almanac = { | ||
"scripts": { | ||
"total": Math.floor(Math.random()*10) | ||
}; | ||
if (Array.isArray(almanac) || typeof almanac != 'object') return result; | ||
// ... | ||
``` | ||
|
||
To speed up queries the call to the test function can also be faked: | ||
|
||
``` | ||
get_almanac_info('') AS almanac_info # TEST | ||
#get_almanac_info(JSON_EXTRACT_SCALAR(payload, '$._almanac')) AS almanac_info # LIVE | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
#standardSQL | ||
# pages almanac metrics grouped by device | ||
|
||
# real run estimated at $4.08 and took 48 seconds | ||
|
||
# to speed things up there is only one js function per custom metric property. It returns a STRUCT with all the data needed | ||
# current test gathers 3 bits of incormation from the custom petric properties | ||
# I tried to do a single js function processing the whole payload but it was very slow (50 sec) because of parsing the full payload in js | ||
# this uses JSON_EXTRACT_SCALAR to first get the custom metrics json string, and only passes those into the js functions | ||
# Estimate about twice the speed of the original code. But should scale up far better as the custom metrics are only parsed once. | ||
|
||
# helper to create percent fields | ||
CREATE TEMP FUNCTION AS_PERCENT (freq FLOAT64, total FLOAT64) RETURNS FLOAT64 AS ( | ||
ROUND(SAFE_DIVIDE(freq, total), 4) | ||
); | ||
|
||
# returns all the data we need from _almanac | ||
CREATE TEMPORARY FUNCTION get_almanac_info(almanac_string STRING) | ||
RETURNS STRUCT< | ||
scripts_total INT64, | ||
none_jsonld_scripts_total INT64, | ||
src_scripts_total INT64, | ||
inline_scripts_total INT64, | ||
good_heading_sequence BOOL, | ||
contains_videos_with_autoplay BOOL, | ||
contains_videos_without_autoplay BOOL, | ||
html_node_lang STRING | ||
> LANGUAGE js AS ''' | ||
var result = {}; | ||
try { | ||
var almanac = JSON.parse(almanac_string); | ||
if (Array.isArray(almanac) || typeof almanac != 'object') return result; | ||
if (almanac.scripts) { | ||
result.scripts_total = almanac.scripts.total; | ||
if (almanac.scripts.nodes) { | ||
result.none_jsonld_scripts_total = almanac.scripts.nodes.filter(n => !n.type || n.type.trim().toLowerCase() !== 'application/ld+json').length; | ||
result.src_scripts_total = almanac.scripts.nodes.filter(n => n.src && n.src.trim().length > 0).length; | ||
result.inline_scripts_total = result.none_jsonld_scripts_total - result.src_scripts_total; | ||
} | ||
} | ||
if (almanac.headings_order) { | ||
var good = true; | ||
var previousLevel = 0; | ||
almanac.headings_order.forEach(level => { | ||
if (previousLevel + 1 < level) { // jumped a level | ||
good = false; | ||
} | ||
previousLevel = level; | ||
}); | ||
result.good_heading_sequence = good; | ||
} | ||
if (almanac.videos) { | ||
var autoplay_count = almanac.videos.nodes.filter(n => n.autoplay == "" || n.autoplay).length; // valid values are blank or autoplay. Im just checking it exists... | ||
result.contains_videos_with_autoplay = autoplay_count > 0; | ||
result.contains_videos_without_autoplay = almanac.videos.total > autoplay_count; | ||
} | ||
if (almanac.html_node) { | ||
result.html_node_lang = almanac.html_node.lang; | ||
} | ||
} catch (e) {} | ||
return result; | ||
'''; | ||
|
||
SELECT | ||
client, | ||
COUNT(0) AS total, | ||
|
||
# has scripts that are not jsonld ones. i.e. has a none jsonld script. | ||
AS_PERCENT(COUNTIF(almanac_info.none_jsonld_scripts_total > 0), COUNT(0)) AS pct_contains_none_jsonld_scripts_m204, | ||
|
||
# has inline scripts | ||
AS_PERCENT(COUNTIF(almanac_info.inline_scripts_total > 0), COUNT(0)) AS pct_contains_inline_scripts_m206, | ||
|
||
# has src scripts | ||
AS_PERCENT(COUNTIF(almanac_info.src_scripts_total > 0), COUNT(0)) AS pct_contains_src_scripts_m208, | ||
|
||
# has no scripts | ||
AS_PERCENT(COUNTIF(almanac_info.scripts_total = 0), COUNT(0)) AS pct_contains_no_scripts_m210, | ||
|
||
# Does the heading logical sequence make any sense | ||
AS_PERCENT(COUNTIF(almanac_info.good_heading_sequence), COUNT(0)) AS pct_good_heading_sequence_m222, | ||
|
||
# pages with autoplaying video elements M310 | ||
AS_PERCENT(COUNTIF(almanac_info.contains_videos_with_autoplay), COUNT(0)) AS pct_contains_videos_with_autoplay_m310, | ||
|
||
# pages without autoplaying video elements M311 | ||
AS_PERCENT(COUNTIF(almanac_info.contains_videos_without_autoplay), COUNT(0)) AS pct_contains_videos_without_autoplay_m311, | ||
|
||
# pages with no html lang attribute M404 | ||
AS_PERCENT(COUNTIF(almanac_info.html_node_lang IS NULL OR LENGTH(almanac_info.html_node_lang) = 0), COUNT(0)) AS pct_no_html_lang_m404, | ||
|
||
FROM | ||
( | ||
SELECT | ||
_TABLE_SUFFIX AS client, | ||
get_almanac_info(JSON_EXTRACT_SCALAR(payload, '$._almanac')) AS almanac_info | ||
FROM | ||
`httparchive.pages.2020_08_01_*` | ||
) | ||
GROUP BY | ||
client | ||
|
40 changes: 40 additions & 0 deletions
40
sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_frequency.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
#standardSQL | ||
# pages almanac metrics grouped by device and element attribute use (frequency) | ||
|
||
CREATE TEMP FUNCTION AS_PERCENT (freq FLOAT64, total FLOAT64) RETURNS FLOAT64 AS ( | ||
ROUND(SAFE_DIVIDE(freq, total), 4) | ||
); | ||
|
||
CREATE TEMPORARY FUNCTION get_almanac_attribute_info(almanac_string STRING) | ||
RETURNS ARRAY<STRUCT<name STRING, freq INT64>> LANGUAGE js AS ''' | ||
try { | ||
var almanac = JSON.parse(almanac_string); | ||
if (Array.isArray(almanac) || typeof almanac != 'object') return []; | ||
if (almanac.attributes_used_on_elements) { | ||
return Object.entries(almanac.attributes_used_on_elements).map(([name, freq]) => ({name, freq})); | ||
} | ||
} catch (e) { | ||
} | ||
return []; | ||
'''; | ||
|
||
SELECT | ||
_TABLE_SUFFIX AS client, | ||
almanac_attribute_info.name, | ||
SUM(almanac_attribute_info.freq) AS freq, # total count from all pages | ||
AS_PERCENT(SUM(almanac_attribute_info.freq), SUM(SUM(almanac_attribute_info.freq)) OVER (PARTITION BY _TABLE_SUFFIX)) AS pct_m400 | ||
FROM | ||
`httparchive.pages.2020_08_01_*`, | ||
UNNEST(get_almanac_attribute_info(JSON_EXTRACT_SCALAR(payload, '$._almanac'))) AS almanac_attribute_info | ||
GROUP BY | ||
client, | ||
almanac_attribute_info.name | ||
ORDER BY | ||
freq DESC, | ||
client | ||
LIMIT | ||
1000 |
Oops, something went wrong.