Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markup 2020 queries #1137

Merged
merged 21 commits into from
Sep 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions sql/2020/03_Markup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Guide

## file naming
file names start with the table being queried. Next is the custom metric if appropriate. Then "_by_" followed by how the query is grouped. Finally anything more specific needed.

Queries contain one or more metrics that match the queries grouping structure.
rviscomi marked this conversation as resolved.
Show resolved Hide resolved

## general

percents should be 0-1
always segment by device (client)
total for the denominator, freq for the numerator, and pct for the fraction

percentiles (10, 25, 50, 75 and 90th) The y-axis min should almost always be 0 and usually the max that gets automatically rendered is good enough. So no max min required?

Some common query patterns:

## _by_device

Example: pages_markup_by_device.sql

### AS_PERCENT function

This makes it simpler to create percent based fields. It may be added as a shared function for all.

```
# helper to create percent fields
CREATE TEMP FUNCTION AS_PERCENT (freq FLOAT64, total FLOAT64) RETURNS FLOAT64 AS (
ROUND(SAFE_DIVIDE(freq, total), 4)
);
```

In use:

```
AS_PERCENT(COUNTIF(element_count_info.contains_custom_element), COUNT(0)) AS pct_contains_custom_element,
```

### custom metrics functions

For speed the json string for the custom metric (not payload) should be passed in and the function can return multiple values via a STRUCT. This minimises json parsing in the JS which seems to be slow.

```
CREATE TEMPORARY FUNCTION get_element_count_info(element_count_string STRING)
RETURNS STRUCT<
count INT64,
contains_custom_element BOOL,
contains_obsolete_element BOOL,
contains_details_element BOOL,
contains_summary_element BOOL
> LANGUAGE js AS '''
var result = {};
try {
if (!element_count_string) return result;

var element_count = JSON.parse(element_count_string);

if (Array.isArray(element_count) || typeof element_count != 'object') return result;

// fill result with all the values

result.count = Object.values(element_count).reduce((total, freq) => total + (parseInt(freq, 10) || 0), 0);

//...

} catch (e) {}
return result;
''';
```

Make sure you do null/undefined checks in your js code when digging into an object. We don't want a simple error causing the loss of data. No issue in setting a value to null or undefined as it gets converted into a NULL.

```
if (almanac.html_node) {
result.html_node_lang = almanac.html_node.lang;
}
```

To get the info first extracts the custom metric from the payload using the fast JSON_EXTRACT_SCALAR. Then the returned STRUCT values can be accessed with dot notation.

```
SELECT
client,
COUNT(0) AS total,

# % of pages with custom elements ("slang") related to M242
COUNTIF(element_count_info.contains_custom_element) AS freq_contains_custom_element,
AS_PERCENT(COUNTIF(element_count_info.contains_custom_element), COUNT(0)) AS pct_contains_custom_element,

#...

FROM
(
SELECT
_TABLE_SUFFIX AS client,
get_element_count_info(JSON_EXTRACT_SCALAR(payload, '$._element_count')) AS element_count_info # LIVE
FROM
`httparchive.sample_data.pages_*` # TEST
)
GROUP BY
client
```

## _by_device_and_percentile

Example: pages_markup_by_device_and_percentile.sql

This would typically use the exact same function to extract the data. The select adds in the percentiles and uses a standard field definition to extract the data

```
SELECT
percentile,
client,
COUNT(DISTINCT url) AS total,

# Elements per page
APPROX_QUANTILES(element_count_info.count, 1000)[OFFSET(percentile * 10)] AS elements_count,

#...

FROM (
SELECT
_TABLE_SUFFIX AS client,
percentile,
url,
get_element_count_info(JSON_EXTRACT_SCALAR(payload, '$._element_count')) AS element_count_info
FROM
`httparchive.sample_data.pages_*`, # TEST
UNNEST([10, 25, 50, 75, 90]) AS percentile
)
GROUP BY
percentile,
client
ORDER BY
percentile,
client
```

## Testing

For testing I change the start of the function and hard code some random data. e.g.

```
var result = {};
try {
// var almanac = JSON.parse(almanac_string); // LIVE

// TEST
var almanac = {
"scripts": {
"total": Math.floor(Math.random()*10)
};

if (Array.isArray(almanac) || typeof almanac != 'object') return result;

// ...
```

To speed up queries the call to the test function can also be faked:

```
get_almanac_info('') AS almanac_info # TEST
#get_almanac_info(JSON_EXTRACT_SCALAR(payload, '$._almanac')) AS almanac_info # LIVE
```
110 changes: 110 additions & 0 deletions sql/2020/03_Markup/pages_almanac_by_device.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#standardSQL
# pages almanac metrics grouped by device

# real run estimated at $4.08 and took 48 seconds

# to speed things up there is only one js function per custom metric property. It returns a STRUCT with all the data needed
# current test gathers 3 bits of incormation from the custom petric properties
# I tried to do a single js function processing the whole payload but it was very slow (50 sec) because of parsing the full payload in js
# this uses JSON_EXTRACT_SCALAR to first get the custom metrics json string, and only passes those into the js functions
# Estimate about twice the speed of the original code. But should scale up far better as the custom metrics are only parsed once.

# helper to create percent fields
CREATE TEMP FUNCTION AS_PERCENT (freq FLOAT64, total FLOAT64) RETURNS FLOAT64 AS (
Tiggerito marked this conversation as resolved.
Show resolved Hide resolved
ROUND(SAFE_DIVIDE(freq, total), 4)
);

# returns all the data we need from _almanac
CREATE TEMPORARY FUNCTION get_almanac_info(almanac_string STRING)
RETURNS STRUCT<
scripts_total INT64,
none_jsonld_scripts_total INT64,
src_scripts_total INT64,
inline_scripts_total INT64,
good_heading_sequence BOOL,
contains_videos_with_autoplay BOOL,
contains_videos_without_autoplay BOOL,
html_node_lang STRING
> LANGUAGE js AS '''
var result = {};
try {
var almanac = JSON.parse(almanac_string);
Tiggerito marked this conversation as resolved.
Show resolved Hide resolved

if (Array.isArray(almanac) || typeof almanac != 'object') return result;

if (almanac.scripts) {
result.scripts_total = almanac.scripts.total;
if (almanac.scripts.nodes) {
result.none_jsonld_scripts_total = almanac.scripts.nodes.filter(n => !n.type || n.type.trim().toLowerCase() !== 'application/ld+json').length;
result.src_scripts_total = almanac.scripts.nodes.filter(n => n.src && n.src.trim().length > 0).length;

result.inline_scripts_total = result.none_jsonld_scripts_total - result.src_scripts_total;
}
}

if (almanac.headings_order) {
var good = true;
var previousLevel = 0;
almanac.headings_order.forEach(level => {
if (previousLevel + 1 < level) { // jumped a level
good = false;
}
previousLevel = level;
});
result.good_heading_sequence = good;
}

if (almanac.videos) {
var autoplay_count = almanac.videos.nodes.filter(n => n.autoplay == "" || n.autoplay).length; // valid values are blank or autoplay. Im just checking it exists...

result.contains_videos_with_autoplay = autoplay_count > 0;
result.contains_videos_without_autoplay = almanac.videos.total > autoplay_count;
}

if (almanac.html_node) {
result.html_node_lang = almanac.html_node.lang;
}

} catch (e) {}
return result;
''';

SELECT
client,
COUNT(0) AS total,

# has scripts that are not jsonld ones. i.e. has a none jsonld script.
AS_PERCENT(COUNTIF(almanac_info.none_jsonld_scripts_total > 0), COUNT(0)) AS pct_contains_none_jsonld_scripts_m204,
rviscomi marked this conversation as resolved.
Show resolved Hide resolved
Tiggerito marked this conversation as resolved.
Show resolved Hide resolved

# has inline scripts
AS_PERCENT(COUNTIF(almanac_info.inline_scripts_total > 0), COUNT(0)) AS pct_contains_inline_scripts_m206,

# has src scripts
AS_PERCENT(COUNTIF(almanac_info.src_scripts_total > 0), COUNT(0)) AS pct_contains_src_scripts_m208,

# has no scripts
AS_PERCENT(COUNTIF(almanac_info.scripts_total = 0), COUNT(0)) AS pct_contains_no_scripts_m210,

# Does the heading logical sequence make any sense
AS_PERCENT(COUNTIF(almanac_info.good_heading_sequence), COUNT(0)) AS pct_good_heading_sequence_m222,

# pages with autoplaying video elements M310
AS_PERCENT(COUNTIF(almanac_info.contains_videos_with_autoplay), COUNT(0)) AS pct_contains_videos_with_autoplay_m310,

# pages without autoplaying video elements M311
AS_PERCENT(COUNTIF(almanac_info.contains_videos_without_autoplay), COUNT(0)) AS pct_contains_videos_without_autoplay_m311,

# pages with no html lang attribute M404
AS_PERCENT(COUNTIF(almanac_info.html_node_lang IS NULL OR LENGTH(almanac_info.html_node_lang) = 0), COUNT(0)) AS pct_no_html_lang_m404,

FROM
(
Tiggerito marked this conversation as resolved.
Show resolved Hide resolved
SELECT
_TABLE_SUFFIX AS client,
get_almanac_info(JSON_EXTRACT_SCALAR(payload, '$._almanac')) AS almanac_info
FROM
`httparchive.pages.2020_08_01_*`
)
GROUP BY
client

Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#standardSQL
# pages almanac metrics grouped by device and element attribute use (frequency)

CREATE TEMP FUNCTION AS_PERCENT (freq FLOAT64, total FLOAT64) RETURNS FLOAT64 AS (
ROUND(SAFE_DIVIDE(freq, total), 4)
);

CREATE TEMPORARY FUNCTION get_almanac_attribute_info(almanac_string STRING)
RETURNS ARRAY<STRUCT<name STRING, freq INT64>> LANGUAGE js AS '''
try {
var almanac = JSON.parse(almanac_string);

if (Array.isArray(almanac) || typeof almanac != 'object') return [];

if (almanac.attributes_used_on_elements) {
return Object.entries(almanac.attributes_used_on_elements).map(([name, freq]) => ({name, freq}));
}

} catch (e) {

}
return [];
''';

SELECT
_TABLE_SUFFIX AS client,
almanac_attribute_info.name,
SUM(almanac_attribute_info.freq) AS freq, # total count from all pages
AS_PERCENT(SUM(almanac_attribute_info.freq), SUM(SUM(almanac_attribute_info.freq)) OVER (PARTITION BY _TABLE_SUFFIX)) AS pct_m400
FROM
`httparchive.pages.2020_08_01_*`,
UNNEST(get_almanac_attribute_info(JSON_EXTRACT_SCALAR(payload, '$._almanac'))) AS almanac_attribute_info
rviscomi marked this conversation as resolved.
Show resolved Hide resolved
GROUP BY
client,
almanac_attribute_info.name
ORDER BY
client,
pct_m400 DESC
LIMIT
1000
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#standardSQL
# pages almanac metrics grouped by device and element attributes being used (present)

CREATE TEMP FUNCTION AS_PERCENT (freq FLOAT64, total FLOAT64) RETURNS FLOAT64 AS (
ROUND(SAFE_DIVIDE(freq, total), 4)
);

CREATE TEMPORARY FUNCTION get_almanac_attribute_names(almanac_string STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
try {
var almanac = JSON.parse(almanac_string);

if (Array.isArray(almanac) || typeof almanac != 'object') return [];

if (almanac.attributes_used_on_elements) {
return Object.keys(almanac.attributes_used_on_elements);
}

} catch (e) {

}
return [];
''';

SELECT
_TABLE_SUFFIX AS client,
attribute_name,
Tiggerito marked this conversation as resolved.
Show resolved Hide resolved
COUNT(DISTINCT url) AS pages,
total,
AS_PERCENT(COUNT(DISTINCT url), total) AS pct_m401
FROM
`httparchive.pages.2020_08_01_*`
JOIN
(SELECT _TABLE_SUFFIX, COUNT(0) AS total
FROM
`httparchive.pages.2020_08_01_*`
GROUP BY _TABLE_SUFFIX) # to get an accurate total of pages per device. also seems fast
USING (_TABLE_SUFFIX),
UNNEST(get_almanac_attribute_names(JSON_EXTRACT_SCALAR(payload, '$._almanac'))) AS attribute_name
GROUP BY
client,
total,
attribute_name
ORDER BY
pages / total DESC,
client
LIMIT
1000
Loading