Markup 2020 queries #1137

Tiggerito · 2020-08-01T09:05:01Z

Introduction

Metrics are in sections.

All metrics segment their data and there are common segments like device (mobile, desktop). Each metric will indicate how the data is segmented. In some cases I will be able to merge metrics into queries that use the same segments to reduce costs.

Some metrics are segemented by percentile. This means totals are split into the 10, 25, 50, 75 and 90th percentiles.

I've use some tags to help identify common segments:

[SB-D] = segmented by device
[SB-DP] = segmented by device and percentile
[SB-CUST] = custom segmenting

I'm also adding IDs to each metric so they can be referenced from the queries and the metrics sheet etc.

Sections

1. General

2. Elements

3. Element/Attributes

4. Attributes

…nac.httparchive.org into markup-sql-2020

experiment on speed improvements

Tiggerito · 2020-08-10T12:14:39Z

@rviscomi @bazzadp

I've been testing out a new way to construct these custom metrics payload queries. Here's my experiment.

It's building on the idea of using a single query to gather as much data as possible.

I was worried that all the repeated parsing of json would make things very slow. I eventually came up with creating one js function per custom metric that returns a STRUCT with all the data needed. So only one parse of the json per row.

This should mean that the query should scale up well.

At one point I tried to parse the whole payload in a js function, and it was sloooow. query went from 15 sec to 50 sec. Using JSON_EXTRACT_SCALAR is far faster but does not have the power to process the json like we often need. Parsing a single custom metric in js is fine. So I get to each custom metric via JSON_EXTRACT_SCALAR and pass that to the functions.

I've only got it dealing with a few fields at the moment. Adding new ones seem to have no impact on speed.

As expected, it was about twice the speed of an older query that used two separate functions to get each value.

The current experiment query uses the function method to gather data from two custom metrics. If it does start to struggle then doing a query per custom metric would work.

I did one query on real data. It took 39.1 seconds and cost $2.90.

rviscomi · 2020-08-10T15:48:30Z

+cc @paulcalvano

@Tiggerito for readers who are browsing the finished chapters and want to see how a particular metric was calculated, would this mean that they would be linked to an SQL file that also calculates many other metrics? I think we should opt for many smaller queries rather than a few larger ones. The more narrowly focused the queries are, the easier they are for subsequent analysts (be it the 2021++ team or external readers/researchers) to grok and reuse.

To your implicit goal of reducing computational overhead, I think we should explore ways to make the custom metrics more easily accessible. For example, we could preprocess all custom metrics so that each metric has its own column in a special purpose table. Something like SELECT COUNTIF(markup.images.loading.lazy) FROM httparchive.almanac.custom_metrics where the _markup field from the pages dataset is converted into a BigQuery STRUCT object with properties corresponding to the markup.js schema.

Tiggerito · 2020-08-11T00:05:25Z

+cc @paulcalvano

@Tiggerito for readers who are browsing the finished chapters and want to see how a particular metric was calculated, would this mean that they would be linked to an SQL file that also calculates many other metrics? I think we should opt for many smaller queries rather than a few larger ones. The more narrowly focused the queries are, the easier they are for subsequent analysts (be it the 2021++ team or external readers/researchers) to grok and reuse.

To your implicit goal of reducing computational overhead, I think we should explore ways to make the custom metrics more easily accessible. For example, we could preprocess all custom metrics so that each metric has its own column in a special purpose table. Something like SELECT COUNTIF(markup.images.loading.lazy) FROM httparchive.almanac.custom_metrics where the _markup field from the pages dataset is converted into a BigQuery STRUCT object with properties corresponding to the markup.js schema.

I will test more about using JSON_EXTRACT_SCALAR to get the metric out of the payload. That alone seemed to save a lot of cpu time.

For mine I may take a middle ground. Often a set of data makes sense together, so combining them in one function using a STRUCT output would make sense. I learnt it from a simple example last year that returned a few values in one go.

I'm concerned with having many smaller queries because of the maintenance involved (we have a lot to create, rough guess 100 for this chapter). Partly why the shared AS_PERCENT would be of help. I also understand that merging into one causes other issues. Maybe a middle ground again with appropriate/logical grouping?

Pre converting the metrics to STRUCT would be great. I'm kind of post doing that for just the data I need. Just having each in its own column would probably help a lot.

This grouping only works with the common query types, by device and by device and percentile. It can create a lot of columns but a few rows. I'm trying to find a way to flip rows and columns as that would create far more readable/processable results.

Tiggerito · 2020-08-11T03:50:16Z

Definitely using JSON_EXTRACT_SCALAR is faster for the payload. 5.4 seconds v 43 seconds for this simple query. I think it's worth at least telling people to do this change:

# payload parsed in the function to get _element_count
# 43.0 seconds for sample 10k
CREATE TEMPORARY FUNCTION parse_element_count(payload STRING)
RETURNS BOOL LANGUAGE js AS '''
    var $ = JSON.parse(payload); // payload in the function
    var elements = JSON.parse($._element_count);
    return true;
''';

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(parse_element_count(payload)) AS parsed,
  COUNT(0) AS freq
FROM
  `httparchive.sample_data.pages_*`
GROUP BY
  client

# payload parsed in JSON_EXTRACT_SCALAR to get _element_count
# 5.4 seconds for sample 10k
CREATE TEMPORARY FUNCTION parse_element_count(element_count STRING)
RETURNS BOOL LANGUAGE js AS '''
    var elements = JSON.parse(element_count);
    return true;
''';

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(parse_element_count(JSON_EXTRACT_SCALAR(payload, '$._element_count'))) AS parsed, # payload in JSON_EXTRACT_SCALAR
  COUNT(0) AS freq
FROM
  `httparchive.sample_data.pages_*`
GROUP BY
  client

rviscomi · 2020-08-13T19:26:58Z

I'm concerned with having many smaller queries because of the maintenance involved (we have a lot to create, rough guess 100 for this chapter). Partly why the shared AS_PERCENT would be of help. I also understand that merging into one causes other issues. Maybe a middle ground again with appropriate/logical grouping?

Yeah if there's a cluster of related metrics then a single query is ok. Grouping unrelated metrics would be overkill though, even if more efficient. If you run into quota issues, you can pass me the finished query which I can run using my HTTP Archive billing account and save to your chapter's results sheet.

Definitely using JSON_EXTRACT_SCALAR is faster for the payload.

Interesting findings! This is a good tip to share with the rest of the @HTTPArchive/analysts.

To summarize: if we let BigQuery extract the custom metric using native JSON functions before passing it into the JS UDF, that seems to speed up the query significantly. We still need UDFs for complex processing, but the basic extraction can be done more efficiently in the SQL code.

Tiggerito · 2020-08-14T04:33:38Z

Definitely using JSON_EXTRACT_SCALAR is faster for the payload.

Interesting findings! This is a good tip to share with the rest of the @HTTPArchive/analysts.

To summarize: if we let BigQuery extract the custom metric using native JSON functions before passing it into the JS UDF, that seems to speed up the query significantly. We still need UDFs for complex processing, but the basic extraction can be done more efficiently in the SQL code.

Yep. I guess the JS UDF struggles with the large payload. but can cope fine with single custom metrics.

I also suspect JSON_EXTRACT_SCALAR may cache well, meaning multiple calls on the same json string would also be fast. Not sure how to test that.

more metrics

Tiggerito · 2020-08-15T04:01:48Z

I've covered almost all of the requested data now. Only the one query pages_markup_by_device.sql ended up quite big at about 20 fields. There are some obvious ways I can split out some field groups if required.

I've also enhanced the README with some guidelines on the structure I used.

Tiggerito · 2020-09-05T03:51:35Z

@rviscomi @paulcalvano @bazzadp

All the queries are now pointing to live tables and ready for review.

I tested a few metrics based queries, which most of mine are. They come in at $4 each.

And I used them to practice setting up the spreadsheet:

https://docs.google.com/spreadsheets/d/1Ta7amoUeaL4pILhWzH-SCzMX9PsZeb1x_mwrX2C4eY8/edit?ts=5f4e700c#gid=150962402

Note, I had to copy the Data Viz Gallery so that I could copy the charts.

rviscomi

Similar to #1062 (comment), I'd recommend removing test code from the queries.

Tiggerito · 2020-09-10T22:44:57Z

Similar to #1062 (comment), I'd recommend removing test code from the queries.

Yep, I'll make the same changes here.

removed commented out code

Tiggerito · 2020-09-11T02:42:11Z

Similar to #1062 (comment), I'd recommend removing test code from the queries.

All test code removed.

rviscomi

Early feedback. Looks great so far, just some organizational questions and a few suggestions.

sql/2020/03_Markup/README.md

sql/2020/03_Markup/pages_almanac_by_device.sql

sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_frequency.sql

sql/2020/03_Markup/pages_almanac_by_device_and_html_lang.sql

sql/2020/03_Markup/pages_element_count_by_device_and_custom_dash_elements.sql

Added an order by comments total -> freq

sql/2020/03_Markup/pages_almanac_by_device.sql

sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_frequency.sql

sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_present.sql

sql/2020/03_Markup/pages_almanac_by_device_and_favicon_image_type.sql

sql/2020/03_Markup/pages_almanac_by_device_and_html_lang.sql

sql/2020/03_Markup/pages_markup_by_device_and_button_types.sql

sql/2020/03_Markup/pages_wpt_bodies_by_device_and_protocol.sql

sql/2020/03_Markup/summary_pages_by_device.sql

moved some metrics to a percentile query removed error reporting

Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>

Tiggerito · 2020-09-19T02:51:55Z

@rviscomi If your happy with my comments, I think I've covered everything.

rviscomi · 2020-09-19T22:22:37Z

Massive effort, thanks @Tiggerito!

Initial mapping of queries from 2019

d33e3ff

Tiggerito mentioned this pull request Aug 1, 2020

Markup 2020 #899

Closed

10 tasks

rviscomi added the analysis Querying the dataset label Aug 1, 2020

rviscomi added this to TODO in 2020 via automation Aug 1, 2020

rviscomi added this to the 2020 Analysis milestone Aug 1, 2020

Initial mapping of queries from 2019

b28f3ab

Tiggerito force-pushed the markup-sql-2020 branch from d33e3ff to b28f3ab Compare August 7, 2020 12:49

Tiggerito added 6 commits August 8, 2020 22:00

comments

5898b0a

Merge branch 'markup-sql-2020' of https://github.com/HTTPArchive/alma…

76e80bc

…nac.httparchive.org into markup-sql-2020

fixed merge conflict

3a0a7a2

New AS_PERCENT function

453558c

reviewing all scripts

e7c00c3

experiment on speed improvements

refining experiment

5d2db38

Tiggerito requested review from rviscomi and tunetheweb August 10, 2020 11:55

Tiggerito self-assigned this Aug 10, 2020

tunetheweb requested a review from paulcalvano August 10, 2020 12:00

Tiggerito added 5 commits August 14, 2020 18:22

lots of changes

32a7623

added buttons field

a262c5e

moved instructions into the README

ae79f1c

more metrics

almost complete

8fd8081

testing and creating sample sheets

fe7032b

Tiggerito marked this pull request as ready for review September 1, 2020 07:37

switched to live tables

8ac113b

rviscomi requested a review from a team September 5, 2020 06:17

rviscomi moved this from TODO to In progress in 2020 Sep 5, 2020

Tiggerito added 2 commits September 10, 2020 20:18

removed unused queries

f7f9d8e

newline at end of files

59a38cf

rviscomi requested changes Sep 10, 2020

View reviewed changes

removed all test code

3816d9b

removed commented out code

rviscomi requested changes Sep 11, 2020

View reviewed changes

Reduced some limits

51969a0

Added an order by comments total -> freq

rviscomi requested changes Sep 18, 2020

View reviewed changes

Tiggerito and others added 3 commits September 19, 2020 11:21

adding/shrinking limits

ab32ce7

moved some metrics to a percentile query removed error reporting

Update sql/2020/03_Markup/summary_pages_by_device.sql

38016c1

Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>

now based on pages

2bf738f

rviscomi approved these changes Sep 19, 2020

View reviewed changes

rviscomi merged commit 3300d73 into main Sep 19, 2020

2020 automation moved this from In progress to Done Sep 19, 2020

rviscomi deleted the markup-sql-2020 branch September 19, 2020 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markup 2020 queries #1137

Markup 2020 queries #1137

Tiggerito commented Aug 1, 2020 •

edited

Loading

Tiggerito commented Aug 10, 2020

rviscomi commented Aug 10, 2020

Tiggerito commented Aug 11, 2020

Tiggerito commented Aug 11, 2020

rviscomi commented Aug 13, 2020 •

edited

Loading

Tiggerito commented Aug 14, 2020

Tiggerito commented Aug 15, 2020

Tiggerito commented Sep 5, 2020

rviscomi left a comment

Tiggerito commented Sep 10, 2020

Tiggerito commented Sep 11, 2020

rviscomi left a comment

Tiggerito commented Sep 19, 2020

rviscomi commented Sep 19, 2020

Markup 2020 queries #1137

Markup 2020 queries #1137

Conversation

Tiggerito commented Aug 1, 2020 • edited Loading

Introduction

Sections

1. General

2. Elements

3. Element/Attributes

4. Attributes

Tiggerito commented Aug 10, 2020

rviscomi commented Aug 10, 2020

Tiggerito commented Aug 11, 2020

Tiggerito commented Aug 11, 2020

rviscomi commented Aug 13, 2020 • edited Loading

Tiggerito commented Aug 14, 2020

Tiggerito commented Aug 15, 2020

Tiggerito commented Sep 5, 2020

rviscomi left a comment

Choose a reason for hiding this comment

Tiggerito commented Sep 10, 2020

Tiggerito commented Sep 11, 2020

rviscomi left a comment

Choose a reason for hiding this comment

Tiggerito commented Sep 19, 2020

rviscomi commented Sep 19, 2020

Tiggerito commented Aug 1, 2020 •

edited

Loading

rviscomi commented Aug 13, 2020 •

edited

Loading