Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markup 2020 queries #1137

Merged
merged 21 commits into from
Sep 19, 2020
Merged

Markup 2020 queries #1137

merged 21 commits into from
Sep 19, 2020

Conversation

Tiggerito
Copy link
Contributor

@Tiggerito Tiggerito commented Aug 1, 2020

Introduction

Metrics are in sections.

All metrics segment their data and there are common segments like device (mobile, desktop). Each metric will indicate how the data is segmented. In some cases I will be able to merge metrics into queries that use the same segments to reduce costs.

Some metrics are segemented by percentile. This means totals are split into the 10, 25, 50, 75 and 90th percentiles.

I've use some tags to help identify common segments:

[SB-D] = segmented by device
[SB-DP] = segmented by device and percentile
[SB-CUST] = custom segmenting

I'm also adding IDs to each metric so they can be referenced from the queries and the metrics sheet etc.

Sections

1. General

2. Elements

3. Element/Attributes

  • M301 - how many button elements on a page, segmented by device and percentile [SB-DP]
    pages.payload._markup->buttons
    pages_markup_by_device_and_percentile.sql
    pages_markup_by_device_and_percentile_sample_10k

  • M302 - pages with button elements, percent, segmented by device [SB-D]
    pages.payload._markup->buttons
    pages_markup_by_device.sql
    pages_markup_by_device_sample_10k

  • M303 - pages with button elements with no type, percent, segmented by device [SB-D]
    pages.payload._markup->buttons
    pages_markup_by_device.sql
    pages_markup_by_device_sample_10k

  • M304 - top button element types, percent, segmented by device and type [SB-CUST]
    pages.payload._markup->buttons
    pages_markup_by_device_and_button_types.sql
    pages_markup_by_device_and_button_types_sample_10k

  • M305 - how many input elements with type image on a page, segmented by device and percentile [SB-DP]
    pages.payload._markup->inputs
    pages_markup_by_device_and_percentile.sql
    pages_markup_by_device_and_percentile_sample_10k

  • M306 - how many input elements with type button on a page, segmented by device and percentile [SB-DP]
    pages.payload._markup->inputs
    pages_markup_by_device_and_percentile.sql
    pages_markup_by_device_and_percentile_sample_10k

  • M307 - how many input elements with type submit on a page, segmented by device and percentile [SB-DP]
    pages.payload._markup->inputs
    pages_markup_by_device_and_percentile.sql
    pages_markup_by_device_and_percentile_sample_10k

  • M310 - pages with autoplaying video elements, percent, segmented by device [SB-D]
    autoplay equates to true
    pages.payload._almanac->videos
    pages_almanac_by_device.sql
    pages_almanac_by_device_sample_10k

  • M311 - pages with non autoplaying video elements, percent, segmented by device [SB-D]
    autoplay equates to false
    pages.payload._almanac->videos
    pages_almanac_by_device.sql
    pages_almanac_by_device_sample_10k

  • M312 - pages with autoplaying audio elements, percent, segmented by device [SB-D]
    autoplay equates to true
    pages.payload._markup->audio
    pages_markup_by_device.sql
    pages_markup_by_device_sample_10k

  • M313 - pages with non autoplaying audio elements, percent, segmented by device [SB-D]
    autoplay equates to false
    pages.payload._markup->audio
    pages_markup_by_device.sql
    pages_markup_by_device_sample_10k

  • M320 - pages that are WordPress, percent, segmented by device - ??? [SB-D]
    or maybe use technologies or leave this up to CMS chapter
    pages.payload._almanac->meta_nodes

4. Attributes

@Tiggerito Tiggerito mentioned this pull request Aug 1, 2020
10 tasks
@rviscomi rviscomi added the analysis Querying the dataset label Aug 1, 2020
@rviscomi rviscomi added this to TODO in 2020 via automation Aug 1, 2020
@rviscomi rviscomi added this to the 2020 Analysis milestone Aug 1, 2020
@Tiggerito
Copy link
Contributor Author

@rviscomi @bazzadp

I've been testing out a new way to construct these custom metrics payload queries. Here's my experiment.

It's building on the idea of using a single query to gather as much data as possible.

I was worried that all the repeated parsing of json would make things very slow. I eventually came up with creating one js function per custom metric that returns a STRUCT with all the data needed. So only one parse of the json per row.

This should mean that the query should scale up well.

At one point I tried to parse the whole payload in a js function, and it was sloooow. query went from 15 sec to 50 sec. Using JSON_EXTRACT_SCALAR is far faster but does not have the power to process the json like we often need. Parsing a single custom metric in js is fine. So I get to each custom metric via JSON_EXTRACT_SCALAR and pass that to the functions.

I've only got it dealing with a few fields at the moment. Adding new ones seem to have no impact on speed.

As expected, it was about twice the speed of an older query that used two separate functions to get each value.

The current experiment query uses the function method to gather data from two custom metrics. If it does start to struggle then doing a query per custom metric would work.

I did one query on real data. It took 39.1 seconds and cost $2.90.

@rviscomi
Copy link
Member

+cc @paulcalvano

@Tiggerito for readers who are browsing the finished chapters and want to see how a particular metric was calculated, would this mean that they would be linked to an SQL file that also calculates many other metrics? I think we should opt for many smaller queries rather than a few larger ones. The more narrowly focused the queries are, the easier they are for subsequent analysts (be it the 2021++ team or external readers/researchers) to grok and reuse.

To your implicit goal of reducing computational overhead, I think we should explore ways to make the custom metrics more easily accessible. For example, we could preprocess all custom metrics so that each metric has its own column in a special purpose table. Something like SELECT COUNTIF(markup.images.loading.lazy) FROM httparchive.almanac.custom_metrics where the _markup field from the pages dataset is converted into a BigQuery STRUCT object with properties corresponding to the markup.js schema.

@Tiggerito
Copy link
Contributor Author

+cc @paulcalvano

@Tiggerito for readers who are browsing the finished chapters and want to see how a particular metric was calculated, would this mean that they would be linked to an SQL file that also calculates many other metrics? I think we should opt for many smaller queries rather than a few larger ones. The more narrowly focused the queries are, the easier they are for subsequent analysts (be it the 2021++ team or external readers/researchers) to grok and reuse.

To your implicit goal of reducing computational overhead, I think we should explore ways to make the custom metrics more easily accessible. For example, we could preprocess all custom metrics so that each metric has its own column in a special purpose table. Something like SELECT COUNTIF(markup.images.loading.lazy) FROM httparchive.almanac.custom_metrics where the _markup field from the pages dataset is converted into a BigQuery STRUCT object with properties corresponding to the markup.js schema.

I will test more about using JSON_EXTRACT_SCALAR to get the metric out of the payload. That alone seemed to save a lot of cpu time.

For mine I may take a middle ground. Often a set of data makes sense together, so combining them in one function using a STRUCT output would make sense. I learnt it from a simple example last year that returned a few values in one go.

I'm concerned with having many smaller queries because of the maintenance involved (we have a lot to create, rough guess 100 for this chapter). Partly why the shared AS_PERCENT would be of help. I also understand that merging into one causes other issues. Maybe a middle ground again with appropriate/logical grouping?

Pre converting the metrics to STRUCT would be great. I'm kind of post doing that for just the data I need. Just having each in its own column would probably help a lot.

This grouping only works with the common query types, by device and by device and percentile. It can create a lot of columns but a few rows. I'm trying to find a way to flip rows and columns as that would create far more readable/processable results.

@Tiggerito
Copy link
Contributor Author

Definitely using JSON_EXTRACT_SCALAR is faster for the payload. 5.4 seconds v 43 seconds for this simple query. I think it's worth at least telling people to do this change:

# payload parsed in the function to get _element_count
# 43.0 seconds for sample 10k
CREATE TEMPORARY FUNCTION parse_element_count(payload STRING)
RETURNS BOOL LANGUAGE js AS '''
    var $ = JSON.parse(payload); // payload in the function
    var elements = JSON.parse($._element_count);
    return true;
''';

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(parse_element_count(payload)) AS parsed,
  COUNT(0) AS freq
FROM
  `httparchive.sample_data.pages_*`
GROUP BY
  client
# payload parsed in JSON_EXTRACT_SCALAR to get _element_count
# 5.4 seconds for sample 10k
CREATE TEMPORARY FUNCTION parse_element_count(element_count STRING)
RETURNS BOOL LANGUAGE js AS '''
    var elements = JSON.parse(element_count);
    return true;
''';

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(parse_element_count(JSON_EXTRACT_SCALAR(payload, '$._element_count'))) AS parsed, # payload in JSON_EXTRACT_SCALAR
  COUNT(0) AS freq
FROM
  `httparchive.sample_data.pages_*`
GROUP BY
  client

@rviscomi
Copy link
Member

rviscomi commented Aug 13, 2020

I'm concerned with having many smaller queries because of the maintenance involved (we have a lot to create, rough guess 100 for this chapter). Partly why the shared AS_PERCENT would be of help. I also understand that merging into one causes other issues. Maybe a middle ground again with appropriate/logical grouping?

Yeah if there's a cluster of related metrics then a single query is ok. Grouping unrelated metrics would be overkill though, even if more efficient. If you run into quota issues, you can pass me the finished query which I can run using my HTTP Archive billing account and save to your chapter's results sheet.

Definitely using JSON_EXTRACT_SCALAR is faster for the payload.

Interesting findings! This is a good tip to share with the rest of the @HTTPArchive/analysts.

To summarize: if we let BigQuery extract the custom metric using native JSON functions before passing it into the JS UDF, that seems to speed up the query significantly. We still need UDFs for complex processing, but the basic extraction can be done more efficiently in the SQL code.

@Tiggerito
Copy link
Contributor Author

Definitely using JSON_EXTRACT_SCALAR is faster for the payload.

Interesting findings! This is a good tip to share with the rest of the @HTTPArchive/analysts.

To summarize: if we let BigQuery extract the custom metric using native JSON functions before passing it into the JS UDF, that seems to speed up the query significantly. We still need UDFs for complex processing, but the basic extraction can be done more efficiently in the SQL code.

Yep. I guess the JS UDF struggles with the large payload. but can cope fine with single custom metrics.

I also suspect JSON_EXTRACT_SCALAR may cache well, meaning multiple calls on the same json string would also be fast. Not sure how to test that.

@Tiggerito
Copy link
Contributor Author

I've covered almost all of the requested data now. Only the one query pages_markup_by_device.sql ended up quite big at about 20 fields. There are some obvious ways I can split out some field groups if required.

I've also enhanced the README with some guidelines on the structure I used.

@Tiggerito Tiggerito marked this pull request as ready for review September 1, 2020 07:37
@Tiggerito
Copy link
Contributor Author

@rviscomi @paulcalvano @bazzadp

All the queries are now pointing to live tables and ready for review.

I tested a few metrics based queries, which most of mine are. They come in at $4 each.

And I used them to practice setting up the spreadsheet:

https://docs.google.com/spreadsheets/d/1Ta7amoUeaL4pILhWzH-SCzMX9PsZeb1x_mwrX2C4eY8/edit?ts=5f4e700c#gid=150962402

Note, I had to copy the Data Viz Gallery so that I could copy the charts.

@rviscomi rviscomi requested a review from a team September 5, 2020 06:17
@rviscomi rviscomi moved this from TODO to In progress in 2020 Sep 5, 2020
Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #1062 (comment), I'd recommend removing test code from the queries.

@Tiggerito
Copy link
Contributor Author

Similar to #1062 (comment), I'd recommend removing test code from the queries.

Yep, I'll make the same changes here.

removed commented out code
@Tiggerito
Copy link
Contributor Author

Similar to #1062 (comment), I'd recommend removing test code from the queries.

All test code removed.

Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early feedback. Looks great so far, just some organizational questions and a few suggestions.

Added an order by
comments
total -> freq
Tiggerito and others added 3 commits September 19, 2020 11:21
moved some metrics to a percentile query
removed error reporting
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
@Tiggerito
Copy link
Contributor Author

@rviscomi If your happy with my comments, I think I've covered everything.

@rviscomi rviscomi merged commit 3300d73 into main Sep 19, 2020
2020 automation moved this from In progress to Done Sep 19, 2020
@rviscomi rviscomi deleted the markup-sql-2020 branch September 19, 2020 22:22
@rviscomi
Copy link
Member

Massive effort, thanks @Tiggerito!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
No open projects
2020
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants