-
-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markup 2020 queries #1137
Markup 2020 queries #1137
Conversation
d33e3ff
to
b28f3ab
Compare
…nac.httparchive.org into markup-sql-2020
experiment on speed improvements
I've been testing out a new way to construct these custom metrics payload queries. Here's my experiment. It's building on the idea of using a single query to gather as much data as possible. I was worried that all the repeated parsing of json would make things very slow. I eventually came up with creating one js function per custom metric that returns a STRUCT with all the data needed. So only one parse of the json per row. This should mean that the query should scale up well. At one point I tried to parse the whole payload in a js function, and it was sloooow. query went from 15 sec to 50 sec. Using JSON_EXTRACT_SCALAR is far faster but does not have the power to process the json like we often need. Parsing a single custom metric in js is fine. So I get to each custom metric via JSON_EXTRACT_SCALAR and pass that to the functions. I've only got it dealing with a few fields at the moment. Adding new ones seem to have no impact on speed. As expected, it was about twice the speed of an older query that used two separate functions to get each value. The current experiment query uses the function method to gather data from two custom metrics. If it does start to struggle then doing a query per custom metric would work. I did one query on real data. It took 39.1 seconds and cost $2.90. |
+cc @paulcalvano @Tiggerito for readers who are browsing the finished chapters and want to see how a particular metric was calculated, would this mean that they would be linked to an SQL file that also calculates many other metrics? I think we should opt for many smaller queries rather than a few larger ones. The more narrowly focused the queries are, the easier they are for subsequent analysts (be it the 2021++ team or external readers/researchers) to grok and reuse. To your implicit goal of reducing computational overhead, I think we should explore ways to make the custom metrics more easily accessible. For example, we could preprocess all custom metrics so that each metric has its own column in a special purpose table. Something like |
I will test more about using JSON_EXTRACT_SCALAR to get the metric out of the payload. That alone seemed to save a lot of cpu time. For mine I may take a middle ground. Often a set of data makes sense together, so combining them in one function using a STRUCT output would make sense. I learnt it from a simple example last year that returned a few values in one go. I'm concerned with having many smaller queries because of the maintenance involved (we have a lot to create, rough guess 100 for this chapter). Partly why the shared AS_PERCENT would be of help. I also understand that merging into one causes other issues. Maybe a middle ground again with appropriate/logical grouping? Pre converting the metrics to STRUCT would be great. I'm kind of post doing that for just the data I need. Just having each in its own column would probably help a lot. This grouping only works with the common query types, by device and by device and percentile. It can create a lot of columns but a few rows. I'm trying to find a way to flip rows and columns as that would create far more readable/processable results. |
Definitely using JSON_EXTRACT_SCALAR is faster for the payload. 5.4 seconds v 43 seconds for this simple query. I think it's worth at least telling people to do this change:
|
Yeah if there's a cluster of related metrics then a single query is ok. Grouping unrelated metrics would be overkill though, even if more efficient. If you run into quota issues, you can pass me the finished query which I can run using my HTTP Archive billing account and save to your chapter's results sheet.
Interesting findings! This is a good tip to share with the rest of the @HTTPArchive/analysts. To summarize: if we let BigQuery extract the custom metric using native JSON functions before passing it into the JS UDF, that seems to speed up the query significantly. We still need UDFs for complex processing, but the basic extraction can be done more efficiently in the SQL code. |
Yep. I guess the JS UDF struggles with the large payload. but can cope fine with single custom metrics. I also suspect JSON_EXTRACT_SCALAR may cache well, meaning multiple calls on the same json string would also be fast. Not sure how to test that. |
I've covered almost all of the requested data now. Only the one query pages_markup_by_device.sql ended up quite big at about 20 fields. There are some obvious ways I can split out some field groups if required. I've also enhanced the README with some guidelines on the structure I used. |
@rviscomi @paulcalvano @bazzadp All the queries are now pointing to live tables and ready for review. I tested a few metrics based queries, which most of mine are. They come in at $4 each. And I used them to practice setting up the spreadsheet: Note, I had to copy the Data Viz Gallery so that I could copy the charts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to #1062 (comment), I'd recommend removing test code from the queries.
Yep, I'll make the same changes here. |
removed commented out code
All test code removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early feedback. Looks great so far, just some organizational questions and a few suggestions.
sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_frequency.sql
Outdated
Show resolved
Hide resolved
sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_frequency.sql
Show resolved
Hide resolved
sql/2020/03_Markup/pages_element_count_by_device_and_custom_dash_elements.sql
Outdated
Show resolved
Hide resolved
Added an order by comments total -> freq
sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_frequency.sql
Outdated
Show resolved
Hide resolved
sql/2020/03_Markup/pages_almanac_by_device_and_attribute_name_present.sql
Show resolved
Hide resolved
moved some metrics to a percentile query removed error reporting
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
@rviscomi If your happy with my comments, I think I've covered everything. |
Massive effort, thanks @Tiggerito! |
Introduction
Metrics are in sections.
All metrics segment their data and there are common segments like device (mobile, desktop). Each metric will indicate how the data is segmented. In some cases I will be able to merge metrics into queries that use the same segments to reduce costs.
Some metrics are segemented by percentile. This means totals are split into the 10, 25, 50, 75 and 90th percentiles.
I've use some tags to help identify common segments:
[SB-D] = segmented by device
[SB-DP] = segmented by device and percentile
[SB-CUST] = custom segmenting
I'm also adding IDs to each metric so they can be referenced from the queries and the metrics sheet etc.
Sections
1. General
M101 - doctype values, percent, segmented by device and doctype [SB-CUST]
summary_pages.doctype
summary_pages_by_device_and_doctype.sql
summary_pages_by_device_and_doctype_sample_10k
M102 - doctype is present, percent, segmented by device [SB-D]
summary_pages.doctype
summary_pages_by_device.sql
summary_pages_by_device_sample_10k
M103 - number of comments per page, segmented by device and percentile [SB-DP]
pages.payload._wpt_bodies->raw_html.comment_count
pages_wpt_bodies_by_device_and_percentile.sql
pages_wpt_bodies_by_device_and_percentile_sample_10k
M104 - comments are present, percent, segmented by device [SB-D]
pages.payload._wpt_bodies->raw_html.comment_count
pages_wpt_bodies_by_device.sql
pages_wpt_bodies_by_device_sample_10k
M105 - number of conditional comments per page, segmented by device and percentile [SB-DP]
pages.payload._wpt_bodies->raw_html.conditional_comment_count
pages_wpt_bodies_by_device_and_percentile.sql
pages_wpt_bodies_by_device_and_percentile_sample_10k
M106 - conditional comments are present, percent, segmented by device [SB-D]
pages.payload._wpt_bodies->raw_html.conditional_comment_count
pages_wpt_bodies_by_device.sql
pages_wpt_bodies_by_device_sample_10k
M107 - document size, segmented by device and percentile [SB-DP]
summary_pages.bytesHtml
summary_pages_by_device_and_percentile.sql
summary_pages_by_device_and_percentile_sample_10k
M108 - max document size, segmented by device [SB-D]
summary_pages.bytesHtml
summary_pages_by_device.sql
summary_pages_by_device_sample_10k
M109 - min document size, segmented by device [SB-D]
summary_pages.bytesHtml
summary_pages_by_device.sql
summary_pages_by_device_sample_10k
M110 - average document size, segmented by device [SB-D]
summary_pages.bytesHtml
summary_pages_by_device.sql
summary_pages_by_device_sample_10k
2. Elements
M201 - top 10,000 element types, full count over all pages, segmented by device and element type [SB-CUST]
pages.payload._element_count
fig 1 - frequency
pages_element_count_by_device_and_element_type_frequency.sql
pages_element_count_by_device_and_element_type_frequency_sample_10k
M202 - top 10,000 element types, percent, segmented by device and element type [SB-CUST]
I think this relates to last year and is number of pages that include an element at least once
pages.payload._element_count
fig 1 - frequency
pages_element_count_by_device_and_element_type_frequency.sql
pages_element_count_by_device_and_element_type_frequency_sample_10k
M203 - how many different element types are there on a page, segmented by device [SB-CUST]
from last year. Not requested this year.
pages.payload._element_count
pages_element_count_by_device_and_element_types_used_per_page.sql
pages_element_count_by_device_and_element_types_used_per_page_sample_10k
M204 - pages with a script tag, percent, segmented by device [SB-D]
excluding application/ld+json
pages.payload._almanac->scripts
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M205 - script tags per page, segmented by device and percentile [SB-DP]
excluding application/ld+json
pages.payload._almanac->scripts
pages_almanac_by_device_and_percentile.sql
pages_almanac_by_device_and_percentile_sample_10k
M206 - pages with an inline script tag, percent, segmented by device [SB-D]
excluding application/ld+json
pages.payload._almanac->scripts
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M207 - inline script tags per page, segmented by device and percentile [SB-DP]
excluding application/ld+json
pages.payload._almanac->scripts
pages_almanac_by_device_and_percentile.sql
pages_almanac_by_device_and_percentile_sample_10k
M208 - pages with a src script tag, percent, segmented by device [SB-D]
pages.payload._almanac->scripts
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M209 - src script tags per page, segmented by device and percentile [SB-DP]
excluding application/ld+json
pages.payload._almanac->scripts
pages_almanac_by_device_and_percentile.sql
pages_almanac_by_device_and_percentile_sample_10k
M210 - pages without a script tag, percent, segmented by device [SB-D]
pages.payload._almanac->scripts
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M211 - pages with a noscript tag, percent, segmented by device [SB-D]
pages.payload._markup->noscripts
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M212 - noscript tags per page, segmented by device and percentile [SB-DP]
pages.payload._markup->noscripts
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M213 - pages with a noscript GTM tag, percent, segmented by device [SB-D]
pages.payload._markup->noscripts
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M214 - pages with details/summary tags, percent, segmented by device [SB-D]
pages_payload._element_count
pages_element_count_by_device.sql
pages_element_count_by_device_sample_10k
M215 - How often do elements appear per document? - ???
pages_payload._element_count
Is this really M202?
Suggestion on how to do this in slack from Rick. Do percentile and group by element and limit to top elements
This seems like a good use case for percentiles, specifically the median. You could count the frequency of each element type per page, aggregate the median frequency across all pages, and limit the results to the top 10 element types by frequency.
M216 - pages using obsolete elements, percent, segmented by device and element [SB-CUST]
pages.payload._markup->obsolete_elements
pages.payload._element_count
pages_element_count_by_device_and_obsolete_elements.sql
pages_element_count_by_device_and_obsolete_elements_sample_10k
M217 - image types used for favicons, percent, segmented by device and type [SB-CUST]
This is based on file extensions.
pages.payload._almanac.link-nodes
pages_almanac_by_device_and_favicon_image_type.sql
pages_almanac_by_device_and_favicon_image_type_sample_10k
M218 - pages with a favicon, percent, segmented by device [SB-D]
pages.payload._markup->favicon
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M219 - How do people use the
viewport
meta elementsummary_pages.meta_viewport
summary_pages_by_device_and_viewport.sql
summary_pages_by_device_and_viewport_sample_10k
M220 - pages without an h1, percent, segmented by device [SB-D]
pages.payload._wpt_bodies->headings
pages_wpt_bodies_by_device.sql
pages_wpt_bodies_by_device_sample_10k
M221 - number of h1s to h8s on a page, segmented by device, percentile and heading level [SB-CUST]
pages.payload._wpt_bodies->headings
pages_wpt_bodies_by_device_and_percentile_and_heading_level.sql
pages_wpt_bodies_by_device_and_percentile_and_heading_level_sample_10k
M222 - Does the heading logical sequence make any sense [SB-CUST]
Need more details. safe to be same or one lower or go up any amount
pages.payload._almanac->headings_order
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M223 - pages with an svg element, percent, segmented by device [SB-D]
pages.payload._markup->svgs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M224 - number of svg element on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->svgs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M225 - pages with an svg image tag, percent, segmented by device [SB-D]
pages.payload._markup->svgs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M226 - number of svg image tag on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->svgs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M227 - pages with an svg object tag, percent, segmented by device [SB-D]
pages.payload._markup->svgs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M228 - number of svg object tag on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->svgs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M229 - pages with an svg embed tag, percent, segmented by device [SB-D]
pages.payload._markup->svgs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M230 - number of svg embed tag on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->svgs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M231 - pages with an svg iframe tag, percent, segmented by device [SB-D]
pages.payload._markup->svgs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M232 - number of svg iframe tag on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->svgs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M233 - pages with a svg in them, percent, segmented by device [SB-D]
i.e. any identified method (svg_total)
pages.payload._markup->svgs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M234 - size of the head section in characters, segmented by device and percentile [SB-DP]
pages.payload._wpt_bodies->raw_html.head_size
pages_wpt_bodies_by_device.sql
pages_wpt_bodies_by_device_sample_10k
M235 - how many pages contain a link using a specific protocol, percent, segmented by device, protocol [SB-CUST]
pages.payload._wpt_bodies->...protocols
pages_wpt_bodies_by_device_and_protocol.sql
pages_wpt_bodies_by_device_and_protocol_sample_10k
M236 - What values are used for
src
andsrcset
onimg
andsource
elements? - ???Need more details
pages.payload._almanac->images
M241 - how many pages contain an element by element and device [SB-CUST]
from last year. Not requested this year.
fig 1 - per site
pages.payload._element_count
pages_element_count_by_device_and_element_type_present.sql
pages_element_count_by_device_and_element_type_present_sample_10k
M242 - number of pages containing element types including a dash segmented by element and device [SB-CUST]
from last year. Not requested this year.
pages.payload._element_count
pages_element_count_by_device_and_custom_dash_elements.sql
pages_element_count_by_device_and_custom_dash_elements_sample_10k
M243 - top element types including a dash segmented by element and device [SB-CUST]
from last year. Not requested this year.
pages.payload._element_count
pages_element_count_by_device_and_custom_dash_elements.sql
pages_element_count_by_device_and_custom_dash_elements_sample_10k
3. Element/Attributes
M301 - how many button elements on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->buttons
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M302 - pages with button elements, percent, segmented by device [SB-D]
pages.payload._markup->buttons
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M303 - pages with button elements with no type, percent, segmented by device [SB-D]
pages.payload._markup->buttons
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M304 - top button element types, percent, segmented by device and type [SB-CUST]
pages.payload._markup->buttons
pages_markup_by_device_and_button_types.sql
pages_markup_by_device_and_button_types_sample_10k
M305 - how many input elements with type image on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->inputs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M306 - how many input elements with type button on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->inputs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M307 - how many input elements with type submit on a page, segmented by device and percentile [SB-DP]
pages.payload._markup->inputs
pages_markup_by_device_and_percentile.sql
pages_markup_by_device_and_percentile_sample_10k
M310 - pages with autoplaying video elements, percent, segmented by device [SB-D]
autoplay equates to true
pages.payload._almanac->videos
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M311 - pages with non autoplaying video elements, percent, segmented by device [SB-D]
autoplay equates to false
pages.payload._almanac->videos
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M312 - pages with autoplaying audio elements, percent, segmented by device [SB-D]
autoplay equates to true
pages.payload._markup->audio
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M313 - pages with non autoplaying audio elements, percent, segmented by device [SB-D]
autoplay equates to false
pages.payload._markup->audio
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M320 - pages that are WordPress, percent, segmented by device - ??? [SB-D]
or maybe use technologies or leave this up to CMS chapter
pages.payload._almanac->meta_nodes
4. Attributes
M400 - most frequently use attributes, percent, segmented by device and attribute [SB-CUST]
pages.payload._almanac->attributes_used_on_elements
pages_almanac_by_device_and_attribute_name_frequency.sql
pages_almanac_by_device_and_attribute_name_frequency_sample_10k
M401 - How often do what attributes appear per document? [SB-CUST]
Not sure how to report this. Maybe just the top attributes and a percentile for them?
pages.payload._almanac->attributes_used_on_elements
pages_almanac_by_device_and_attribute_name_present.sql
pages_almanac_by_device_and_attribute_name_present_sample_10k
M402 - most frequently use data- attributes, percent, segmented by device and attribute [SB-CUST]
pages.payload._almanac->attributes_used_on_elements
pages_almanac_by_device_and_data_attribute_name_frequency.sql
pages_almanac_by_device_and_data_attribute_name_frequency_sample_10k
M403 - pages identified as an app, percent, segmented by device [SB-D]
pages.payload._markup->app
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M404 - pages with no html lang attribute, percent, segmented by device [SB-D]
pages.payload._almanac->html_node.lang
pages_almanac_by_device.sql
pages_almanac_by_device_sample_10k
M405 - most frequently use html lang attribute, percent, segmented by device and lang [SB-CUST]
pages.payload._almanac->html_node.lang
pages_almanac_by_device_and_html_lang.sql
pages_almanac_by_device_and_html_lang_sample_10k
M410 - pages with html dir set, percent, segmented by device [SB-D]
pages.payload._markup->dirs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M411 - pages with html dir set to ltr, percent, segmented by device [SB-D]
pages.payload._markup->dirs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M412 - pages with html dir set to rtl, percent, segmented by device [SB-D]
pages.payload._markup->dirs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M413 - pages with html dir set to auto, percent, segmented by device [SB-D]
pages.payload._markup->dirs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M414 - pages with dir on other elements, percent, segmented by device [SB-D]
pages.payload._markup->dirs
pages_markup_by_device.sql
pages_markup_by_device_sample_10k
M420 - pages with all target _banks including rel="noopener noreferrer", percent, segmented by device [SB-D]
pages.payload._wpt_bodies->anchors.target_blank
pages_wpt_bodies_by_device.sql
pages_wpt_bodies_by_device_sample_10k
M421 - pages with some target _banks not using rel="noopener noreferrer", percent, segmented by device [SB-D]
pages.payload._wpt_bodies->anchors.target_blank
pages_wpt_bodies_by_device.sql
pages_wpt_bodies_by_device_sample_10k
M430 - pages with a link rel="amphtml", percent, segmented by device [SB-D]
pages.payload._markup->amp
pages_markup_by_device.sql
pages_markup_by_device_sample_10k