Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 5: Add first third party scripting query #119

Merged
merged 2 commits into from
Aug 20, 2019
Merged

Conversation

patrickhulce
Copy link
Contributor

The first third party query for the scripting time analysis.

ref #86

SELECT
category AS thirdPartyCategory,
SAFE_CAST(JSON_EXTRACT(report,
'$.audits.bootup-time.details.items[0].scripting') AS FLOAT64) AS executionTime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost positive you know a better way to do this @rviscomi than the copypasta I have here. My stumbling block was always needing a .map function in order to flatten this result and match it with the join index

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh yeah BigQuery's JSON function don't play well with arrays. For best results I recommend a JS UDF to do the parsing and data processing.

Here's an example of a query that creates a UDF, which parses the JSON payload, preprocesses it, and hands it back to the SQL for aggregation: https://github.com/HTTPArchive/almanac.httparchive.org/blob/master/sql/2019/03_Markup/03_03c.sql

Copy link
Member

@rviscomi rviscomi Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#standardSQL
# Percentage of script execution time that are from third party requests broken down by third party category.
CREATE TEMPORARY FUNCTION getExecutionTimes(report STRING)
RETURNS ARRAY<STRUCT<url STRING, execution_time FLOAT64>> LANGUAGE js AS '''
try {
  var $ = JSON.parse(report);
  return $.audits['bootup-time'].details.items.map(item => ({
    url: item.url,
    execution_time: item.scripting
  }));
} catch (e) {
  return [];
}
''';

SELECT
  category AS third_party_category,
  SUM(item.execution_time) AS total_execution_time,
  ROUND(SUM(item.execution_time) * 100 / SUM(SUM(item.execution_time)) OVER (), 4) AS pct_execution_time
FROM
  `httparchive.lighthouse.2019_07_01_mobile`,
  UNNEST(getExecutionTimes(report)) AS item
LEFT JOIN
  `lighthouse-infrastructure.third_party_web.2019_07_01`
ON
  NET.HOST(item.url) = domain
GROUP BY
  third_party_category
ORDER BY
  pct_execution_time DESC

(65.3s elapsed, 1.22 TB processed)

third_party_category total_execution_time pct_execution_time
  3.42E+09 48.3414
ad 1.01E+09 14.2466
social 4.75E+08 6.7085
hosting 4.70E+08 6.6399
video 4.59E+08 6.4946
analytics 3.46E+08 4.8858
cdn 3.16E+08 4.4615
utility 2.76E+08 3.9068
tag-manager 1.02E+08 1.4358
customer-success 1.01E+08 1.4253
marketing 5.49E+07 0.7756
content 3.48E+07 0.4925
other 1.32E+07 0.1859

Could probably remove the total field, doesn't seem to be very insightful.

Hope that helps.

Copy link
Member

@rviscomi rviscomi Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference here are the results from your original query:

(448.7s elapsed, 1.22 TB processed)

thirdPartyCategory totalExecutionTime percentExecutionTime
  3.42E+09 48.3633
ad 1.01E+09 14.2296
social 4.74E+08 6.6972
hosting 4.70E+08 6.6429
video 4.59E+08 6.4984
analytics 3.45E+08 4.8847
cdn 3.16E+08 4.4631
utility 2.76E+08 3.9067
tag-manager 1.01E+08 1.4348
customer-success 1.01E+08 1.4261
marketing 5.48E+07 0.7757
content 3.48E+07 0.4919
other 1.31E+07 0.1858

Identical results for the most part. The slight differences are probably attributed to pages with 35+ items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thanks Rick! didn't expect a full query from you thank you very much for the help :)

SELECT
category AS thirdPartyCategory,
SAFE_CAST(JSON_EXTRACT(report,
'$.audits.bootup-time.details.items[0].scripting') AS FLOAT64) AS executionTime
Copy link
Member

@rviscomi rviscomi Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#standardSQL
# Percentage of script execution time that are from third party requests broken down by third party category.
CREATE TEMPORARY FUNCTION getExecutionTimes(report STRING)
RETURNS ARRAY<STRUCT<url STRING, execution_time FLOAT64>> LANGUAGE js AS '''
try {
  var $ = JSON.parse(report);
  return $.audits['bootup-time'].details.items.map(item => ({
    url: item.url,
    execution_time: item.scripting
  }));
} catch (e) {
  return [];
}
''';

SELECT
  category AS third_party_category,
  SUM(item.execution_time) AS total_execution_time,
  ROUND(SUM(item.execution_time) * 100 / SUM(SUM(item.execution_time)) OVER (), 4) AS pct_execution_time
FROM
  `httparchive.lighthouse.2019_07_01_mobile`,
  UNNEST(getExecutionTimes(report)) AS item
LEFT JOIN
  `lighthouse-infrastructure.third_party_web.2019_07_01`
ON
  NET.HOST(item.url) = domain
GROUP BY
  third_party_category
ORDER BY
  pct_execution_time DESC

(65.3s elapsed, 1.22 TB processed)

third_party_category total_execution_time pct_execution_time
  3.42E+09 48.3414
ad 1.01E+09 14.2466
social 4.75E+08 6.7085
hosting 4.70E+08 6.6399
video 4.59E+08 6.4946
analytics 3.46E+08 4.8858
cdn 3.16E+08 4.4615
utility 2.76E+08 3.9068
tag-manager 1.02E+08 1.4358
customer-success 1.01E+08 1.4253
marketing 5.49E+07 0.7756
content 3.48E+07 0.4925
other 1.32E+07 0.1859

Could probably remove the total field, doesn't seem to be very insightful.

Hope that helps.

@rviscomi rviscomi changed the base branch from 3p_top_domains to master August 20, 2019 15:36
@rviscomi rviscomi changed the base branch from master to 3p_top_domains August 20, 2019 15:36
@patrickhulce patrickhulce changed the base branch from 3p_top_domains to master August 20, 2019 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants