Stable `all.pages` #8

max-ostapenko · 2024-09-09T23:59:17Z

Schema changes

custom_metrics column is a RECORD type with multiple metrics split separately, all of JSON type (closes Split the custom metrics JSON into structured fields data-pipeline#262)
CrUX data is added into summary.crux (closes Extract page-level CrUX object to top-level schema #16)
payload trimmed of technologies, custom metrics and blink features data (closes In future trim down custom metrics from payload? data-pipeline#269)
summary column trimmed of multiple metrics (closes The new schema and cost concerns for users data-pipeline#149)
adds page as a new cluster column
summary, payload, lighthouse, metadata columns transformed from STRING into JSON type

New schema

all.pages_stable

Run reprocessing jobs using all_pages_stable tag.

After reprocessing complete

Replace the production table:

DROP TABLE `all.page`;

CREATE TABLE `all.pages`
COPY `all_dev.pages_stable`;

max-ostapenko · 2024-09-10T22:37:25Z

Test run

~~Resulting table~~ see PR description

tunetheweb · 2024-09-11T12:45:40Z

One other thing I was consider (but that would require dropping and recreating the table) was to add another cluster on page. This is what I did for parsed_css.

I notice we only use 3 (out of maximum of 4) columns for clustering. Adding a page column would allow me to look up individual pages more quickly (e.g. I wanna look at custom metrics for https://www.example.com/ as I know they use something). At the moment the best you can do is find the rank for it and search by page and rank but that means looking up the rank.

But not sure that's worth recreating the whole table just for that, given the volume of data (and the fact that we couldn't add it to the all.requests table anyway as it's used all it's cluster columns). It's annoying you can't do this with an ALTER TABLE command :-(

definitions/output/all/pages_stable.js

max-ostapenko · 2024-09-29T16:50:28Z

Instead of ALTERing, we'll now reprocess the table.
See the updated PR description.

Eventually custom_metrics.other is ~30% of all the custom metrics.

tunetheweb · 2024-10-07T14:04:08Z

@rviscomi you potentially want to add CrUX data as a column?

While I think that could be useful for some, the fact it's only page-level data, is only available for 25% of home pages and 7% of secondary pages (by @pmeenan 's previous estimates), makes me think it's more a niche thing, that people can get from the payload column. WDYT?

    "_CrUX": {
        "key": {
            "formFactor": "DESKTOP",
            "url": "https://myanimelist.net/anime/55791/Oshi_no_Ko_2nd_Season"
        },
        "metrics": {
            "navigation_types": {
                "fractions": {
                    "navigate": 0.7687,
                    "navigate_cache": 0,
                    "reload": 0.0122,
                    "restore": 0.0074,
                    "back_forward": 0.0694,
                    "back_forward_cache": 0.1422,
                    "prerender": 0
                }
            },
            "round_trip_time": {
                "percentiles": {
                    "p75": 75
                }
            },
            "cumulative_layout_shift": {
...
        "collectionPeriod": {
            "firstDate": {
                "year": 2024,
                "month": 7,
                "day": 17
            },
            "lastDate": {
                "year": 2024,
                "month": 8,
                "day": 13
            }
        }
    },

DNS is maybe another one that could also be it's own column, though again, maybe niche enough that it's fine to have people pick that up from the payload column?:

    "_origin_dns": {
        "cname": [
            "www.ynet.co.il-v1.edgekey.net."
        ],
        "ns": [
            "usc2.akam.net.",
            "use1.akam.net.",
            "ns1-92.akam.net.",
            "ns1-61.akam.net.",
            "ns1-168.akam.net.",
            "asia2.akam.net.",
            "usw1.akam.net.",
            "eur2.akam.net."
        ],
        "mx": [
            "10 ynet-co-il.mail.protection.outlook.com."
        ],
        "txt": [
            "\"proxy-ssl.webflow.com\"",
            "\"google-site-verification=aVs1GVkIfLRmJiL3DUr64sdtVovFkK_AhftCG-Blq10\"",
            "\"facebook-domain-verification=28qy8xvpk5e6dfsh6si9wm9ecqhn5u\"",
            "\"v=spf1 ip4:192.115.83.94 ip4:192.115.83.121 ip4:62.90.250.129 ip4:192.115.80.21 ip4:192.115.80.141 ip4:192.115.80.142 ip4:192.115.80.143 include:mymarketing.co.il include:_spf.activetrail.com include:spf.protection.outlook.com ~all\"",
            "\"google-site-verification=dppKXl_LBWZo3uEOPqxwUoVTjbqTN-MkOlm01sDWH1I\"",
            "\"MS=ms11993946\""
        ],
        "soa": [
            "prddns01.yitweb.co.il. internet-grp.yit.co.il. 2024060301 600 3600 1209600 600"
        ],
        "https": [],
        "svcb": []
    },

And finally, we need to look at native JSON columns as there were some that couldn't be processed in that and so had to use JavaScript JSON columns. See: HTTPArchive/httparchive.org#923 (comment). Is this going to be a problem if we move to native JSON columns?

tunetheweb · 2024-10-07T14:22:43Z

@max-ostapenko here's some examples of custom_metrics which don't parse a JSON according to BigQuery (but do in JavaScript):

CREATE TEMP FUNCTION js_parsed_custom_metrics(cm STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    cm = JSON.parse(cm);
    return JSON.stringify(cm);
  } catch {
    return ''
  }
""";

SELECT
  page,
  SUBSTR(custom_metrics, 0, 50) AS custom_metrics,
  SAFE.PARSE_JSON(custom_metrics) AS custom_metrics_bigquery,
  SUBSTR(js_parsed_custom_metrics(custom_metrics), 0, 50) AS custom_metrics_js
FROM
  `httparchive.all.pages`
WHERE
  date = '2024-09-01' AND
  rank = 1000 AND
  is_root_page AND
  client = 'desktop' AND
  custom_metrics IS NOT NULL AND
  SAFE.PARSE_JSON(custom_metrics) IS NULL AND
  js_parsed_custom_metrics(custom_metrics) IS NOT NULL

page	custom_metrics	custom_metrics_bigquery	custom_metrics_js
https://suumo.jp/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.sanook.com/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://oglobo.globo.com/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://tenki.jp/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.merriam-webster.com/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.nytimes.com/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.cronista.com/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.ikea.com/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.coolmathgames.com/	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"	null	{"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"

Can you check how those are handled with the conversion to JSON type columns?

tunetheweb · 2024-10-07T14:24:44Z

Oh just saw wide_number_mode => 'round' in the PR. Tried that, and it works now! So ignore above comment.

definitions/output/all/reprocess_pages.js

rviscomi · 2024-10-07T15:11:29Z

@rviscomi you potentially want to add CrUX data as a column?

Yeah, filed here: #16

While I think that could be useful for some, the fact it's only page-level data, is only available for 25% of home pages and 7% of secondary pages (by @pmeenan 's previous estimates), makes me think it's more a niche thing, that people can get from the payload column. WDYT?

It's probably more popular than some of the new top-level custom metrics 😛 Not to say that we should consider removing those, just that the standard for pulling them up in the schema should probably be more around how distinct, useful, and stable the data is. I think the CrUX data checks all those boxes.

tunetheweb · 2024-10-07T15:18:11Z

the standard for pulling them up in the schema should probably be more around how distinct, useful, and stable the data is

I question how "useful" it is based on low level of coverage. And worry by exposing as a "top level" bit of data, people get unrealistic expectations of them (i.e. that they will nearly always be populated), when in many cases we maybe should point them at the origin-level data in the CrUX dataset (which, after all is what the Web Almanac has alwasy used despite this being available, and also what the HTTP Archive reports section uses).

But won't block if you think it really is useful and so want to add it.

rviscomi · 2024-10-07T16:32:02Z

Yeah, usefulness is subjective and you have good points.

I definitely think it's still worth considering though. 5M home pages and 1M secondary pages have data, which is especially useful considering how no other public dataset has page-level data like this. Here's one recent example of its usefulness: HTTPArchive/cwv-tech-report#32 (comment). Moving it to its own field would also make the data more accessible to anyone who wanted to query it; it'd process only 4 GB instead of 14 TB.

tunetheweb · 2024-10-07T17:39:54Z

True but this will also significantly reduce that 14 TB:

payload trimmed of technologies, custom metrics and blink features data (closes HTTPArchive/data-pipeline#269)

WDYT about adding to summary stats instead? That way it's a cheaper one to query, but not a full "top-level" column?

rviscomi · 2024-10-07T18:05:44Z

That also works for me

tunetheweb · 2024-10-07T18:16:13Z

Hmmm the only thing is that makes summary bigger, which we're trying to actively make smaller!
Let's go with a separate CrUX column then. And let's leave DNS out for now.

@max-ostapenko can you make the changes above and rerun a sample into the all_dev schema?

definitions/output/all/reprocess_pages.js

max-ostapenko · 2024-10-07T22:45:31Z

Refreshed all_dev.pages_stable and all_dev.requests_stable.

definitions/output/all/reprocess_pages.js

max-ostapenko and others added 14 commits September 10, 2024 01:48

draft

2881227

operations instead of publish

3f4382a

function updates

a251be3

add column ddl

5d10b85

Merge branch 'main' into stable_pages

9909b1c

alter query post update

1b88a73

column descriptions

10ad463

comment

e7f2b96

Merge branch 'main' into stable_pages

5c3f17c

Merge branch 'stable_pages' into stable_pages

0534d36

monthly iterated test

daf6719

Merge branch 'main' into main

9a8f4ff

Merge branch 'stable_pages' into stable_pages

d82bf71

optimise test tables creation

536adda

max-ostapenko marked this pull request as ready for review September 10, 2024 22:35

iterating using forEach

4959b99

max-ostapenko requested review from rviscomi and tunetheweb September 11, 2024 12:17

tunetheweb reviewed Sep 11, 2024

View reviewed changes

definitions/output/all/pages_stable.js Outdated Show resolved Hide resolved

max-ostapenko and others added 3 commits September 18, 2024 21:38

Merge branch 'main' into stable_pages

8cae3c8

merge

726937b

Merge branch 'stable_pages' into stable_pages

c6ebfaa

max-ostapenko mentioned this pull request Sep 19, 2024

Datasets and schema updates HTTPArchive/har.fyi#15

Merged

max-ostapenko added 3 commits September 29, 2024 12:50

Merge branch 'main' into main

93434a2

table reprocessing

3a7e127

more metrics

3b117b4

max-ostapenko requested a review from tunetheweb September 29, 2024 16:50

max-ostapenko added 2 commits September 30, 2024 15:08

Merge branch 'main' into main

5cd05df

origin-trials removed from other

c44691e

max-ostapenko mentioned this pull request Sep 30, 2024

In future trim down custom metrics from payload? HTTPArchive/data-pipeline#269

Closed

max-ostapenko added 3 commits September 30, 2024 20:07

native json pruning

6f75b69

Merge branch 'main' into main

178d9f2

more payload pruning

7389ef5

tunetheweb requested changes Oct 7, 2024

View reviewed changes

definitions/output/all/reprocess_pages.js Outdated Show resolved Hide resolved

max-ostapenko commented Oct 7, 2024

View reviewed changes

definitions/output/all/reprocess_pages.js Show resolved Hide resolved

max-ostapenko and others added 6 commits October 7, 2024 22:59

Merge branch 'main' into stable_pages

ec66587

merge

89c754b

merge

23ce787

further request headers cleanup

e7da973

test variables

77e8026

crux

c008522

tunetheweb reviewed Oct 7, 2024

View reviewed changes

definitions/output/all/reprocess_pages.js Outdated Show resolved Hide resolved

crux cleanup

0584b53

max-ostapenko requested a review from tunetheweb October 8, 2024 14:54

tunetheweb approved these changes Oct 8, 2024

View reviewed changes

max-ostapenko merged commit af749fc into main Oct 8, 2024
3 checks passed

max-ostapenko deleted the stable_pages branch October 8, 2024 16:20

Stable all.pages #8

Stable all.pages #8

Uh oh!

Conversation

max-ostapenko commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Schema changes

New schema

After reprocessing complete

Uh oh!

max-ostapenko commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tunetheweb commented Sep 11, 2024 • edited by max-ostapenko Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

max-ostapenko commented Sep 29, 2024

Uh oh!

tunetheweb commented Oct 7, 2024

Uh oh!

tunetheweb commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tunetheweb commented Oct 7, 2024

Uh oh!

Uh oh!

rviscomi commented Oct 7, 2024

Uh oh!

tunetheweb commented Oct 7, 2024

Uh oh!

rviscomi commented Oct 7, 2024

Uh oh!

tunetheweb commented Oct 7, 2024

Uh oh!

rviscomi commented Oct 7, 2024

Uh oh!

tunetheweb commented Oct 7, 2024

Uh oh!

Uh oh!

max-ostapenko commented Oct 7, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Stable `all.pages` #8

Stable `all.pages` #8

max-ostapenko commented Sep 9, 2024 •

edited

Loading

max-ostapenko commented Sep 10, 2024 •

edited

Loading

tunetheweb commented Sep 11, 2024 •

edited by max-ostapenko

Loading

tunetheweb commented Oct 7, 2024 •

edited

Loading