Skip to content

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Sep 9, 2024

Schema changes

New schema

all.pages_stable

Run reprocessing jobs using all_pages_stable tag.

After reprocessing complete

Replace the production table:

DROP TABLE `all.page`;

CREATE TABLE `all.pages`
COPY `all_dev.pages_stable`;

@max-ostapenko max-ostapenko marked this pull request as ready for review September 10, 2024 22:35
@max-ostapenko
Copy link
Contributor Author

max-ostapenko commented Sep 10, 2024

Test run

Resulting table see PR description

@tunetheweb
Copy link
Member

tunetheweb commented Sep 11, 2024

One other thing I was consider (but that would require dropping and recreating the table) was to add another cluster on page. This is what I did for parsed_css.

I notice we only use 3 (out of maximum of 4) columns for clustering. Adding a page column would allow me to look up individual pages more quickly (e.g. I wanna look at custom metrics for https://www.example.com/ as I know they use something). At the moment the best you can do is find the rank for it and search by page and rank but that means looking up the rank.

But not sure that's worth recreating the whole table just for that, given the volume of data (and the fact that we couldn't add it to the all.requests table anyway as it's used all it's cluster columns). It's annoying you can't do this with an ALTER TABLE command :-(

@max-ostapenko
Copy link
Contributor Author

Instead of ALTERing, we'll now reprocess the table.
See the updated PR description.

Eventually custom_metrics.other is ~30% of all the custom metrics.

@tunetheweb
Copy link
Member

@rviscomi you potentially want to add CrUX data as a column?

While I think that could be useful for some, the fact it's only page-level data, is only available for 25% of home pages and 7% of secondary pages (by @pmeenan 's previous estimates), makes me think it's more a niche thing, that people can get from the payload column. WDYT?

    "_CrUX": {
        "key": {
            "formFactor": "DESKTOP",
            "url": "https://myanimelist.net/anime/55791/Oshi_no_Ko_2nd_Season"
        },
        "metrics": {
            "navigation_types": {
                "fractions": {
                    "navigate": 0.7687,
                    "navigate_cache": 0,
                    "reload": 0.0122,
                    "restore": 0.0074,
                    "back_forward": 0.0694,
                    "back_forward_cache": 0.1422,
                    "prerender": 0
                }
            },
            "round_trip_time": {
                "percentiles": {
                    "p75": 75
                }
            },
            "cumulative_layout_shift": {
...
        "collectionPeriod": {
            "firstDate": {
                "year": 2024,
                "month": 7,
                "day": 17
            },
            "lastDate": {
                "year": 2024,
                "month": 8,
                "day": 13
            }
        }
    },

DNS is maybe another one that could also be it's own column, though again, maybe niche enough that it's fine to have people pick that up from the payload column?:

    "_origin_dns": {
        "cname": [
            "www.ynet.co.il-v1.edgekey.net."
        ],
        "ns": [
            "usc2.akam.net.",
            "use1.akam.net.",
            "ns1-92.akam.net.",
            "ns1-61.akam.net.",
            "ns1-168.akam.net.",
            "asia2.akam.net.",
            "usw1.akam.net.",
            "eur2.akam.net."
        ],
        "mx": [
            "10 ynet-co-il.mail.protection.outlook.com."
        ],
        "txt": [
            "\"proxy-ssl.webflow.com\"",
            "\"google-site-verification=aVs1GVkIfLRmJiL3DUr64sdtVovFkK_AhftCG-Blq10\"",
            "\"facebook-domain-verification=28qy8xvpk5e6dfsh6si9wm9ecqhn5u\"",
            "\"v=spf1 ip4:192.115.83.94 ip4:192.115.83.121 ip4:62.90.250.129 ip4:192.115.80.21 ip4:192.115.80.141 ip4:192.115.80.142 ip4:192.115.80.143 include:mymarketing.co.il include:_spf.activetrail.com include:spf.protection.outlook.com ~all\"",
            "\"google-site-verification=dppKXl_LBWZo3uEOPqxwUoVTjbqTN-MkOlm01sDWH1I\"",
            "\"MS=ms11993946\""
        ],
        "soa": [
            "prddns01.yitweb.co.il. internet-grp.yit.co.il. 2024060301 600 3600 1209600 600"
        ],
        "https": [],
        "svcb": []
    },

And finally, we need to look at native JSON columns as there were some that couldn't be processed in that and so had to use JavaScript JSON columns. See: HTTPArchive/httparchive.org#923 (comment). Is this going to be a problem if we move to native JSON columns?

@tunetheweb
Copy link
Member

tunetheweb commented Oct 7, 2024

@max-ostapenko here's some examples of custom_metrics which don't parse a JSON according to BigQuery (but do in JavaScript):

CREATE TEMP FUNCTION js_parsed_custom_metrics(cm STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    cm = JSON.parse(cm);
    return JSON.stringify(cm);
  } catch {
    return ''
  }
""";

SELECT
  page,
  SUBSTR(custom_metrics, 0, 50) AS custom_metrics,
  SAFE.PARSE_JSON(custom_metrics) AS custom_metrics_bigquery,
  SUBSTR(js_parsed_custom_metrics(custom_metrics), 0, 50) AS custom_metrics_js
FROM
  `httparchive.all.pages`
WHERE
  date = '2024-09-01' AND
  rank = 1000 AND
  is_root_page AND
  client = 'desktop' AND
  custom_metrics IS NOT NULL AND
  SAFE.PARSE_JSON(custom_metrics) IS NULL AND
  js_parsed_custom_metrics(custom_metrics) IS NOT NULL
page custom_metrics custom_metrics_bigquery custom_metrics_js
https://suumo.jp/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1," null  {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.sanook.com/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"  null {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://oglobo.globo.com/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"  null {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://tenki.jp/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"  null {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.merriam-webster.com/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1," null {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.nytimes.com/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"  null {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.cronista.com/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"  null {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.ikea.com/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1," null  {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"
https://www.coolmathgames.com/ {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1," null {"00_reset":null,"Colordepth":24,"Dpi":{"dppx":1,"

Can you check how those are handled with the conversion to JSON type columns?

@tunetheweb
Copy link
Member

Oh just saw wide_number_mode => 'round' in the PR. Tried that, and it works now! So ignore above comment.

@rviscomi
Copy link
Member

rviscomi commented Oct 7, 2024

@rviscomi you potentially want to add CrUX data as a column?

Yeah, filed here: #16

While I think that could be useful for some, the fact it's only page-level data, is only available for 25% of home pages and 7% of secondary pages (by @pmeenan 's previous estimates), makes me think it's more a niche thing, that people can get from the payload column. WDYT?

It's probably more popular than some of the new top-level custom metrics 😛 Not to say that we should consider removing those, just that the standard for pulling them up in the schema should probably be more around how distinct, useful, and stable the data is. I think the CrUX data checks all those boxes.

@tunetheweb
Copy link
Member

the standard for pulling them up in the schema should probably be more around how distinct, useful, and stable the data is

I question how "useful" it is based on low level of coverage. And worry by exposing as a "top level" bit of data, people get unrealistic expectations of them (i.e. that they will nearly always be populated), when in many cases we maybe should point them at the origin-level data in the CrUX dataset (which, after all is what the Web Almanac has alwasy used despite this being available, and also what the HTTP Archive reports section uses).

But won't block if you think it really is useful and so want to add it.

@rviscomi
Copy link
Member

rviscomi commented Oct 7, 2024

Yeah, usefulness is subjective and you have good points.

I definitely think it's still worth considering though. 5M home pages and 1M secondary pages have data, which is especially useful considering how no other public dataset has page-level data like this. Here's one recent example of its usefulness: HTTPArchive/cwv-tech-report#32 (comment). Moving it to its own field would also make the data more accessible to anyone who wanted to query it; it'd process only 4 GB instead of 14 TB.

@tunetheweb
Copy link
Member

True but this will also significantly reduce that 14 TB:

payload trimmed of technologies, custom metrics and blink features data (closes HTTPArchive/data-pipeline#269)

WDYT about adding to summary stats instead? That way it's a cheaper one to query, but not a full "top-level" column?

@rviscomi
Copy link
Member

rviscomi commented Oct 7, 2024

That also works for me

@tunetheweb
Copy link
Member

Hmmm the only thing is that makes summary bigger, which we're trying to actively make smaller!
Let's go with a separate CrUX column then. And let's leave DNS out for now.

@max-ostapenko can you make the changes above and rerun a sample into the all_dev schema?

@max-ostapenko
Copy link
Contributor Author

Refreshed all_dev.pages_stable and all_dev.requests_stable.

@max-ostapenko max-ostapenko merged commit af749fc into main Oct 8, 2024
3 checks passed
@max-ostapenko max-ostapenko deleted the stable_pages branch October 8, 2024 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants