-
Notifications
You must be signed in to change notification settings - Fork 0
Stable all.pages
#8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
One other thing I was consider (but that would require dropping and recreating the table) was to add another cluster on I notice we only use 3 (out of maximum of 4) columns for clustering. Adding a But not sure that's worth recreating the whole table just for that, given the volume of data (and the fact that we couldn't add it to the |
Instead of ALTERing, we'll now reprocess the table. Eventually |
@rviscomi you potentially want to add CrUX data as a column? While I think that could be useful for some, the fact it's only page-level data, is only available for 25% of home pages and 7% of secondary pages (by @pmeenan 's previous estimates), makes me think it's more a niche thing, that people can get from the "_CrUX": {
"key": {
"formFactor": "DESKTOP",
"url": "https://myanimelist.net/anime/55791/Oshi_no_Ko_2nd_Season"
},
"metrics": {
"navigation_types": {
"fractions": {
"navigate": 0.7687,
"navigate_cache": 0,
"reload": 0.0122,
"restore": 0.0074,
"back_forward": 0.0694,
"back_forward_cache": 0.1422,
"prerender": 0
}
},
"round_trip_time": {
"percentiles": {
"p75": 75
}
},
"cumulative_layout_shift": {
...
"collectionPeriod": {
"firstDate": {
"year": 2024,
"month": 7,
"day": 17
},
"lastDate": {
"year": 2024,
"month": 8,
"day": 13
}
}
}, DNS is maybe another one that could also be it's own column, though again, maybe niche enough that it's fine to have people pick that up from the "_origin_dns": {
"cname": [
"www.ynet.co.il-v1.edgekey.net."
],
"ns": [
"usc2.akam.net.",
"use1.akam.net.",
"ns1-92.akam.net.",
"ns1-61.akam.net.",
"ns1-168.akam.net.",
"asia2.akam.net.",
"usw1.akam.net.",
"eur2.akam.net."
],
"mx": [
"10 ynet-co-il.mail.protection.outlook.com."
],
"txt": [
"\"proxy-ssl.webflow.com\"",
"\"google-site-verification=aVs1GVkIfLRmJiL3DUr64sdtVovFkK_AhftCG-Blq10\"",
"\"facebook-domain-verification=28qy8xvpk5e6dfsh6si9wm9ecqhn5u\"",
"\"v=spf1 ip4:192.115.83.94 ip4:192.115.83.121 ip4:62.90.250.129 ip4:192.115.80.21 ip4:192.115.80.141 ip4:192.115.80.142 ip4:192.115.80.143 include:mymarketing.co.il include:_spf.activetrail.com include:spf.protection.outlook.com ~all\"",
"\"google-site-verification=dppKXl_LBWZo3uEOPqxwUoVTjbqTN-MkOlm01sDWH1I\"",
"\"MS=ms11993946\""
],
"soa": [
"prddns01.yitweb.co.il. internet-grp.yit.co.il. 2024060301 600 3600 1209600 600"
],
"https": [],
"svcb": []
}, And finally, we need to look at native JSON columns as there were some that couldn't be processed in that and so had to use JavaScript JSON columns. See: HTTPArchive/httparchive.org#923 (comment). Is this going to be a problem if we move to native JSON columns? |
@max-ostapenko here's some examples of CREATE TEMP FUNCTION js_parsed_custom_metrics(cm STRING)
RETURNS STRING
LANGUAGE js AS """
try {
cm = JSON.parse(cm);
return JSON.stringify(cm);
} catch {
return ''
}
""";
SELECT
page,
SUBSTR(custom_metrics, 0, 50) AS custom_metrics,
SAFE.PARSE_JSON(custom_metrics) AS custom_metrics_bigquery,
SUBSTR(js_parsed_custom_metrics(custom_metrics), 0, 50) AS custom_metrics_js
FROM
`httparchive.all.pages`
WHERE
date = '2024-09-01' AND
rank = 1000 AND
is_root_page AND
client = 'desktop' AND
custom_metrics IS NOT NULL AND
SAFE.PARSE_JSON(custom_metrics) IS NULL AND
js_parsed_custom_metrics(custom_metrics) IS NOT NULL
Can you check how those are handled with the conversion to JSON type columns? |
Oh just saw |
Yeah, filed here: #16
It's probably more popular than some of the new top-level custom metrics 😛 Not to say that we should consider removing those, just that the standard for pulling them up in the schema should probably be more around how distinct, useful, and stable the data is. I think the CrUX data checks all those boxes. |
I question how "useful" it is based on low level of coverage. And worry by exposing as a "top level" bit of data, people get unrealistic expectations of them (i.e. that they will nearly always be populated), when in many cases we maybe should point them at the origin-level data in the CrUX dataset (which, after all is what the Web Almanac has alwasy used despite this being available, and also what the HTTP Archive reports section uses). But won't block if you think it really is useful and so want to add it. |
Yeah, usefulness is subjective and you have good points. I definitely think it's still worth considering though. 5M home pages and 1M secondary pages have data, which is especially useful considering how no other public dataset has page-level data like this. Here's one recent example of its usefulness: HTTPArchive/cwv-tech-report#32 (comment). Moving it to its own field would also make the data more accessible to anyone who wanted to query it; it'd process only 4 GB instead of 14 TB. |
True but this will also significantly reduce that 14 TB:
WDYT about adding to |
That also works for me |
Hmmm the only thing is that makes @max-ostapenko can you make the changes above and rerun a sample into the |
Refreshed |
Schema changes
custom_metrics
column is a RECORD type with multiple metrics split separately, all of JSON type (closes Split the custom metrics JSON into structured fields data-pipeline#262)summary.crux
(closes Extract page-level CrUX object to top-level schema #16)payload
trimmed of technologies, custom metrics and blink features data (closes In future trim down custom metrics frompayload
? data-pipeline#269)summary
column trimmed of multiple metrics (closes The new schema and cost concerns for users data-pipeline#149)page
as a new cluster columnsummary
,payload
,lighthouse
,metadata
columns transformed from STRING into JSON typeNew schema
all.pages_stable
Run reprocessing jobs using
all_pages_stable
tag.After reprocessing complete
Replace the production table: