Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Legacy Website Reports are Missing Historical Data #151

Open
paulcalvano opened this issue Aug 14, 2018 · 3 comments
Open

Legacy Website Reports are Missing Historical Data #151

paulcalvano opened this issue Aug 14, 2018 · 3 comments
Assignees
Labels

Comments

@paulcalvano
Copy link
Contributor

When the HTTP Archive dataset was expanded in July 2018, new page ids were assigned for the newer URLs. This has broken the historical reports, which breaks continuity for the URLs that were previously tracked.

You can see an example of this here - https://legacy.httparchive.org/viewsite.php?pageid=94191763. The legacy report continues to include the latest stats, but now only shows trends starting with July 2018 -

image

@rviscomi and I believe that this can be corrected by mapping the old pages table records with the new pageids. I've assigned this to myself and will look into it.

@paulcalvano paulcalvano self-assigned this Aug 14, 2018
@Themanwithoutaplan
Copy link

FWIW I worked around this on my dataset by stripping the protocol from the URL (most of the new sites are https). But I use a more normalised schema with a separate URLs table and am thus not dependent upon the page_id.
Happy to share my changes if they'd be any use.

@rviscomi
Copy link
Member

rviscomi commented Feb 4, 2019

We have a goal in 2019 to reimplement the URL dashboard using BigQuery and Data Studio but we would still have the same continuity issues across URL corpus changes, even as we update the CrUX corpus monthly. Something to keep in mind.

One idea is to group by domain and have a line / table row for each origin. Note that some domains with user-generated content (like wordpress.com) would have many many origins.

@Themanwithoutaplan
Copy link

Just a note: the problem isn't really witht the pageid's as these are distinct for each test, the issue is mainly the change from http to https which breaks the lookup by URL so a site that was in as http://www.archive.org is considered distinct to https://www.archive.org I suspect that working directly with the host name in the CrUX dataset would resolve this. This would require some minor schema changes but MySQL often doesn't take kindly to these (adding columns particularly), but there would be more work for the loader and pagas. And if the idea is to replace the reports with something derived more directly from Big Query then it's probaby best to wait for this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants