Legacy Website Reports are Missing Historical Data #151

paulcalvano · 2018-08-14T15:19:26Z

When the HTTP Archive dataset was expanded in July 2018, new page ids were assigned for the newer URLs. This has broken the historical reports, which breaks continuity for the URLs that were previously tracked.

You can see an example of this here - https://legacy.httparchive.org/viewsite.php?pageid=94191763. The legacy report continues to include the latest stats, but now only shows trends starting with July 2018 -

@rviscomi and I believe that this can be corrected by mapping the old pages table records with the new pageids. I've assigned this to myself and will look into it.

The text was updated successfully, but these errors were encountered:

Themanwithoutaplan · 2018-08-14T15:26:10Z

FWIW I worked around this on my dataset by stripping the protocol from the URL (most of the new sites are https). But I use a more normalised schema with a separate URLs table and am thus not dependent upon the page_id.
Happy to share my changes if they'd be any use.

rviscomi · 2019-02-04T23:28:26Z

We have a goal in 2019 to reimplement the URL dashboard using BigQuery and Data Studio but we would still have the same continuity issues across URL corpus changes, even as we update the CrUX corpus monthly. Something to keep in mind.

One idea is to group by domain and have a line / table row for each origin. Note that some domains with user-generated content (like wordpress.com) would have many many origins.

Themanwithoutaplan · 2019-02-05T08:55:31Z

Just a note: the problem isn't really witht the pageid's as these are distinct for each test, the issue is mainly the change from http to https which breaks the lookup by URL so a site that was in as http://www.archive.org is considered distinct to https://www.archive.org I suspect that working directly with the host name in the CrUX dataset would resolve this. This would require some minor schema changes but MySQL often doesn't take kindly to these (adding columns particularly), but there would be more work for the loader and pagas. And if the idea is to replace the reports with something derived more directly from Big Query then it's probaby best to wait for this.

paulcalvano self-assigned this Aug 14, 2018

rviscomi added the bug label Aug 14, 2018

rviscomi mentioned this issue Feb 4, 2019

Legacy website explorer limited to July 2018 #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legacy Website Reports are Missing Historical Data #151

Legacy Website Reports are Missing Historical Data #151

paulcalvano commented Aug 14, 2018

Themanwithoutaplan commented Aug 14, 2018

rviscomi commented Feb 4, 2019

Themanwithoutaplan commented Feb 5, 2019

Legacy Website Reports are Missing Historical Data #151

Legacy Website Reports are Missing Historical Data #151

Comments

paulcalvano commented Aug 14, 2018

Themanwithoutaplan commented Aug 14, 2018

rviscomi commented Feb 4, 2019

Themanwithoutaplan commented Feb 5, 2019