Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httparchive.urls.* tables schema change #42

Closed
tomayac opened this issue Jun 22, 2018 · 1 comment
Closed

httparchive.urls.* tables schema change #42

tomayac opened this issue Jun 22, 2018 · 1 comment

Comments

@tomayac
Copy link
Member

tomayac commented Jun 22, 2018

The schema of the httparchive.urls.* tables seems to have changed from…

I used to be able to quickly get historical ranks by querying httparchive.urls.* and extracting the date as the _TABLE_SUFFIX, but this is now no longer possible. Was this announced anywhere? If so, I missed it and also can't find it now.

@rviscomi
Copy link
Member

For historical ranks you can assume https://bigquery.cloud.google.com/table/httparchive:urls.20170315 is the expected table. The urls dataset is not intended to be used in wildcard queries.

20171221 uses the new Alexa top 1M list, which includes subdomains and is very low quality (in my opinion).

20180620 is the intersection of the https://bigquery.cloud.google.com/table/chrome-ux-report:all.201805 Chrome UX Report dataset and the 20170315 Alexa domains. If needed, you could still infer the rank by joining the urls tables by domain, although you'd be ranking all subdomains the same.

We plan to use the 20180620 URLs in the 2018_07_01 crawl, tripling the number of URLs per crawl.

@tomayac tomayac closed this as completed Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants