Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorize origins #125

Open
rviscomi opened this issue Apr 11, 2017 · 9 comments

Comments

Projects
None yet
4 participants
@rviscomi
Copy link
Member

commented Apr 11, 2017

Group origins by category/vertical, for example news/travel/etc. This will enable category deep dives and comparisons.

DMOZ is no longer operational but a recent data dump is available. We should look for alternate sources.

Slightly related: HTTPArchive/legacy.httparchive.org#75. Alexa is deprecating their top 1M ranking, so finding a rank+category solution would be a bonus.

@gregorywolf

This comment has been minimized.

Copy link
Collaborator

commented Sep 12, 2017

Rick - I listened to your recent talk in NYC at Performance Meet Up about the categorizing of URLs. I think this would be a powerful feature!! Any thoughts on how this could get moved forward? I would volunteer to help the effort. Hopefully more folks will feel the same way and we could proceed before too long.

@rviscomi

This comment has been minimized.

Copy link
Member Author

commented Sep 12, 2017

Hey Greg, thanks for volunteering! Assigning this to you :)

The next steps for this issue are:

  • survey the landscape of options: are there any other services similar to DMOZ that are regularly updated? what is the URL coverage and how does it overlap with the Alexa 500K that we're using? is there room for growth as we expand URL coverage? category correctness/granularity/etc...
  • plan and integrate the category info with HTTP Archive's data: what changes need to be made to the Dataflow pipeline and BigQuery schema?
  • analyze the new data and surface interesting reports on the beta site
@paulcalvano

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2017

I was thinking about this the other day and didn't realize there was an issue open. During my searches the only thing I was able to find was archived DMOZ data. Here's the dump I found in case it's useful - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OMV93V

@igrigorik

This comment has been minimized.

Copy link
Member

commented Sep 12, 2017

Sadly, DMOZ is deprecated.. I don't think we should hitch our wagon to this particular dataset.

@rviscomi

This comment has been minimized.

Copy link
Member Author

commented Sep 12, 2017

Yeah the DMOZ dump could be used as a last resort but it'd be preferable to find a service that's actively maintained.

Ilya also did some work on joining DMOZ data with Alexa URLs here: https://bigquery.cloud.google.com/table/httparchive:urls.20170315?tab=preview. Of the 1M URLs, only ~170K (17%) have topics/categories.

@paulcalvano

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2017

Ah, cool. I'll stop uploading that dataset to bigquery then. Was about to do the same analysis :)

@gregorywolf

This comment has been minimized.

Copy link
Collaborator

commented Sep 13, 2017

Rick - I'll start poking around and see what I can find. Stay tuned.

@rviscomi

This comment has been minimized.

Copy link
Member Author

commented Nov 3, 2017

Hey @gregorywolf have you made any progress on this?

@gregorywolf

This comment has been minimized.

Copy link
Collaborator

commented Nov 3, 2017

@rviscomi rviscomi transferred this issue from HTTPArchive/legacy.httparchive.org Feb 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.