Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow storage of arbitrary state data with an URL #14

Closed
mna opened this issue Mar 24, 2013 · 0 comments
Closed

Allow storage of arbitrary state data with an URL #14

mna opened this issue Mar 24, 2013 · 0 comments

Comments

@mna
Copy link
Member

mna commented Mar 24, 2013

When a page is crawled, some data is extracted. Sometimes, the complete data on a given piece of information is split across many pages. It may be necessary to store some state when crawling a page so that when a "child" page is crawled, this information is available.

For example, a page /author is crawled and information on the author is saved in a DB, with an ID. The URL /author/book1 is then enqueued, but if this page is crawled in a stateless way, it has no way to link the information back to the previously crawled author (it could find the author name in the book page, but let's pretend it's not there, or even if it was, there are maybe many authors with the same name, or there might be a typo, etc.).

Not sure yet if this should be managed by gocrawl or not. Should seed URLs also be allowed to have state? How much of a pain will it be to implement, complexify the API?

@mna mna closed this as completed Apr 3, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant