Allow storage of arbitrary state data with an URL #14

mna · 2013-03-24T13:52:51Z

When a page is crawled, some data is extracted. Sometimes, the complete data on a given piece of information is split across many pages. It may be necessary to store some state when crawling a page so that when a "child" page is crawled, this information is available.

For example, a page /author is crawled and information on the author is saved in a DB, with an ID. The URL /author/book1 is then enqueued, but if this page is crawled in a stateless way, it has no way to link the information back to the previously crawled author (it could find the author name in the book page, but let's pretend it's not there, or even if it was, there are maybe many authors with the same name, or there might be a typo, etc.).

Not sure yet if this should be managed by gocrawl or not. Should seed URLs also be allowed to have state? How much of a pain will it be to implement, complexify the API?

mna closed this as completed Apr 3, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow storage of arbitrary state data with an URL #14

Allow storage of arbitrary state data with an URL #14

mna commented Mar 24, 2013

Allow storage of arbitrary state data with an URL #14

Allow storage of arbitrary state data with an URL #14

Comments

mna commented Mar 24, 2013