New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete gone pages from index #253

Closed
jnioche opened this Issue Feb 13, 2016 · 3 comments

Comments

Projects
None yet
1 participant
@jnioche
Member

jnioche commented Feb 13, 2016

Conceptually it is not the same as an ERROR status - which occur for instance when a document is not parsable or has something wrong with it.
A document could become GONE if it has had N consecutive FETCH_ERRORS or if the server returned a HTTP code 410. A GONE document could be revisited based on the scheduling but should be deleted in an index.

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Sep 15, 2016

Member

An ERROR status would probably have a corresponding NEVER FETCH conventional value for the nextFetchDate (epoch 1/1/1970?) Such value would also be useful for FETCHED documents when we do not want to ever reprocess them. At the moment we just set a ridiculously large value like in 10 years time

Member

jnioche commented Sep 15, 2016

An ERROR status would probably have a corresponding NEVER FETCH conventional value for the nextFetchDate (epoch 1/1/1970?) Such value would also be useful for FETCHED documents when we do not want to ever reprocess them. At the moment we just set a ridiculously large value like in 10 years time

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 15, 2017

Member

See SO discussion on http://stackoverflow.com/questions/42810272/tell-stormcrawler-to-delete-pages-from-es-index-after-they-have-been-deleted-on

In the meantime we could delete the ERROR status (even if most of them would not have been added to ES), the difficulty being that the indexing is done on the default stream i.e. docs which have been fetched and parsed whereas the status info gets sent to the status stream by the various bolts. We could have a bespoke bolt for deletions - it would just need the URL and nothing else: to use it we could modify AbstractStatusUpdaterBolt so that it emits the URLs to delete onto a special stream e.g. 'deletion' and connect our bespoke ES delete bolt to it. Does this make sense? Please feel free to contribute to the discussion on the link below. Thanks!

Member

jnioche commented Mar 15, 2017

See SO discussion on http://stackoverflow.com/questions/42810272/tell-stormcrawler-to-delete-pages-from-es-index-after-they-have-been-deleted-on

In the meantime we could delete the ERROR status (even if most of them would not have been added to ES), the difficulty being that the indexing is done on the default stream i.e. docs which have been fetched and parsed whereas the status info gets sent to the status stream by the various bolts. We could have a bespoke bolt for deletions - it would just need the URL and nothing else: to use it we could modify AbstractStatusUpdaterBolt so that it emits the URLs to delete onto a special stream e.g. 'deletion' and connect our bespoke ES delete bolt to it. Does this make sense? Please feel free to contribute to the discussion on the link below. Thanks!

@jnioche jnioche added this to the 1.5 milestone Apr 18, 2017

@jnioche jnioche changed the title from add GONE status to Delete gone pages from index Apr 19, 2017

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 19, 2017

Member

Renamed the issue. Adding a separate status is probably not practical as the status is typically not carried through the topology. We could add a key value in the metadata to indicate when a page was successfully fetched or indexed last to delete only the ones that get ERRORed but had been fetched. As an initial step, deleting documents even though they never got fetched is not really a problem.

Having said that, we might need to track the canonical value so that we can delete the doc based on the same URL as was used while indexing. The trouble being that we could end up deleting the canonical representation of the page even though only one of the possible variations got lost. Tricky.

Member

jnioche commented Apr 19, 2017

Renamed the issue. Adding a separate status is probably not practical as the status is typically not carried through the topology. We could add a key value in the metadata to indicate when a page was successfully fetched or indexed last to delete only the ones that get ERRORed but had been fetched. As an initial step, deleting documents even though they never got fetched is not really a problem.

Having said that, we might need to track the canonical value so that we can delete the doc based on the same URL as was used while indexing. The trouble being that we could end up deleting the canonical representation of the page even though only one of the possible variations got lost. Tricky.

@jnioche jnioche closed this in 7ac70e5 Apr 20, 2017

jnioche added a commit that referenced this issue Apr 20, 2017

Merge pull request #454 from DigitalPebble/deletion
Delete gone pages from index, fixes #253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment