BasicURLFilter to remove URLs based on path repetition and max length #368

jnioche · 2016-10-24T10:48:45Z

The path repetition is very useful in most crawls there inevitably are sites that generate recursive URLs and rapidly take over everything else. We had a regex based filter inherited from Nutch but it is very slow. Instead we can put that functionality in a new BasicURLFilter which could also remove URLs based on their length.

These filters are meant to be super fast and so should be used early in the filtering chain.

…368

jnioche added enhancement core labels Oct 24, 2016

jnioche added this to the 1.2 milestone Oct 24, 2016

jnioche self-assigned this Oct 24, 2016

jnioche added a commit that referenced this issue Oct 24, 2016

BasicURLFilter to remove URLs based on path repetition and max length #…

369cde3

…368

jnioche closed this as completed Oct 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BasicURLFilter to remove URLs based on path repetition and max length #368

BasicURLFilter to remove URLs based on path repetition and max length #368

jnioche commented Oct 24, 2016

BasicURLFilter to remove URLs based on path repetition and max length #368

BasicURLFilter to remove URLs based on path repetition and max length #368

Comments

jnioche commented Oct 24, 2016