Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BasicURLFilter to remove URLs based on path repetition and max length #368

Closed
jnioche opened this issue Oct 24, 2016 · 0 comments
Closed
Assignees
Milestone

Comments

@jnioche
Copy link
Contributor

jnioche commented Oct 24, 2016

The path repetition is very useful in most crawls there inevitably are sites that generate recursive URLs and rapidly take over everything else. We had a regex based filter inherited from Nutch but it is very slow. Instead we can put that functionality in a new BasicURLFilter which could also remove URLs based on their length.

These filters are meant to be super fast and so should be used early in the filtering chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant