You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The path repetition is very useful in most crawls there inevitably are sites that generate recursive URLs and rapidly take over everything else. We had a regex based filter inherited from Nutch but it is very slow. Instead we can put that functionality in a new BasicURLFilter which could also remove URLs based on their length.
These filters are meant to be super fast and so should be used early in the filtering chain.
The text was updated successfully, but these errors were encountered:
The path repetition is very useful in most crawls there inevitably are sites that generate recursive URLs and rapidly take over everything else. We had a regex based filter inherited from Nutch but it is very slow. Instead we can put that functionality in a new BasicURLFilter which could also remove URLs based on their length.
These filters are meant to be super fast and so should be used early in the filtering chain.
The text was updated successfully, but these errors were encountered: