New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New URL filter implementation based on JSON file and organised per hostname or domain #578

Closed
jnioche opened this Issue Jun 6, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@jnioche
Member

jnioche commented Jun 6, 2018

One benefit is that we won't need to go through all the patterns and can just check the ones for a particular domain which could be quicker. Being based on a JSON file, means that thanks to #569 and we'll even be able to refresh the resources from ES without restarting the topology.

@jnioche jnioche added this to the 1.10 milestone Jun 6, 2018

@noerw

This comment has been minimized.

noerw commented Jun 6, 2018

Even more flexibility would be given, if the filters were organized by arbitrary keys, that would be matched to metadata key/values.
This would for example allow the use case where sets of (seed) URLs are grouped by an external ID (eg metadata.crawlId), which maps to specific crawl settings / filters for that group.

@jnioche

This comment has been minimized.

Member

jnioche commented Jun 6, 2018

That's a good suggestion. I'll think about a nice way to configure per host/domain/arbitrary metadata in the same config file but if you have any suggestion, that would be welcome.

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Jun 6, 2018

Maybe the urlfilter-fast plugin in CC's fork of Nutch could be a source of inspiration. Matching regular expressions only on path (and query) is also faster than matching on the longer URL string. Currently, urlfilter-fast uses a HashMap to hold rules per host/domain. That scales well up to 100,000s of hosts/domains given that most of them share the simple {{DenyPath .*}} - I'm trying to replace the hash by a trie and hope to be able to blacklist few millions of domains using only about 100-200 MB of Java heap memory.

@jnioche

This comment has been minimized.

Member

jnioche commented Jun 7, 2018

@noerw I've put an initial version in a separate branch, please have a look and let me know what you think of it

@noerw

This comment has been minimized.

noerw commented Jun 8, 2018

Amazing, thank you! This looks great, two remarks though:

  • When generating the required JSON, it would be easier to have the scope type, key and value in separate JSON fields instead of having to build strings. But thats just preference I guess.
    [
           {
    		"scope": "GLOBAL",
    		"patterns": ["DenyPathQuery \\.jpg"]
    	},
    	{
    		"scope": "DOMAIN",
    		"value": "stormcrawler.net",
    		"patterns": ["DenyPath .+"]
    	},
    	{
    		"scope": "METADATA",
    		"value": "bar",
    		"key": "foo",
    		"patterns": ["DenyPath .+"]
           }
    ]
  • To implement a whitelist, a directive AllowPath could easily be added to checkScope(), right?
@jnioche

This comment has been minimized.

Member

jnioche commented Jun 8, 2018

it would be easier to have the scope type, key and value in separate JSON fields instead of having to build strings

I considered it but thought it was simpler like that. The value (whatever it is) is part of the scope and easier to build and understand.

To implement a whitelist, a directive AllowPath could easily be added to checkScope(), right?

Could do. I'll have a look at that.

Thanks for your feedback

@jnioche

This comment has been minimized.

Member

jnioche commented Jun 8, 2018

@noerw see #581

jnioche added a commit that referenced this issue Jun 11, 2018

New URL filter implementation based on JSON file and organised per ho…
…stname or domain #578 (#581)

* Initial version of FastURLFilter - more tests needed. Implements #578

* Added AllowPath directives and license headers + minor Javadoc changes
@jnioche

This comment has been minimized.

Member

jnioche commented Jun 11, 2018

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment