New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParseFilter to tag a document based on pattern matching on its URL #577

Closed
jnioche opened this Issue May 30, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@jnioche
Member

jnioche commented May 30, 2018

Similar to the concept of collections in the GSA, we can have a ParseFilter to add to the document metadata based on patterns matching its URL.

The resources can be defined in JSON like so

{
	"collections": [{
			"name": "stormcrawler",
			"includePatterns": ["http://stormcrawler.net/.+"]
		},
		{
			"name": "crawler",
			"includePatterns": [".+crawler.+", ".+nutch.+"],
			"excludePatterns": [".+baby.+", ".+spider.+"]
		}
	]
}
@jnioche

This comment has been minimized.

Member

jnioche commented May 31, 2018

The format used is different from what the GSA supports. Ours is pure regex and it is possible to convert the GSA ones pretty easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment