Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexReferenceFilter from file #534

Open
ghost opened this issue Oct 31, 2018 · 1 comment
Open

RegexReferenceFilter from file #534

ghost opened this issue Oct 31, 2018 · 1 comment

Comments

@ghost
Copy link

ghost commented Oct 31, 2018

Hi,

We have a site we want to crawl and on which we have a large number sub directories of different names that we want to exclude.

With com.norconex.collector.core.filter.impl.RegexReferenceFilter is there any way we can manage this exclusion list other than having one very long regex ?

Could we for instance have it read from a file which contains a list of regex patterns, one per line ?

If that's currently not possible would you consider it as a feature request for future releases.

Many Thanks.

@essiembre
Copy link
Contributor

Here are a few options I can think of:

  • You can create your own filter that takes a file.
  • You can use the Importer ScriptFilter and either define all your regex there or have it include a file.
  • Create one filter entry per regular expression.
  • Use a hack, such as, at launch time, pass a "variables" file that has your regex on each line, numbered like this:
myregex1 = blah.*
myregex2 = blah/again.*
...
myregex123 = blah/lastone.*

Then in your config, you can use Velocity syntax, like this (untested):

#foreach($cnt in [1..123])
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
    #set($regex = "$myregex$cnt")
    #evaluate($regex)
  </filter> 
#end

I will mark it as a feature request nontheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant