Skip to content
mbauhardt edited this page Sep 13, 2010 · 14 revisions

With this plugin you are able to maintain your urls. All urls you wat to upload must be zipped into a zip archive. Four different types of url’s are supported.

  • Start Urls
  • Limit Urls
  • Exclude Urls
  • Metadata Urls

Example fetch all urls under http://lucene.apache.org/nutch/ excepting http://lucene.apache.org/nutch/bot.html

Start Urls
A start url file contains a flat list of your start urls. For our example we need only one start url “http://lucene.apache.org/nutch/index.html”

Limit Urls
A limit url file contains a flat list of limit urls we want to fetch. For our example we need only one limit url “http://lucene.apache.org/nutch/”.

Exclude Urls
A exclude url file contains a flat list of limit urls we dont want to fetch. For our example we need only one exclude url “http://lucene.apache.org/nutch/bot.html”.

Metadata Urls
A exclude url file contains a flat list of urls with metadatas. The file has the format:

url tab metaKey: tab metaValue tab metaValue .. tab metaKey: tab metaValue ..

For our example we want to give the url “http://lucene.apache.org/nutch/apidocs-1.0/” the metadata foo:1.0 and the url “http://lucene.apache.org/nutch/apidocs-0.9/” the metadata foo:0.9

http://lucene.apache.org/nutch/apidocs-1.0 foo: 1.0
http://lucene.apache.org/nutch/apidocs-0.9 foo: 0.9
http://lucene.apache.org/nutch/apidocs-0.8 foobar: 0.9 1.0

Upload Url Zip Files
Now zip all these urls for example

  • startUrls.txt → startUrls.zip
  • limitUrls.txt → limitUrls.zip
  • excludeUrls.txt → excludeUrls.zip
  • metadataUrls.txt → metadataUrls.zip

and upload these zip files.

Index Metadatas
If you want to index the metadatas which you have uploaded, read the section Index Metadatas.

Note: Metadata and Black White Filtering is enabled by default. You can disable this in the Configuration Screen Admin Configuration.

< Previous Next >

Clone this wiki locally