IBM Watson Discovery Service IndexWriter plugin for Apache Nutch

Windows

If you are running on Windows, please follow the README here

Requirements

Java (OpenJDK 8/Oracle JDK) (HBase, Zookeeper, Nutch, Ant, and Gradle are also required, but will be installed for you when you set up the Gradle wrapper.)
Set the JAVA_HOME environment variable

On MAC OS, JAVA_HOME can be found at (or in a similar location): /Library/Java/Home/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home

On LINUX, JAVA_HOME can be found at (or in a similar location): /usr/lib/jvm/java-8-openjdk-amd64

Set up Gradle wrapper

./gradlew

Setting up HBase

You can now use the built-in gradle task to setup Hbase.

./gradlew setupHbase

This will create directories within the project directory to store Hbase and Zookeeper data.
Downloads hbase-0.98.8-hadoop2 in build directory.

Now, startup hbase service by going into the hbase directory: projectDir/build/hbase-0.98.8-hadoop2/bin/ and run:

./start-hbase.sh

Setting up Nutch

Download and extract apache-nutch-2.3.1 in the build directory:

./gradlew setupNutch

Then edit conf/nutch-discovery/nutch-site.xml with Discovery credentials. The values for the first three properties (endpoint, username, and password) are provided by the Discovery service. The others are provided by your specific instance of the Discovery service.

Note: If you are using a Discovery Service Instance, which needs IAM authentication, then set discovery.username to apikey and discovery.password to the value of the apikey.

  <property>
    <name>discovery.endpoint</name>
    <value></value>
  </property>
  <property>
    <name>discovery.username</name>
    <value></value>
  </property>
  <property>
    <name>discovery.password</name>
    <value></value>
  </property>
  <property>
    <name>discovery.configuration.id</name>
    <value></value>
  </property>
  <property>
    <name>discovery.environment.id</name>
    <value></value>
  </property>
  <property>
    <name>discovery.api.version</name>
    <value></value>
  </property>
  <property>
    <name>discovery.collection.id</name>
    <value></value>
  </property>

To build the plugin, run:

./gradlew buildPlugin

This will take about 4-5 minutes to complete; please be patient. That's it. Everything is now set up to crawl websites.

Adding new Domains to crawl with Nutch

Edit the text file seed/urls.txt to specify a list of seed URLs.

$ mkdir seed
$ echo "https://en.wikipedia.org/wiki/Apache_Nutch" >> $projectDir/seed/urls.txt

Crawl and Index

$projectDir/crawl

Note: On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regularly to keep the index up to date.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
conf		conf
docker_scripts		docker_scripts
gradle/wrapper		gradle/wrapper
plugin/indexer-discovery		plugin/indexer-discovery
seed		seed
windows_libraries/Windows_NT-amd64-64		windows_libraries/Windows_NT-amd64-64
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
README.md		README.md
build.gradle		build.gradle
crawl		crawl
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IBM Watson Discovery Service IndexWriter plugin for Apache Nutch

Windows

Requirements

Set up Gradle wrapper

Setting up HBase

Setting up Nutch

Adding new Domains to crawl with Nutch

Crawl and Index

About

Releases

Packages

Contributors 5

Languages

IBM-Watson/nutch-indexer-discovery

Folders and files

Latest commit

History

Repository files navigation

IBM Watson Discovery Service IndexWriter plugin for Apache Nutch

Windows

Requirements

Set up Gradle wrapper

Setting up HBase

Setting up Nutch

Adding new Domains to crawl with Nutch

Crawl and Index

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages