cwensel / bixo forked from emi/bixo

A creepy crawler

This URL has Read+Write access

Ken Krugler (author)
Mon Mar 30 15:57:37 -0700 2009
commit  2b527bf633b053140c39d724dff4ef0c2e307cb5
tree    16d5d6ee7423934dc0eee7352bd0cc2c1e511972
parent  b83b56e30308b3445f3e487723332b76b7ed5e37
bixo /
name age message
file README Loading commit data...
file build.xml
directory conf/
directory doc/
directory ivy/
directory lib/
directory src/
README
===============================
Introduction
===============================

Bixo is an open source Java crawler that runs as a series of Cascading
pipes. It is designed to be used as a tool for creating customized
crawlers, thus each Cascading pipe implements a discrete operation. By
building a customized Cascading pipe assembly, you can quickly create
specialized crawlers that are optimized for a particular use case.

Bixo borrows heavily from the Apache Nutch project, as well as many other
open source projects at Apache and elsewhere.

Bixo is released under the MIT license.

===============================
Building
===============================

To build, you first need:

1. A recent release of Cascading that works with Hadoop 0.19.0+

http://cascading.googlecode.com/files/cascading-1.0.6-hadoop-0.19.0%2B.tgz

2. A recent release of Hadoop 0.19

http://www.apache.org/dyn/closer.cgi/hadoop/core/

3. A build.properties file in the same directory as the build.xml file

This file should contain:

hadoop.home=<path to Hadoop you just downloaded>
cascading.home=<path to Cascading you just downloaded>

Note that you can't use user-relative paths here, e.g. ~/<path> won't work.
They need to be either absolute or relative to the project directory.

4. Run ant test to compile & test the code.