public
Description: Mirror of Apache Nutch
Homepage:
Clone URL: git://github.com/apache/nutch.git
commit  d3392953f0e16846f09c87b303028417befaa6b8
tree    06487bbec0ea2a99ee1cad2cfa6ec3d2dc1b4a15
parent  6d13378e222eb55d9df83a26cc5b388a257d2428
nutch /
name age message
file CHANGES.txt Tue Sep 08 06:15:03 -0700 2009 NUTCH-702 - Lazy Instanciation of Metadata in C... [Tacettin Guney]
file KEYS Thu Mar 19 14:26:52 -0700 2009 copy keys to trunk git-svn-id: https://svn.apa... [Sami Siren]
file LICENSE.txt Thu Mar 19 14:09:56 -0700 2009 NUTCH-723 git-svn-id: https://svn.apache.org/r... [Sami Siren]
file NOTICE.txt Loading commit data...
file README.txt Sun Aug 09 06:32:25 -0700 2009 + revert first test commit [Marko Bauhardt]
directory bin/ Mon Mar 09 10:34:51 -0700 2009 NUTCH-684 - Dedup support for Solr git-svn-id... [Tacettin Guney]
file build.xml
directory conf/
file default.properties
directory docs/
file index.html Tue Mar 01 14:04:46 -0800 2005 Initial import of Nutch to Apache. git-svn-id:... [Douglass Cutting]
directory lib/ Thu Mar 19 13:27:58 -0700 2009 record license git-svn-id: https://svn.apache.... [Sami Siren]
directory site/
directory src/ Tue Sep 08 06:15:03 -0700 2009 NUTCH-702 - Lazy Instanciation of Metadata in C... [Tacettin Guney]
README.txt
Apache Nutch README

Important note: Due to licensing issues we cannot provide two libraries that
are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser
library we use for parsing PDF files. If you encounter unexpected problems when
working with PDF files please

1. download the two missing libraries  from:
   http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/

2. Put them to directory src/plugin/parse-pdf/lib
3. follow the instructions in file src/plugin/parse-pdf/plugin.xml
4. Rebuild nutch.



Interesting files include:


  docs/api/index.html
      Javadocs for the Nutch software.

  CHANGES.txt
      Log of changes to Nutch.


For the latest information about Nutch, please visit our website at:

   http://lucene.apache.org/nutch/

and our wiki, at:

   http://wiki.apache.org/nutch/

To get started using Nutch read Tutorial:

   http://lucene.apache.org/nutch/tutorial.html
   
Export Control

This distribution includes cryptographic software.  The country in which you 
currently reside may have restrictions on the import, possession, use, and/or 
re-export to another country, of encryption software.  BEFORE using any encryption 
software, please check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to see if this is 
permitted.  See <http://www.wassenaar.org/> for more information. 

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has 
classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which 
includes information security software using or performing cryptographic functions with 
asymmetric algorithms.  The form and manner of this Apache Software Foundation 
distribution makes it eligible for export under the License Exception ENC Technology 
Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, 
Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Nutch uses the PDFBox API in its parse-pdf plugin for extracting textual content 
and metadata from encrypted PDF files. See http://incubator.apache.org/pdfbox/ for more 
details on PDFBox.