Pinned repositories

  1. behemoth

    Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

    Java 279 61

  2. storm-crawler

    Web crawler SDK based on Apache Storm

    Java 437 165

  • Web crawler SDK based on Apache Storm

    Java 437 165 Apache-2.0 Updated Aug 15, 2018
  • Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

    Java 279 61 Updated Apr 25, 2018
  • Crawl configuration for benchmarking StormCrawler

    FLUX 3 3 Apache-2.0 Updated Apr 9, 2018
  • A set of reusable Java components that implement functionality common to any web crawler

    Java 4 43 Apache-2.0 Updated Apr 4, 2017
  • storm

    Forked from apache/storm

    Mirror of Apache Storm

    Java 3,682 Apache-2.0 Updated Feb 27, 2017
  • WARC resources for StormCrawler

    2 1 Updated Oct 20, 2016
  • Azazello is an open source platform for large scale document analysis based on Apache Spark

    Java 7 1 Apache-2.0 1 issue needs help Updated Apr 20, 2016
  • nutch

    Forked from apache/nutch

    Mirror of Apache Nutch

    Java 1,100 Apache-2.0 Updated Nov 25, 2015
  • Setup for crawling tescobank with SC

    Java 4 1 Apache-2.0 Updated Sep 23, 2015
  • Use cases for DigitalPebble's TextClassification API

    Java 9 2 Apache-2.0 Updated Sep 1, 2015
  • A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.

    Java 40 20 Apache-2.0 Updated Sep 1, 2015
  • Support for old (pre 2013) CommonCrawl dataset in Behemoth

    Java 4 Updated Apr 20, 2015
  • resources for generating a corpus of docs from CC for Tika

    Shell Updated Nov 28, 2014
  • Elasticsearch real-time search and analytics natively integrated with Hadoop

    Java 1 719 Apache-2.0 Updated Sep 29, 2014
  • Resources for comparison between 1.8 and 2.x of Apache Nutch

    Java 4 Apache-2.0 Updated Jun 4, 2014
  • ElasticSearch module for Behemoth

    Java 1 Updated Feb 12, 2014
  • Module for classifying Behemoth documents with a model from our Text Classification API

    Java 1 Updated Nov 22, 2012
  • GATE Processing Resource wrapping DigitalPebble's TextClassification API

    Java 5 2 Updated Jul 12, 2012
  • Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format

    Java 4 1 Updated Apr 27, 2012