Pinned repositories

  1. behemoth

    Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

    Java 254 55

  2. storm-crawler

    Web crawler SDK based on Apache Storm

    Java 277 105

  • Web crawler SDK based on Apache Storm

    java web-crawler distributed apache-storm

    Java 277 105 Updated Mar 28, 2017
  • Crawl configuration for benchmarking StormCrawler

    Shell Updated Mar 27, 2017
  • storm

    Forked from apache/storm

    Mirror of Apache Storm

    Java 2,997 Updated Feb 27, 2017
  • Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

    java nlp hadoop mapreduce

    Java 254 55 Updated Nov 24, 2016
  • WARC resources for StormCrawler

    2 1 Updated Oct 20, 2016
  • Azazello is an open source platform for large scale document analysis based on Apache Spark

    Java 7 1 Updated Apr 20, 2016
  • A set of reusable Java components that implement functionality common to any web crawler

    Java 1 26 Updated Dec 3, 2015
  • nutch

    Forked from apache/nutch

    Mirror of Apache Nutch

    Java 795 Updated Nov 25, 2015
  • Setup for crawling tescobank with SC

    Java 3 1 Updated Sep 23, 2015
  • Use cases for DigitalPebble's TextClassification API

    Java 6 2 Updated Sep 1, 2015
  • A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.

    Java 37 19 Updated Sep 1, 2015
  • Support for old (pre 2013) CommonCrawl dataset in Behemoth

    Java 4 Updated Apr 20, 2015
  • resources for generating a corpus of docs from CC for Tika

    Shell Updated Nov 28, 2014
  • Elasticsearch real-time search and analytics natively integrated with Hadoop

    Java 1 531 Updated Sep 29, 2014
  • A library of tools for interacting with RabbitMQ from Storm.

    Java 54 Updated Jul 15, 2014
  • Resources for comparison between 1.8 and 2.x of Apache Nutch

    Java 2 1 Updated Jun 4, 2014
  • ElasticSearch module for Behemoth

    Java 1 Updated Feb 12, 2014
  • Module for classifying Behemoth documents with a model from our Text Classification API

    Java 1 Updated Nov 22, 2012
  • GATE Processing Resource wrapping DigitalPebble's TextClassification API

    Java 5 2 Updated Jul 12, 2012
  • Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format

    Java 4 2 Updated Apr 27, 2012