Web crawler SDK based on Apache Storm
Crawl configuration for benchmarking StormCrawler
Mirror of Apache Storm
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
WARC resources for StormCrawler
Azazello is an open source platform for large scale document analysis based on Apache Spark
A set of reusable Java components that implement functionality common to any web crawler
Mirror of Apache Nutch
Setup for crawling tescobank with SC
Use cases for DigitalPebble's TextClassification API
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Support for old (pre 2013) CommonCrawl dataset in Behemoth
resources for generating a corpus of docs from CC for Tika
Elasticsearch real-time search and analytics natively integrated with Hadoop
A library of tools for interacting with RabbitMQ from Storm.
Resources for comparison between 1.8 and 2.x of Apache Nutch
ElasticSearch module for Behemoth
Module for classifying Behemoth documents with a model from our Text Classification API
GATE Processing Resource wrapping DigitalPebble's TextClassification API
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format