Skip to content

Project Description

jonaslindmark edited this page May 2, 2012 · 3 revisions

What is Hydra?

When working with free text search using for example Apache Solr the quality of the data in the index is a key factor of the quality of the results delivered. Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine.

Architecturally Hydra sits in between the search engine and the source integration. A common use-case would be to use Apache Manifold CF to crawl a folder on a filesystem and send the documents to hydra which in turn will process and dispatch processed documents to Solr for indexing.

Description

The pipeline framework design is intended to be easily distributed, very flexible and to allow easy testing and development. Because of its distributed data crunching nature, we've decided to name it Hydra.

Hydra is designed as a fully distributed, persisted and flexible document processing pipeline. It has one central repository, currently an instance of a MongoDB document store that can be run on a single machine or completely in the cloud. A worker node reads a pipeline configuration from this central repository - all processing stages are packeted as separate jar files. The stages in those jar files are then launched by the main process as separate JVM instances for isolation purposes. This is perhaps the most controversial design decision, but was done to ensure problems such as Tika leaking memory and running out of heap space on a problematic document would not bring the whole pipeline to a halt. The design can of course be extended to run multiple stages in the same JVM. All communication between the stages and the core framework happens via REST. Because of this, one can test processing stages in development by simply running them from any IDE such as Eclipse, pointing them to an active node. This eliminates the need for time consuming WAR packaging/deployment found in e.g. OpenPipeline.

Hydra was also designed to allow the development of a particular pipeline to be easy, even allowing for test driven development (TDD) of a pipeline, e.g. specifying a valid document domain model, that will be enforced before output from the pipeline takes place.

The stages, running in their own JVMs, will then fetch documents relevant for their processing purpose from MongoDB, e.g. a “Static Writer” stage would fetch any and all documents, while a more specialized node might only fetch documents that have or lack a certain field. This allows configuration of both the classic “straight” pipeline where all documents pass through all processing nodes (in order, if necessary), and asymmetrical pipelines that can fork depending on the content of the document.

All administration of the pipeline, as well as traceability, are handled directly through an administrator interface communicating with the central repository (MongoDB). To add processing power to the pipeline, it is as simple as starting a new worker node on another machine, pointing to the same pipeline configuration in MongoDB.

Design Goals

Hydras key design goals are for it to be:
  • scalable: the central repository as well as the number of worker nodes can scale horizontally with little to no performance loss.
  • distributed: any processing node can work on any document - a single document may be processed on any number of physical machines
  • fail-safe: if a processing node goes down, this will not affect the documents in the pipeline, which are persisted centrally, and any other node can simply and automatically pick up where the other left off.
  • robust: all stages run in separate JVMs, thus allowing for instance Tika to crash in a separate JVM, which will be automatically restarted, without stopping the processing pipeline for less problematic documents.
  • easy to use/configure: stages can be run from your IDE during development, allowing testing against the actual data in the repository.