Best practices

Follow MongoDB hardware recommendations
Make sure that there are no field names containing dots (MongoDB does not support names with dots)
If you want to run Hydra as a service, use Procrun
For each stage, you can specify a JVM configuration (for example to specify different RAM usage). This is done using jvm_parameters in the stage configuration file.
Push documents into MongoDB rather than Hydra to ensure multiple Cores can share the document load
Use SLF4J instead of Logger

Scalability

Simplest way of scaling is having multiple number of threads for those stages that are slow.

In the configuration file of a stage, simply add or change the numberOfThreads parameter.
To do a profiling of the time each stage is consuming for processing documents and for identifying which stages are slowing the processing in the pipeline, you can process some documents and then look in MongoDB at the fetched and touched timestamps for each stage. These timestamps would give you an idea of how much time each of the stages is consuming.
A note about caching: multiple Hydra instances would have different caches but multiple threads in a stage will share the cache

The next approach would be to keep one MongoDB instance but have multiple Hydra instances (for each source for examples, and in this case make sure to have different pipelines for each source).

To start a new Hydra instance you just need to point it to the right MongoDB (change the IP address in the resource.properties file that is placed in the same folder with hydra-core.jar).

If the above solutions are not good enough, modify MongoDB out, for example by separating documents in databases based on language.

Admin interface

There is an admin service (hydra-admin) exposing the REST API and a GUI. With this service, you can for example add stages, look at documents, look at documents stuck in the pipeline, refetch documents for a specific stage etc.

Hydra core configuration

If you look in the directory where hydra-core.jar resides, there should be a resource.properties file. Here is where can configure to run multiple hydra instances; you can change the name of the pipeline; change the password for mongodb; change the cache timeout; change maxcount and maxsize for old documents etc.
- You would need to restart Hydra for changes to be applied
How the stage configuration works:
- Hydra automatically checks for new stages each 10 seconds (in production environment you might not need this to happen so often)
- A stage is reinitialised when you pass it a new configuration

Handling failed documents

You might want to be able to see which documents have failed during processing so that you can manage these cases (by for example adding the documents to a separate queue and trying to process them again later). To check which documents have failed, take a look at com.findwise.hydra.mongodb.MongoTailableIterator. Some considerations:

One of the reasons a document might fail is because of its size. The max doc size is 16 MB, and if this is a worry, increase cache timeouts. A recent fix (at the time of writing) was that if a document is too large, then the ten largest fields will be removed when the document is written back to olddocuments in the output stage.
There are two settings (that can be changed in the resource.properties file) for the storage of processed documents: the maximum amount of old processed documents to keep (defaulting to 2000) and the maximum size in Megabytes that the old processed documents are allowed to reach (defaulting to 200 MB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices

Scalability

Admin interface

Hydra core configuration

Handling failed documents

Clone this wiki locally