What is herd

Herd efficiently tracks and catalogs data in a unified data repository accessible via web service APIs. The repository captures audit and data lineage information to fulfill the requirements of data-driven and highly regulated business environments. Users across an organization can programmatically access this data to enable data processing with heterogeneous processing tools.

Core Design Tenets

Operate in big data environment

Herd inherently handles big data by understanding partitioning schemes and capturing meta-data about format and usages in your organization's data pipeline. Herd is aware of storage tiers and archiving and allows for managing data lifecycle across these dimensions.

Separate storage and processing

Herd was created to manage data independent of how that data is processed. This is different than data management systems designed for monolithic clusters or specific technologies. Herd is built around storing data in a system dedicated to reliability and durability such as Amazon S3. The data is then independent of any platform or processing engine. Herd meta-data contains enough information to understand the format and allows access from any data processing platform. This encourages use of best-of-breed processing tools in the rapidly evolving ecosystem.

Handle enterprise meta-data requirements

Many data-driven enterprises require special attention to handling data -- the meta-data is just as important as the data itself. Herd understands data lineage to allow for managing impact of changes in upstream datasets. Herd also supports data and format versioning to capture changes over time.

Run on any platform

Herd currently runs on AWS and utilizes S3 storage but is architected to run on other platforms in future releases.

herd Features

Data Catalog APIs
Register object definitions and formats
Register data
Determine data availability
Generate DDL
Cluster management APIs
Create cluster definition
Add cluster steps
Start/stop cluster
Get cluster status
Job Orchestration - BPMN workfloww engine
Define workflow from API steps
Start/Stop/Signal workflow
Get workflow status
Notification of registration event can start workflow
Administrative APIs
Define storage platforms
Define meta-data structure elements
Tools to upload/register and download large sets of files

Classic herd Use Case

Team A creates and registers a dataset in herd on a regular basis
Team B creates an automated workflow consisting of the following steps:
Verify availability of required data partition(s) registered by Team A
Retrieve Hive DDL for required partitions
Start cluster
Add Hive queries that process data and create a new dataset
Register new dataset including lineage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is herd

Core Design Tenets

herd Features

Classic herd Use Case

Clone this wiki locally