Skip to content

What is herd

Nate Weisz edited this page Sep 30, 2015 · 2 revisions

Herd efficiently tracks and catalogs data in a unified data repository accessible via web service APIs. The repository captures audit and data lineage information to fulfill the requirements of data-driven and highly regulated business environments. Users across an organization can programmatically access this data to enable data processing with heterogeneous processing tools.

Core Design Tenets

  • Operate in big data environment

Herd inherently handles big data by understanding partitioning schemes and capturing meta-data about format and usages in your organization's data pipeline. Herd is aware of storage tiers and archiving and allows for managing data lifecycle across these dimensions.

  • Separate storage and processing

Herd was created to manage data independent of how that data is processed. This is different than data management systems designed for monolithic clusters or specific technologies. Herd is built around storing data in a system dedicated to reliability and durability such as Amazon S3. The data is then independent of any platform or processing engine. Herd meta-data contains enough information to understand the format and allows access from any data processing platform. This encourages use of best-of-breed processing tools in the rapidly evolving ecosystem.

  • Handle enterprise meta-data requirements

Many data-driven enterprises require special attention to handling data -- the meta-data is just as important as the data itself. Herd understands data lineage to allow for managing impact of changes in upstream datasets. Herd also supports data and format versioning to capture changes over time.

  • Run on any platform

Herd currently runs on AWS and utilizes S3 storage but is architected to run on other platforms in future releases.

herd Features

  • Data Catalog APIs
  • Register object definitions and formats
  • Register data
  • Determine data availability
  • Generate DDL
  • Cluster management APIs
  • Create cluster definition
  • Add cluster steps
  • Start/stop cluster
  • Get cluster status
  • Job Orchestration - BPMN workfloww engine
  • Define workflow from API steps
  • Start/Stop/Signal workflow
  • Get workflow status
  • Notification of registration event can start workflow
  • Administrative APIs
  • Define storage platforms
  • Define meta-data structure elements
  • Tools to upload/register and download large sets of files

Classic herd Use Case

  1. Team A creates and registers a dataset in herd on a regular basis
  2. Team B creates an automated workflow consisting of the following steps:
  3. Verify availability of required data partition(s) registered by Team A
  4. Retrieve Hive DDL for required partitions
  5. Start cluster
  6. Add Hive queries that process data and create a new dataset
  7. Register new dataset including lineage
Clone this wiki locally