Myriad Data Generator Toolkit
Myriad is a development toolkit for scalable data generators. Generating large, synthetic datasets with a certain schema and a set of statistical constraints is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems for web-scale data management or parallel RDBMS (e.g. Hadoop, DB2).
The Myriad Toolkit aims to simplify this process by providing a fast and easy way to develop data generators that can generate statistically dependent data in parallel on a set of independently running nodes.
The Myriad Toolkit consists of two main components:
- a generic C++ runtime library for scalable data generation, and
- a Python prototype compiler that generates library extensions from a prototype specification of a user-defined data generator written in XML.
Through the use of a compact XML specification language, Myriad users can define the domain model to be generated as a family of user-defined domain types, and the associated data generation logic as a corresponding family of pseudo-random domain type generators (PRDGs).
In essence, PRDGs are functions that transform a sequence of pseudo-random numbers into a sequence of pseudo-random domain type records. PRDGs are specified as chains of setter functions, each one responsible for the assignment of a fixed-length substream of values to one or more record fields. The Myriad Toolkit provides a range of built-in primitive setters that realize various statistical properties (e.g. single field value distributions or value dependencies between record fields).
The Myriad runtime library transparently builds-in parallel execution support in all compiled data generators. To do so, the framework makes sure that the following two conditions always hold:
- Each domain record is identified by a unique position (i.e. a concrete seed) in the generating pseudo-random number sequence.
- The sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) function that supports arbitrary skips to any position on the sequence in constant time.
These runtime-level decisions are critical for efficient parallelization. More specifically, they allow for
- (A) partitioning the generated PRDG sequences across arbitrary number of data generator nodes in a shared-nothing environment, and
- (B) the use of function shipping (i.e. re-compute) instead of data shipping (i.e. transfer over the network) to get the contents of a referenced record generated on a remote node.
To get a running demo of a simple data generator, please check the vldb-demo package.
Here is a list of publications that describe the Myriad Toolkit:
- Myriad: Scalable and Expressive Data Generation - Alexander Alexandrov, Kostas Tzoumas, Volker Markl; PVLDB, 5(12), 2012: pp. 1890-1893
- Myriad - Parallel Data Generation on Shared-Nothing Architectures - Alexander Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, Volker Markl; Proceedings of the First Workshop on Architectures and Systems for Big Data (ASBD), 2011
For further questions about the Myriad Data Generator Toolkit or any other related questions please use the mailing list.
- Prof. Dr. rer. nat. Volker Markl, FG DIMA, TU Berlin - principal investigator
- Alexander Alexandrov, FG DIMA, TU Berlin - lead developer
- Marie Hoffmann, FG DIMA, TU Berlin - general assistance
- Christoph Brücke, FG DIMA, TU Berlin - general assistance
- Thomas Bodner, FG DIMA, TU Berlin - general assistance
The Myriad Toolkit is developed as part of the Stratosphere Project at the Fachgebiet Datenbanksysteme und Informationsmanagement, TU Berlin under the supervision of Prof. Dr. rer. nat. Volker Markl.