Skip to content

RandomDataGenerator

philip-stoev edited this page Jul 17, 2012 · 1 revision

Table of Contents

The Random Data Generator

The Random Data Generator allows the creation of tables of arbitary size containing an arbitary combination of columns, filled with random data.

Installing the Data Generator

The Data Generator is available as part of the Random Query Generation framework, which is available on Launchpad. To create your own working copy of the RQG framework, execute:

 $ bzr checkout lp:randgen/2.0 randgen

or

 $ bzr branch lp:randgen

It will end up in a directory on your disk called randgen.

Invoking the Data Generator

The Data Generator can be invoked in three different ways:

  • Standalone, which allows the tables to be created on the desired server instance or to print the CREATE and INSERT statements to STDOUT.
 $ perl gendata.pl \
   --dsn=dbi:mysql:host=127.0.0.1:port=9306:user=root:database=test\
   --spec=conf/examples/example.zz
 $ perl gendata.pl \
   --dsn=dummy:print \
   --spec=conf/examples/example.zz

The DSN is in the standard format for Perl's DBI. If "dummy:print" is specified as DSN, the script prints the required SQL to create the tables to STDOUT. After trimming away the "Executing " text before each SQL statement, you can then pipe it from there into a mysql client for execution.

  • As part of the Random Query Generator
 $ perl gentest.pl\
   --dsn=dbi:mysql:host=127.0.0.1:port=9306:user=root:database=test\
   --gendata=conf/example.zz\
   --grammar=conf/example.yy

This will run the Data Generator in order to create and populate the tables as specified in conf/example.zz and will then run the random query test as described in conf/select.yy

  • As part of the full Random Query Generation framework
 $ perl runall.pl \
   --basedir=/path/to/mysql/basedir \
   --gendata=conf/example.zz \
   --grammar=conf/example.yy

This will proceed to start a mysql server, generate the tables and the data, and run the random queries in a single command.

Default behavior

If the Random Query Generator is run without specifying a Data Generator configuration script (for example by using the --gendata option), a default set of tables will be generated.

See RandomQueryGenerator#The_Random_Data_Generator for more details.

The remainder of this document is about the more advanced, customizable data generator, and not the default one.

Configuring the Data Generator

The Data Generator is driven by a configuration file which may look as follows:

 $fields = {
         types => [ 'int', 'char', 'enum', 'set', 'blob' ],
         indexes => [undef, 'unique', 'key' ],
         null => [undef, 'not null'],
         default => [undef, 'default null'],
         sign => [undef, 'unsigned'],
         charsets => ['utf8', 'latin1']
 };
 
 $data = {
         numbers => [ 'digit', 'null', undef ],
         strings => [ 'letter', 'english' ],
         blobs => [ 'data' ]
 }

Configuring the Tables

The $tables section describes the sizes and the other attributes of the tables. The example above will create 24 tables, one for each combination of storage engine, partitions and table size. This section accepts the following parameters:

  • rows - the sizes of the tables that will be created. The default is 0, 1, 2, 10, 100, which are sizes suitable for testing the Optimizer (MyISAM tables with sizes 0 and 1 will be optimized away).
  • engines - the engines to be used. If it is not specified, the default is to use the engine specified using the --engine option when the framework is called. If no such option is specified, the default storage engine for the MySQL server is used, which can be controlled using the --mysqld=--default-storage-engine= option.
  • partitions - the partitioning clause to be used. To avoid partitioning altogether, remove this clause from the configuration file. To create some unpartitioned tables, use undef as an element in the array.
  • names - tables names that are used for the generated tables. If there are more tables (i.e. combinations of size, engine, partitions, etc.) than there are names, default names will be used when the supply of user-defined names has run out. See below for details on the structure of default table names.
   names => ['A','B','C','D'],
  • charsets - specify the character sets
  • collations - specify the collations to be used. Note that a table will be created for each combination of character set and collation, which may not always be valid.
  • pk - specify the primary keys to be used. Valid values are undef for no primary key, or any integer-based PK definition. If auto_increment is specified, the script will insert NULLs in that column. If not, then a sequence of increasing integers will be used. Generation of non-integer primary keys is not yet supported.
  • views - specify what views are to be created for each base table. For example:
  views => ['ALGORITHM=MERGE', 'ALGORITHM=TEMPTABLE'],
  • merges - describe what merge tables are to be created. One merge table will be created over all base tables that are either MyISAM or without a specific engine (with the expectation that those will be created with MyISAM by default.
  merges => ['INSERT_METHOD=LAST','INSERT_METHOD=FIRST'],

Tables names are set according to the properties of the generated tables, and/or according to the names parameter if set. If the names parameter is not set, or if there are not enough names specified, the following approximate name structure is used:

 table<rows>_[engine]_[charset]_[collation]_[partition]_<pk>_<row-format>

Resulting in table names like:

 table0_int_autoinc
 table1000_innodb_int_autoinc
 table100_myisam_key_pk_parts_2_int_autoinc

If views is specified, views are named after the table upon which they are based, as v<tablename></tablename>_<sequence></sequence>, e.g.:

 vtable1000_innodb_int_autoinc_0
 vtable1000_innodb_int_autoinc_1
 vA_0

Configuring the Fields

The $fields section describes the fields and the indexes to be created in each table. One field will be created for each combination of values. For example, the resulting table will have a field int_unsigned_unique and the corresponding key UNIQUE (int_unsigned_unique).

The order of the fields within each table is pseudo-random, which may help trigger bugs which are dependent on the physical placement of the fields. Each table also has a PRIMARY KEY named, appropriately, pk.

The following settings are available:

  • type - specifies a list of MySQL types to be used. Any MySQL type name or alias can be used here. Length modifiers, e.g. char (50) are also allowed. If no length is specified, char columns are created as char (1). set and enum fields will be created to accept every letter from A to Z, unless you specify otherwise. This is so that they can then be populated with random letters. The default for this option is ['int',].
  • nullability - for all field types, specifies whether the fields will be nullable, non-nullable, or both. Valid values are 'not null' and undef. The default is [undef], meaning that all fields will be nullable.
  • sign - for numeric fields, specifies whether the field is signed or unsigned. The valid values are 'unsigned', 'signed' and undef (which is equivalent to signed). The default is to generate signed fields only.
  • index - for all field types, specifies whether the field will have an index over it. Valid values are undef, meaning no index, 'key' or 'index' meaning a standard index and 'unique' and 'fulltext' for other index types. The default is [undef,] meaning that each field will be created in both indexed and non-indexed variants. Note that some storage engines may not support all index types, so table creation may fail. Each field and index are created separately, so no compound indexes are supported at this time. blob and text fields require a length modifier for each index - the generator will create such indexes as KEY (field (255)).
  • charset - for string field types, fields will be created for each character set listed. The default is [undef] meaning no character sets will be specified.
  • collation - for string field types, fields will be created for each collation listed.

Configuring the Data

The $data section from the configuration file describes what data will be inserted into each field type: numbers, strings and blobs.

  • numbers this describes how to generate values for all integer and float field types. Valid values are:
    • 'null' - a NULL value if the column can accept it;
    • 'digit' - a single random digit from 0 to 9;
    • undef - generate a random value that will fit in field being inserted into. Note that for field types such as bigint this will produce tables with very high cardinalities, meaning that only range queries on those tables will have any significant chance of matching some records. Using 'digit' provides substantially less cardinality.
    • 'boolean' - random 0 or 1 .
    • 1,2,3 or any other literal integer will be used as is. For example numbers => [0,] will produce random even integers from 0 to 6.
    • 'tinyint', 'mediumint', or any other valid MySQL type - generate a random value that will fit in this particular type. The unsigned modifier is also supported. So, you can specify:
 numbers => ['tinyint]

to create a mixture of 50% values from 0 to 9 and 50% values from 0 to 255.

To achieve a specific proportion of values, specify a value more than once, for example:

 numbers => ['null',] 

will create 66% nulls and 33% single digits .

  • strings describes how to generate values for all string columns (char and text. Valid values are:
    • 'null'
    • 'letter' which generates a random letter from A to Z
    • 'english' which picks a random word from a list of 100 most common words in English.
    • 'string', 'string(N)' generates a random string containing letters A to Z with a random length from 1 to N characters. If N is not specified, a single random letter is generated.
    • 'ascii', 'ascii(N)' which will generate a random string containing all ascii characters (even unprintable ones) with a random length from 1 to N characters. If N is not specified, a single random ascii character is generated.
    • 'empty' returns an empty string.
    • 'one','two','three' or any other string literal will be used verbatim. For example:
 strings => ['foo',]

Will cause random data to be generated containing an even proportion between the literal words 'foo' and 'bar'.

  • blobs describes how to generate values for the blob columns. Valid values are:
    • 'data' - picks one random file from the gentest/data directory, which contains several JPG images with various sizes (8K to 1.8MB) and loads it using LOAD_DATA().
Category:RandomQueryGenerator
Something went wrong with that request. Please try again.