# Simple Example

This example demonstrates creating a very simple and small dataset. 

First thing to do is import `randomdataset` from the parent of this directory:

In [1]:
import os
import sys

sys.path.append(os.path.abspath(".."))

import randomdataset

The YAML schema is written out which will be used to generate the random data:

In [6]:
%%writefile paymentschema.yaml

- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: customers
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: FirstName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: LastName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
        
- typename: randomdataset.generators.CSVGenerator
  num_lines: 20
  dataset:
    name: payments
    typename: randomdataset.Dataset
    fields:
    - name: date
      typename: randomdataset.DateTimeFieldGen
    - name: customer_id
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 10
    - name: amount
      typename: randomdataset.FloatFieldGen
      vmin: 0
      vmax: 100

Overwriting paymentschema.yaml


All this schema does is instantiate Python objects. Each dictionary is used to define the type with the `typename` value with all other key-value pairs are passed as constructor arguments. If custom classes are present in the PYTHONPATH these can be constructed instead. Lists of dictionaries (for example under `fields`) are converted into a Python list of constructed objects. As shown here, `Dataset` requires a `name` argument containing a string and a list of `FieldGen` instances, while `CSVGenerator` requires a `num_lines` argument for the number of lines to generate followed by a `Dataset` instance named `dataset`.

The generation is done by passing this schema to the `generate_dataset` command line utility in the library:

```bash
$ generate_dataset paymentschema.yaml .
```

This gives the schema file as the input, and the output is written to the current directory. Other directories can be specified which must already exist. The schema generates CSV data but others which require a target file would expect that to be provided in place of a directory name.


Instead of invoking this utility the command can be called directly through the imported library:

In [7]:
randomdataset.application.generate_dataset.callback("paymentschema.yaml",".")

Schema: 'paymentschema.yaml'
Output: '.'


The output is two CSV files, we can look at `customers.csv` to see the list of randomly generated customer:

In [8]:
!cat customers.csv

id,FirstName,LastName
0,"QDFFgv4XBd5VW","O1Odro"
1,"Gp4mYq","82IPIChjBALg"
2,"LR7KVudB","HcAPBwM"
3,"6FfWGEYS0Q","5NbspSBJk"
4,"si1Tj0xSBB2","eChYKAaW5aa8R"
5,"DYP6OMerUUFOR","pYNXUTNLqdrv"
6,"ltfnhTgrJF","2Rctye"
7,"1tAoaDl57Lo5","xMkVKt6O"
8,"1yJImoqiwf","IJICD8W6B8k"
9,"XkYgS7","8owHyjR"
