Skip to content

Commit

Permalink
refactor faker parsing and enable array relationships (#85)
Browse files Browse the repository at this point in the history
* refactor faker parsing and enable array relationships

* add helpful error message

* update ecommerce example

* slight change to array example

* update ecommerce example

* accommodate breaking change to pass tests

* update readme

* add warning about executing user input to readme

* fix typo

* beef up examples with blog example

* bump version
  • Loading branch information
chuck-alt-delete committed Mar 22, 2023
1 parent 7441dc3 commit 801d383
Show file tree
Hide file tree
Showing 17 changed files with 453 additions and 306 deletions.
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,11 @@ See example input schema files in [examples](./examples) and [tests](/tests) fol

1. Iterate through a schema defined in SQL 10 times, but don't actually interact with Kafka or Schema Registry ("dry run"). Also, see extra output with debug mode.
```bash
datagen --schema tests/products.sql --format avro --dry-run --debug
datagen \
--schema tests/products.sql \
--format avro \
--dry-run \
--debug
```

1. Same as above, but actually create the schema subjects and Kafka topics, and actually produce the data. There is less output because debug mode is off.
Expand Down Expand Up @@ -146,7 +150,7 @@ This is particularly useful when you want to generate a small set of records wit
"topic": "mz_datagen_users"
},
"id": "iteration.index",
"name": "internet.userName",
"name": "faker.internet.userName()",
}
]
```
Expand Down Expand Up @@ -181,13 +185,15 @@ docker run \

You can define input schemas using JSON (`.json`), Avro (`.avsc`), or SQL (`.sql`). Within those schemas, you use the [FakerJS API](https://fakerjs.dev/api/) to define the data that is generated for each field.

You can pass arguments to `faker` methods by escaping quotes. For example, here is [datatype.number](https://fakerjs.dev/api/datatype.html#number) with `min` and `max` arguments:
You can pass arguments to `faker` methods by escaping quotes. For example, here is [faker.datatype.number](https://fakerjs.dev/api/datatype.html#number) with `min` and `max` arguments:

```
"datatype.number({\"min\": 100, \"max\": 1000})"
"faker.datatype.number({min: 100, max: 1000})"
```

> :construction: Right now, JSON is the only kind of input schema that supports generating relational data.
> :warning: Please inspect your input schema file since `faker` methods can contain arbitrary Javascript functions that `datagen` will execute.
### JSON Schema

Here is the general syntax for a JSON input schema:
Expand Down Expand Up @@ -229,10 +235,10 @@ The SQL schema option allows you to use a `CREATE TABLE` statement to define wha
```sql
CREATE TABLE "ecommerce"."products" (
"id" int PRIMARY KEY,
"name" varchar COMMENT 'internet.userName',
"merchant_id" int NOT NULL COMMENT 'datatype.number',
"price" int COMMENT 'datatype.number',
"status" int COMMENT 'datatype.boolean',
"name" varchar COMMENT 'faker.internet.userName()',
"merchant_id" int NOT NULL COMMENT 'faker.datatype.number()',
"price" int COMMENT 'faker.datatype.number()',
"status" int COMMENT 'faker.datatype.boolean()',
"created_at" datetime DEFAULT (now())
);
```
Expand Down
2 changes: 1 addition & 1 deletion datagen.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ import dataGenerator from './src/dataGenerator.js';
import fs from 'fs';
import { program, Option } from 'commander';

program.name('datagen').description('Fake Data Generator').version('0.1.4');
program.name('datagen').description('Fake Data Generator').version('0.2.0');

program
.requiredOption('-s, --schema <char>', 'Schema file to use')
Expand Down
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@ This directory contains end-to-end tutorials for the `datagen` tool.
| -------- | ----------- |
| [ecommerce](ecommerce) | A tutorial for the `datagen` tool that generates data for an ecommerce website. |
| [docker-compose](docker-compose) | A `docker-compose` setup for the `datagen`. |
| [blog](blog) | Sample data for a blog with users, posts, and comments. |

To request a new tutorial, please [open an issue](https://github.com/MaterializeInc/datagen/issues/new?assignees=&labels=feature%2C+enhancement&template=feature_request.md&title=Feature%3A+).
60 changes: 60 additions & 0 deletions examples/blog/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Blog Demo

This small example generates relational data for a blog where users make posts, and posts have comments by other users.

## Inspect the Schema

1. Take a moment to look at [blog.json](./blog.json) and make a prediction about what the output will look like.

## Do a Dry Run

Here is a command to do a dry run of a single iteration.

```
datagen \
--dry-run \
--debug \
--schema examples/blog/blog.json \
--format avro\
--prefix mz_datagen_blog \
--number 1
```

Notice that in a single iteration, a user is created, and then 2 posts are created, and for each post, 2 comments are created. Then, since comments are made by users, 2 additional users are created. This happens in such a way that the value of a field in a parent record is passed to child records (eg if `users.id` is `5`, then each associated post will have `posts.user_id` equal to `5`). This makes it so downstream systems can perform meaningful joins.

Also notice the number of unique primary keys of each collection are limited, so over time you will see each key appear multiple times. These can be interpreted in upstream systems as updates.

## (Optional) Produce to Kafka

See [.env.example](../../.env.example) to see the environment variables to connect to your Kafka cluster.
If you use the `--format avro` option, you would also have to set environment variables to connect to your Schema Registry.

After you set those, you can produce to your Kafka cluster. Press `Ctrl+C` when you are ready to stop the producer.

```
datagen \
--schema examples/blog/blog.json \
--format avro \
--prefix mz_datagen_blog \
--number -1
```

When you are finished, you can delete all the topics and schema subjects with the `--clean` option.

```
datagen \
--schema examples/blog/blog.json \
--format avro \
--prefix mz_datagen_blog \
--clean
```

## (Optional) Query in Materialize

Materialize is a [streaming database](https://materialize.com/guides/streaming-database/). You create materialized views with standard SQL and Materialize will eagerly read from Kafka topics and Postgres tables and keep your materialized views up to date automatically in response to new data. It's Postgres wire compatible, so you can read your materialized views directly with the `psql` CLI or any Postgres client library.

See the [ecommerce example](../ecommerce/README.md) for a full end-to-end example where data is transformed in and served from Materialize in near real-time.

### Learn More

Check out the Materialize [docs](www.materialize.com/docs) and [blog](www.materialize.com/blog) for more!
61 changes: 61 additions & 0 deletions examples/blog/blog.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
[
{
"_meta": {
"topic": "users",
"key": "id",
"relationships": [
{
"topic": "posts",
"parent_field": "id",
"child_field": "user_id",
"records_per": 2
}
]
},
"id": "faker.datatype.number(100)",
"name": "faker.internet.userName()",
"email": "faker.internet.exampleEmail()",
"phone": "faker.phone.imei()",
"website": "faker.internet.domainName()",
"city": "faker.address.city()",
"company": "faker.company.name()"
},
{
"_meta": {
"topic": "posts",
"key": "id",
"relationships": [
{
"topic": "comments",
"parent_field": "id",
"child_field": "post_id",
"records_per": 2
}
]
},
"id": "faker.datatype.number(1000)",
"user_id": "faker.datatype.number(100)",
"title": "faker.lorem.sentence()",
"body": "faker.lorem.paragraph()"
},
{
"_meta": {
"topic": "comments",
"key": "id",
"relationships": [
{
"topic": "users",
"parent_field": "user_id",
"child_field": "id",
"records_per": 1
}
]
},
"id": "faker.datatype.number(2000)",
"user_id": "faker.datatype.number(100)",
"body": "faker.lorem.paragraph",
"post_id": "faker.datatype.number(1000)",
"views": "faker.datatype.number({min: 100, max: 1000})",
"status": "faker.datatype.number(1)"
}
]
Loading

0 comments on commit 801d383

Please sign in to comment.