refactor faker parsing and enable array relationships (#85)

* refactor faker parsing and enable array relationships * add helpful error message * update ecommerce example * slight change to array example * update ecommerce example * accommodate breaking change to pass tests * update readme * add warning about executing user input to readme * fix typo * beef up examples with blog example * bump version
MaterializeInc · Mar 22, 2023 · 801d383 · 801d383
1 parent 7441dc3
commit 801d383
Show file tree

Hide file tree

Showing 17 changed files with 453 additions and 306 deletions.
diff --git a/README.md b/README.md
@@ -95,7 +95,11 @@ See example input schema files in [examples](./examples) and [tests](/tests) fol
 
 1. Iterate through a schema defined in SQL 10 times, but don't actually interact with Kafka or Schema Registry ("dry run"). Also, see extra output with debug mode.
     ```bash
-    datagen --schema tests/products.sql --format avro --dry-run --debug
+    datagen \
+      --schema tests/products.sql \
+      --format avro \
+      --dry-run \
+      --debug
     ```
 
 1. Same as above, but actually create the schema subjects and Kafka topics, and actually produce the data. There is less output because debug mode is off.
@@ -146,7 +150,7 @@ This is particularly useful when you want to generate a small set of records wit
             "topic": "mz_datagen_users"
         },
         "id": "iteration.index",
-        "name": "internet.userName",
+        "name": "faker.internet.userName()",
     }
 ]
 ```
@@ -181,13 +185,15 @@ docker run \
 
 You can define input schemas using JSON (`.json`), Avro (`.avsc`), or SQL (`.sql`). Within those schemas, you use the [FakerJS API](https://fakerjs.dev/api/) to define the data that is generated for each field.
 
-You can pass arguments to `faker` methods by escaping quotes. For example, here is [datatype.number](https://fakerjs.dev/api/datatype.html#number) with `min` and `max` arguments:
+You can pass arguments to `faker` methods by escaping quotes. For example, here is [faker.datatype.number](https://fakerjs.dev/api/datatype.html#number) with `min` and `max` arguments:
 
 ```
-"datatype.number({\"min\": 100, \"max\": 1000})"
+"faker.datatype.number({min: 100, max: 1000})"
 ```
 
 > :construction: Right now, JSON is the only kind of input schema that supports generating relational data.
+
+> :warning: Please inspect your input schema file since `faker` methods can contain arbitrary Javascript functions that `datagen` will execute.
 ### JSON Schema
 
 Here is the general syntax for a JSON input schema:
@@ -229,10 +235,10 @@ The SQL schema option allows you to use a `CREATE TABLE` statement to define wha
 ```sql
 CREATE TABLE "ecommerce"."products" (
   "id" int PRIMARY KEY,
-  "name" varchar COMMENT 'internet.userName',
-  "merchant_id" int NOT NULL COMMENT 'datatype.number',
-  "price" int COMMENT 'datatype.number',
-  "status" int COMMENT 'datatype.boolean',
+  "name" varchar COMMENT 'faker.internet.userName()',
+  "merchant_id" int NOT NULL COMMENT 'faker.datatype.number()',
+  "price" int COMMENT 'faker.datatype.number()',
+  "status" int COMMENT 'faker.datatype.boolean()',
   "created_at" datetime DEFAULT (now())
 );
 ```

diff --git a/datagen.ts b/datagen.ts
@@ -17,7 +17,7 @@ import dataGenerator from './src/dataGenerator.js';
 import fs from 'fs';
 import { program, Option } from 'commander';
 
-program.name('datagen').description('Fake Data Generator').version('0.1.4');
+program.name('datagen').description('Fake Data Generator').version('0.2.0');
 
 program
     .requiredOption('-s, --schema <char>', 'Schema file to use')

diff --git a/examples/README.md b/examples/README.md
@@ -6,5 +6,6 @@ This directory contains end-to-end tutorials for the `datagen` tool.
 | -------- | ----------- |
 | [ecommerce](ecommerce) | A tutorial for the `datagen` tool that generates data for an ecommerce website. |
 | [docker-compose](docker-compose) | A `docker-compose` setup for the `datagen`. |
+| [blog](blog) | Sample data for a blog with users, posts, and comments. |
 
 To request a new tutorial, please [open an issue](https://github.com/MaterializeInc/datagen/issues/new?assignees=&labels=feature%2C+enhancement&template=feature_request.md&title=Feature%3A+).
diff --git a/examples/blog/README.md b/examples/blog/README.md
@@ -0,0 +1,60 @@
+# Blog Demo
+
+This small example generates relational data for a blog where users make posts, and posts have comments by other users.
+
+## Inspect the Schema
+
+1. Take a moment to look at [blog.json](./blog.json) and make a prediction about what the output will look like.
+
+## Do a Dry Run
+
+Here is a command to do a dry run of a single iteration.
+
+```
+datagen \
+    --dry-run \
+    --debug \
+    --schema examples/blog/blog.json \
+    --format avro\
+    --prefix mz_datagen_blog \
+    --number 1
+```
+
+Notice that in a single iteration, a user is created, and then 2 posts are created, and for each post, 2 comments are created. Then, since comments are made by users, 2 additional users are created. This happens in such a way that the value of a field in a parent record is passed to child records (eg if `users.id` is `5`, then each associated post will have `posts.user_id` equal to `5`). This makes it so downstream systems can perform meaningful joins.
+
+Also notice the number of unique primary keys of each collection are limited, so over time you will see each key appear multiple times. These can be interpreted in upstream systems as updates.
+
+## (Optional) Produce to Kafka
+
+See [.env.example](../../.env.example) to see the environment variables to connect to your Kafka cluster.
+If you use the `--format avro` option, you would also have to set environment variables to connect to your Schema Registry.
+
+After you set those, you can produce to your Kafka cluster. Press `Ctrl+C` when you are ready to stop the producer.
+
+```
+datagen \
+    --schema examples/blog/blog.json \
+    --format avro \
+    --prefix mz_datagen_blog \
+    --number -1
+```
+
+When you are finished, you can delete all the topics and schema subjects with the `--clean` option.
+
+```
+datagen \
+    --schema examples/blog/blog.json \
+    --format avro \
+    --prefix mz_datagen_blog \
+    --clean
+```
+
+## (Optional) Query in Materialize
+
+Materialize is a [streaming database](https://materialize.com/guides/streaming-database/). You create materialized views with standard SQL and Materialize will eagerly read from Kafka topics and Postgres tables and keep your materialized views up to date automatically in response to new data. It's Postgres wire compatible, so you can read your materialized views directly with the `psql` CLI or any Postgres client library.
+
+See the [ecommerce example](../ecommerce/README.md) for a full end-to-end example where data is transformed in and served from Materialize in near real-time.
+
+### Learn More
+
+Check out the Materialize [docs](www.materialize.com/docs) and [blog](www.materialize.com/blog) for more!
diff --git a/examples/blog/blog.json b/examples/blog/blog.json
@@ -0,0 +1,61 @@
+[
+    {
+        "_meta": {
+            "topic": "users",
+            "key": "id",
+            "relationships": [
+                {
+                    "topic": "posts",
+                    "parent_field": "id",
+                    "child_field": "user_id",
+                    "records_per": 2
+                }
+            ]
+        },
+        "id": "faker.datatype.number(100)",
+        "name": "faker.internet.userName()",
+        "email": "faker.internet.exampleEmail()",
+        "phone": "faker.phone.imei()",
+        "website": "faker.internet.domainName()",
+        "city": "faker.address.city()",
+        "company": "faker.company.name()"
+    },
+    {
+        "_meta": {
+            "topic": "posts",
+            "key": "id",
+            "relationships": [
+                {
+                    "topic": "comments",
+                    "parent_field": "id",
+                    "child_field": "post_id",
+                    "records_per": 2
+                }
+            ]
+        },
+        "id": "faker.datatype.number(1000)",
+        "user_id": "faker.datatype.number(100)",
+        "title": "faker.lorem.sentence()",
+        "body": "faker.lorem.paragraph()"
+    },
+    {
+        "_meta": {
+            "topic": "comments",
+            "key": "id",
+            "relationships": [
+                {
+                    "topic": "users",
+                    "parent_field": "user_id",
+                    "child_field": "id",
+                    "records_per": 1
+                }
+            ]
+        },
+        "id": "faker.datatype.number(2000)",
+        "user_id": "faker.datatype.number(100)",
+        "body": "faker.lorem.paragraph",
+        "post_id": "faker.datatype.number(1000)",
+        "views": "faker.datatype.number({min: 100, max: 1000})",
+        "status": "faker.datatype.number(1)"
+    }
+]