###Generating Synthetic Data for Data Quality Testing
This section demonstrates how to set up and generate sample data using the `dbldatagen` library. The data includes representative PII-like fields and will later be used for data quality checks.

We configure the synthetic data generator to produce rows with a fields such as email, IP address, location, phone number, SSN, and credit card information. Templates define the format of each field.

In [0]:
# Install dbldatagen, a library for generating synthetic test data
%pip install dbldatagen

In [0]:
import dbldatagen as dg
from pyspark.sql.types import IntegerType

df_spec = (
    dg.DataGenerator(
        sparkSession=spark,
        name="test_data_set1",
        rows=100000,
        partitions=4,
        randomSeedMethod="hash_fieldname"
    )
    .withColumn(
        "email", "string",
        template=r'\w.\w@\w.com|\w@\w.co.uk'
    )
    .withColumn(
        "ip_addr", "string",
        template=r'\n.\n.\n.\n'
    )
    .withColumn(
        "location", "string",
        values=['Seattle', 'New York', 'Los Angeles', 'Chicago', 'San Francisco'],
        random=True
    )
    .withColumn(
        "phone", "string",
        template=r'(ddd)-ddd-dddd'
    )
    .withColumn(
        "ssn", "string",
        template=r'ddd-dd-dddd'
    )
    .withColumn(
        "CC_number", "string",
        template=r'dddddddddddddddd'
    )
    .withColumn (
        "product_purchased", "string",
        values=['product1', 'product2', 'product3'],
        random=True
    )
    .withColumn (
        "price", "double",
        minValue=100,
        maxValue=1000,
        random=True
    )
    .withColumn (
        "purchase_location", "string",
        values=["website", "app", "in-person"],
        random=True
    )
)

# Build DataFrame from the above specification
df = df_spec.build()
num_rows = df.count()
display(df)

### Creating a Sample Delta Table
Here we create a catalog, schema, and demo Delta Lake table to store the generated data, ready for use in later quality validation workflows.

In [0]:
# Set catalog, schema, and table for storing data
catalog = "george_test"
schema = "dqx"
table_name = "dqx_table"

In [0]:
# Create catalog and schema if they don't exist
# Create a fresh Delta table with demo columns

spark.sql(
    f"CREATE CATALOG IF NOT EXISTS {catalog}"
)
spark.sql(
    f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}"
)
df.write.mode("overwrite").option("mergeSchema", "true").saveAsTable(f"{catalog}.{schema}.{table_name}")

In [0]:
spark.sql(
    f"SELECT * FROM {catalog}.{schema}.{table_name} LIMIT 10"
)