Synthetic Data Validation Notebook

This notebook validates the synthetic TPC-DS–like dataset generated by our project. It performs the following checks:

1. Loads all CSV files from the `data/` directory.
2. Verifies primary key uniqueness for dimension and fact tables.
3. Checks foreign key relationships between tables.
4. Verifies that the aggregated order totals from transactions match the `order_total` in the orders table.
5. Loads and displays metadata from the JSON schema file.
6. Reports any inconsistencies found.

**Note:** Adjust file paths if needed.

In [None]:
import pandas as pd
import numpy as np
import json
import yaml
from datetime import datetime

Define file paths
data_dir = "../data/"
metadata_dir = "../metadata/"

customers_file = data_dir + "customers.csv"
products_file = data_dir + "products.csv"
stores_file = data_dir + "stores.csv"
promotions_file = data_dir + "promotions.csv"
dates_file = data_dir + "dates.csv"
orders_file = data_dir + "orders.csv"
transactions_file = data_dir + "transactions.csv"
inventory_file = data_dir + "inventory.csv"
returns_file = data_dir + "returns.csv"

schema_metadata_file = metadata_dir + "schema_metadata.json"
etl_lineage_file = metadata_dir + "generation_lineage.yaml"

#Load DataFrames

In [2]:
customers = pd.read_csv(customers_file)
products = pd.read_csv(products_file)
stores = pd.read_csv(stores_file)
promotions = pd.read_csv(promotions_file)
dates = pd.read_csv(dates_file)
orders = pd.read_csv(orders_file, parse_dates=["order_date"])
transactions = pd.read_csv(transactions_file)
inventory = pd.read_csv(inventory_file)
returns = pd.read_csv(returns_file)

print("DataFrames loaded successfully!")
print(f"Customers: {customers.shape[0]} rows")
print(f"Products: {products.shape[0]} rows")
print(f"Stores: {stores.shape[0]} rows")
print(f"Promotions: {promotions.shape[0]} rows")
print(f"Dates: {dates.shape[0]} rows")
print(f"Orders: {orders.shape[0]} rows")
print(f"Transactions: {transactions.shape[0]} rows")
print(f"Inventory: {inventory.shape[0]} rows")
print(f"Returns: {returns.shape[0]} rows")

DataFrames loaded successfully!
Customers: 10000 rows
Products: 1000 rows
Stores: 100 rows
Promotions: 100 rows
Dates: 365 rows
Orders: 10000 rows
Transactions: 30023 rows
Inventory: 10000 rows
Returns: 1000 rows


#Primary Key Uniqueness Checks

In [None]:
def check_uniqueness(df, key, table_name):
    unique_count = df[key].nunique()
    total_count = df.shape[0]
    if unique_count == total_count:
        print(f"PASS: {table_name} - All {total_count} rows have unique '{key}'.")
    else:
        print(f"FAIL: {table_name} - Only {unique_count} unique '{key}' found out of {total_count} rows.")

print("\nPrimary Key Checks:")
check_uniqueness(customers, 'customer_id', 'Customers')
check_uniqueness(products, 'product_id', 'Products')
check_uniqueness(stores, 'store_id', 'Stores')
check_uniqueness(promotions, 'promo_id', 'Promotions')
check_uniqueness(dates, 'date_id', 'Dates')
check_uniqueness(orders, 'order_id', 'Orders')
check_uniqueness(transactions, 'transaction_id', 'Transactions')
Returns table should also have unique return_id if exists
if 'return_id' in returns.columns:
    check_uniqueness(returns, 'return_id', 'Returns')


Primary Key Checks:
PASS: Customers - All 10000 rows have unique 'customer_id'.
PASS: Products - All 1000 rows have unique 'product_id'.
PASS: Stores - All 100 rows have unique 'store_id'.
PASS: Promotions - All 100 rows have unique 'promo_id'.
PASS: Dates - All 365 rows have unique 'date_id'.
PASS: Orders - All 10000 rows have unique 'order_id'.
PASS: Transactions - All 30023 rows have unique 'transaction_id'.
PASS: Returns - All 1000 rows have unique 'return_id'.


## Foreign Key Consistency Checks

We verify that:
- Every `customer_id` in Orders exists in Customers.
- Every `store_id` in Orders exists in Stores.
- Every `order_id` in Transactions exists in Orders.
- Every `product_id` in Transactions exists in Products.
- If `promo_id` is provided in Transactions, it exists in Promotions.
- Similarly, check for Inventory and Returns.

In [4]:
def check_foreign_key(child_df, child_key, parent_df, parent_key, child_table, parent_table):
    missing = child_df[~child_df[child_key].isin(parent_df[parent_key])]
    if missing.empty:
        print(f"PASS: All values in {child_table}.{child_key} exist in {parent_table}.{parent_key}.")
    else:
        print(f"FAIL: {child_table}.{child_key} has {missing.shape[0]} values not found in {parent_table}.{parent_key}.")

print("\nForeign Key Checks:")
check_foreign_key(orders, 'customer_id', customers, 'customer_id', 'Orders', 'Customers')
check_foreign_key(orders, 'store_id', stores, 'store_id', 'Orders', 'Stores')
check_foreign_key(transactions, 'order_id', orders, 'order_id', 'Transactions', 'Orders')
check_foreign_key(transactions, 'product_id', products, 'product_id', 'Transactions', 'Products')
# For promo_id, drop nulls before checking
if 'promo_id' in transactions.columns:
    check_foreign_key(transactions.dropna(subset=['promo_id']), 'promo_id', promotions, 'promo_id', 'Transactions', 'Promotions')
check_foreign_key(inventory, 'store_id', stores, 'store_id', 'Inventory', 'Stores')
check_foreign_key(inventory, 'product_id', products, 'product_id', 'Inventory', 'Products')
check_foreign_key(returns, 'order_id', orders, 'order_id', 'Returns', 'Orders')
check_foreign_key(returns, 'product_id', products, 'product_id', 'Returns', 'Products')




Foreign Key Checks:
PASS: All values in Orders.customer_id exist in Customers.customer_id.
PASS: All values in Orders.store_id exist in Stores.store_id.
PASS: All values in Transactions.order_id exist in Orders.order_id.
PASS: All values in Transactions.product_id exist in Products.product_id.
PASS: All values in Transactions.promo_id exist in Promotions.promo_id.
PASS: All values in Inventory.store_id exist in Stores.store_id.
PASS: All values in Inventory.product_id exist in Products.product_id.
PASS: All values in Returns.order_id exist in Orders.order_id.
PASS: All values in Returns.product_id exist in Products.product_id.
