DataWizz is a data platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase. It combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.
The current version is intentionally built as a serious MVP rather than a toy demo:
- Upload, preview, and profile raw files
- Query raw and curated data with DuckDB
- Run notebooks with DuckDB, PySpark, and DataFusion
- Publish transformed outputs as Delta tables
- Build and validate visual pipelines
- Track runs, retries, and logs
- Create semantic datasets, charts, dashboards, and scheduled reports
- Switch between dark and light workspace themes
- Run locally with one script or as a Docker demo stack
DataWizz is designed for analytics engineering and platform demos where you want a believable lakehouse product surface without needing a full distributed stack on day one.
It is especially useful when you want to demonstrate:
- Raw-to-curated data workflows
- SQL-first transformation on local or object-backed files
- Notebook-driven prototyping across multiple execution engines
- Delta Lake publishing with metadata tracking
- Airflow-like orchestration without leaving the app
- In-app BI dashboards on top of curated outputs
- File upload, preview, schema inference, and deletion
- SQL querying over CSV, JSON, Parquet, and curated Delta tables
- Write query outputs to Delta Lake with append or overwrite modes
- Catalog browsing with metadata, freshness, ownership, tags, and lineage hints
- Theme-aware workspace shell with dark and light presentation modes
- Multi-cell saved notebooks in the
Engine Lab - Real local execution for DuckDB, PySpark, and DataFusion
- Run-all, run-single-cell, and run-from-here execution flows
- Notebook duplicate, delete, rename, and run history support
- Source-aware asset browser with one-click SQL or Python snippet insertion
- Persisted per-cell outputs so reopened notebooks restore the latest visible state
- Visual pipeline builder powered by React Flow
- File source, Delta source, filter, select, join, aggregate, SQL, validate, write, and schedule nodes
- DAG validation, node guardrails, run history, retries, and detailed logs
- Airflow DAG code generation and export
- Backend recurring scheduler for saved cron pipelines
- Semantic dataset explorer
- Dataset-driven chart builder
- Saved chart library with traceability into dashboards and report schedules
- Dashboard builder and dashboard viewer
- Report scheduler with stored artifacts and snapshot history
- Optional Superset integration surface for demo storytelling
See the deeper system walkthrough in docs/architecture.md.
At a high level:
Users
-> React + TypeScript frontend
-> FastAPI application layer
-> DuckDB execution services
-> Delta Lake curated storage
-> PostgreSQL metadata store
-> Optional MinIO object storage
Project layout:
frontend/ React app for the workspace UI
backend/ FastAPI APIs, services, models, and migrations
docs/ Architecture, API, demo workflow, and screenshots
sample_data/ CSV fixtures and sample pipeline JSON
storage/ raw/, curated/, and temp/ runtime zones
docker-compose.yml Demo stack for frontend, backend, PostgreSQL, MinIO, Superset
run.sh One-command local launcher
From the project root:
./run.shThis launcher:
- Reuses healthy local frontend and backend processes when they are already running
- Starts the app in local demo mode when Docker is unavailable
- Supports a Docker-based stack when Docker is installed
Local endpoints:
- App:
http://localhost:5173 - API:
http://localhost:8000 - API docs:
http://localhost:8000/docs - Superset setup page:
http://localhost:5173/bi/superset
Demo credentials:
- Email:
admin@datawizz.local - Password:
datawizz123
./run.sh local
./run.sh local --restart
./run.sh docker
./run.sh docker supersetcd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
cp .env.example .env
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000Notes:
- The backend targets PostgreSQL by default.
- For quick local demos, the launcher can use SQLite-backed metadata automatically.
cd frontend
npm install
cp .env.example .env
npm run devdocker compose up --buildIncluded services:
- Frontend
- FastAPI backend
- PostgreSQL
- MinIO
- Optional Superset profile
Optional Superset:
docker compose --profile superset up --buildFor a complete scripted walkthrough, see docs/demo-workflow.md.
Suggested first demo:
- Upload
sample_data/sales.csvandsample_data/customers.csv - Query
raw_salesin the SQL workspace - Write
sales_curatedas a Delta table - Open the catalog and inspect the curated asset
- Open
Engine Laband run a DuckDB, Spark, or DataFusion notebook cell - Run the sample visual pipeline
- Build charts and review the published BI dashboard
Regional revenue:
SELECT
region,
SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY region
ORDER BY total_revenue DESC;Monthly revenue:
SELECT
strftime(order_date, '%Y-%m') AS month,
SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY 1
ORDER BY 1;Top customers:
SELECT
customer_id,
SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY customer_id
ORDER BY total_revenue DESC
LIMIT 10;This repo has been locally verified with:
python3 -m compileall backend/app backend/alembicnpm run build- backend smoke checks for file upload, SQL execution, Delta writes, notebook runtime flows, pipelines, BI flows, and report scheduling
- DuckDB remains the primary SQL workspace engine
- Spark and DataFusion are available through the notebook runtime surface
- Delta publishing is implemented through the backend write services
- Scheduling is now active in-app for saved cron pipelines
- Notebook outputs persist per cell and restore when a notebook is reopened
- The BI layer is intentionally lightweight and app-native; Superset remains the richer external optional path
- Authentication and RBAC hardening
- Flink streaming support
- Great Expectations quality checks
- OpenLineage integration
- Hive Metastore or Nessie-backed catalog options
- Notebook export artifacts and richer collaboration flows
- CI/CD, monitoring, and Kubernetes deployment
- Natural-language chart generation
- Dashboard sharing and permissions
- Row-level security and column masking
- Semantic metrics layer
- Alerts, subscriptions, and richer export delivery
DataWizz is built to show what a modern internal analytics platform can look like when lakehouse workflows, orchestration, and BI are treated as one cohesive product surface.






