GitHub - Rohan-flutterint/DataWizz: DataWizz is a data warehouse / lakehouse platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase.

An lakehouse, orchestration, and BI workspace for modern analytics teams.

DataWizz is a data platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase. It combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.

The current version is intentionally built as a serious MVP rather than a toy demo:

Upload, preview, and profile raw files
Query raw and curated data with DuckDB
Run notebooks with DuckDB, PySpark, and DataFusion
Publish transformed outputs as Delta tables
Build and validate visual pipelines
Track runs, retries, and logs
Create semantic datasets, charts, dashboards, and scheduled reports
Switch between dark and light workspace themes
Run locally with one script or as a Docker demo stack

Product Tour


Workspace access A polished dark-mode entry experience with demo credentials, platform positioning, and a more presentation-ready first impression.

Lakehouse home Monitor files, Delta assets, pipeline health, and workspace activity from a single landing page.	SQL workspace Query raw files and curated outputs, inspect history, and write results back to Delta Lake.

Curated catalog Browse governed Delta assets with ownership, freshness, schema, and preview data.	Notebook runtime lab Build saved multi-cell notebooks, switch between DuckDB, Spark, and DataFusion, insert source-aware snippets, and persist per-cell outputs.

Pipeline builder Design low-code flows, validate graph rules, schedule recurring runs, and export Airflow-style DAGs.	BI dashboard layer Publish chart-driven dashboards, apply shared filters, and generate JSON or mock snapshot exports for stakeholder-ready reporting surfaces.

Why DataWizz

DataWizz is designed for analytics engineering and platform demos where you want a believable lakehouse product surface without needing a full distributed stack on day one.

It is especially useful when you want to demonstrate:

Raw-to-curated data workflows
SQL-first transformation on local or object-backed files
Notebook-driven prototyping across multiple execution engines
Delta Lake publishing with metadata tracking
Airflow-like orchestration without leaving the app
In-app BI dashboards on top of curated outputs

Core Capabilities

Lakehouse

File upload, preview, schema inference, and deletion
SQL querying over CSV, JSON, Parquet, and curated Delta tables
Write query outputs to Delta Lake with append or overwrite modes
Catalog browsing with metadata, freshness, ownership, tags, and lineage hints
Theme-aware workspace shell with dark and light presentation modes

Notebook Runtime

Multi-cell saved notebooks in the Engine Lab
Real local execution for DuckDB, PySpark, and DataFusion
Run-all, run-single-cell, and run-from-here execution flows
Notebook duplicate, delete, rename, and run history support
Source-aware asset browser with one-click SQL or Python snippet insertion
Persisted per-cell outputs so reopened notebooks restore the latest visible state

Orchestration

Visual pipeline builder powered by React Flow
File source, Delta source, filter, select, join, aggregate, SQL, validate, write, and schedule nodes
DAG validation, node guardrails, run history, retries, and detailed logs
Airflow DAG code generation and export
Backend recurring scheduler for saved cron pipelines

BI Layer

Semantic dataset explorer
Dataset-driven chart builder
Saved chart library with traceability into dashboards and report schedules
Dashboard builder and dashboard viewer
Report scheduler with stored artifacts and snapshot history
Optional Superset integration surface for demo storytelling

Architecture

See the deeper system walkthrough in docs/architecture.md.

At a high level:

Users
  -> React + TypeScript frontend
  -> FastAPI application layer
  -> DuckDB execution services
  -> Delta Lake curated storage
  -> PostgreSQL metadata store
  -> Optional MinIO object storage

Project layout:

frontend/           React app for the workspace UI
backend/            FastAPI APIs, services, models, and migrations
docs/               Architecture, API, demo workflow, and screenshots
sample_data/        CSV fixtures and sample pipeline JSON
storage/            raw/, curated/, and temp/ runtime zones
docker-compose.yml  Demo stack for frontend, backend, PostgreSQL, MinIO, Superset
run.sh              One-command local launcher

Quick Start

One command

From the project root:

./run.sh

This launcher:

Reuses healthy local frontend and backend processes when they are already running
Starts the app in local demo mode when Docker is unavailable
Supports a Docker-based stack when Docker is installed

Local endpoints:

App: http://localhost:5173
API: http://localhost:8000
API docs: http://localhost:8000/docs
Superset setup page: http://localhost:5173/bi/superset

Demo credentials:

Email: admin@datawizz.local
Password: datawizz123

Other launcher modes

./run.sh local
./run.sh local --restart
./run.sh docker
./run.sh docker superset

Local Development

Backend

cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
cp .env.example .env
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Notes:

The backend targets PostgreSQL by default.
For quick local demos, the launcher can use SQLite-backed metadata automatically.

Frontend

cd frontend
npm install
cp .env.example .env
npm run dev

Docker Demo Stack

docker compose up --build

Included services:

Frontend
FastAPI backend
PostgreSQL
MinIO
Optional Superset profile

Optional Superset:

docker compose --profile superset up --build

Demo Flow

For a complete scripted walkthrough, see docs/demo-workflow.md.

Suggested first demo:

Upload sample_data/sales.csv and sample_data/customers.csv
Query raw_sales in the SQL workspace
Write sales_curated as a Delta table
Open the catalog and inspect the curated asset
Open Engine Lab and run a DuckDB, Spark, or DataFusion notebook cell
Run the sample visual pipeline
Build charts and review the published BI dashboard

Sample SQL

Regional revenue:

SELECT
  region,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY region
ORDER BY total_revenue DESC;

Monthly revenue:

SELECT
  strftime(order_date, '%Y-%m') AS month,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY 1
ORDER BY 1;

Top customers:

SELECT
  customer_id,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY customer_id
ORDER BY total_revenue DESC
LIMIT 10;

Documentation

Verification

This repo has been locally verified with:

python3 -m compileall backend/app backend/alembic
npm run build
backend smoke checks for file upload, SQL execution, Delta writes, notebook runtime flows, pipelines, BI flows, and report scheduling

Current MVP Notes

DuckDB remains the primary SQL workspace engine
Spark and DataFusion are available through the notebook runtime surface
Delta publishing is implemented through the backend write services
Scheduling is now active in-app for saved cron pipelines
Notebook outputs persist per cell and restore when a notebook is reopened
The BI layer is intentionally lightweight and app-native; Superset remains the richer external optional path

Roadmap

Platform

Authentication and RBAC hardening
Flink streaming support
Great Expectations quality checks
OpenLineage integration
Hive Metastore or Nessie-backed catalog options
Notebook export artifacts and richer collaboration flows
CI/CD, monitoring, and Kubernetes deployment

BI and Analytics

Natural-language chart generation
Dashboard sharing and permissions
Row-level security and column masking
Semantic metrics layer
Alerts, subscriptions, and richer export delivery

DataWizz is built to show what a modern internal analytics platform can look like when lakehouse workflows, orchestration, and BI are treated as one cohesive product surface.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
backend		backend
docs		docs
frontend		frontend
sample_data		sample_data
storage		storage
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Tour

Why DataWizz

Core Capabilities

Lakehouse

Notebook Runtime

Orchestration

BI Layer

Architecture

Quick Start

One command

Other launcher modes

Local Development

Backend

Frontend

Docker Demo Stack

Demo Flow

Sample SQL

Documentation

Verification

Current MVP Notes

Roadmap

Platform

BI and Analytics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Product Tour

Why DataWizz

Core Capabilities

Lakehouse

Notebook Runtime

Orchestration

BI Layer

Architecture

Quick Start

One command

Other launcher modes

Local Development

Backend

Frontend

Docker Demo Stack

Demo Flow

Sample SQL

Documentation

Verification

Current MVP Notes

Roadmap

Platform

BI and Analytics

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages