Skip to content

Rohan-flutterint/DataWizz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataWizz_Logo

An lakehouse, orchestration, and BI workspace for modern analytics teams.

Workspace Backend Frontend Notebook Engines Format Theme

DataWizz is a data platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase. It combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.

The current version is intentionally built as a serious MVP rather than a toy demo:

  • Upload, preview, and profile raw files
  • Query raw and curated data with DuckDB
  • Run notebooks with DuckDB, PySpark, and DataFusion
  • Publish transformed outputs as Delta tables
  • Build and validate visual pipelines
  • Track runs, retries, and logs
  • Create semantic datasets, charts, dashboards, and scheduled reports
  • Switch between dark and light workspace themes
  • Run locally with one script or as a Docker demo stack

Product Tour

DataWizz login
Workspace access
A polished dark-mode entry experience with demo credentials, platform positioning, and a more presentation-ready first impression.
DataWizz dashboard DataWizz SQL workspace
Lakehouse home
Monitor files, Delta assets, pipeline health, and workspace activity from a single landing page.
SQL workspace
Query raw files and curated outputs, inspect history, and write results back to Delta Lake.
DataWizz catalog DataWizz engine lab
Curated catalog
Browse governed Delta assets with ownership, freshness, schema, and preview data.
Notebook runtime lab
Build saved multi-cell notebooks, switch between DuckDB, Spark, and DataFusion, insert source-aware snippets, and persist per-cell outputs.
DataWizz pipeline builder DataWizz dashboard viewer
Pipeline builder
Design low-code flows, validate graph rules, schedule recurring runs, and export Airflow-style DAGs.
BI dashboard layer
Publish chart-driven dashboards, apply shared filters, and generate JSON or mock snapshot exports for stakeholder-ready reporting surfaces.

Why DataWizz

DataWizz is designed for analytics engineering and platform demos where you want a believable lakehouse product surface without needing a full distributed stack on day one.

It is especially useful when you want to demonstrate:

  • Raw-to-curated data workflows
  • SQL-first transformation on local or object-backed files
  • Notebook-driven prototyping across multiple execution engines
  • Delta Lake publishing with metadata tracking
  • Airflow-like orchestration without leaving the app
  • In-app BI dashboards on top of curated outputs

Core Capabilities

Lakehouse

  • File upload, preview, schema inference, and deletion
  • SQL querying over CSV, JSON, Parquet, and curated Delta tables
  • Write query outputs to Delta Lake with append or overwrite modes
  • Catalog browsing with metadata, freshness, ownership, tags, and lineage hints
  • Theme-aware workspace shell with dark and light presentation modes

Notebook Runtime

  • Multi-cell saved notebooks in the Engine Lab
  • Real local execution for DuckDB, PySpark, and DataFusion
  • Run-all, run-single-cell, and run-from-here execution flows
  • Notebook duplicate, delete, rename, and run history support
  • Source-aware asset browser with one-click SQL or Python snippet insertion
  • Persisted per-cell outputs so reopened notebooks restore the latest visible state

Orchestration

  • Visual pipeline builder powered by React Flow
  • File source, Delta source, filter, select, join, aggregate, SQL, validate, write, and schedule nodes
  • DAG validation, node guardrails, run history, retries, and detailed logs
  • Airflow DAG code generation and export
  • Backend recurring scheduler for saved cron pipelines

BI Layer

  • Semantic dataset explorer
  • Dataset-driven chart builder
  • Saved chart library with traceability into dashboards and report schedules
  • Dashboard builder and dashboard viewer
  • Report scheduler with stored artifacts and snapshot history
  • Optional Superset integration surface for demo storytelling

Architecture

See the deeper system walkthrough in docs/architecture.md.

At a high level:

Users
  -> React + TypeScript frontend
  -> FastAPI application layer
  -> DuckDB execution services
  -> Delta Lake curated storage
  -> PostgreSQL metadata store
  -> Optional MinIO object storage

Project layout:

frontend/           React app for the workspace UI
backend/            FastAPI APIs, services, models, and migrations
docs/               Architecture, API, demo workflow, and screenshots
sample_data/        CSV fixtures and sample pipeline JSON
storage/            raw/, curated/, and temp/ runtime zones
docker-compose.yml  Demo stack for frontend, backend, PostgreSQL, MinIO, Superset
run.sh              One-command local launcher

Quick Start

One command

From the project root:

./run.sh

This launcher:

  • Reuses healthy local frontend and backend processes when they are already running
  • Starts the app in local demo mode when Docker is unavailable
  • Supports a Docker-based stack when Docker is installed

Local endpoints:

  • App: http://localhost:5173
  • API: http://localhost:8000
  • API docs: http://localhost:8000/docs
  • Superset setup page: http://localhost:5173/bi/superset

Demo credentials:

  • Email: admin@datawizz.local
  • Password: datawizz123

Other launcher modes

./run.sh local
./run.sh local --restart
./run.sh docker
./run.sh docker superset

Local Development

Backend

cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
cp .env.example .env
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Notes:

  • The backend targets PostgreSQL by default.
  • For quick local demos, the launcher can use SQLite-backed metadata automatically.

Frontend

cd frontend
npm install
cp .env.example .env
npm run dev

Docker Demo Stack

docker compose up --build

Included services:

  • Frontend
  • FastAPI backend
  • PostgreSQL
  • MinIO
  • Optional Superset profile

Optional Superset:

docker compose --profile superset up --build

Demo Flow

For a complete scripted walkthrough, see docs/demo-workflow.md.

Suggested first demo:

  1. Upload sample_data/sales.csv and sample_data/customers.csv
  2. Query raw_sales in the SQL workspace
  3. Write sales_curated as a Delta table
  4. Open the catalog and inspect the curated asset
  5. Open Engine Lab and run a DuckDB, Spark, or DataFusion notebook cell
  6. Run the sample visual pipeline
  7. Build charts and review the published BI dashboard

Sample SQL

Regional revenue:

SELECT
  region,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY region
ORDER BY total_revenue DESC;

Monthly revenue:

SELECT
  strftime(order_date, '%Y-%m') AS month,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY 1
ORDER BY 1;

Top customers:

SELECT
  customer_id,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY customer_id
ORDER BY total_revenue DESC
LIMIT 10;

Documentation

Verification

This repo has been locally verified with:

  • python3 -m compileall backend/app backend/alembic
  • npm run build
  • backend smoke checks for file upload, SQL execution, Delta writes, notebook runtime flows, pipelines, BI flows, and report scheduling

Current MVP Notes

  • DuckDB remains the primary SQL workspace engine
  • Spark and DataFusion are available through the notebook runtime surface
  • Delta publishing is implemented through the backend write services
  • Scheduling is now active in-app for saved cron pipelines
  • Notebook outputs persist per cell and restore when a notebook is reopened
  • The BI layer is intentionally lightweight and app-native; Superset remains the richer external optional path

Roadmap

Platform

  • Authentication and RBAC hardening
  • Flink streaming support
  • Great Expectations quality checks
  • OpenLineage integration
  • Hive Metastore or Nessie-backed catalog options
  • Notebook export artifacts and richer collaboration flows
  • CI/CD, monitoring, and Kubernetes deployment

BI and Analytics

  • Natural-language chart generation
  • Dashboard sharing and permissions
  • Row-level security and column masking
  • Semantic metrics layer
  • Alerts, subscriptions, and richer export delivery

DataWizz is built to show what a modern internal analytics platform can look like when lakehouse workflows, orchestration, and BI are treated as one cohesive product surface.

About

DataWizz is a data warehouse / lakehouse platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors