DataHelm is a data engineering framework focused on the following:
- Source ingestion and orchestration
- dbt transformation workflows
- Notebook-based dashboard execution
- Reusable provider connectors (SharePoint, GCS, S3, BigQuery)
- Optional local LLM analytics query scaffolding
- Core Capabilities
- High-Level Architecture
- Repository Structure
- Local Setup
- Configuration Model
- Reusable Connectors
- Local LLM Analytics Module
- Testing
- CI/CD and Branching
- Containerization
- Deployment
- Contributing and Governance
- Detailed Technical Documentation
- Reusable provider connectors (SharePoint, GCS, S3, and BigQuery)
- Optional local LLM analytics query scaffolding
- Config-driven ingestion using YAML in
config/api/ - Dagster orchestration for managing jobs, schedules, and sensors
- dbt project execution through
analytics/dbt_runner.pyand dbt configuration files - Dashboard generation with Dagstermill notebooks
- Reusable handlers/connectors for multiple external providers
- Optional NL-to-SQL module (
analytics/nl_query/) for local Ollama-based analytics workflows
The repository follows layered responsibilities: The repository follows a layered responsibility structure:
handlers/: provider-specific source connectors and API handlersingestion/: ingestion factory and native ingestion implementationsanalytics/: dbt, dashboard, and optional NL-query modulesdagster_op/: orchestration objects (jobs, schedules, repository)config/: all runtime configuration (API, dbt, dashboard, analytics metadata)tests/: unit tests for handlers, ingestion, analytics, and scripts
dagster_op/
ingestion/
tests/
scripts/
docs/
config/
api/
dbt/
dashboard/
analytics/
analytics/
dbt_projects/
notebooks/
nl_query/
dagster_op/
handlers/
api/
sharepoint/
gcs/
s3/
bigquery/
ingestion/
tests/
scripts/
docs/
Python 3.12+ PostgreSQL (accessible from the local environment) Optional: Docker, local Ollama, dbt CLI
Run the following commands to set up the local environment:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .Create a .env file in the repository root with required values, for example:
Create a file named .env in the root of the repository with the required values, for example:
DB_HOST=${DB_HOST}
DB_PORT=${DB_PORT}
DB_USER=${DB_USER}
DB_PASSWORD=${DB_PASSWORD}
DB_NAME=${DB_NAME}
CLASHOFCLANS_API_TOKEN=${CLASHOFCLANS_API_TOKEN}
To start Dagster locally, run:
python scripts/run_dagster_dev.pyFor a quick verification without executing jobs, run:
python scripts/run_dagster_dev.py --print-onlyDefines source-level extraction, publish targets, schedules, and column mapping. Example included: CLASHOFCLANS_PLAYER_STATS
Defines dbt units, selection/exclusion rules, variables, and schedules.
Defines notebook path, source table mapping, chart columns, and cadence.
Defines dataset metadata for the isolated NL-to-SQL module.
The repository includes reusable connector classes under handlers/:
handlers/sharepoint/sharepoint.py- Microsoft Graph authentication and site/file access helpers
handlers/gcs/gcs.py- Upload/download/list/delete/signed URL helpers
handlers/s3/s3.py- Upload/download/list/delete/presigned URL helpers
handlers/bigquery/bigquery.py- Query, row fetch, dataframe load, schema helpers
analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama:
- Semantic catalog loader
- SQL read-only safety guard
- Ollama client wrapper
- Orchestration service
Run all tests: The repository includes reusable connector classes under handlers/:
handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers handlers/gcs/gcs.py – Upload/download/list/delete/signed URL helpers handlers/s3/s3.py – Upload/download/list/delete/presigned URL helpers handlers/bigquery/bigquery.py – Query, row fetch, dataframe load, schema helpers
analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama:
- Semantic catalog loader
- SQL read-only safety guard
- Ollama client wrapper
- Orchestration service
Run all tests with the following command:
.venv/bin/python -m pytest -qThe current test suite includes coverage for:
- Ingestion and handler behavior
- Analytics factory and runner logic
- Connector modules (SharePoint, GCS, S3, BigQuery)
- Script behavior
- NL-query safety and service paths
dev: integration branchmaster: release/production branch
- Ingestion and handler behavior
- Analytics factory and runner logic
- Connector modules (SharePoint, GCS, S3, BigQuery)
- Script behavior
- NL-query safety and service paths
dev: integration branchmaster: release/production branch
Workflows:
- CI: tests on development and PR flows
- Docker Release: image build/publish on master
- Deploy Release: workflow_run/manual deployment orchestration
The container image is defined via Dockerfile.
The default runtime command starts Dagster gRPC: Container image is defined via Dockerfile.
Default runtime command starts the Dagster gRPC server:
python -m dagster api grpc -m dagster_op.repositoryDeployment flow is workflow-based:
- Production auto-path after successful Docker release
- Manual staging/production dispatch path
- Contribution guide:
CONTRIBUTING.md - Code of conduct:
CODE_OF_CONDUCT.md - Security reporting:
SECURITY.md
- Production auto-path after successful Docker release
- Manual staging/production dispatch path
For complete, long-form project documentation (operations, architecture, and runbook-style details), see:
docs/document.mddocs/document.md
