DataSQRL

DataSQRL is a framework for building data pipelines with guaranteed data integrity. Ingest data from multiple sources, integrate, transform, store, and serve the result as data APIs, LLM tooling, or Apache Iceberg views.

Data Engineers use DataSQRL to build reliable data pipelines that ensure:

Consistent data served through realtime data APIs,
Accurate data as tooling for LLMs and agents,
Reliable data lakehouses with Iceberg tables and catalog views.

You define the data processing in SQL and DataSQRL compiles the entire data infrastructure with Apache Flink, Postgres, Iceberg, GraphQL API, and LLM tooling. It generates the glue code, schemas, mappings, and deployment artifacts to automatically integrate and configure these components into a coherent data stack that is highly available, consistent, scalable, observable, and fast. DataSQRL supports quick local iteration, end-to-end pipeline testing, and deployment to Kubernetes or cloud-managed services.

DataSQRL Features

🛡️ Data Integrity Guarantees: Exactly-once processing, consistent data across all outputs, automated data lineage tracking, and comprehensive testing framework.
🔒 Production-grade Reliability: Robust, highly available, scalable, observable data pipelines executed by trusted OSS technologies (Kafka, Flink, Postgres, DuckDB).
🔗 End-to-End Consistency: DataSQRL generates connectors, schemas, data mappings, SQL dialect translation, and configurations that maintain data integrity across the entire pipeline.
🚀 Robust Operations: Local development, CI/CD support, logging framework, reusable components, and composable architecture for reliable pipeline management.
🤖 AI-native with Accuracy: Support for vector embeddings, LLM invocation, and ML model inference with accurate data delivery for LLM tooling interfaces.

To learn more about DataSQRL, check out the documentation.

Getting Started

This example builds a data pipeline that captures user token consumption via API, exposes consumption alerts via subscription, and aggregates the data for query access.

/*+no_query */
CREATE TABLE UserTokens (
    userid BIGINT NOT NULL,
    tokens BIGINT NOT NULL,
    request_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp'
);

/*+query_by_all(userid) */
TotalUserTokens := SELECT userid, sum(tokens) as total_tokens,
                          count(tokens) as total_requests
                   FROM UserTokens GROUP BY userid;

UsageAlert := SUBSCRIBE SELECT * FROM UserTokens WHERE tokens > 100000;

Create a file usertokens.sqrl with the content above and run it with:

docker run -it --rm -p 8888:8888 -p 8081:8081 -p 9092:9092 -v $PWD:/build datasqrl/cmd:latest run usertokens.sqrl

(Use ${PWD} in Powershell on Windows).

The pipeline is exposed through a GraphQL API that you can access at http://localhost:8888/graphiql/ in your browser.

UserTokens is exposed as a mutation for adding data.
TotalUserTokens is exposed as a query for retrieving the aggregated data.
UsageAlert is exposed as a subscription for real-time alerts.

Once you are done, terminate the pipeline with CTRL-C.

To build the deployment assets in the for the data pipeline, execute

docker run --rm -v $PWD:/build datasqrl/cmd:latest compile usertokens.sqrl

The build/deploy directory contains the Flink compiled plan, Kafka topic definitions, PostgreSQL schema and view definitions, server queries, and GraphQL data model.

Read the full Getting Started tutorial or check out the DataSQRL Examples repository for more examples creating Iceberg views, Chatbots, data APIs and more.

Why DataSQRL?

As data engineers, we got frustrated by the data integrity challenges in complex pipelines - inconsistent data across systems, lost data due to processing failures, and the difficulty of ensuring end-to-end correctness at scale.

Traditional data tools focus on moving data fast but often sacrifice consistency. DataSQRL prioritizes data integrity while maintaining performance, giving you confidence that your data is accurate and reliable throughout the entire pipeline.

How DataSQRL Works

DataSQRL compiles the SQRL scripts and data source/sink definitions into a data processing DAG (Directed Acyclic Graph) according to the configuration. The cost-based optimizer cuts the DAG into segments executed by different engines (e.g. Flink, Kafka, Postgres, Vert.x), generating the necessary physical plans, schemas, and connectors for a fully integrated, reliable, and consistent data pipeline. These deployment assets are then executed in Docker, Kubernetes, or by a managed cloud service.

DataSQRL gives you full visibility and control over the generated data pipeline and uses proven open-source technologies to execute the generated deployment assets.

Learn more about DataSQRL in the documentation.

Contributing

Our goal is to simplify the development of data pipelines you can trust by compiling robust and consistent data architectures. Your feedback is invaluable in achieving this goal. Let us know what works and what doesn't by filing GitHub issues or in the DataSQRL Slack community.

We welcome code contributions. For more details, check out CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1,972 Commits
.circleci		.circleci
.github		.github
.mvn		.mvn
documentation		documentation
sqrl-cli		sqrl-cli
sqrl-discovery		sqrl-discovery
sqrl-planner		sqrl-planner
sqrl-server		sqrl-server
sqrl-testing		sqrl-testing
sqrl-tools/sqrl-config		sqrl-tools/sqrl-config
.gitignore		.gitignore
.gitmodules		.gitmodules
BUILD.md		BUILD.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
DEVELOPER_DOCS.md		DEVELOPER_DOCS.md
Dockerfile.mcp-inspector		Dockerfile.mcp-inspector
Dockerfile.run		Dockerfile.run
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
RELEASE.md		RELEASE.md
codecov.yml		codecov.yml
entrypoint-run.sh		entrypoint-run.sh
manual-tests.md		manual-tests.md
playbook.md		playbook.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataSQRL

DataSQRL Features

Getting Started

Why DataSQRL?

How DataSQRL Works

Contributing

About

Uh oh!

Releases 21

Uh oh!

Contributors 13

Uh oh!

Languages

License

DataSQRL/sqrl

Folders and files

Latest commit

History

Repository files navigation

DataSQRL

DataSQRL Features

Getting Started

Why DataSQRL?

How DataSQRL Works

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 21

Uh oh!

Contributors 13

Uh oh!

Languages