Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Qrlew #2555

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add Qrlew #2555

wants to merge 1 commit into from

Conversation

ngrislain
Copy link

What is this Python project?

Qrlew (/ˈkɝlu/) is the open source library that rewrites SQL queries into privacy-preserving variants using Differential Privacy (DP).

Use Qrlew if you want to bring privacy guarantees to your SQL pipelines. It is:

  • SQL-to-SQL: Qrlew turns SQL queries into differentially-private SQL queries that can be executed at scale on many SQL datastore, in many SQL dialects.
  • Feature-rich: Qrlew covers the broadest range of SQL queries, including JOIN and nested queries
  • Privacy-optimized: Qrlew keeps track of tight bounds and ranges throughout each step, minimizing the amount of noise needed to achieve differential privacy.

What's the difference between this Python project and similar ones?

There are a few existing open-source libraries for differential privacy.

Some libraries focus on deep learning and DP-SGD, such as: Opacus, Tensorflow Privacy or Optax's DP-SGD. Qrlew has a very different goal: analytics and SQL.

GoogleDP is a library implementing many differentially private mechanisms in various languages (C++, Go and Java).
IBM's diffprivlib is also a rich library implementing a wide variety of DP primitives in python and in particular many DP versions of classical machine learning algorithms.
These libraries provide the bricks for experts to build DP algorithms. Qrlew has a very different approach, it is a high level tool designed to take queries written in SQL by a data practitioner with no expertise in privacy and to rewrite them into DP equivalent able to run on any SQL-enabled data store. Qrlew implemented very few DP mechanisms to date, but automated the whole process of rewriting a query, while these library offer a rich variety of DP mechanism, and give full control to the user to use them as they wish.

Google built several higher-level tools on top of.
PrivacyOnBeam is a framework to run DP jobs written in Apache Beam with its Go SDK.
PipelineDP is a framework that let analysts write Beam-like or Spark-like programs and have them run on Apache Spark or Apache Beam as back-end. It focuses on the Beam and Spark ecosystem, while Qrlew tries to provide an SQL interface to the analyst and runs on SQL-enabled back-ends (including Spark, a variety of data warehouses, and more traditional databases).
ZetaSQL, gives the user a way to write SQL-like queries and have them executed on tables using GoogleDB custom code, so it is not compatible with any SQL data store and support relatively simple queries only.

OpenDP is a powerful Rust library with a python bindings. It offers many possibilities of building complex DP computations by composing basic elements. Nonetheless, it require both expertise in privacy and to learn a new API to describe a query. Also, the computations are handled by the Rust core, so it does not integrate easily with existing data stores and may not scale well either.

Tumult Analytics shares many of the nice composable design of OpenDP, but runs on Apache Spark, making it a scalable alternative to OpenDP. Still, it require the learning of a specific API (close to that of Spark) and cannot leverage any SQL back-end.

SmartNoise SQL is a library that share some of the design choices of Qrlew. An analyst can write SQL queries, but the scope of possible queries is relatively limited: no JOINs, no sub-queries, no CTEs (WITH) that Qrlew supports. Also, it does not run the full computation in the DB so the integration with existing systems may not be straightforward.

Other systems such as PINQ and Chorus are prototypes that do not seem to be actively maintained. Chorus shares many of the design goals of Qrlew, but requires post-processing outside of the DB, which can make the integration more complex on the data-owner side (as the computation happens in two distinct places).

Beyond that, Qrlew brings unique functionalities, such as:

  • advanced automated range propagation;
  • the possibility to automatically blend in synthetic data;
  • advanced privacy unit definition capabilities across many related tables;
  • the possibility for the non-expert to simply write standard SQL, but for the DP aware analyst to improve its utility by adding WHERE x < b or WHERE x IN (1,2,3) to give hints to the Qrlew;
  • all the compute happens in the DB.
    --

Anyone who agrees with this pull request could submit an Approve review to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants