Plaintext-speed SMPC Linear Regression #2069

iamtrask · 2019-04-15T08:46:47Z

This project is to port the following Paper/Code into PyTorch. It makes claims of plaintext-speed linear regression which could be very useful to users of PySyft.

Paper: https://arxiv.org/pdf/1901.09531.pdf
Code: https://github.com/jbloom22/DASH/

jbloom22 · 2019-04-15T11:37:16Z

Well that'd be sweet! I'll do my best to answer any questions. I also just joined the slack as jbloom.

Roughly speaking, efficient distributed algorithms are privacy-preserving algorithms because efficiency requires minimizing the number of bits serialized over the network.

In the case of linear regression, we can "quotient out" rotational invariance, factoring the problem into plaintext compression followed by network usage that is independent of n.

In the simplest case of y regressed on X, this just amounts to staring at the formula:
beta = (X^T X)^-1 (X^T y)

So:

Compute X^T X, reducing k n-vectors to (k choose 2) numbers.
Compute X^T y, reducing the n-vector to a k-vector.
Compute beta.

Steps 1 and 2 are the big data steps, but trivially parallelize over p groups of samples, e.g.X^T X = X1^T X1 + ... + Xp^T Xp. So compute these summands in plaintext, and for Step 3, just SMPC the sums and the tiny linear solve to obtain beta. The only extra bit needed for the standard error is the number y^T y.

This gets more fun when you have a ton of features that you'd like to test one-by-one, while including a common set of confounders. We implemented the "singly-distributed" version in Hail (https://hail.is) and used it to compute the largest genetic association to date in a couple hours (http://www.nealelab.is/uk-biobank/), testing 4K traits at each of ~13M locations in the genome across ~360K Brits with ~20 additional covariates, with the data partitioned into intervals of rows in the variant-by-individual matrix.*

To accommodate the growing number of individuals in these studies, we're moving to a more general block (i.e. checkerboard) partitioning strategy. The DASH note is just the natural generalization to "doubly-distributed" (on features and also samples) linear regression. Once we have block partitioning, we'll implement the non-SMPC version in a few lines of Python. We're also discussing with Hoon Cho (https://hhcho.com/) and others how to best provide a provably-secure SMPC version to the statistical genetics community, building on the great work that he and Bonnie Berger have done in the area.

To be clear, none of this is specific to genomics, and I'd love for the mathematical ideas (useful in theory) to find their way into whatever (esp. open-source!) frameworks make them useful in practice. :)

*For BLAS3 vectorization, Hail also lifts the math to process regressions in 2d groups along both the trait and variant axes. The team is now building a Kubernetes/C++ backend as an alternative to Spark/Scala, which will facilitate compiling the parts of the computational graph that are big linear algebra to GPU / TPU.

jbloom22 · 2019-04-15T12:40:38Z

And in case you're not satisfied by my cheeky reference to the Pythagorean Theorem (https://en.wikipedia.org/wiki/Plimpton_322) as the proof of the Lemma, here's a home-brewed geometric proof (there must be plenty of short algebraic proofs in the lit):

Let P be the orthogonal projection from R^N to C^perp (the orthogonal complement of the column space of C), so PC = 0. Let b and g be the fit parameters for:

y = xb + Cg + e

Then the residual e is orthogonal to both x and col(C), hence to Px which is in their span. Since e is orthogonal to col(C), we also know Pe = e. So applying P to both sides gives

Py = (Px)b + e

and this is an orthogonal decomposition (!), so by uniqueness the regression problem Py on Px has the same estimate b and the same residual vector e as the original regression problem of y on x and C. All that's changed is the number of degrees of freedom.

[explicitly, P = I - QQ^T, which leads to the formulas].

akontaxis · 2019-04-15T14:48:02Z

I'd definitely be interested in taking up this project if it's still unassigned!

kamathhrishi · 2019-04-18T17:42:19Z

@akontaxis Go for it :)

iamtrask · 2019-04-25T22:18:13Z

Hey @akontaxis how's it goin? Also, are you in the Slack (if so, what's your username?)

akontaxis · 2019-04-26T04:24:13Z

Hi @iamtrask ! Yep, I'm on slack (username is also akontaxis).

It's going alright -- I put together a basic notebook that walks through an example of the linear regression case. Securely performing the linear solve at the end is still a to do, so was planning to tackle that next, but definitely open to any guidance on scope/next steps!

jbloom22 · 2019-05-01T12:01:32Z

I've updated the demo in the repo to clarify that the matrix R need not be shared publicly:
https://github.com/jbloom22/DASH/

akontaxis · 2019-07-11T03:41:16Z

Hi guys! Apologies for the radio silence, have been busy the past few months but working on this intermittently. I just opened a PR: #2350

I kept this stuff in a notebook because I wasn't sure where it should sit in the codebase -- any feedback there would be super helpful!

kamathhrishi · 2019-08-21T13:26:12Z

I guess this issue should be closed?

jbloom22 · 2019-08-21T13:31:25Z

Hi Hrisikesh! Slack Andre Farias, winner of the OpenMined grant, for the latest if you're interested to contribute. Andrew Kontaxis is now coordinating with him, they should decide what to do with this Issue. @akontaxis @andrelmfarias

TanishB · 2020-03-01T09:18:21Z

Hello everyone! Is this issue resolved? If not, can I try it out? @iamtrask

github-actions · 2020-05-25T00:08:58Z

This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale label to unmark it. Otherwise, this will be closed in 7 days.

iamtrask added the Good first issue 🎓 Perfect for beginners, welcome to OpenMined! label Apr 15, 2019

LaRiffle changed the title ~~Plantext-speed SMPC Linear Regression~~ Plaintext-speed SMPC Linear Regression Apr 15, 2019

hereismari added the Type: New Feature ➕ Introduction of a completely new addition to the codebase label May 5, 2019

LaRiffle added the multiparty computation label Jun 7, 2019

github-actions bot added the Status: Stale 🍞 Been open for a while with no activity label May 25, 2020

github-actions bot closed this as completed May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plaintext-speed SMPC Linear Regression #2069

Plaintext-speed SMPC Linear Regression #2069

iamtrask commented Apr 15, 2019

jbloom22 commented Apr 15, 2019 •

edited

Loading

jbloom22 commented Apr 15, 2019

akontaxis commented Apr 15, 2019 •

edited

Loading

kamathhrishi commented Apr 18, 2019

iamtrask commented Apr 25, 2019

akontaxis commented Apr 26, 2019

jbloom22 commented May 1, 2019

akontaxis commented Jul 11, 2019

kamathhrishi commented Aug 21, 2019

jbloom22 commented Aug 21, 2019

TanishB commented Mar 1, 2020

github-actions bot commented May 25, 2020

Plaintext-speed SMPC Linear Regression #2069

Plaintext-speed SMPC Linear Regression #2069

Comments

iamtrask commented Apr 15, 2019

jbloom22 commented Apr 15, 2019 • edited Loading

jbloom22 commented Apr 15, 2019

akontaxis commented Apr 15, 2019 • edited Loading

kamathhrishi commented Apr 18, 2019

iamtrask commented Apr 25, 2019

akontaxis commented Apr 26, 2019

jbloom22 commented May 1, 2019

akontaxis commented Jul 11, 2019

kamathhrishi commented Aug 21, 2019

jbloom22 commented Aug 21, 2019

TanishB commented Mar 1, 2020

github-actions bot commented May 25, 2020

jbloom22 commented Apr 15, 2019 •

edited

Loading

akontaxis commented Apr 15, 2019 •

edited

Loading