Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Local Differential Privacy Methods #119

Closed
3 of 5 tasks
hcngac opened this issue Nov 24, 2021 · 12 comments
Closed
3 of 5 tasks

[FR] Local Differential Privacy Methods #119

hcngac opened this issue Nov 24, 2021 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@hcngac
Copy link
Contributor

hcngac commented Nov 24, 2021

Is your feature request related to a problem? Please describe.
Currently there is only one implementation of local differential privacy (LDP): RAPPOR[1], implemented in https://github.com/TL-System/plato/blob/main/plato/utils/unary_encoding.py and it is not decoupled with algorithm implementation.

_randomize = getattr(self.trainer, "randomize", None)
for inputs, targets, *__ in data_loader:
with torch.no_grad():
logits = self.model.forward_to(inputs, cut_layer)
if epsilon is not None:
logits = logits.detach().numpy()
logits = unary_encoding.encode(logits)
if callable(_randomize):
logits = self.trainer.randomize(
logits, targets, epsilon)
else:
logits = unary_encoding.randomize(logits, epsilon)

if epsilon is not None:
logits = logits.asnumpy()
logits = unary_encoding.encode(logits)
logits = unary_encoding.randomize(logits, epsilon)
logits = mindspore.Tensor(logits.astype('float32'))

if epsilon is not None:
logits = unary_encoding.encode(logits)
if callable(_randomize):
logits = self.trainer.randomize(logits, targets, epsilon)
else:
logits = unary_encoding.randomize(logits, epsilon)

This feature request calls for a modular LDP plugin interface and a number of different other methods e.g. [2][3]

Describe the solution you'd like

  • Unified data exchange format between clients and server.
  • A modular interface for plugging in data processing modules into the server-client data exchange.
  • A config entry for enabling specific data processing modules.
  • LDP modules implementation.
  • Test on the theoretical property of modules i.e. ε-LDP

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
To be filled.

Additional context
Add any other context or screenshots about the feature request here.
[1] Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067. ACM, 2014.
[2] Differential Privacy Team, Apple. Learning with privacy at scale. 2017.
[3] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, December 2017.

@hcngac hcngac added the enhancement New feature or request label Nov 24, 2021
@hcngac
Copy link
Contributor Author

hcngac commented Nov 24, 2021

To combine with adding model/feature compression methods later, I also propose to implement it as a middleman or preprocessor interface that can accept a number of data preprocessing modules before data is sent between clients and server. It will also call for unifying the data transfer format between clients and server such that we don't have to implement the same LDP algorithm on different data format e.g. PyTorch/Tensorflow/NumPy.

@baochunli
Copy link
Collaborator

Challenging to do but makes sense.

@hcngac
Copy link
Contributor Author

hcngac commented Nov 26, 2021

In investigation of the data transferred, we can standardise two types of data format:

  • features, in the form of a list of numpy arrays, with each numpy array represents a feature extracted from an input.
  • model parameters, in the form of a ordered dict of (layer name) to (numpy array of the parameters of that layer).

The Algorithm class should be responsible in providing the formatting to and from the framework format into our standard format in the method extract_weights, load_weights and extract_features. Loading of features in servers can be implemented in the feature.DataSource class.

@hcngac
Copy link
Contributor Author

hcngac commented Nov 26, 2021

A new class called DataProcessor is proposed. Each DataProcessor class should have a process method that processes the data.

Two new config entries can be added for both client and server, one for data receiving and one for sending. Each entry should be a list of the names of DataProcessor classes in the order of processor application.

For example when a client receives a piece of data from the server, the data is first processed by the list of DataProcessor in the order of the config list., and then passed to the client's remaining handling.

For example when a client is ready to send a piece of data, the data is first processed by the list of DataProcessor in the order of the config list., and then sent out to the server.

Same can be said for the server.

@hcngac
Copy link
Contributor Author

hcngac commented Nov 26, 2021

A serializer/deserializer DataProcessor class can also be mandated as the last/first DataProcessor to provide different transfer encoding than Python pickle.

@baochunli
Copy link
Collaborator

baochunli commented Nov 26, 2021

I like the design of the DataProcessor class. For now, however, I think we should continue to use pickle for data transfers, as this is challenging to replace -- it would require changing a lot of existing data transfer code that is field tested.

@hcngac
Copy link
Contributor Author

hcngac commented Nov 29, 2021

Individual client can have different privacy requirement and might opt to set their own LDP parameter i.e. epsilon. Is the performance of federated learning model under varying privacy parameter a valid research question?

Support for variable privacy parameter for individual clients? Or a randomised privacy parameter distribution for clients?

@baochunli
Copy link
Collaborator

Fact is, even the performance of federated learning with the same set of parameters across all clients is not well understood.

@hcngac
Copy link
Contributor Author

hcngac commented Dec 1, 2021

I briefly studied the other two LDP methods, they seem to be applied to statistical data e.g. counters and frequencies.

I cannot determine if they are suitable, and how can they be applied to the FL scenario.

@baochunli
Copy link
Collaborator

Beyond randomized response (using unary encoding, which is implemented already in Plato), one can implement other, more mainstream mechanisms of differential privacy. These include (but not limited to) the Laplace mechanism, the Gaussian mechanism, the Exponential mechanism_, and the sparse vector technique. More detailed descriptions of these differential privacy mechanisms can be found here:

Programming Differential Privacy

These mechanisms should all be fairly easy to implement. For example, the Laplace mechanism involves simply calling the function np.random.laplace() in Python's numpy framework.

For a more detailed coverage, you can also find a comparison between randomized response and the Laplace mechanism in this research paper:

http://ceur-ws.org/Vol-1558/paper35.pdf

@hcngac
Copy link
Contributor Author

hcngac commented Dec 9, 2021

Not sure about the applicability of exponential mechanism and sparse vector technique as exponential mechanism is more suitable for querying for a discrete/finite result e.g. actual inferencing, and sparse vector technique seems to be applied to finding a query that has result above the threshold. Not exactly a data feeding process into the model training.

Also, is applying clipping to features and model parameters make sense? I assume its ok for features, but not sure for model parameters.

@baochunli
Copy link
Collaborator

No, I agree that the exponential mechanism and sparse vector technique do not make much sense.

We actually don't apply Gaussian or Laplacian mechanisms on model parameters — we apply them on gradients during training inside the training loop (where clipping is done as well). So some of your recent code needs to be redone. I am working on this in the 'gradient_dp' branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants