University of Helsinki, Master's Programme in Data Science  
DATA20019 Trustworthy Machine Learning, Autumn 2019  
Antti Honkela and Razane Tajeddine  

# Project 2: Real-life privacy-preserving machine learning

Deadline for returning the solutions: 24 November 23:55.

## General instructions (IMPORTANT!)

1. This is an individual project. You can discuss the solutions with other students, but everyone needs to write their own code and answers.
2. Please return your solutions as a notebook. When returning your solutions, please leave all output in the notebook.
3. When returning your solutions, please make sure the notebook can be run cleanly using "Cell" / "Run All".
4. Please make sure there are no dependencies between solutions to different problems.
5. Please make sure that your notebook will not depend on any local files.
6. Please make sure that the solutions for each problem in your notebook will produce the same results when run multiple times, i.e. remember to seed any random number generators you use (`numpy.random.seed()`!).


## Task 1: Differentially private logistic regression with DP-SGD and synthetic data

TensorFlow Privacy (https://github.com/tensorflow/privacy) library provides implementations of many differentially private optimisation algorithms for deep learning and other models. In order to perform these exercises, you will need to install TensorFlow Privacy and its dependencies according to instructions given on the website.

In order to study TensorFlow Privacy, we will use logistic regression on a small synthetic data set. This will be faster to run than larger neural network models. A simple example implementation of the model is available at https://github.com/ahonkela/privacy/blob/master/tutorials/toy_lr_tutorial.py
The code has been adapted from tutorials provided with TensorFlow privacy.

The definition of the logistic regression model binary classification is itself very straightforward in TensorFlow, simply using a single fully connected linear layer with cross entropy loss:
```{python}
  # Define logistic regression model using tf.keras.layers.
  logits = tf.keras.layers.Dense(2).apply(features['x'])

  # Calculate loss as a vector (to support microbatches in DP-SGD).
  vector_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
      labels=labels, logits=logits)
```

The rest of the example provides supporting architecture. Key parameters of the algorithm are defined as `flags` at the beginning of the file. These include:
```{python}
flags.DEFINE_float('learning_rate', .05, 'Learning rate for training')
flags.DEFINE_float('noise_multiplier', 2.0,
                   'Ratio of the standard deviation to the clipping norm')
flags.DEFINE_float('l2_norm_clip', 1.0, 'Clipping norm')
flags.DEFINE_integer('batch_size', 64, 'Batch size')
flags.DEFINE_integer('epochs', 2, 'Number of epochs')
```

`learning_rate` is the initial learning rate for the Adam optimiser. Larger value means faster learning but can cause instability.  
`noise_multiplier` controls the amount of noise added in DP-SGD: higher value means more noise. The value is defined relative to the gradient clipping norm.  
`l2_norm_clip` is the maximum norm at which per-example gradients are clipped. Smaller values mean less noise with the same level of privacy, but too small values can bias the results and make learning impossible.  
`batch_size` is the minibatch size which impacts privacy via amplification from subsampling. Smaller batch sizes increase privacy for equal number of epochs, but too small batches can make the learning unstable.  
`epochs` controls the length of training as a number of passes over the entire data.

Test how these parameters (clipping threshold, batch size, noise multiplier and learning rate) affect the accuracy or the classifier and the privacy. Plot all your results to a privacy ($\epsilon$) vs. accuracy plot to trace the optimal accuracy achievable under a specific level of privacy.

You can limit the number of experiments to keep the runtimes reasonable: it is not necessary to try every combination of parameters but you can focus on testing the effect of one variable at a time. TensorFlow can use GPUs which can speed up learning significantly.

Note: testing several hyperparameters and choosing the best has an impact on the privacy guarantees. There are methods for dealing with this (e.g. https://arxiv.org/abs/1811.07971) but the field is still under active development.

In [14]:
!python 2.py

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2019-11-24 11:49:23.976139: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
For delta=1e-5, the current epsilon is: 0.47
For delta=1e-5, the current epsilon is: 0.49
For delta=1e-5, the current epsilon is: 0.52
For delta=1e-5, the current epsilon is: 0.54
For delta=1e-5, the current epsilon is: 0.56
Test accuracy after 0.5 epochs is: 0.833
For delta=1e-5, the current epsilon is: 0.59
For delta=1e-5, the current epsilon is: 0.61
For delta=1e-5, the current epsilon is: 0.63
For delta=1e-5, the current epsilon is: 0.65
For delta=1e-5, the current epsilon is: 0.67
Test accuracy after 1.0 epochs is: 0.835
For delta=1e-5, the current epsilon is: 0.69
For delta=1e-5, the current epsilon is: 0.71
For delta=1e-5, the current epsilon is: 0.73
For delta=1e-5, the current epsilon is: 0.75
For delta=1e-5, the current epsilon is: 0

## Task 2: DP logistic regression on realistic data

Using the above code as a basis, build a DP logistic regression classifier for the UCI Adult data set (https://archive.ics.uci.edu/ml/datasets/Adult). (The data set is a standard benchmark data set that is available in various packages - feel free to use one of those.)

How accurate classifier can you build to predict if an individual has an income of at most 50k, using DP with $\epsilon=1, \delta = 10^{-5}$? Report your accuracy on the separate test set not used in learning.

Hint: the data set includes many categorical variables. In order to use these, you will need to use a one-hot encoding with $n-1$ variables used to denote $n$ values so that $k$th value is represented by value 1 in $k-1$st variable and zeros otherwise.

## Task 3: Your own problem in privacy-preserving machine learning

State and solve your own problem related to privacy-preserving machine learning.

You can use code available online, as long as you cite the source.

You can for example try reproducing the results of some interesting paper using their data or your own data, try out some of the privacy attacks, or simply try the above examples using more complex models and/or on different data sets.

If your problem is based on some previous problem, it should extend it in a non-trivial manner (not just running exact same code with new parameters or data).

The evaluation of the project will take the difficulty of your chosen problem into account.

This task is worth as much as two regular problems.