Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

mozilla/rappor-aggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAPPOR one-off aggregator

This is a research prototype built as part of an internship project. See the Mozilla Governance thread and bug 1386569 for context.

Analysis

This is a Python reimplementation of the analysis part of RAPPOR (paper | original repository | updated repository). Concretely, there are two different notebooks available:

  • RAPPOR-Production: Can be used to perform an entire RAPPOR analysis using all the components that we deemed useful. Additionally, test datasets can be dynamically generated using Spark.
  • RAPPOR-Prototyping: Contains a lot of explanations for the individual steps in RAPPOR and a lot of experimenting with different ideas. This notebook works with datasets generated from the original repository.

Both notebooks need to be run from within this repository as they require the files located in the clients/ folder. The production notebook is expected to run on the data generated by this SHIELD study.

Installing dependencies

Generally, only the usual SciPy tech stack is required.

$ pip install scipy numpy matplotlib pandas sklearn

To be able to use the same hash functions as the client part, some files were already copied from the other repository and put into the client folder.


Prototyping notebook

Generating data

Before we can start running the analysis part, we need to have sopme datasets to work on.

To generate these, clone the repository with Alejandro's updates:

$ git clone https://github.com/Alexrs95/rappor

Then, run one of the following commands, depending on which distribution you want to use:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final'
$ ./regtest.sh run-seq 'r-zipf1.5-small-sim_final'
$ ./regtest.sh run-seq 'r-gauss-small-sim_final'
$ ./regtest.sh run-seq 'r-exp-small-sim_final'
$ ./regtest.sh run-seq 'r-unif-small-sim_final'

Alternatively, all these datasets can be generated with one command:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

There are many other datasets, a full list can be generated by running tests/regtest_spec.py.

Generally, we want to use the datasets with the final parameters, as listed here. Using small means we work on reports from 1,000,000 clients with 100 unique values. This is a reasonable dataset size to still run the analysis on a laptop. For testing, looking at all distributions is useful. The real distribution for our use case is probably most similar to zipf1.5.

Just for seeing how important using the tuned parameters is, a different dataset can be generated:

$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1'
$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1|r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

Running the analysis

At the top of the Jupyter notebook, the path to the generated data needs to be adapted. Afterwards, just run all cells to see the results. Depending on the data and parameters used, this might take a while.

About

One-off notebook for performing RAPPOR aggregations

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •