RAPPOR one-off aggregator

This is a research prototype built as part of an internship project. See the Mozilla Governance thread and bug 1386569 for context.

Analysis

This is a Python reimplementation of the analysis part of RAPPOR (paper | original repository | updated repository). Concretely, there are two different notebooks available:

RAPPOR-Production: Can be used to perform an entire RAPPOR analysis using all the components that we deemed useful. Additionally, test datasets can be dynamically generated using Spark.
RAPPOR-Prototyping: Contains a lot of explanations for the individual steps in RAPPOR and a lot of experimenting with different ideas. This notebook works with datasets generated from the original repository.

Both notebooks need to be run from within this repository as they require the files located in the clients/ folder. The production notebook is expected to run on the data generated by this SHIELD study.

Installing dependencies

Generally, only the usual SciPy tech stack is required.

$ pip install scipy numpy matplotlib pandas sklearn

To be able to use the same hash functions as the client part, some files were already copied from the other repository and put into the client folder.

Prototyping notebook

Generating data

Before we can start running the analysis part, we need to have sopme datasets to work on.

To generate these, clone the repository with Alejandro's updates:

$ git clone https://github.com/Alexrs95/rappor

Then, run one of the following commands, depending on which distribution you want to use:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final'
$ ./regtest.sh run-seq 'r-zipf1.5-small-sim_final'
$ ./regtest.sh run-seq 'r-gauss-small-sim_final'
$ ./regtest.sh run-seq 'r-exp-small-sim_final'
$ ./regtest.sh run-seq 'r-unif-small-sim_final'

Alternatively, all these datasets can be generated with one command:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

There are many other datasets, a full list can be generated by running tests/regtest_spec.py.

Generally, we want to use the datasets with the final parameters, as listed here. Using small means we work on reports from 1,000,000 clients with 100 unique values. This is a reasonable dataset size to still run the analysis on a laptop. For testing, looking at all distributions is useful. The real distribution for our use case is probably most similar to zipf1.5.

Just for seeing how important using the tuned parameters is, a different dataset can be generated:

$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1'
$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1|r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

Running the analysis

At the top of the Jupyter notebook, the path to the generated data needs to be adapted. Afterwards, just run all cells to see the results. Depending on the data and parameters used, this might take a while.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
client		client
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
RAPPOR-Production.ipynb		RAPPOR-Production.ipynb
RAPPOR-Prototyping.ipynb		RAPPOR-Prototyping.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client

client

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

RAPPOR-Production.ipynb

RAPPOR-Production.ipynb

RAPPOR-Prototyping.ipynb

RAPPOR-Prototyping.ipynb

README.md

README.md

Repository files navigation

RAPPOR one-off aggregator

Analysis

Installing dependencies

Prototyping notebook

Generating data

Running the analysis

About

Releases

Packages

Contributors 4

Languages

mozilla/rappor-aggregator

Folders and files

Latest commit

History

Repository files navigation

RAPPOR one-off aggregator

Analysis

Installing dependencies

Prototyping notebook

Generating data

Running the analysis

About

Resources

Code of conduct

Stars

Watchers

Forks

Languages