Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 14 additions & 15 deletions docs/introduction.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
.. class:: center
INTRODUCTION TO DIFFERENTIAL PRIVACY
============
Introduction To Differential Privacy
====================================


Introduction
Expand All @@ -9,7 +8,7 @@ Introduction
The era where we are living in is data driven, tons and tons of data are being generated in every second. A lot of this data is being used to improve our own lifestyle - be it recommending the best series to watch after a tiring day at work, suggesting the best gifts to buy when it's our best friend's birthday or keeping our birthday party photos sorted so that we can cherish them years later. All big companies are using data to gain insights of their progress which drives their business. Machine Learning has made our life from easy to easier but is it just about improving our lifestyle? This raises a question can machine learning change the way we live ? Can it improve our healthcare? Can ML be friends to those who are lonely and have no one to talk with? The answer is “Yes” and also “No”.

Machine Learning and Data
============
=========================

Machine Learning is extensively both data and research driven. The more the data is, better will be the research on that particular topic. Now, all data cannot be released for research, there is a lot of private information which once leaked can be misused. Take for example, to tackle a particular medical problem we need a lot of medical health records. These records are considered as private information as no person would love the fact that her/his medical records are identifiable by anyone on the internet. Hence, these are some real world issues that need immediate solutions but the hands of the researchers are tied due to the unavailability of data. So, is there a solution ?

Expand All @@ -23,7 +22,7 @@ This is where “Differential Privacy” comes into the picture, a smarter way t
(Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series )

Why is Differential Privacy so important ?
============
==========================================

The aim of any privacy algorithm is to keep one's private information safe and secured from external attacks. Differential privacy aims to keep an individual's identity secured even if their data is being used in research. An easy approach to maintain this kind of privacy is “Data Anonymization” which is a process of removing personally identifiable information from a dataset. It is seen that there are cons in following this approach:

Expand All @@ -50,13 +49,13 @@ Despite the fact that the dataset was anonymized (no username or movie name was
:align: center
:figclass: align-center

They scraped the IMDB Website and by statistical analysis on these two datasets, they were able to identify the movie names and also the individual names. Ten years down the line they have published yet another `paper <https://www.cs.princeton.edu/~arvindn/publications/de-anonymization-retrospective.pdf>`_ where they have reviewed de-anonymization of datasets in the present world. There are other instances too where such attacks have been made which led to the leakage of private information.
They scraped the IMDB Website and by statistical analysis on these two datasets, they were able to identify the movie names and also the individual names. Ten years down the line they have published yet another `research paper <https://www.cs.princeton.edu/~arvindn/publications/de-anonymization-retrospective.pdf>`_ where they have reviewed de-anonymization of datasets in the present world. There are other instances too where such attacks have been made which led to the leakage of private information.

Now, that we have learnt how important is “Differential Privacy”, let see how is the Differential Privacy actually implemented.


How is Differential Privacy implemented ?
============
=========================================

According to `Cynthia Dwork <https://www.microsoft.com/en-us/research/people/dwork>`_- *“Differential privacy” describes a promise, made by a data holder, or curator, to a data subject: “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”*

Expand All @@ -68,7 +67,7 @@ These algorithms add random noise to the queries and to the database. This is do
* Global Differential Privacy

Local Differential Privacy
-----
--------------------------

In local differential privacy the random noise is applied at the start of the process(local) level i.e when the data is sent to the data curator/aggregator. If the data is too confidential, generally the data generators do not want to trust the curator and hence add noise to the dataset beforehand. This is adopted when the Data Curator cannot be completely trusted.

Expand All @@ -80,7 +79,7 @@ In local differential privacy the random noise is applied at the start of the pr
Image Credit: Google Images

Global Differential Privacy
-----
---------------------------
In Global differential privacy the random noise is applied at the global level i.e when the answer to a query is returned to the User. This type of differential privacy is adopted when the Data generators trusts the data curator completely and leaves it to the curator the amount of noise to be added to the results. This type of privacy results is more accurate as it involves lesser noise.

.. figure:: https://user-images.githubusercontent.com/19529592/91381550-4ec2d400-e845-11ea-8f63-b7a3adb3fde8.png
Expand All @@ -90,8 +89,8 @@ In Global differential privacy the random noise is applied at the global level i

Image Credits: Google Images

FORMAL DEFINITION OF DIFFERENTIAL PRIVACY
============
Formal Definition Of Differential Privacy
=========================================

In the book, “`The Algorithmic Foundations of Differential Privacy <https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf>`_” by Cynthia Dwork and Aaron Roth. Differential Privacy is formally defined as:
.. glossary::
Expand All @@ -107,8 +106,8 @@ The Epsilon *(ε)* and *Delta(δ)* parameters measure the threshold for leakage.

This when both Epsilon and Delta is 0, it is called Perfect-Privacy. The values are set in such a way so that the privacy is maintained. This set of values is known as Privacy-Budget.

DIFFERENTIAL - PRIVACY IN REAL WORLD
============
Differential - Privacy In Real World
====================================

Differential Privacy ensures privacy of all sorts of data which can be used by anyone to draw insights which can help them run their business. In the present world, Differentially Private Data Analysis is widely used and these are implemented by using various libraries.

Expand Down Expand Up @@ -144,8 +143,8 @@ Differential Privacy is playing an important role in building Privacy-protected



FURTHER READING
============
Further Reading
===============

* `Secure and Private AI Course on Udacity by Andrew Trask <https://www.udacity.com/course/secure-and-private-ai--ud185>`_

Expand Down
134 changes: 96 additions & 38 deletions docs/readme.rst
Original file line number Diff line number Diff line change
@@ -1,53 +1,111 @@
Introduction
============
| |Tests| |Version| |License|

PyDP is a Python wrapper for Google’s `Differential Privacy`_ project.
The library provides a set of ε-differentially private algorithms, which
can be used to produce aggregate statistics over numeric data sets
containing private or sensitive information.
Introduction to PyDP
====================

PyDP is part of the OpenMined community, come join the movement on
`Slack`_.
In today's data-driven world, more and more researchers and data
scientists use machine learning to create better models or more innovative
solutions for a better future.

Instructions
============
These models often tend to handle sensitive or personal data, which
can cause privacy issues. For example, some AI models can memorize details about the data they've been trained on and could potentially leak these
details later on.

If you’d like to contribute to this project please read these
`guidelines`_.
To help measure sensitive data leakage and reduce the possibility of
it happening, there is a mathematical framework called differential
privacy.

Usage
-----
In 2020, OpenMined created a Python wrapper for Google's `Differential
Privacy <https://github.com/google/differential-privacy>`_ project
called PyDP. The library provides a set of ε-differentially private algorithms,
which can be used to produce aggregate statistics over numeric data sets containing
private or sensitive information. Therefore, with PyDP you can control the
privacy guarantee and accuracy of your model written in Python.

As part of the 0.1.1 dev release, we have added all functions required in carrots demo.
**Things to remember about PyDP:**

- ::rocket: Features differentially private algorithms including: BoundedMean, BoundedSum, Max, Count Above, Percentile, Min, Median, etc.

To install the package: ``pip install python-dp``
- All the computation methods mentioned above use Laplace noise only (other noise mechanisms will be added soon! :smiley:).

::
- ::fire: Currently supports Linux and macOS (Windows support coming soon :smiley:)
- ::star: Use Python 3.x.

import pydp as dp # imports the DP library
Installation
------------

# To calculate the Bounded Mean
# epsilon is a number between 0 and 1 denoting privacy threshold
# It measures the acceptable loss of privacy (with 0 meaning no loss is acceptable)
# If both the lower and upper bounds are specified,
# x = dp.BoundedMean(epsilon: double, lower: int, upper: int)
x = dp.BoundedMean(0.6, 1, 10)
To install PyDP, use the `PiPy <https://pip.pypa.io/en/stable/>`__
package manager:

# If lower and upper bounds are not specified,
# DP library automatically calculates these bounds
# x = dp.BoundedMean(epsilon: double)
x = dp.BoundedMean(0.6)
.. code:: bash

# To get the result
# Currently supported data types are integer and float. Future versions will support additional data types
# Refer to examples/carrots.py for an introduction
x.result(input_data: list)
pip install python-dp

Known issue: If the privacy budget (epsilon is too less), we get a
StatusOR error in the command line. While this needs to be raised as an
error, right now, it’s just displayed as an error in logs.
(If you have ``pip3`` separately for Python 3.x, use ``pip3 install python-dp``.)

.. _Differential Privacy: https://github.com/google/differential-privacy
.. _Slack: http://slack.openmined.org/
.. _guidelines: https://github.com/OpenMined/PyDP/blob/master/contributing.md
Examples
--------

Refer to the `curated list <https://github.com/OpenMined/PyDP/tree/dev/examples>`__ of tutorials and sample code to learn more about the PyDP library.

You can also get started with `an introduction to
PyDP <https://github.com/OpenMined/PyDP/blob/dev/examples/carrots_demo/carrots_demo.ipynb>`__ (a Jupyter notebook) and `the carrots demo <https://github.com/OpenMined/PyDP/blob/dev/examples/carrots_demo/carrots.py>`__ (a Python file).

Example: calculate the Bounded Mean

.. code:: python

# Import PyDP
import pydp as dp
# Import the Bounded Mean algorithm
from pydp.algorithms.laplacian import BoundedMean

# Calculate the Bounded Mean
# Structure: `BoundedMean(epsilon: double, lower: int, upper: int)`
# `epsilon`: a Double, between 0 and 1, denoting the privacy threshold,
# measures the acceptable loss of privacy (with 0 meaning no loss is acceptable)
# `lower` and `upper`: Integers, representing lower and upper bounds, respectively
x = BoundedMean(0.6, 1, 10)

# If the lower and upper bounds are not specified,
# PyDP automatically calculates these bounds
# x = BoundedMean(epsilon: double)
x = BoundedMean(0.6)

# Calculate the result
# Currently supported data types are integers and floats
# Future versions will support additional data types
# (Refer to https://github.com/OpenMined/PyDP/blob/dev/examples/carrots.py)
x.quick_result(input_data: list)

Learning Resources
------------------

Go to `resources <https://github.com/OpenMined/PyDP/blob/dev/resources.md>`__ to learn more about differential privacy.

Support and Community on Slack
------------------------------

If you have questions about the PyDP library, join `OpenMined's Slack <https://slack.openmined.org>`__ and check the **#lib\_pydp** channel. To follow the code source changes, join **#code\_dp\_python**.

Contributing
------------

To contribute to the PyDP project, read the `guidelines <https://github.com/OpenMined/PyDP/blob/dev/contributing.md>`__.

Pull requests are welcome. If you want to introduce major changes,
please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.


<!-- ## Contributors -->

License
-------

`Apache License 2.0 <https://choosealicense.com/licenses/apache-2.0/>`__

.. |Tests| image:: https://img.shields.io/github/workflow/status/OpenMined/PyDP/Tests
.. |Version| image:: https://img.shields.io/github/v/tag/OpenMined/PyDP?color=green&label=pypi
.. |License| image:: https://img.shields.io/github/license/OpenMined/PyDP