# Auditing a dataframe
In this notebook, we shall demonstrate how to use `privacypanda` to _audit_ the privacy of your data. `privacypanda` provides a simple function which prints the names of any columns which break privacy. Currently, these are:
- Addresses
    - E.g. "10 Downing Street"; "221b Baker St"; "EC2R 8AH"
- Phonenumbers (UK mobile)
    - E.g. "+447123456789"
- Email addresses
    - Ending in ".com", ".co.uk", ".org", ".edu" (to be expanded soon)

In [1]:
%load_ext watermark
%watermark -n -p pandas,privacypanda -g

Sun Mar 08 2020 

pandas 1.0.1
privacypanda 0.1.0.dev0
Git hash: 7d1343dc13973da5c265a5a2bcf1915384c3e131


In [2]:
import pandas as pd
import privacypanda as pp

---
## Firstly, we need data

In [3]:
data = pd.DataFrame(
    {
        "user ID": [
            1665,
            1,
            5287,
            42,
        ],
        "User email": [
            "xxxxxxxxxxxxx",
            "xxxxxxxx",
            "I'm not giving you that",
            "an_email@email.com",
        ],
        "User address": [
            "AB1 1AB",
            "",
            "XXX XXX",
            "EC2R 8AH",
        ],
        "Likes raclette": [
            1,
            0,
            1,
            1,
        ],
    }
)

You will notice two things about this dataframe:
1. _Some_ of the data has already been anonymized, for example by replacing characters with "x"s. However, the person who collected this data has not been fastidious with its cleaning as there is still some raw, potentially problematic private information. As the dataset grows, it becomes easier to miss entries with private information
2. Not all columns expose privacy: "Likes raclette" is pretty benign information (but be careful, lots of benign information can be combined to form a unique fingerprint identifying an individual - let's not worry about this at the moment, though), and "user ID" is already an anonymized labelling of an individual.

---
# Auditing the data's privacy
As a data scientist, we want a simple way to tell which columns, if any break privacy. More importantly, _how_ they break privacy determines how we deal with them. For example, emails will likely be superfluous information for analysis and can therefore be removed from the data, but age may be important and so we may wish instead to apply differential privacy to the dataset.

We can use `privacypanda`'s `report_privacy` function to see which data is problematic.

In [4]:
report = pp.report_privacy(data)
print(report)

User address: ['address']
User email: ['email']



`report_privacy` returns a `Report` object which stores the privacy issues of each column in the data. 

As `privacypanda` is in active development, 
this is currently only a simple dictionary of binary "breaks"/"doesn't break" privacy for each column. 
We aim to make this information _cell-level_, 
i.e. removing/replacing the information in individual cells in order to protect privacy with less information loss.