# De-identification & Re-identification

Download the dataset by clicking [here](https://jnear.github.io/cs295-data-privacy/homework/adult_with_pii.csv) and placing them in the same directory as this notebook.

The dataset is based on census data. The personally identifiable information (PII) is made up.

In [2]:
import pandas as pd
import numpy as np

Add the dataset_path in the cell below <br>
Example: dataset_path = "/Users/pxu3/Desktop/FALL 2019/DS300/Lectures/DS300-master/adult_with_pii.csv"

In [3]:
dataset_path = "Add the path to the dataset here"

In [None]:
adult = pd.read_csv(dataset_path)
adult.head()

# De-identification

*De-identification* is the process of removing *identifying information* from a dataset. The term *de-identification* is sometimes used synonymously with the terms *anonymization* and *pseudonymization*.

Identifying information has no formal definition. It is usually understood to be information which would be used to identify us uniquely in the course of daily life - name, address, phone number, e-mail address, etc. As we will see later, it's *impossible* to formalize the concept of identifying information, because *all* information is identifying. The term *personally identifiable information (PII)* is often used synonymously with identifying information.

How do we de-identify information? Easy - we just remove the columns that contain identifying information!

In [None]:
adult_data = adult.drop(columns=['Name', 'SSN'])
adult_data.head()

We'll save some of the identifying information for later, when we'll use it as *auxiliary data* to perform a *re-identification* attack.

In [None]:
adult_pii = pd.read_csv(dataset_path)
adult_pii.head()

# Linking Attacks

Imagine we want to determine the income of a friend from our de-identified data. Names have been removed, but we happen to know some auxiliary information about our friend. Our friend's name is Karrie Trusslove, and we know Karrie's date of birth and zip code.

In [None]:
adult_pii.head(1)

## A Simple Linking Attack

To perform a simple *linking attack*, we look at the overlapping columns between the dataset we're trying to attack, and the auxiliary data we know. In this case, both datasets have dates of birth and zip codes. We look for rows in the dataset we're attacking with dates of birth and zip codes that match Karrie's date of birth and zip code. If there is only one such row, we've found Karrie's row in the dataset we're attacking. In databases, this is called a *join* of two tables, and we can do it in Pandas using `merge`.

In [None]:
karries_row = adult_pii[adult_pii['Name'] == 'Karrie Trusslove']
karries_row

In [None]:
pd.merge(karries_row, adult_data, left_on=['DOB', 'Zip'], right_on=['DOB', 'Zip'])

Indeed, there is only one row that matches. We have used auxiliary data to re-identify an individual in a de-identified dataset, and we're able to infer that Karrie's income is less than $50k.

## How Hard is it to Re-Identify Karrie?

This scenario is made up, but linking attacks are surprisingly easy to perform in practice. How easy? It turns out that in many cases, just one data point is sufficient to pinpoint a row!

In [None]:
karries_new_row = adult_pii[adult_pii['Name'] == 'Karrie Trusslove'][['Name', 'Zip']]
karries_new_row

In [None]:
pd.merge(karries_new_row, adult_data, left_on=['Zip'], right_on=['Zip'])

So ZIP code is sufficient **by itself** to allow us to re-identify Karrie. What about date of birth?

In [None]:
karries_newer_row = adult_pii[adult_pii['Name'] == 'Karrie Trusslove'][['Name', 'DOB']]
karries_newer_row

In [None]:
pd.merge(karries_newer_row, adult_data, left_on=['DOB'], right_on=['DOB'])

This time, there are three rows returned - and we don't know which one is the real Karrie. But we've still learned a lot!

- We know that there's a 2/3 chance that Karrie's income is less than $50k
- We can look at the differences between the rows to determine what additional auxiliary informatino would *help* us to distinguish them (e.g. sex, occupation, marital status)

## Is Karrie Special?

How hard is it to re-identify others in the dataset? Is Karrie especially easy or especially difficult to re-identify? A good way to guage the effectiveness of this type of attack is to look at how "selective" certain pieces of data are - how good they are at narrowing down the set of potential rows which may belong to the target individual. For example, is it common for birthdates to occur more than once?

In [None]:
adult_pii['DOB'].value_counts().head(n=20)

This is encouraging - some dates of birth occur eight times! However, it's common for a few values to be represented many times, while the vast majority are actually pretty rare. We'd like to get an idea of how many dates of birth are likely to be useful in performing an attack, which we can do by looking at how common "unique" dates of birth are in the dataset.

In [None]:
adult_pii['DOB'].value_counts().hist();

We can do the same thing with ZIP codes, and we find the same results - ZIP code happens to be very selective in this dataset.

In [None]:
adult_pii['Zip'].value_counts().hist();

## How Many People can we Re-Identify?

In this dataset, how many people can we re-identify uniquely? We can use our auxiliary information to find out! First, let's see what happens with just dates of birth:

In [None]:
attack = pd.merge(adult_pii, adult_data, left_on=['DOB'], right_on=['DOB'])
attack['Name'].value_counts().hist();

So it's not possible to re-identify a majority of individuals using *just* date of birth. But, for the vast majority of records, we get between 1 and 3 records - so it might be possible to guess which record is the right one, or collect more information to narrow things down further. If we use both date of birth and ZIP, things get much better:

In [None]:
attack = pd.merge(adult_pii, adult_data, left_on=['DOB', 'Zip'], right_on=['DOB', 'Zip'])
attack['Name'].value_counts().hist();

When we use both pieces of information, we can re-identify **essentially everyone**. This is a surprising result, since we generally assume that many people share the same birthday, and many people live in the same ZIP code. It turns out that the *combination* of these factors is **extremely** selective. According to Latanya Sweeney's work, 87% of people in the US can be uniquely re-identified by the combination of date of birth, gender, and ZIP code.

Let's just check that we've actually re-identified *everyone*:

In [None]:
attack['Name'].value_counts().head()

Looks like we missed two people! In other words, in this dataset, only **two people** share a combination of ZIP code and date of birth.

# Aggregation

Another way to prevent the release of private information is to release only *aggregate* date.

In [None]:
adult['Age'].mean()

## Problem of Small Groups

This isn't very useful though! So mostly we see aggregated results broken down along some axis.

In [None]:
adult[['Education-Num', 'Age']].groupby('Education-Num').mean()

If the group is too small, we run into problems right away!!

In [None]:
adult[['Zip', 'Age']].groupby('Zip').mean().head()

Consider: Many census statistics are at the block level, which means it might be easy to get auxiliary information to reverse an aggregation like "mean." How big a group is "big enough"? It's not easy to say!

## Differencing Attacks

The problem is *much* worse when you get to design your own queries. A "mean" query over a large group might seem fine:

In [None]:
adult['Age'].sum()

We might do another query over a large group:

In [None]:
adult[adult['Name'] != 'Karrie Trusslove']['Age'].sum()

Combine them, and we're in trouble!

In [None]:
 adult['Age'].sum() - adult[adult['Name'] != 'Karrie Trusslove']['Age'].sum()

This is a recurring theme.

- Releasing *data* that is useful makes ensuring *privacy* very difficult
- Distinguishing between *malicious* and *non-malicious* queries is not possible

# Summary

- A *linking attack* involves combining *auxiliary data* with *de-identified data* to *re-identify* individuals.
- In the simplest case, a linking attack can be performed via a *join* of two tables containing these datasets.
- Simple linking attacks are surprisingly effective:
  - Just a single data point is sufficient to narrow things down to a few records
  - The narrowed-down set of records helps suggest additional auxiliary data which might be helpful
  - Two data points are often good enough to re-identify a huge fraction of the population in a particular dataset
  - Three data points (gender, ZIP code, date of birth) uniquely identify 87% of people in the US