# Anonymization techniques

## Motivating examples
These examples were taken from various other sources ({cite}`dp-theory-practice-book, sweeneyKanon-2002`).

### Medical data linking 
The National Association of Health Data Organizations (NAHDO) reported
that 37 states in the USA have legislative mandates to collect hospital level
data and that 17 states have started collecting ambulatory care data from
hospitals, physicians offices, clinics, and so on. The leftmost circle in
Figure 1 contains a subset of the fields of information, or attributes, that
NAHDO recommends these states collect; these attributes include the
patient’s ZIP code, birth date, gender, and ethnicity.
In Massachusetts, the Group Insurance Commission (GIC) is responsible
for purchasing health insurance for state employees. GIC collected patientspecific
data with nearly one hundred attributes per encounter along the lines
of the those shown in the leftmost circle of Figure 1 for approximately
135,000 state employees and their families. Because the data were believed to
be anonymous, GIC gave a copy of the data to researchers and sold a copy to
industry.
For twenty dollars I purchased the voter registration list for Cambridge
Massachusetts and received the information on two diskettes. The
rightmost circle in Figure 1 shows that these data included the name, address,
ZIP code, birth date, and gender of each voter. This information can be linked
using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals.


```{figure} hospital-voter.png
---
height: 400px
---
Linking to re-identify data (from Latanya Sweeney, 2002).
```


For example, William Weld was governor of Massachusetts at that time
and his medical records were in the GIC data. Governor Weld lived in
Cambridge Massachusetts. According to the Cambridge Voter list, six people
had his particular birth date; only three of them were men; and, he was the
only one in his 5-digit ZIP code.

### Web search log
Another well-known privacy incident came from publishing web search logs. In 2006, AOL
released three months of search logs involving 650,000 users. The main privacy protection technique
used is replacing user ids with random numbers. This proved to be a failure. Two New
York Times journalists, Barbaro and Tom Zeller [2006], were able to re-identify elma Arnold,
a 62-year-old women living in Lilburn, GA, from the published search logs. Ms. Arnold’s search
log includes her last name and location names near where she lived. The reporters were able to
cross-reference this information with phonebook entries. After the New York Times article had
been published, the data was immediately retracted by AOL. Later a class action lawsuit was
filed against AOL. This scandal led to the resignation of AOL’s CTO and the dismissal of two
employees.

### Netflix
In 2009, Netflix released a dataset containing the movie rating data from 500,000 users
as part of a one-million dollar challenge to the data mining research community for developing
effective algorithms for predicting users’ movie preferences based on their viewing history
and ratings. While the data was anonymized in order to protect users’ privacy, Narayanan and
Shmatikov [2008] showed that an adversary having some knowledge about a subscriber’s movie
viewing experience can easily identify the subscriber’s record if present in the dataset. For example,
Narayanan and Shmatikov [2008] showed that, from the profiles of 50 IMDB users, at least
two of them also appear in the Netflix dataset.

### Genome data leak
Another privacy incident targeted the Genome-Wide Association Studies (GWAS). These
studies normally compare the DNA sequences of two groups of participants: people with the
disease (cases) and similar people without (controls). Each person’s DNA mutations (singlenucleotide
polymorphisms, or SNPs) at indicative locations are read, and this information is then
analyzed. Traditionally, researchers publish aggregate frequencies of SNPs for participants in the
two groups. In 2008, Homer et al. [2008] proposed attacks that could tell with high confidence
whether an individual is in the case group, assuming that the individual’s DNA is known. e
attack works even if the group includes hundreds of individuals. Because of the privacy concerns
from such attacks, a number of institutions, including the U.S. National Institute of Health
(NIH) and the Wellcome Trust in London all decided to restrict access to data from GWAS.
Such attacks need access to the victim’s DNA data and publicly available genomic database to
establish the likely SNP frequencies in the general population.
Another example of the failure of naive “data anonymization” is location-based social
networks that provide friend discovery feature by location proximity, documented by Li et al.
[2014b]. These social networks try to provide some privacy protection by using a number of
location-obfuscating techniques. One technique is to show to a user only relative distances of
other users to her, instead of their location coordinates. Another is to set a limit on the precision
of the reported information. For example, Skout defines localization accuracy to 1 mile, i.e., users
will be located with an accuracy no better than 1 mile. Similarly, Wechat and Momo set 100 m and 10 m as their localization accuracy limits. Yet another technique is to restrict a user’s view
to within a certain distance of the user, or to no more than a certain number of users. Li et al.
[2014b] demonstrated the effectiveness of attacks that use lots of fake locations to issue many
queries and then aggregate this information to infer the exact locations of users.


---
**References**
```{bibliography}
```