# Other anonymization techniques

## Limitations of k-anonymity

There are multiple possible attacks on a k-anonymized dataset. Below, we explain two, with examples taken from {cite}`Machanavajjhala *et.* al.<l-diversity>`. We will use the following table and it's 4-anonymized version.

![figure](l-diversity-example.png)


## Homogeneity attack

Alice and Bob are antagonistic neighbors. One day Bob falls ill and is taken by ambulance to the hospital. Having seen the ambulance, Alice sets out to discover what disease Bob is suffering from. Alice discovers the 4-anonymous table of current inpatient records published by the hospital, and so she knows that one of the records in this table contains Bob’s data. Since Alice is Bob’s neighbor, she knows that Bob is a 31-year old American male who lives in the zip code 13053. Therefore, Alice knows that Bob’s record number is 9, 10, 11, or 12. All of those patients have the same medical condition (cancer), and so Alice concludes that Bob has cancer.
Thus, k-Anonymity can create groups that leak information due to lack of diversity in the sensitive attribute.

```{note}
k-anonymous tables can create groups of k people (equivalent classes) who have the same value (i.e., homogeneous) for the sensitive attribute. Identifying this group membership can leak sensitive data.
```

Such leaks are common. Consider a health dataset of 60,000 people where health condition (sensitive attribute) can take three distinct values and is uncorrelated with other nonsensitive attributes. A 5-anonymized version of this table will have about 12,000 groups. Probabilistically, in 1 out of every 81 groups, all group members will have the same health condition (no diversity). Thus, about 148 groups with no diversity, there will be about 740 people with data leak due to a homogeneity attack. 

The remedy to this problem is obvious: k-anonymized tables should also ensure diversity in the equivalent classes in terms of values of the sensitive attribute. That is, all tuples with the same values of their quasi-identifiers should have diverse values for their sensitive attributes. But how to achieve this diversity?

## Background Knowledge Attack. 
Alice has a pen-friend named Umeko who is admitted to the same hospital as Bob and whose patient records also appear in the table shown above. Alice knows that Umeko is a 21-year old Japanese
female who currently lives in zip code 13068. Based on this information, Alice learns that Umeko’s information is contained in record number 1,2,3, or 4. Without additional information, Alice is not sure whether Umeko caught a virus or has heart disease. However, it is well known that Japanese have an
extremely low incidence of heart disease. Therefore Alice concludes with near certainty that Umeko has a viral infection. **Thus, k-Anonymity does not protect against attacks based on background knowledge.** Access to such background knowledge is also common. For example, employers might be required to publish vaccination or medical history of all state employees.

Having diverse values in the anonymized tables can tackle both attacks. In particular, we will look at $l-$diversity.

sec_l_div= 
## $l-$diversity
An equivalence class ($EC$) is $l-$diverse if it contains at least $l$ *well-represented* values for the sensitive attribute 𝚂. A table is $l-$diverse if every $EC$ is $l-$diverse.

Note the stress on *well-represented*, what does it mean and how to ensure this property? We will see variants in $l-$diversity based on different definitions of well-representedness.

### Distinct $l-$diversity
This variant ensures that each equivalent class has at least $l$ distinct values for the sensitive attribute. The following 4-anonymous table has been converted to a $3-$diverse table ($l=3$) since each $EC$ has three values for the `Condition` column.

```{figure} 3-diversity.png
---
height: 250px
name: 3-diversity
---
```
Unfortunately, distinct  $l-$diversity does not prevent probabilistic inference attacks due to skewness in the sensitive attribute. Because distinct $l-$diversity only requires that there must be $l$ different values, but do not specify their (relative) frequency, an equivalence class may have one value appear much more frequently than other values. For example, consider the following scenario where Bob's medical record was published as a 2-diverse table, but the equivalence class (with $N=100$) he belongs to is extremely skewed, with all but one record having a pre-existing medical condition (it still satisfies 2-diversity condition). By looking at this table, an insurance company may infer that BOB has that condition (since $99\%$ people in this class had it) regardless of his actual health status. This potentially wrong inference can violate BOB's privacy and subject him to a higher insurance premium.

---
### Entropy $l-$diversity
Another notion capturing the idea of well-represented class is based on entropy, which measures the *uncertainty* of the values under a class. Formally, the entropy of an equivalent class $EC$ is defined to be.
$H(EC) = - \sum_{s\in S} p(EC,s) \log{(p(EC,s))}$ where $S$ is the domain of the sensitive attribute, and
$p(EC, s)$ is the fraction of records in $EC$ that have sensitive value $s$.\
An equivalent class ($EC$) is *entropy $l-$diverse* if $H(EC)\geq \log(l)$\
A table is entropy $l-$diverse if each $EC$ is entropy $l-$diverse.

Intuitively, since having more diverse values increase entropy (uncertainty), this notion indirectly encourages diversity of sensitive attribute values within equivalent classes.

```{note}
Constraint: the full table must have $H(T)\geq \log(l)$ because for any two equivalent classes ($E_a, E_b$) in the anonymized table: $H(T) \geq min(H(E_a), H(E_b))$
```

Because of the above constraint, entropy $l-$diversity is considered too strict and often cannot be satisfied. Below is another variant that is less strict.

### Recursive $(c,l)-$diversity 

Let $m$ be the number of values in an equivalence class, and $r_i$  ($1\leq i \leq m$) be the number of times that the ith most frequent sensitive value appears in an equivalence class $EC$. 
Then, $EC$ has recursive $(c, l)-$diversity if $r_1 < c(r_l+ r_{(l+1)}+ \dots + r_m)$ where $c$ is a constant.

A table is said to have recursive $(c, l)-$diversity if all of its equivalence classes have recursive  $(c, l)-$diversity. 

Intuitively, this definition requires that the count of the most frequent item must be less than a constant multiple of the total count of the least $(m-l)$ frequent items. This condition makes sure that the most frequent value does not appear too frequently, and the less frequent values do not appear too rarely. We control how strictly this condition is applied by selecting $c$ and $l$ appropriately: larger $l$ and smaller $c$ values correspond to tighter bounds.

Consider the second $EC$ (rows 5 to 8) in the following table.

![figure](4-anon-table.png)

Here,\ 
three different values of sensitive attribute (`{cancer, flu, heart disease}`), so,   $m=3$\
the most frequent value `Flu` appears twice, so $r_1 = 2$,\
the second and third most frequent values, both `cancer` and `heart disease` appears once, so $r_2 = 1 = r_3$\ 
Then, $EC$ has recursive $(c, 2)-$diversity if $r_1 < c(r_2 + r_3)$, or if $2<c*2$. To make it true, $c$ must be selected $c>1$, any smaller value will further tighten the condition which cannot be satisfied. Note also that, a larger $l=3$ will also tighten the condition because then $2<c*1$ has to be true, that will require further increasing $c$ to gain some flexibility. The parameter $l$ decides how many of the *least* frequent counts to consider, the larger the value, the smaller their total ($r_l+\dots +r_m$) will be, and the more stricter the condition will be; the parameter $c$ can be increassed to satisfy the condition.

### Limitations of $l-$diversity

#### Difficult to achieve or drastic loss of data utility
Consider a dataset containing a sensitive attribute, whether someone was infected with a rare disease (binary: Y or N), which was true for only $1\%$ of the total entries ($N=100,000$). In this case, an entry is not sensitive if the value is `N` for the sensitive column, which is true for $99\%$ of them. Thus achieving $l-$diversity is unnecessary here cor equivalent classes containing only `N` values. But, if one wants to have 2-diversity, then there can be only 100 equivalent classes because there are only 100 entries with `Y` value and they need to be distributed in across the equivalent classes. Such restrictions also can drastically reduce data utility. Consider the following table with only one quasi-identifier, Zip:

![figure](l-diversity-application.png)

The middle table shows equivalent classes with two entries each, they satisfy 2-diversity. But the values of zipcode have to be almost completely removed before publishing. If we wanted to apply entropy $l-$diversity, we would have to choose a small $l$ since the total entropy of the table is small (remember $H(T)\geq \log(l)$).

### $l-$diversity cannot prevent attribute disclosure
**Skewness Attack:** If the overall distribution of a sensitive attribute in a table is skewed, satisfying  -diversity does not prevent attribute disclosure.
Consider again the example table with 100000 entries where $1\%$ have positive values for a disease. Suppose that one equivalence class has an equal number of positive records
and negative records. Even though it satisfies distinct 2-diversity, entropy
2-diversity, and any recursive $(c, 2)-$diversity requirement for any value of $c$, it presents a serious privacy risk, because anyone in that $EC$ would be considered to have 50% possibility of being positive, as compared with the 1% of the overall population. Consider another equivalence class that has 49 positive records and only 1 negative record. It is distinct 2-diverse and with a larger entropy than the overall table, it satisfies any Entropy $l-$diversity that can be imposed even though anyone in the equivalence class would be considered $98\%$ positive, rather than $1\%$ percent. In fact, this equivalence class has exactly the same diversity as a class that has 1 positive and 49 negative records, even though the two classes present very differen levels of privacy risks.

**Similarity Attack:** If the sensitive attribute values in an equivalence class are distinct but semantically similar, an adversary can learn important information. Consider the orginal table on the left and the 3-anonymized version on the right, which also satisfies 3-diversity for `Disease` attribute. But it has similar type of values for all entries in the first $EC$, thus if one can assume that someone in that category suffers from a stomach related issue (this attack is similar to the Homogeneity attack we saw earlier).

![figure](similarity-attack.png)

```{note}
** Intuition behind the limitations of l-diversity**

Intuitively, distributions that have the same level of (syntactic) diversity may provide very different levels of privacy, because 

1. There are semantic relationships among the attribute value
2. Different values have very different levels of sensitivity, e.g., headache can have vastly different level of sensitivity than cancer/diabetic 
3. Different distributions at the equivalent class level and population level can have adverse effects

```

## $t-$closeness

Intuitively, the loss of privacy can be measured by how much information was gained by an attacker after observing a dataset (compared to before observing it). The $t-$closeness captures this notion and attempts to minimize the information gain. Let's try to understand this concept with examples.

Consider that before seeing any data, an attacker might have some prior belief about someone's health condition. We can represent this prior belief with a probability distribution over the possible conditions. Let's assume that there are 100 possible conditions: `{heart disease, viral infection, cancer, flu, ...}` and the prior belief $\alpha$ is a uniform probability distribution over this set. Now, if a health research center publishes the number of people suffering from different conditions (and no other information), then attacker's prior belief will change after seeing this data. For example, previously the attacker believed that the probablity of someone suffering from cancer is $.01\%$, but in reality, cancer might be much more frequent (unfortunately). Thus, the attacker's belief about an individual will also change. Let's assume this new belief is $\beta$ (and the population level distribution of health condition is $Q$). Thus, the change in information gain is $\beta - \alpha$. Note that, this information gain is about the whole population, i.e., the attacker learns new information at the population level, but did not learn anything new about any specific individual.

Now consider that the attacker's belief will again change after seeing the anonymized table shown below:

![figure](4-anon-table.png)

If the new belief is $\gamma$ then the information gain is $\gamma - \beta$. This gain is about individual people. For example, if someone belongs to the second equivalent class (entries 5 to 8), the probability of that person suffering from cancer is $1/4$, heart disease is $1/2$, viral infection is $0$, and so on. If the health condition distribution in this equivalent class is represented by $P$, then $\gamma - \beta$ will be proportional to the distance between $Q$ and $P$.

$l-$diversity attempted to diversify (syntactic) values of a sensitive attribute within an equivalent class, but did not consider the distribution of those values in a class relative to its distribution in the whole table. In other words, it attempted to limit what an attacker could learn after looking at an $EC$ beyond what population level knowledge they already had (i.e. $\gamma - \alpha$). It tried to achieve this increasing the entropy of each $EC$ by diversifying values, and the uniform distribution has the highest entropy! But within an $EC$, if the values are not semantically diverse, than privacy can still be violated, as we saw above.

To tackle these cases, another measure of privacy, $t-$closeness ({cite}`$t-$closeness<t-closeness>`), was proposed. It achieves privacy by forcing that distributions of values withing equivalent classess follow the population distribution. For example, if `Cancer` is found in $20\%$ of the entries in the whole dataset, then it should be roughly in $20\%$ entries in each of the classes. Formally,

**$t-$closeness:** An equivalence class has $t-$closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold. In other words: $d(P,Q)\leq t$ where $P$ and $Q$ are distributions of values in the the whole table and an equivalent class, and $d$ is the distance function between two distributions. A table satisfies $t-$closeness if all equivalent classes satisfy $t-$closeness.

```{note}
$t-$closeness aims to keep the value distributions in each equivalent class **close** to the population distribution.
```

Minimizing this distance limits the amount of individual-specific information an observer can learn, even if population level distribution is publicly available. Thus, $t-$closeness minimizes $\gamma - \beta$ (as opposed to $\gamma -  \alpha$ like $t-$closeness).

How to minimize the $d(P,Q)\leq t$? This requires some measure of how close $P$ and $Q$ are. We will use the *Earth mover's distance* (or EMD) for this purpose. Given two distributions of numeric values, $P$ and $Q$,\
$D(P, Q) = \frac{1}{m-1} (|r_1| + |r_1+r_2| + \dots + |r_1+r_2 + \dots + r_{m-1}|)$\
where $r_i=p_i-q_i, (i = 1, 2, \dots, m)$

$Q= {3k, 4k, 5k, 6k, 7k, 8k, 9k, 10k, 11k}$\
$P1 = {3k, 4k, 5k}$, and $P2 = {6k, 8k, 11k}$\

$D(P_1, Q) = 

---
**Other resources**
- [Youtube video on $t-$closeness](https://www.youtube.com/watch?v=Upb8jqlsbFM)

---
**References**
```{bibliography} ../../references.bib
:filter: docname in docnames
```