# Data utility

Data is collected and released for its usefulness. Privacy-preserving mechanisms, such as k-anonymity, removes parts of the data, and thus may lower utility. But how much utility is retained in a privacy-enhanced version of a dataset? There are two methods to measure this: general data utility metrics or usefulness metrics of something else derived from the data (e.g., the accuracy of a machine learning model trained on the dataset in question). We will look at the first category in this chapter.

## Generalization or Suppression Counting

One of the most intuitive measure of loss of utility is the number of privacy operations (generalization and suppression) has been performed, since each operation makes the data more generic and removes some utility. Consider the generalization steps on Race and Zip for the tuple $<Asian, 94141> \rightarrow_{[1,1]} <person, 9414*>$. The number of operations here is 2, which can also be considered as proportional to the information that was lost. If only generalization is used, this loss is also equal to the height of the node in the tuple domain hierarchy.

One proglem with this measure is that it treats information loss for all operations as equal. But generalizing Race once completely removes any information about race ($Asian\rightarrow person$), but doing so for zip only removes some information (we can still guess the larger areas based on the remaining four digits). Even for the same attribute, (repeated) generalization operations can have different impact (compare 9414* to 941**).

## Average size of equivalence classes

## Loss Metric

This metric was proposed by Vijay Iyengar{cite}`iyengarLossMetric` and takes account of the number of levels and values of the generalization hierarchy of an attribute. Consider the attribute `employment type`, which has the following hierarchy of values. In a dataset, the values in the leaf node are specified, and intermediate node values can be used while generalizing (e.g., `Local government employee` can be replaced by `Government employee`). 

```{figure} emp-tree.png
---
height: 400px
name: emp-tree
---
Example of value hierarchy of an attribute.
```
If the generalized value corresponds to node `P` in the taxonomy tree `T`, and the total number of leaf nodes in `T` is $M$ and the number of leaf nodes in the subtree rooted at `P` is $M_P$, then the Loss metric, $LM = \frac{M_P-1}{M-1}$. For the above example, $LM = \frac{2}{7}$. The loss for a suppressed entry is the same
as the loss when the generalized value corresponds to the root of the tree. Intuitively, this loss value is proportional to the *width* of the node `P`.

Numeric attributes are generalized using ranges (e.g., replacing age of 62 with the age range [55-64]). The loss metric for such columns are $LM = \frac{U_i-L_i}{U-L}$ where $U_i$ and $L_i$ are the upper and lower end of the new interval, $U$ and $L$ are the highest and lowest value of that column in the associated dataset.

The total loss for an attribute $A$ is the average loss of $A$ for all tuples in the table. The loss metric for the whole table is just the (possibly weighted) sum of loss metrics for each attribute.

## Discernibility metric

Another loss metric is called discernability metric, and defined as $DM = n * S + \sum_{i=1}^{NEQ} {(EQ_i)}^2$ \
where,\
$n= |T|$ \
$S=$ number of suppressed tuples\
$NEQ=$ number of equivalence class\
$|EQ_i| =$ size of the *ith* equivalent class $EQ_i$ 

Thus, $DM$ assigns a penalty to each tuple based on how many other
tuples in the database are indistinguishable from it (i.e., the sizes of equivalent classes). For a database of size $n$, $DM$ assigns a penalty of $n$ for each suppressed tuple. If a tuple is not suppressed, the penalty it receives is the total number of tuples in the equivalence class it belongs to. Intuitively, having too big equivalent classes (i.e., generalizing too much) will increase $DM$.

## Distortion

This is an information theoretic measure, and unlike the previous ones, it takes into account the distribution of the attribute in question. This is defined as

$DIST = \frac{H(QI_{pre}) - H(QI_{post})}{\log_2{|N|}}$\
where,\
$|N|=$ number of tuples in the anonymized table\
$H(D)=$ the entropy of a distribution $D$\
$QI_{pre}=$ the distribution of the quasi-identifiers *before* anonymization\
$QI_{post}=$ the distribution of the quasi-identifiers *after* anonymization

Intuitively, it measures how much entropy was lost (i.e., loss in uniqueness of the QI values), normalized by the total number of tuples.

For the following anonymization (`QI={Race, Gender}`), we have $QI_{pre}={{M, Black}, {M, Asian}, {F,Asian}, {M,White}, {F,White}, {F, Black}}$ and $QI_{post}={{*, Black}, {*, Asian}, {*, White}}$. So,\
$H(QI_{pre})=-6 *1/6*\log_2⁡(1/6)\approx2.585$\
$H(QI_{post})=-3 *1/3*\log_2⁡(1/3)\approx1.585$\
$D=\frac{2.585-1.585}{\log_2{6}}=\frac{1}{\log_2{6}}$

```{figure} distortion.png
---
height: 250px
name: distortion
---
```