## Fairness, Accountability, Transparency and Ethics Course (FATE)
## Universitat Pompeu Fabra (UPF)
### Academic Year 2025-2026
### Author: Ashwin Singh (ashwin.singh01@estudiant.upf.edu)
*** Partially based on the original exercises made by David Solans (david.solans@upf.edu) ***

Submission date: 03/02/2025 at 23:59 on Aula Global

Please work on this notebook **individually**.

# LAB 1: Data anonymization with Python

# Getting Started

## Imports

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans
from scipy.stats import pearsonr

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import random

In [None]:
import warnings

warnings.filterwarnings(action="ignore")

## Loading Dataset
In this section we will use a synthetic dataset containing medical records of 100 users.

In [None]:
dataset_url = "https://raw.githubusercontent.com/Ashwin-19/UPF-FATE-2025/refs/heads/main/Lab%20I%20-%20Data%20Anonymization/Data/health_dataset.csv"
health_data = pd.read_csv(dataset_url)
health_data.head()

## Types of Identifiers and Attributes [ 1 pts ]



**[ 0.5 pts ]** List the identifiers and quasi-identifiers in the dataset:

**[ Answer ]**

**[ 0.5 pts ]** List the confidential attributes in the dataset.

**[ Answer ]**

## Data Filtering [ 0.5 pts ]

**[ 0.5 pts ]** Remove/Suppress the columns including personally identifiable information from the dataset.

**[ Discuss ]** Why do we remove/supress them?

In [None]:
""" your code here """
# pii_columns = []
# health_data = health_data[pii_columns]
# health_data.head()

# K-Anonymity

## Checking K-Anonymity [ 2 pts ]

**[ 1.5 pts ]** Create a function which checks whether the data sastifies k-anonimity given some quasi-identifiers. Also return the maximum k for which the quasi-identifiers satisfy k-anonymity.

In [None]:
def is_k_anonymous(data, quasi_identifiers):
    """
    Determines whether the data is k-anonymous quasi-identifiers
    and the maximum k for which k-anonymity is satisfied.

    Args:
        data (pandas.DataFrame): microdata
        quasi_identifiers (list): quasi-identifiers in microdata

    Returns:
        bool: indicating whether the data is k-anonymous
        int: maximum k for which k-anonymity is satisfied
    """
    # Your code here
    return True, 1

**[ 0.25 pts ]** Is the data k-anonymous for BirthDate and ZipCode? If so, what is the maximum k for which k-anonymity is satisfied?

In [None]:
# Your code here

**[ 0.25 pts ]** Is the data k-anonymous for Gender and ZipCode? If so, what is the maximum k for which k-anonymity is satisfied?

In [None]:
# Your code here

## Testing K-Anonimity [ 1 pts ]

To test our function, we will use a dataset from the example given in the [K-Anonymity article](https://en.wikipedia.org/wiki/K-anonymity) on Wikipedia and satisfies 2-anonymity for any combination of the attributes ``Age``, ``Gender`` and ``State of domicile``.

In [None]:
wiki_data = pd.read_html("https://en.wikipedia.org/wiki/K-anonymity",header=0)[1]
wiki_data

**[ 1 pts ]** Test k-anonimity for the three attributes and their different combinations picking two attributes at a time. For each combination, find the maximum k satisfying k-anonymity.

In [None]:
## Your code here

## Properties of K-Anonymity [ 1 pts ]
> You may use examples from the previous exercises.

**[ 0.5 pts ]** If a dataset is $k_1$-anonymous for a set of attributes, and $k_2$-anonymous for any subset of those attributes, what is the relationship between $k_1$ and $k_2$? Can you show an example?

**[ Answer ]**

**[ 0.5 pts ]** If a dataset is $k_1$-anonymous given an attribute $a_1$, and $k_2$-anonymous given an attribute $a_2$, is the dataset always $k$-anonymous given attributes $\{a_1,a_2\}$ ? If not, show an example, and find the maximum $k$ for which the dataset can be $k$-anonymous.

**[ Answer ]**

## Methods for K-Anonymity [ 5 pts ]

### GLOBAL Recoding (Generalization)

**[ 0.25 pts ]** Create a function which applies **global recoding** to an attribute in the data upto $n$ digits.

In [None]:
def global_recoding(data, attribute, n):
    """
    applies global recoding to an attribute in the data for n digits

    Args:
        data (pandas.DataFrame): data for anonymization
        attribute (str): attribute to be recoded
        n (int): number of digits to be recoded

    Returns:
        pandas.DataFrame: globally recoded data
    """
    # your code here

**[ 0.75 pts ]** Use the function to list all possible ways of 3-anonymizing the dataset given attributes ``BirthDate`` and ``ZipCode``. What is the best way of 3-anonymizing the dataset if we want to preserve maximum information about the health conditions across different zip codes?

**[ Answer ]**

In [None]:
# your code here

**[ 1 pts ]** For the best way of $3$-anonymizing the dataset from last question, comment on the sufficiency of $k$-anonymity against **homogeneity** and **internal** attacks aimed at inferring the health condition of patients. If you find $k$-anonymity to be insufficient, you must show quasi-identifier equivalence classes (groups) susceptible to the attacks.

In [None]:
# your code here

In [None]:
# your code here

### Microaggregation

**[ 2 pts ]** Create a function which microaggregates continuous attributes (``height, weight``) to achieve $k$-anonymity, while minimizing the loss in information across both attributes. Return the best-possible $k$-anonymity and corresponding loss.

**Recommended Approach**: Experiment with a clustering algorithm for $n \leq \frac{|Data|}{k}$ clusters to achieve $k$-anonymity, and use sum of squared errors between original and anonymized data for loss of information.

In [None]:
def microaggregation(data,attributes,k):
    """
    applies microaggregation to attributes in the data
    to achieve k'-anonymity, k' >= k

    Args:
        data (pandas.DataFrame): data for anonymization
        attributes (list): attributes to be microaggregated
        k (int): lower bound of k-anonymity to achieve

    Returns:
        tuple:
            float: loss of information to achieve k' anonymity
            int: k' corresponding to k'-anonymity achieved
    """
    return loss, observed_k

**[ 1 pts ]** Plot the relationship between $k$-anonymity achieved v/s loss of information for $k \in \{2,3,4,...10\}$. What do you observe?

In [None]:
# your code here

# L-Diversity

## Checking L-Diversity [ 1.5 pts ]

**[ 0.5 pts ]** Create a function which checks whether a dataset is l-diverse given a sensitive attribute and quasi-identifiers. If so, return the maximum-l for which the dataset satisfies l-diversity.

In [None]:
def is_l_diverse(data, sensitive, quasi_identifiers):
    """
    checks whether a data is l-diverse for a sensitive
    attribute given quasi-identifiers, and returns the
    maximum l for which the data satisfies l-diversity

    Args:
        data (pandas.DataFrame): data to check l-diversity for
        sensitive (str): sensitive attribute for l-diversity
        quasi_identifiers (list): list of quasi-identifiers

    Returns:
        tuple:
            bool: indicating whether the data is l-diverse
            int: maximum l for which the data satisfies l-diversity
    """
    return True, l

> Answer the following questions for the original dataset, globally recoded dataset.

**[ 0.5 pts ]** Is the dataset l-diverse given ``Health Condition`` as sensitive attribute and ``ZipCode, BirthDate`` as quasi-identifiers?

In [None]:
# your code here

In [None]:
# your code here

**[ 0.5 pts ]** Is the dataset l-diverse given ``Gender`` as sensitive attribute and ``ZipCode, BirthDate`` as quasi-identifiers?

In [None]:
# your code here

In [None]:
# your code here

## L-Diversity via Local Suppression [ 2 pts ]

**[ 1 pts ]** Create a function to apply local suppression to groups within a $k$-anonymized dataset which violate $l$-diversity given a sensitive attribute, two quasi-identifiers, and $l$. If the same is not feasible, raise an error.


In [None]:
def local_suppression(data,sensitive,quasi_identifiers,l):
    """
    locally suppresses quasi-identifier equivalence classes
    in anonymized data which violate l-diversity so that the
    overall data satisfies l-diversity

    Args:
        data (pandas.DataFrame): data for local suppression
        sensitive (str): sensitive attribute for l-diversity
        quasi_identifiers (list): quasi-identifiers to suppress
        l (int): l-diversity constraint to satisfy

    Returns:
        pandas.DataFrame: locally suppressed data satisfying l-diversity
    """
    # your code here
    return data

**[ 0.5 pts ]** Using the function, make the globally recoded dataset $2$-diverse for sensitive attribute ``Health Condition`` given quasi_identifiers ``ZipCode``,``BirthDate``, and show the suppressed groups.

In [None]:
# your code here

In [None]:
# your code here

**[ 0.5 pts ]** Using the function, make the globally recoded dataset $3$-diverse for sensitive attribute ``Gender`` given quasi_identifiers ``ZipCode``,``BirthDate``, and show the suppressed groups.

In [None]:
# your code here

In [None]:
# your code here

## Sufficiency of $k$-anonymity and $l$-diversity [ 1 pts ]

**[ 1 pts ]** Describe one scenario where $k$-anonymity and $l$-diversity would be sufficient for guaranteeing the anonymity of a dataset, and one scenario where they will be insufficient. You may use tables to illustrate your answer. 

**[ Answer ]**