In [None]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Data Anonymization

## Data
Healthcare cardiovascular dataset - original: [https://www.kaggle.com/datasets/sulianova/cardiovascular\-disease\-dataset/data](https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset/data)


In [None]:
import pandas as pd
import numpy as np

# Load the dataset
file_path = 'cardio_train_m.csv'
data = pd.read_csv(file_path, delimiter=';')

# Display the first few rows of the dataframe
data.head()


## Exploratory Data Analysis (EDA) in Privacy-Preserving Machine Learning (PPML)

Given the dataset description, there are several points to consider when it comes to PPML. The dataset contains a mix of identifiable personal information (like Full Name, Email, Address, Social Security number), objective features related to the individual's physical characteristics and health parameters, and subjective features that cover lifestyle choices.

Before we proceed with any PPML techniques, we must categorize the data into identifiable, sensitive, and non-sensitive information. This categorization is crucial for determining the appropriate privacy-preserving methods.

### Identifiable Information

- **Id**
- **Full Name**
- **Email**
- **Address**
- **Social Security number (SSN)**

These are direct identifiers and should be handled with the utmost care. This information can be used to directly trace back the data to the individuals. In most cases, this type of information should be removed or encrypted before the dataset is used for data mining purposes.

### Sensitive Information

- **Health data**: Age, Height, Weight, Systolic blood pressure, Diastolic blood pressure, Cholesterol, Glucose
- **Lifestyle data**: Smoking, Alcohol intake, Physical activity
- **Medical Outcome**: Presence or absence of cardiovascular disease

This information, while not directly identifying, is still sensitive. It can potentially be used to infer identities, especially when combined with other data. For PPML, this data might need to be anonymized or perturbed.

### Non-sensitive Information

- **Gender**

Gender may not be sensitive on its own, but in combination with other data, it can contribute to re-identification risk.

### Steps for PPML on this dataset:

1. **Anonymization by data transformation**: Remove or encrypt direct identifiers. For example, replace 'Full Name' with a pseudonym or random ID.

2. **K-Anonymity**: Apply k-anonymity to the remaining quasi-identifiers like Age, Height, Weight, etc. This could mean generalizing ages to age ranges or height to height ranges.

3. **L-Diversity**: Ensure that the anonymized data has sufficient diversity in the sensitive attributes. For instance, make sure that for every combination of age range and weight range, there are multiple records with different health conditions.

4. **T-Closeness**: Maintain the distribution of sensitive attributes like Cholesterol and Glucose levels within each anonymized group close to the overall distribution to prevent skewness.

5. **Data Perturbation**: Add noise to the data, especially for the numerical health-related features like blood pressure readings, to obscure the precise values but maintain the statistical distribution.

6. **Minimize Data**: Only retain data necessary for the analysis. If the goal is to study cardiovascular disease patterns, data like 'Address' can be excluded entirely.

7. **Access Control**: Ensure that only authorized personnel can access the de-identified dataset and that further use of the data is regulated. In our case this is not applicable as this is a study environment, however, in real life this might be crucial.

In practice, you would need to balance data utility against privacy. The more you perturb the data to protect privacy, the less accurate your data mining results might be. It is crucial to find a point where the data is still useful for analysis without compromising individual privacy.

# 1. Anonymization by data transformation

**Remove Direct Identifiers**: We'll drop the `id`, `name`, `email` and `ssn` columns, as these contain directly identifiable information.

**Data Tokenization**: Since we've removed almost all direct identifiers, we'll simulate tokenization by replacing `address` values with tokens. We will drop it later as we still don't need this field for cardiovascular analysis.

**Data Generalization**: For attributes like `age` and `height`, we'll replace exact values with ranges.

**Data Perturbation**: We'll apply a slight random noise to weight, `ap_hi`, and `ap_lo` to perturb these continuous variables.

**Data Swapping**: We'll swap the values of `cholesterol` and `gluc` between records.

**Noise Addition**: We'll add Gaussian noise to the `weight`.

**Data Masking**: We'll replace `gender`, `smoke`, `alco`, and `active` with masked categories.

In [None]:
# Remove Direct Identifiers
data_anonymized = data.drop(columns=['id', 'name', 'email', 'ssn'])
data_anonymized.head()

In [None]:
# Data Tokenization
import hashlib

# Simulate data tokenization for the 'address' variable
def generate_token(value):
    # A simple tokenization using a hash function (not reversible in this case)
    return hashlib.sha256(str(value).encode()).hexdigest()

# Apply tokenization
data_anonymized['address'] = data_anonymized['address'].apply(generate_token)
data_anonymized.head()

In [None]:
# Drop address, we no longer need it

data_anonymized = data_anonymized.drop(columns=['address'])
data_anonymized.head()

In [None]:
# Data Generalization or data aggraration = grouping data
# Exapmle: Convert age from days to years

data_anonymized['age'] = (data_anonymized['age'] / 365).astype(int)
data_anonymized.head()

In [None]:
# Create age ranges

data_anonymized['age'] = pd.cut(data_anonymized['age'], bins=range(0, 110, 10), right=False)
data_anonymized.head()

In [None]:
# Create height ranges

data_anonymized['height'] = pd.cut(data_anonymized['height'], bins=range(100, 250, 10), right=False)
data_anonymized.head()

In [None]:
# Data Perturbation

# Define a function to add noise based on a percentage of the standard deviation
def add_noise(series, noise_level):
    return series + np.random.normal(0, noise_level * series.std(), size=len(series))

# Apply the function to perturb the weight, ap_hi, and ap_lo
data_anonymized['weight'] = add_noise(data_anonymized['weight'], 0.01)  # 1% noise
data_anonymized['ap_hi'] = add_noise(data_anonymized['ap_hi'], 0.01)
data_anonymized['ap_lo'] = add_noise(data_anonymized['ap_lo'], 0.01)
data_anonymized.head()

In [None]:
# Data Swapping

# Example: Randomly swap 'cholesterol' and 'gluc' between records
cholesterol_indices = np.random.permutation(data_anonymized.index)
gluc_indices = np.random.permutation(data_anonymized.index)

data_anonymized['cholesterol'] = data_anonymized['cholesterol'].iloc[cholesterol_indices].values
data_anonymized['gluc'] = data_anonymized['gluc'].iloc[gluc_indices].values
data_anonymized.head()

In [None]:
# Further Noise Addition (if needed)

# We could add more noise to the 'ap_hi' and 'ap_lo' if we decide it's necessary. 
# But remember, the accuracy of predictions will also fall.

data_anonymized['ap_hi'] = add_noise(data_anonymized['ap_hi'], 0.05)  # 5% noise
data_anonymized['ap_lo'] = add_noise(data_anonymized['ap_lo'], 0.05)  # 5% noise
data_anonymized.head()

In [None]:
# Data Masking

# Simple example of masking
masking_tokens = {0: 'A', 1: 'B'}  

# For binary attributes like 'smoke', 'alco', and 'active', we replace the actual values with assigned tokens
data_anonymized['smoke'] = data_anonymized['smoke'].map(masking_tokens)
data_anonymized['alco'] = data_anonymized['alco'].map(masking_tokens)
data_anonymized['active'] = data_anonymized['active'].map(masking_tokens)
data_anonymized.head()

# 2. K-anonymity

**K-anonymity** is a privacy-preserving technique used to protect the identity of individuals in a dataset. The main goal is to ensure that each individual cannot be uniquely distinguished from at least **k−1** other individuals based on their attribute values.

A dataset is said to be k-anonymous if the information for each person contained in the dataset cannot be distinguished from at least k−1 individuals whose information also appears in the dataset


### K-anonymity limitations:

If all the individuals in a k-anonymous set have the same sensitive value, then the sensitive value for the set is known. That is called a **homogeneity attack**.

Imagine giving everyone in a neighborhood the same house color and car to make them less recognizable. Now, if all these look-alike folks also have the same health issue, for example, everyone has a sunburn, then if you know someone lives in that neighborhood, you'd guess they probably have a sunburn too. That's a homogeneity attack: when all the hidden data is too similar, it's easy to guess private stuff about someone if you know they're part of that group.

The k-anonymity method tries to prevent this by making sure that each person's data is hidden in a group of at least 'k' people. But if all 'k' people have the same sensitive info, it doesn't help much. That's where **l-diversity** comes in. It's a fancier method that makes sure within each group, there's a mix of different sensitive details. So, even if you know someone is in a group, you can't be sure of their specific issue because there's a variety in there.

In [None]:
# Define k for k-anonymity check
k = 5

# Check for k-anonymity in age and height ranges
age_k_anonymity = data_anonymized['age'].value_counts()
height_k_anonymity = data_anonymized['height'].value_counts()

# Check if all groups have at least k records
age_k_anonymity_check = all(age_k_anonymity >= k)
height_k_anonymity_check = all(height_k_anonymity >= k)

# Output the results of the k-anonymity check
age_k_anonymity_check, height_k_anonymity_check, age_k_anonymity, height_k_anonymity

The check for K-anonymity shows that our dataset does not satisfy K-anonymity for k=5 for both age and height ranges. There are groups, particularly at the extremes of the age and height ranges, that have fewer than 5 records.

To achieve K-anonymity, we need to generalize these quasi-identifiers further. For example, we can combine less populous age and height groups with neighboring ones to ensure that each group has at least 5 records.

Let's apply further generalization to ensure K-anonymity:

- For the 'age' attribute, we'll combine the underrepresented ranges with the nearest populated range.
- For 'height', we'll do the same by combining extreme ranges into broader categories.

In [None]:
# Correcting the approach to generalize 'age' and 'height' to satisfy k-anonymity

# Function to extract the lower bound of the interval for re-binning
def get_lower_bound(interval):
    if pd.isnull(interval):
        return np.nan
    return interval.left

# Apply the function to 'age' and 'height' to get the lower bound
data_anonymized['age'] = data_anonymized['age'].apply(get_lower_bound)
data_anonymized['height'] = data_anonymized['height'].apply(get_lower_bound)

# Re-bin the 'age' and 'height' columns with new bins to satisfy k-anonymity
age_bins = [0, 50, 60, 70, 120]  # Adjusted to combine underrepresented age groups
height_bins = [0, 150, 170, 190, 250]  # Adjusted to combine underrepresented height groups

data_anonymized['age'] = pd.cut(data_anonymized['age'], bins=age_bins, right=False)
data_anonymized['height'] = pd.cut(data_anonymized['height'], bins=height_bins, right=False)

# Re-check the k-anonymity after re-grouping
age_k_anonymity = data_anonymized['age'].value_counts()
height_k_anonymity = data_anonymized['height'].value_counts()

# Check if all groups have at least k records after re-binning
age_k_anonymity_check = all(age_k_anonymity >= k)
height_k_anonymity_check = all(height_k_anonymity >= k)

age_k_anonymity_check, height_k_anonymity_check, age_k_anonymity, height_k_anonymity

Now, after re-binning:

- Height satisfies K-anonymity for k=5, as all height groups have at least 5 records.
- Age still does not satisfy K-anonymity for k=5. The problem lies with the upper age group [70,120), which has no records at all, indicating that our age data does not extend into this range.

To address this, we need to adjust the age bins to ensure that all populated age bins have at least k records. We'll combine the highest age range with the one below it to ensure compliance with K-anonymity. Let's adjust the age bins accordingly.

In [None]:
# Extract the actual age values from the intervals again
data_anonymized['age'] = data_anonymized['age'].apply(lambda x: x.left if pd.notnull(x) else x)

# Adjust the age bins again to ensure k-anonymity
# Combining the [60,70) and [70,120) age groups

age_bins = [0, 50, 60, 120]  
data_anonymized['age'] = pd.cut(data_anonymized['age'], bins=age_bins, right=False)

# Re-check the k-anonymity for the 'age' attribute after re-binning
age_k_anonymity = data_anonymized['age'].value_counts()
age_k_anonymity_check = all(age_k_anonymity >= k)

age_k_anonymity_check, age_k_anonymity

Now, the k-anonymity for both columns seem to be implemented correctly. Let's check it.

In [None]:
# Calculate the frequency of each combination of 'age' and 'height'

# Extract the lower bound of the intervals to represent each record for re-binning if necessary
data_anonymized['age_lower_bound'] = data_anonymized['age'].apply(lambda x: x.left)
data_anonymized['height_lower_bound'] = data_anonymized['height'].apply(lambda x: x.left)

# Group by the 'age' and 'height' ranges and calculate the counts
k_anonymity_counts = data_anonymized.groupby(['age_lower_bound', 'height_lower_bound']).size().reset_index(name='counts')

# Filter out the groups that do not meet k-anonymity (k=5)
violations = k_anonymity_counts[k_anonymity_counts['counts'] < 5]

violations

All combinations of 'age' and 'height' in the dataset now meet the k-anonymity criterion with **k=5**, as the dataframe showing the violations is empty. This means that for every combination of these quasi-identifiers, there are at least **five** records in the dataset, and we do not need to generalize these attributes further.

### Evaluation of K-Anonymity

To evaluate the k-anonymity of the dataset, we can ensure that no groups of records exist such that their count is less than **k**. Since our violations dataframe is empty, we can confirm that our dataset is kk-anonymous with respect to 'age' and 'height'.

An additional evaluation metric could be the diversity of the sensitive attributes within each group defined by 'age' and 'height'. However, since we were focusing on these two attributes specifically for k-anonymity, and they are already compliant, we can conclude that the dataset satisfies k-anonymity for **k=5**.

# 3. L-diversity

**L-diversity** is a model that ***extends k-anonymity*** with the goal of reducing the granularity of data representation while maintaining diversity in the sensitive attributes within each group of k-anonymized records. L-diversity requires that each equivalence class (groups of records that are indistinguishable from each other with respect to certain quasi-identifiers) has at least ll "well-represented" values for the sensitive attributes.

In the context of the dataset we are working with, sensitive attributes could include features such as 'weight', 'ap_hi', 'ap_lo', 'cholesterol', and 'gluc'. However, since 'weight', 'ap_hi', and 'ap_lo' have already been perturbed, focusing on 'cholesterol' and 'gluc' might be more relevant for L-diversity.

**Here's how we'll implement L-diversity for the 'cholesterol' and 'gluc' attributes:**

- For each combination of 'age' and 'height', we'll check the number of unique values of 'cholesterol' and 'gluc'.
- If any group does not have at least 3 unique values for both 'cholesterol' and 'gluc', it violates L-diversity with **l=3**.
- For any groups that violate L-diversity, we'll need to further generalize the quasi-identifiers or suppress records until the criterion is met.
- We'll evaluate the L-diversity of the dataset after applying any necessary transformations.

In [None]:
# Check the L-diversity for 'cholesterol' and 'gluc' within each k-anonymized group
# Create a function to check if a group has l-diversity

def check_l_diversity(group, column, l=3):
    return group[column].nunique() >= l

# Extraction of lower bounds for 'age' and 'height' intervals
data_anonymized['age_lower_bound'] = data_anonymized['age'].apply(lambda x: x.left if pd.notnull(x) else x)
data_anonymized['height_lower_bound'] = data_anonymized['height'].apply(lambda x: x.left if pd.notnull(x) else x)

# Now let's check for l-diversity again
l_diversity_violations = []

for name, group in data_anonymized.groupby(['age_lower_bound', 'height_lower_bound']):
    cholesterol_diverse = check_l_diversity(group, 'cholesterol', l=3)
    gluc_diverse = check_l_diversity(group, 'gluc', l=3)
    if not cholesterol_diverse or not gluc_diverse:
        l_diversity_violations.append((name, cholesterol_diverse, gluc_diverse))

# Create a DataFrame for violations for easier reading
l_diversity_violations_df = pd.DataFrame(l_diversity_violations, columns=['Group', 'Cholesterol_Diverse', 'Gluc_Diverse'])

l_diversity_violations_df

The output indicates that there are no groups that violate the l-diversity criterion for 'cholesterol' and 'gluc'. Each group of records, defined by the quasi-identifiers 'age' and 'height', has at least three well\-represented values for both of these sensitive attributes.

This confirms that the dataset complies with l-diversity with l=3, which means that the dataset is suitably anonymized with respect to these attributes within the context of the l-diversity privacy model.


### Evaluation of l-diversity
To evaluate l-diversity, we can use some metrics that will help us understand the level of diversity within each group of quasi-identifiers. Here are evaluation metrics we could consider:

- **The average l-diversity of the dataset**: This is the average number of unique sensitive attribute values per group. A higher average indicates better privacy.

- **The minimum l-diversity of the dataset**: This is the minimum number of unique sensitive attribute values in any group. For strict l-diversity compliance, this should be at least ll.

- **The entropy of the sensitive attributes in each group**: Entropy measures the uncertainty or randomness of the information. Higher entropy is usually better for privacy because it indicates a higher degree of randomness in the sensitive attribute values. It is calculated using the following formula:

$$
Entropy = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i)
$$

Where:

- $n$ is the number of unique sensitive attribute values in the group.
- $p(x_i)$ is the proportion of the group that has the \( i \)-th sensitive attribute value.
- $\log_2$ is the logarithm base 2, which is used to measure the entropy in bits.

In l-diversity:

- A higher entropy value for a group indicates a more diverse and thus a more "private" distribution of sensitive values, as it would be harder for an attacker to predict the value of the sensitive attribute for any individual within the group.
- The goal is to ensure that the entropy is above a certain threshold for all groups defined by the combination of quasi-identifiers, which corresponds to ensuring a minimum level of l-diversity (where l is the minimum number of unique values).


**Let's calculate these metrics for our dataset. We'll compute:**

- The average number of unique values for 'cholesterol' and 'gluc' across all groups.
- The minimum number of unique values for 'cholesterol' and 'gluc' across all groups.
- The entropy of 'cholesterol' and 'gluc' values in each group.


In [None]:
from scipy.stats import entropy

# Calculate the average l-diversity for 'cholesterol' and 'gluc'
unique_values_per_group = data_anonymized.groupby(['age_lower_bound', 'height_lower_bound'])[['cholesterol', 'gluc']].nunique()
average_l_diversity = unique_values_per_group.mean()
min_l_diversity = unique_values_per_group.min()

# Calculate entropy for 'cholesterol' and 'gluc' in each group
def calculate_entropy(group, column):
    
    # Count the frequency of each value
    value_counts = group[column].value_counts()
    
    # Normalize to get probabilities
    probabilities = value_counts / value_counts.sum()
    
    # Calculate entropy
    return entropy(probabilities)

# Apply entropy calculation for each group
entropy_per_group = data_anonymized.groupby(['age_lower_bound', 'height_lower_bound']).apply(lambda g: pd.Series({
    'Cholesterol_Entropy': calculate_entropy(g, 'cholesterol'),
    'Gluc_Entropy': calculate_entropy(g, 'gluc')
}))

# Show the results
(average_l_diversity, min_l_diversity, entropy_per_group)

The results are as follows:

- **Average l-diversity:**
        For 'cholesterol': 3.0 unique values on average per group.
        For 'gluc': 3.0 unique values on average per group.

    Since the average number of unique values for both 'cholesterol' and 'gluc' is equal to 3, it meets our l-diversity criterion of l=3l=3 on average across the dataset.

- **Minimum l-diversity:**
        For 'cholesterol': 3 unique values in the least diverse group.
        For 'gluc': 3 unique values in the least diverse group.

    The minimum number of unique values for both 'cholesterol' and 'gluc' also meets our l-diversity criterion of l=3l=3, which indicates that every group has at least three different values for both sensitive attributes.

- **Entropy of sensitive attributes:**
        The entropy values for 'cholesterol' and 'gluc' vary across the groups, with some groups having higher entropy than others. Higher entropy values suggest a higher level of unpredictability or diversity within the group, which is desirable for privacy.

# 4. T-closeness

**T-closeness** is a privacy model that requires the distribution of a sensitive attribute in any group of records to be close to the distribution of the attribute in the overall dataset. "Close" is defined by a threshold tt, which is a distance measure between the two distributions.

**Here are the steps we'll take to implement t-closeness:**

- Calculate the overall distribution of the sensitive attributes 'cholesterol' and 'gluc'.
- Calculate the distribution of 'cholesterol' and 'gluc' within each group.
- Measure the distance between the two distributions for each group.
- Determine if the distance is within a threshold tt, which is typically set based on domain knowledge or requirements.

We will need to decide on a distance measure to use. A common choice is the Earth Mover's Distance (also known as the [Wasserstein metric](https://en.wikipedia.org/wiki/Wasserstein_metric)), but simpler measures like the absolute difference in proportions can also be used for illustrative purposes.


In [None]:
# Calculate the overall distribution of the sensitive attributes 'cholesterol' and 'gluc'
overall_cholesterol_distribution = data_anonymized['cholesterol'].value_counts(normalize=True)
overall_gluc_distribution = data_anonymized['gluc'].value_counts(normalize=True)

# Show the overall distributions
(overall_cholesterol_distribution, overall_gluc_distribution)

The overall distributions for 'cholesterol' and 'gluc' are as follows:

- For 'cholesterol':
  - Value 1 (normal): 74.84%
  - Value 2 (above normal): 13.64%
  - Value 3 (well above normal): 11.52%

- For 'gluc':
  - Value 1 (normal): 84.97%
  - Value 2 (above normal): 7.41%
  - Value 3 (well above normal): 7.62%

Next, we'll calculate the distribution of 'cholesterol' and 'gluc' within each group of records sharing the same combination of ***quasi-identifiers*** ('age' and 'height'). Then, we'll compare each group's distribution to the overall distribution to assess t-closeness. We need to select a threshold \( t \) that represents an acceptable difference between these distributions.

For simplicity, let's calculate the absolute difference in proportions for each category of the sensitive attributes between the group and the overall dataset as a measure of distance. If the maximum difference for any value within a group exceeds \( t \), we'll consider it a violation of t-closeness.

Let's proceed with this calculation. We'll use a threshold \( t \) of 0.2 for demonstration purposes, but this value should be carefully chosen based on the context and requirements of the data privacy needs.

In [None]:
# Function to calculate the absolute difference in proportions for t-closeness
def calculate_t_closeness(group_distribution, overall_distribution):
    
    # Align the group distribution with the overall distribution to ensure matching indices
    group_distribution = group_distribution.reindex(overall_distribution.index, fill_value=0)
    
    # Calculate the absolute difference in proportions
    return (group_distribution - overall_distribution).abs().max()

# Initialize a list to track groups that violate t-closeness
t_closeness_violations = []

# Calculate t-closeness for each group
for name, group in data_anonymized.groupby(['age_lower_bound', 'height_lower_bound']):
    group_cholesterol_distribution = group['cholesterol'].value_counts(normalize=True)
    group_gluc_distribution = group['gluc'].value_counts(normalize=True)

    cholesterol_distance = calculate_t_closeness(group_cholesterol_distribution, overall_cholesterol_distribution)
    gluc_distance = calculate_t_closeness(group_gluc_distribution, overall_gluc_distribution)

    if cholesterol_distance > 0.2 or gluc_distance > 0.2:
        t_closeness_violations.append((name, cholesterol_distance, gluc_distance))

# Create a DataFrame for violations for easier reading
t_closeness_violations_df = pd.DataFrame(t_closeness_violations, columns=['Group', 'Cholesterol_Distance', 'Gluc_Distance'])

t_closeness_violations_df

### Evaluation of  t-closeness
The evaluation for t-closeness has resulted in an empty DataFrame for violations, which means there are no groups where the absolute difference in proportions for 'cholesterol' and 'gluc' exceeds the threshold of 0.2. This indicates that with respect to the sensitive attributes 'cholesterol' and 'gluc', each group of records is similar to the overall dataset within the specified threshold, and thus, the dataset complies with t-closeness with t=0.2.

This is a good outcome, as it suggests the dataset has been anonymized in a way that maintains the distribution of sensitive attributes close to the overall distribution, reducing the risk of attribute disclosure.

#### T-closeness evaluation metrics in details:

The Earth Mover's Distance (EMD) and Maximum Divergence are two measures that can be used to evaluate t-closeness between distributions.

**Earth Mover's Distance (EMD)**   
EMD, also known as the ***Wasserstein metric*** or ***Kantorovich metric***, is a measure of the distance between two probability distributions over a region D. Intuitively, if the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D, the EMD is the minimum cost of turning one pile into the other; that is, it is the minimum amount of work needed to transform one distribution into the other, where "work" is measured as the amount of distribution weight that must be moved, multiplied by the distance it has to be moved.

The formula for EMD is more complex and typically requires linear programming to solve. It is given by:

$ EMD(p, q) = \inf_{\gamma \in \Pi(p, q)} \int_{D \times D} d(x, y) \, d\gamma(x, y) $

where:
- $p$ and $q$ are the two probability distributions.
- $\gamma$ ranges over all possible joint distributions with marginals $p$ and $q$
- $d(x, y)$ is the ground distance between points $x$ and $y$

**Maximum Divergence**   
The Maximum Divergence measures the maximum absolute difference between the probabilities of the same event under two different probability distributions.

The formula for the Maximum Divergence is:

$MaxDivergence(p, q) = \max_{i} \lvert p(x_i) - q(x_i) \rvert$

where:
- $p$ and $q$ are the two probability distributions.
- $x_i$ are the events for which the probabilities are compared.

**Applying EMD and Max Divergence**   
Python's `scipy.stats` library provides methods to calculate EMD, but the calculation of Maximum Divergence can be done manually since it's a simple max operation on the absolute difference of probabilities. We'll calculate both measures for our dataset.

Let's apply these measures to compare the distributions of 'cholesterol' and 'gluc' within each group to the overall distribution. We will use the `wasserstein_distance` function from `scipy.stats` to calculate EMD. For the Maximum Divergence, we'll perform the calculation directly.

In [None]:
from scipy.stats import wasserstein_distance

# Function to calculate the EMD and Max Divergence for t-closeness
def calculate_emd_max_divergence(group_distribution, overall_distribution, categories):
    
    # Ensure the distributions are aligned and filled with zeros where necessary
    aligned_group_dist = [group_distribution.get(x, 0) for x in categories]
    aligned_overall_dist = [overall_distribution.get(x, 0) for x in categories]

    # Calculate EMD
    emd_value = wasserstein_distance(aligned_group_dist, aligned_overall_dist)

    # Calculate Max Divergence
    max_divergence_value = max(abs(a - b) for a, b in zip(aligned_group_dist, aligned_overall_dist))

    return emd_value, max_divergence_value


# Get the unique categories for cholesterol and gluc
cholesterol_categories = overall_cholesterol_distribution.index.tolist()
gluc_categories = overall_gluc_distribution.index.tolist()

# Initialize a list to track the distances for each group
t_closeness_metrics = []

# Calculate EMD and Max Divergence for each group
for name, group in data_anonymized.groupby(['age_lower_bound', 'height_lower_bound']):
    group_cholesterol_distribution = group['cholesterol'].value_counts(normalize=True)
    group_gluc_distribution = group['gluc'].value_counts(normalize=True)

    cholesterol_emd, cholesterol_max_div = calculate_emd_max_divergence(group_cholesterol_distribution, overall_cholesterol_distribution, cholesterol_categories)
    gluc_emd, gluc_max_div = calculate_emd_max_divergence(group_gluc_distribution, overall_gluc_distribution, gluc_categories)

    t_closeness_metrics.append((name, cholesterol_emd, cholesterol_max_div, gluc_emd, gluc_max_div))

# Create a DataFrame for the distances for easier reading
t_closeness_metrics_df = pd.DataFrame(t_closeness_metrics, columns=['Group', 'Cholesterol_EMD', 'Cholesterol_MaxDiv', 'Gluc_EMD', 'Gluc_MaxDiv'])

t_closeness_metrics_df

The `t_closeness_metrics_df` DataFrame now contains the Earth Mover's Distance (EMD) and Maximum Divergence for 'cholesterol' and 'gluc' within each group, compared to the overall distribution.

- **Cholesterol_EMD** and **Gluc_EMD** columns: These show the Earth Mover's Distance for 'cholesterol' and 'gluc' respectively. Smaller values indicate that the group's distribution is closer to the overall distribution.

- **Cholesterol_MaxDiv** and **Gluc_MaxDiv** columns: These show the Maximum Divergence for 'cholesterol' and 'gluc' respectively. These values represent the maximum absolute difference in proportions for each value of the sensitive attributes between the group and the overall dataset. Smaller values are better, indicating a closer match to the overall distribution.

From the DataFrame, we can see that most groups have relatively small EMD and Maximum Divergence values, suggesting that the distribution of 'cholesterol' and 'gluc' within these groups is not significantly different from the overall distribution. This indicates a good level of compliance with t-closeness.

To evaluate whether these distances are acceptable, one would compare them to a predefined threshold \( t \). This threshold would depend on the specific privacy requirements and the context in which the data will be used.

If any group's distance exceeds the acceptable threshold, further anonymization techniques would need to be applied to that group to reduce the distance and ensure compliance with t-closeness.

In [None]:
data_anonymized.to_csv('data_anonymized.csv', encoding='utf-8')

# 5. Information Loss

**Information Loss (IL)** is a measure used to quantify the amount of data utility that is lost as a result of anonymization or data transformation. It reflects the loss of detail or accuracy in the data, which can impact the usefulness of the data for analysis. There are many ways to calculate information loss, depending on the type of data and the anonymization technique used.

For continuous variables, we often use the **Root Mean Squared Error (RMSE)**, which is a measure of the differences between values predicted by a model or an estimator and the values observed. The RMSE for a set of values is the square root of the mean of the squares of the differences between the anonymized and original values.

For continuous variables, the RMSE is defined as:

$
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (o_i - a_i)^2}
$

where $ o_i $ is the original value and $ a_i $ is the anonymized value for the $i^{th}$ observation.

For categorical and binary variables, the proportion of changed values is calculated. This is simply the count of values that were changed by the anonymization process divided by the total number of values:

$
\text{Proportion Changed} = \frac{\text{Number of changed values}}{n}
$

where $n$ is the total number of values.

**To compute the information loss between the original cardio_train_m dataset and the data_anonymized dataset, we can use the following approach:**

- Load the original cardio_train_m dataset and remove the direct identifiers.
- Load the data_anonymized dataset and convert interval strings back to integer series (by extracting the lower bound, for example).
- Align the datasets based on the indexes so that we can compare them directly.
- Calculate information loss metrics for the quantitative and categorical attributes.

For quantitative attributes (like weight, ap_hi, and ap_lo), we can calculate the information loss by measuring the average difference between the original and anonymized data. For categorical attributes and binary attributes (like gender, cholesterol, gluc, smoke, alco, active, and cardio), we can measure the information loss by the proportion of changed values.

Let's start by loading both datasets and preparing them for the information loss calculation.

In [None]:
# Load the original dataset
original_data_path = 'cardio_train_m.csv'
original_data = pd.read_csv(original_data_path, delimiter=';')

# Drop direct identifiers from the original dataset
identifiers = ['id', 'name', 'email', 'address', 'ssn']
original_data.drop(identifiers, axis=1, inplace=True)

# Keep only the necessary columns in the original dataset
necessary_columns = ['gender', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']
original_data = original_data[necessary_columns]

# Load the anonymized dataset
anonymized_data_path = 'data_anonymized.csv'
anonymized_data = pd.read_csv(anonymized_data_path)

# Convert interval strings back to integer series for 'age' and 'height'
anonymized_data['age'] = anonymized_data['age'].apply(lambda x: int(x[1:].split(',')[0]) if pd.notnull(x) else np.nan)
anonymized_data['height'] = anonymized_data['height'].apply(lambda x: int(x[1:].split(',')[0]) if pd.notnull(x) else np.nan)

# Remove the 'Unnamed: 0' column and any other non-necessary columns from the anonymized dataset
anonymized_data.drop(['Unnamed: 0', 'age', 'height'], axis=1, inplace=True)

# Show the first few rows of both datasets to ensure correctness
(original_data.head(), anonymized_data.head())

Next, we will calculate the information loss for each column. 
- For continuous variables (weight, ap_hi, and ap_lo), we will calculate the Root Mean Squared Error (RMSE) to quantify information loss. 
- For categorical and binary variables (gender, cholesterol, gluc, smoke, alco, active, and cardio), we will calculate the proportion of changed values.

Since the anonymized dataset has perturbed these values, we expect some level of information loss. For binary variables in the anonymized dataset represented by letters ('A', 'B'), we assume 'A' corresponds to 0 and 'B' to 1 to make a direct comparison.

In [None]:
# Function to calculate RMSE for continuous variables
def rmse(original, anonymized):
    return np.sqrt(((original - anonymized) ** 2).mean())

# Function to calculate the proportion of changed values for categorical/binary variables
def proportion_changed(original, anonymized):
    return (original != anonymized).mean()

# Replace 'A' with 0 and 'B' with 1 in the anonymized dataset for binary variables
binary_columns = ['smoke', 'alco', 'active']
anonymized_data[binary_columns] = anonymized_data[binary_columns].replace({'A': 0, 'B': 1})

# Calculate information loss for continuous variables
continuous_columns = ['weight', 'ap_hi', 'ap_lo']
info_loss_continuous = {col: rmse(original_data[col], anonymized_data[col]) for col in continuous_columns}

# Calculate information loss for categorical/binary variables
categorical_columns = ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']
info_loss_categorical = {col: proportion_changed(original_data[col], anonymized_data[col]) for col in categorical_columns}

(info_loss_continuous, info_loss_categorical)

The information loss for the continuous variables is quantified as RMSE:

- **weight**: RMSE of approximately 0.14, indicating a minor average error in weight values after anonymization.
- **ap_hi**: RMSE of approximately 7.84, indicating a moderate average error in systolic blood pressure values.
- **ap_lo**: RMSE of approximately 9.56, indicating a moderate average error in diastolic blood pressure values.

For the categorical and binary variables, the information loss is quantified as the proportion of changed values:

- **gender, smoke, alco, active, cardio**: No information loss detected (0% changed values).
- **cholesterol**: Approximately 40.84% of the values were changed during the anonymization.
- **gluc**: Approximately 26.63% of the values were changed during the anonymization.

This analysis suggests that the anonymization process has maintained the gender and binary lifestyle-related attributes (smoke, alco, active, cardio) very well, with no changes observed. However, there is significant alteration in the cholesterol and gluc attributes, which is expected due to the data swapping step applied during anonymization. The RMSE values for continuous variables indicate an acceptable level of information loss, given the goal of anonymizing the dataset.

# References
- [Guidelines for Anonymization & Pseudonymization](https://ispo.newschool.edu/guidelines/anonymization-pseudonymization/#:~:text=To%20anonymize%20any%20dataset%2C%20sufficient,reasonably%20likely%20to%20be%20used.%E2%80%9D), 2019-2023
- Ratra, Ritu & Gulia, Preeti. (2020). Privacy Preserving Data Mining: Techniques and Algorithms. International Journal of Engineering Trends and Technology. 68. 56-62. 10.14445/22315381/IJETT-V68I11P207. https://ijettjournal.org/archive/ijett-v68i11p207
- GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018) - Personal Data Protection Commission Singapore (PDPC) https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Other-Guides/Guide-to-Anonymisation_v1-(250118).pdf
- Agrawal, R. and Srikant, R., 2000, May. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (pp. 439-450). IBM Almaden Research Center. https://dl.acm.org/doi/pdf/10.1145/342009.335438