In [1]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Data Anonymization

### Notes
- Cardiovascular Disease dataset used with reference to original authors and compliance with CC BY 4.0 license: https://data.mendeley.com/datasets/dzz48mvjht/1


In [2]:
!pip install pandas
!pip install numpy
!pip install matplotlib



In [3]:
import pandas as pd
import numpy as np
import hashlib
import matplotlib.pyplot as plt

In [4]:
# Load the dataset
file_path = 'cardiovascular_disease_m.csv'
data = pd.read_csv(file_path, delimiter=',')

# Display the first few rows of the dataframe
data.head()


Unnamed: 0,name,address,email,ssn,patientid,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,William Smith,"698 Dunn Wall Suite 320, Henryhaven, TN 31231",joshuablackburn@example.com,114-99-3593,103368,53,1,2,171,0,0,1,147,0,5.3,3,3,1
1,Timothy Miller,"90253 Amy Springs Suite 381, Lake Seanchester,...",rowekathy@example.org,620-36-6387,119250,40,1,0,94,229,0,1,115,0,3.7,1,1,0
2,Jeanette Lopez,"7180 Montgomery Ville Apt. 768, East Brittany,...",victoriasimpson@example.net,241-42-7559,119372,49,1,2,133,142,0,0,202,1,5.0,1,0,0
3,Karl Freeman,"88166 Jennifer Orchard Suite 229, West Lindsey...",lisarobinson@example.org,257-83-9899,132514,43,1,0,138,295,1,1,153,0,3.2,2,2,1
4,Jasmine Robinson,"USCGC Richardson, FPO AA 93600",hobryan@example.net,063-80-8977,146211,31,1,1,199,0,0,2,136,0,5.3,3,2,1


## Exploratory Data Analysis (EDA) in Privacy-Preserving Machine Learning (PPML)

Given the dataset description, there are several points to consider when it comes to PPML. The dataset contains a mix of identifiable personal information (like Full Name, Email, Address, Social Security number), objective features related to the individual's physical characteristics and health parameters, and subjective features that cover lifestyle choices.

Before we proceed with any PPML techniques, we must categorize the data into identifiable, sensitive, and non-sensitive information. This categorization is crucial for determining the appropriate privacy-preserving methods.

### Identifiable Information

- **Id**
- **Full Name**
- **Email**
- **Address**
- **Social Security number (SSN)**

These are direct identifiers and should be handled with the utmost care. This information can be used to directly trace back the data to the individuals. In most cases, this type of information should be removed or encrypted before the dataset is used for data mining purposes.

### Sensitive Information

- **Health data**: Age, Gender, Chest pain type, Resting blood pressure, Serum cholesterol, Fasting blood sugar, Resting electrocardiogram results, Maximum heart rate achieved, Exercise induced angina, Oldpeak = ST, Slope of the peak exercise ST segment, Number of major vessels
- **Medical Outcome**: Presence or absence of heart disease

This information, while not directly identifying, is still sensitive. It can potentially be used to infer identities, especially when combined with other data. For PPML, this data might need to be anonymized or perturbed.

### Non-sensitive Information

- **Gender**

Gender may not be sensitive on its own, but in combination with other data, it can contribute to re-identification risk.

### Steps for PPML on this dataset:

1. **Anonymization by data transformation**: Remove or encrypt direct identifiers. For example, replace 'Full Name' with a pseudonym or random ID.

2. **K-Anonymity**: Apply k-anonymity to the remaining quasi-identifiers like Age, Gender, Chest pain type, etc. This could mean generalizing ages to age ranges or encoding gender in a less identifiable manner.

3. **L-Diversity**: Ensure that the anonymized data has sufficient diversity in the sensitive attributes. For instance, make sure that for every combination of age range and gender, there are multiple records with different health conditions and chest pain types.

4. **T-Closeness**: Maintain the distribution of sensitive attributes like Serum cholesterol and Fasting blood sugar levels within each anonymized group close to the overall distribution to prevent skewness.

5. **Data Perturbation**: Add noise to the data, especially for the numerical health-related features like Resting blood pressure readings and Serum cholesterol, to obscure the precise values but maintain the statistical distribution.

6. **Minimize Data**: Only retain data necessary for the analysis. If the goal is to study cardiovascular disease patterns, data like 'Address' can be excluded entirely.

7. **Access Control**: Ensure that only authorized personnel can access the de-identified dataset and that further use of the data is regulated. In our case this is not applicable as this is a study environment, however, in real life this might be crucial.

In practice, you would need to balance data utility against privacy. The more you perturb the data to protect privacy, the less accurate your data mining results might be. It is crucial to find a point where the data is still useful for analysis without compromising individual privacy.



# 1. Anonymization by data transformation

**Remove Direct Identifiers**: We'll drop the `id`, `name`, `email` and `ssn` columns, as these contain directly identifiable information.

**Data Tokenization**: Since we've removed almost all direct identifiers, we'll simulate tokenization by replacing `address` values with tokens. We will drop it later as we still don't need this field for cardiovascular analysis.

**Data Generalization**: For attributes like `age`, we'll replace exact values with ranges.

**Data Perturbation**: We'll apply a slight random noise to `restingBP (resting blood pressure)` and `serumcholestrol` to perturb these continuous variables.

**Data Swapping**: We'll swap the values of `serumcholestrol` and `fastingbloodsugar` between records. In real-life case, this is not the right step to take, as these fields are important and swapping them might result in wrong analysis and diagnosis prediction. *TL;DR: know your data before swapping something.*

**Noise Addition**: We'll add Gaussian noise to the `serumcholestrol`.

**Data Masking**: We'll replace `gender` with masked categories.



In [5]:
# Removing direct identifiers
data_anonymized = data.drop(columns=['patientid', 'name', 'email', 'ssn'])
data_anonymized.head()

Unnamed: 0,address,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,"698 Dunn Wall Suite 320, Henryhaven, TN 31231",53,1,2,171,0,0,1,147,0,5.3,3,3,1
1,"90253 Amy Springs Suite 381, Lake Seanchester,...",40,1,0,94,229,0,1,115,0,3.7,1,1,0
2,"7180 Montgomery Ville Apt. 768, East Brittany,...",49,1,2,133,142,0,0,202,1,5.0,1,0,0
3,"88166 Jennifer Orchard Suite 229, West Lindsey...",43,1,0,138,295,1,1,153,0,3.2,2,2,1
4,"USCGC Richardson, FPO AA 93600",31,1,1,199,0,0,2,136,0,5.3,3,2,1


In [6]:
# Data Tokenization

# Simulating data tokenization for the 'address' variable
def generate_token(value):
    # A simple tokenization using a hash function (not reversible in this case)
    return hashlib.sha256(str(value).encode()).hexdigest()

# Applying tokenization
data_anonymized['address'] = data_anonymized['address'].apply(generate_token)
data_anonymized.head()

Unnamed: 0,address,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,4addb3417c39927fbf4030544422752bab7142644688e0...,53,1,2,171,0,0,1,147,0,5.3,3,3,1
1,441efb368bb892c841bd3988b45f2fa2e7557b75597faa...,40,1,0,94,229,0,1,115,0,3.7,1,1,0
2,c4dab081bbee3500cb2d7dc1df8268131c299fbf6c0142...,49,1,2,133,142,0,0,202,1,5.0,1,0,0
3,8b5d18ff8b7af598ed2641180c069c9d24cfc00c89f6be...,43,1,0,138,295,1,1,153,0,3.2,2,2,1
4,9c2a1b6db9ef0037ac703744d535eaabbcdb83783bef88...,31,1,1,199,0,0,2,136,0,5.3,3,2,1


In [7]:
# Dropping address, we no longer need it
data_anonymized = data_anonymized.drop(columns=['address'])
data_anonymized.head()

Unnamed: 0,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,53,1,2,171,0,0,1,147,0,5.3,3,3,1
1,40,1,0,94,229,0,1,115,0,3.7,1,1,0
2,49,1,2,133,142,0,0,202,1,5.0,1,0,0
3,43,1,0,138,295,1,1,153,0,3.2,2,2,1
4,31,1,1,199,0,0,2,136,0,5.3,3,2,1


In [8]:
# Create age ranges

data_anonymized['age'] = pd.cut(data_anonymized['age'], bins=range(0, 110, 10), right=False)
data_anonymized.head()

Unnamed: 0,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,"[50, 60)",1,2,171,0,0,1,147,0,5.3,3,3,1
1,"[40, 50)",1,0,94,229,0,1,115,0,3.7,1,1,0
2,"[40, 50)",1,2,133,142,0,0,202,1,5.0,1,0,0
3,"[40, 50)",1,0,138,295,1,1,153,0,3.2,2,2,1
4,"[30, 40)",1,1,199,0,0,2,136,0,5.3,3,2,1


In [9]:
# Data Perturbation

# Define a function to add noise based on a percentage of the standard deviation
def add_noise(series, noise_level):
    return series + np.random.normal(0, noise_level * series.std(), size=len(series))

# Apply the function to perturb the restingBP and serumcholestrol
data_anonymized['restingBP'] = add_noise(data_anonymized['restingBP'], 0.01)  # 1% noise
data_anonymized['serumcholestrol'] = add_noise(data_anonymized['serumcholestrol'], 0.01)

data_anonymized.head()

Unnamed: 0,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,"[50, 60)",1,2,170.548145,1.231585,0,1,147,0,5.3,3,3,1
1,"[40, 50)",1,0,93.859307,226.545454,0,1,115,0,3.7,1,1,0
2,"[40, 50)",1,2,132.480029,144.853821,0,0,202,1,5.0,1,0,0
3,"[40, 50)",1,0,137.500666,294.852135,1,1,153,0,3.2,2,2,1
4,"[30, 40)",1,1,198.726813,0.261416,0,2,136,0,5.3,3,2,1


In [10]:
# Data Swapping

# Storing the original values before swapping
data_anonymized['original_serumcholestrol'] = data_anonymized['serumcholestrol'].copy()
data_anonymized['original_fastingbloodsugar'] = data_anonymized['fastingbloodsugar'].copy()

# Performing the swapping between 'serumcholestrol' and 'fastingbloodsugar'
serumcholestrol_indices = np.random.permutation(data_anonymized.index)
fastingbloodsugar_indices = np.random.permutation(data_anonymized.index)

data_anonymized['serumcholestrol'] = data_anonymized['serumcholestrol'].iloc[fastingbloodsugar_indices].values
data_anonymized['fastingbloodsugar'] = data_anonymized['fastingbloodsugar'].iloc[serumcholestrol_indices].values

# The dataset after swapping
print("Dataset after swapping:")
print(data_anonymized.head())



# Reverting the changes by copying back the original values
data_anonymized['serumcholestrol'] = data_anonymized['original_serumcholestrol']
data_anonymized['fastingbloodsugar'] = data_anonymized['original_fastingbloodsugar']

# Dropped the temporary columns storing the original values
data_anonymized.drop(columns=['original_serumcholestrol', 'original_fastingbloodsugar'], inplace=True)

# The reverted dataset
print("\nDataset after reverting the swapping:")
print(data_anonymized.head())

Dataset after swapping:
        age  gender  chestpain   restingBP  serumcholestrol  \
0  [50, 60)       1          2  170.548145       266.055008   
1  [40, 50)       1          0   93.859307       322.922140   
2  [40, 50)       1          2  132.480029       340.610607   
3  [40, 50)       1          0  137.500666       418.244465   
4  [30, 40)       1          1  198.726813       272.792806   

   fastingbloodsugar  restingrelectro  maxheartrate  exerciseangia  oldpeak  \
0                  1                1           147              0      5.3   
1                  0                1           115              0      3.7   
2                  0                0           202              1      5.0   
3                  0                1           153              0      3.2   
4                  1                2           136              0      5.3   

   slope  noofmajorvessels  target  original_serumcholestrol  \
0      3                 3       1                  1.2315

In [11]:
# Data Masking

# Simple example of masking
masking_tokens = {0: 'A', 1: 'B'}
data_anonymized['gender'] = data_anonymized['gender'].map(masking_tokens)
data_anonymized.head()

Unnamed: 0,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,"[50, 60)",B,2,170.548145,1.231585,0,1,147,0,5.3,3,3,1
1,"[40, 50)",B,0,93.859307,226.545454,0,1,115,0,3.7,1,1,0
2,"[40, 50)",B,2,132.480029,144.853821,0,0,202,1,5.0,1,0,0
3,"[40, 50)",B,0,137.500666,294.852135,1,1,153,0,3.2,2,2,1
4,"[30, 40)",B,1,198.726813,0.261416,0,2,136,0,5.3,3,2,1


# 2. K-anonymity

**K-anonymity** is a privacy-preserving technique used to protect the identity of individuals in a dataset. The main goal is to ensure that each individual cannot be uniquely distinguished from at least **k−1** other individuals based on their attribute values.

A dataset is said to be k-anonymous if the information for each person contained in the dataset cannot be distinguished from at least k−1 individuals whose information also appears in the dataset


### K-anonymity limitations:

If all the individuals in a k-anonymous set have the same sensitive value, then the sensitive value for the set is known. That is called a **homogeneity attack**.

Imagine giving everyone in a neighborhood the same house color and car to make them less recognizable. Now, if all these look-alike folks also have the same health issue, for example, everyone has a sunburn, then if you know someone lives in that neighborhood, you'd guess they probably have a sunburn too. That's a homogeneity attack: when all the hidden data is too similar, it's easy to guess private stuff about someone if you know they're part of that group.

The k-anonymity method tries to prevent this by making sure that each person's data is hidden in a group of at least 'k' people. But if all 'k' people have the same sensitive info, it doesn't help much. That's where **l-diversity** comes in. It's a fancier method that makes sure within each group, there's a mix of different sensitive details. So, even if you know someone is in a group, you can't be sure of their specific issue because there's a variety in there.

In [12]:
# Define k for k-anonymity check
k = 5

age_k_anonymity = data_anonymized['age'].value_counts()
gender_k_anonymity = data_anonymized['gender'].value_counts()

# Check if all groups have at least k records for both age and gender separately
age_k_anonymity_check = all(age_k_anonymity >= k)
gender_k_anonymity_check = all(gender_k_anonymity >= k)

# Output the results of the k-anonymity check for age and gender separately
age_k_anonymity_check, gender_k_anonymity_check, age_k_anonymity, gender_k_anonymity

(False,
 True,
 age
 [20, 30)     181
 [70, 80)     169
 [30, 40)     167
 [50, 60)     165
 [40, 50)     158
 [60, 70)     148
 [80, 90)      12
 [0, 10)        0
 [10, 20)       0
 [90, 100)      0
 Name: count, dtype: int64,
 gender
 B    765
 A    235
 Name: count, dtype: int64)

The check for K-anonymity shows that our dataset does not satisfy K-anonymity for k=5 for age. There are groups at the extremes of the ageranges, that have fewer than 5 records.

To achieve K-anonymity, we need to generalize these quasi-identifiers further. For example, we can combine less populous age groups with neighboring ones to ensure that each group has at least 5 records.

Let's apply further generalization to ensure K-anonymity:

- For the 'age' attribute, we'll combine the underrepresented ranges with the nearest populated range.

In [13]:
# Step 1: Extract a numeric age value from each interval (the lower bound or midpoint)
# For simplicity, let's use the lower bound as the representative age
data_anonymized['age_numeric'] = data_anonymized['age'].apply(lambda x: x.left)

# Step 2: Define new age bins that merge groups into broader categories
new_age_bins = [20, 40, 60, 80, 90]

# Re-binning 'age' using the new bins, applying the broader categories
data_anonymized['age_binned'] = pd.cut(data_anonymized['age_numeric'], bins=new_age_bins, right=False)

# Checking K-anonymity of the newly binned 'age' attribute
new_age_k_anonymity = data_anonymized['age_binned'].value_counts()
new_age_k_anonymity_check = all(new_age_k_anonymity >= k)

new_age_k_anonymity_check, new_age_k_anonymity

(True,
 age_binned
 [20, 40)    348
 [40, 60)    323
 [60, 80)    317
 [80, 90)     12
 Name: count, dtype: int64)

Now, after re-binning:

- Age satisfies K-anonymity for k=5, as all age groups have at least 5 records. However, we might want to ensure that the distribution is not that biased, so let's group age groups of 80-90 with 60-80.

In [14]:
# Merging 80-90 with 60-80
new_age_bins = [20, 40, 60, 90]

# Re-bin 'age' using the new bins
data_anonymized['age_binned'] = pd.cut(data_anonymized['age'].apply(lambda x: x.left), bins=new_age_bins, right=False)

# Group by the new 'age_binned' and 'gender' directly and calculate the counts
k_anonymity_counts = data_anonymized.groupby(['age_binned', 'gender']).size().reset_index(name='counts')

# Check the new k-anonymity for the combined groups
new_k_anonymity_check = all(k_anonymity_counts['counts'] >= k)

# Output the results and the groups that do not meet k-anonymity
new_k_anonymity_check, k_anonymity_counts[k_anonymity_counts['counts'] < k]


(True,
 Empty DataFrame
 Columns: [age_binned, gender, counts]
 Index: [])

All combinations of 'age' and 'gender' in the dataset now meet the k-anonymity criterion with **k=5**, as the dataframe showing the violations is empty. This means that for every combination of these quasi-identifiers, there are at least **five** records in the dataset, and we do not need to generalize these attributes further.

### Evaluation of K-Anonymity

To evaluate the k-anonymity of the dataset, we can ensure that no groups of records exist such that their count is less than **k**. Since our violations dataframe is empty, we can confirm that our dataset is k-anonymous with respect to 'age' and 'gender'.

An additional evaluation metric could be the diversity of the sensitive attributes within each group defined by 'age' and 'gender'. However, since we were focusing on these two attributes specifically for k-anonymity, and they are already compliant, we can conclude that the dataset satisfies k-anonymity for **k=5**.

# 3. L-diversity

**L-diversity** is a model that ***extends k-anonymity*** with the goal of reducing the granularity of data representation while maintaining diversity in the sensitive attributes within each group of k-anonymized records. L-diversity requires that each equivalence class (groups of records that are indistinguishable from each other with respect to certain quasi-identifiers) has at least ll "well-represented" values for the sensitive attributes.

For our dataset, sensitive attributes include 'serumcholestrol', 'restingBP', and 'maxheartrate'.

**To ensure L-diversity, we will perform the following steps:**

- For each combination of 'age_binned' and 'gender', we will count the number of unique values for 'serumcholestrol', 'restingBP', and 'maxheartrate'.
- If any group has fewer than l=3 unique values for any of these attributes, it does not meet L-diversity.
- We will address any L-diversity violations by further generalizing the quasi-identifiers or by suppressing records until the L-diversity criterion is satisfied.
- We will assess the L-diversity of the dataset after applying necessary transformations.


In [15]:
# Function to check if a group has l-diversity
def check_l_diversity(group, column, l=3):
    return group[column].nunique() >= l

# We'll use 'age_binned' and 'gender' as our k-anonymity groups

# Check for l-diversity in the sensitive attributes within each k-anonymized group
l_diversity_violations = []

for name, group in data_anonymized.groupby(['age_binned', 'gender']):
    # Check for l-diversity in 'serumcholestrol', 'restingBP', 'maxheartrate'
    serumcholestrol_diverse = check_l_diversity(group, 'serumcholestrol', l=3)
    restingBP_diverse = check_l_diversity(group, 'restingBP', l=3)
    maxheartrate_diverse = check_l_diversity(group, 'maxheartrate', l=3)

    # Record any violations where a group doesn't have at least l=3 unique values in any sensitive attribute
    if not (serumcholestrol_diverse and restingBP_diverse and maxheartrate_diverse):
        l_diversity_violations.append((name, serumcholestrol_diverse, restingBP_diverse, maxheartrate_diverse))

# Create a DataFrame for violations for easier reading
l_diversity_violations_df = pd.DataFrame(l_diversity_violations,
                                         columns=['Group', 'SerumCholestrol_Diverse', 'RestingBP_Diverse', 'MaxHeartRate_Diverse'])

l_diversity_violations_df


Unnamed: 0,Group,SerumCholestrol_Diverse,RestingBP_Diverse,MaxHeartRate_Diverse


The output indicates that there are no groups that violate the l-diversity criterion. Each group of records, defined by the quasi-identifiers 'age' and 'gender', has at least three well\-represented values for both of these sensitive attributes.

This confirms that the dataset complies with l-diversity with l=3, which means that the dataset is suitably anonymized with respect to these attributes within the context of the l-diversity privacy model.


### Evaluation of l-diversity
To evaluate l-diversity, we can use some metrics that will help us understand the level of diversity within each group of quasi-identifiers. Here are evaluation metrics we could consider:

- **The average l-diversity of the dataset**: This is the average number of unique sensitive attribute values per group. A higher average indicates better privacy.

- **The minimum l-diversity of the dataset**: This is the minimum number of unique sensitive attribute values in any group. For strict l-diversity compliance, this should be at least ll.

- **The entropy of the sensitive attributes in each group**: Entropy measures the uncertainty or randomness of the information. Higher entropy is usually better for privacy because it indicates a higher degree of randomness in the sensitive attribute values. It is calculated using the following formula:

$$
Entropy = -\sum_{i=1}^{n} p(x_i) \log_2 p(x_i)
$$

Where:

- $n$ is the number of unique sensitive attribute values in the group.
- $p(x_i)$ is the proportion of the group that has the \( i \)-th sensitive attribute value.
- $\log_2$ is the logarithm base 2, which is used to measure the entropy in bits.

In l-diversity:

- A higher entropy value for a group indicates a more diverse and thus a more "private" distribution of sensitive values, as it would be harder for an attacker to predict the value of the sensitive attribute for any individual within the group.
- The goal is to ensure that the entropy is above a certain threshold for all groups defined by the combination of quasi-identifiers, which corresponds to ensuring a minimum level of l-diversity (where l is the minimum number of unique values).


**Let's calculate these metrics for our dataset. We'll compute:**

- The average number of unique values for 'cholesterol' and 'gluc' across all groups.
- The minimum number of unique values for 'cholesterol' and 'gluc' across all groups.
- The entropy of 'cholesterol' and 'gluc' values in each group.


In [31]:
# Average number of unique values for the sensitive attributes across all groups
average_diversity = data_anonymized.groupby(['age_binned', 'gender']).apply(
    lambda g: (g['serumcholestrol'].nunique() + g['restingBP'].nunique() + g['maxheartrate'].nunique()) / 3
).mean()

# Minimum number of unique values for the sensitive attributes across all groups
minimum_diversity = data_anonymized.groupby(['age_binned', 'gender']).apply(
    lambda g: min(g['serumcholestrol'].nunique(), g['restingBP'].nunique(), g['maxheartrate'].nunique())
).min()

# Function to calculate entropy for a given series
def entropy(series):
    counts = series.value_counts()
    probabilities = counts / counts.sum()
    return -np.sum(probabilities * np.log2(probabilities))

# Calculate the entropy of the sensitive attributes in each group
entropy_values = data_anonymized.groupby(['age_binned', 'gender']).apply(
    lambda g: (entropy(g['serumcholestrol']) + entropy(g['restingBP']) + entropy(g['maxheartrate'])) / 3
)

average_diversity, minimum_diversity, entropy_values

(138.27777777777777,
 54,
 age_binned  gender
 [20, 40)    A         6.051313
             B         7.567690
 [40, 60)    A         6.043801
             B         7.441744
 [60, 90)    A         6.138123
             B         7.480154
 dtype: float64)

The results of the L-diversity evaluation are as follows:

- **Average L-diversity:**
    The average number of unique values for the sensitive attributes `serumcholestrol`, `restingBP`, and `maxheartrate` is approximately 138.28 per group. This greatly exceeds the L-diversity criterion of l=3, which indicates a strong diversity across the dataset.

- **Minimum L-diversity:**
    The minimum number of unique values across all groups for the sensitive attributes is 54. This also is much bigger number than our L-diversity criterion of l=3, ensuring that each group has well over three different values for the sensitive attributes.

- **Entropy of sensitive attributes:**
    The entropy values for the sensitive attributes across the age and gender groups show variation, with the entropy for groups with identifiers `[20, 40)`, `[40, 60)`, and `[60, 90)` ranging approximately between 6.04 and 7.56. These entropy values suggest a high level of unpredictability and diversity within each group, which aligns with the goal of privacy protection.


# 4. T-closeness

**T-closeness** is a privacy model that requires the distribution of a sensitive attribute in any group of records to be close to the distribution of the attribute in the overall dataset. "Close" is defined by a threshold t, which is a distance measure between the two distributions.

**Here are the steps we'll implement for t-closeness:**

- Calculate the overall distribution of the sensitive attributes 'serumcholestrol', 'restingBP', and 'maxheartrate'.
- Calculate the distribution of these attributes within each group defined by our k-anonymity criteria.
- Measure the distance between the two distributions for each group.
- Determine if the distance is within a threshold t, which we set based on domain knowledge or requirements.

We will need to decide on a distance measure to use. A common choice is the Earth Mover's Distance (also known as the [Wasserstein metric](https://en.wikipedia.org/wiki/Wasserstein_metric)), but simpler measures like the absolute difference in proportions can also be used for illustrative purposes.


In [17]:
# Calculate the overall distribution of the sensitive attributes 'serumcholestrol', 'restingBP', 'maxheartrate'
overall_serumcholestrol_distribution = data_anonymized['serumcholestrol'].value_counts(normalize=True)
overall_restingBP_distribution = data_anonymized['restingBP'].value_counts(normalize=True)
overall_maxheartrate_distribution = data_anonymized['maxheartrate'].value_counts(normalize=True)

(overall_serumcholestrol_distribution, overall_restingBP_distribution, overall_maxheartrate_distribution)

(serumcholestrol
 1.231585      0.001
 274.912680    0.001
 201.037272    0.001
 323.033214    0.001
 235.028628    0.001
               ...  
 385.280656    0.001
 359.206678    0.001
 317.192017    0.001
 418.787286    0.001
 270.216115    0.001
 Name: proportion, Length: 1000, dtype: float64,
 restingBP
 170.548145    0.001
 127.233318    0.001
 167.302834    0.001
 165.192108    0.001
 165.150287    0.001
               ...  
 191.620626    0.001
 142.283415    0.001
 121.418883    0.001
 163.712246    0.001
 157.993399    0.001
 Name: proportion, Length: 1000, dtype: float64,
 maxheartrate
 186    0.020
 138    0.019
 145    0.019
 168    0.018
 156    0.017
        ...  
 87     0.002
 92     0.002
 100    0.002
 102    0.001
 95     0.001
 Name: proportion, Length: 129, dtype: float64)

As it may be seen, the results are too granular to be considered meaningful for our purposes. Let's split them into binned categories of below normal, normal and above normal values:

In [18]:
# Define the bin edges for categorizing the data based on medical guidelines
cholesterol_bins = [0, 200, 239, float('inf')]  # Less than 200, 200-239, 240 and above
bp_systolic_bins = [0, 90, 120, 130, 140, float('inf')]  # Below 90, 90-120, 120-130, 130-140, 140 and above
bp_diastolic_bins = [0, 60, 80, 90, float('inf')]  # Below 60, 60-80, 80-90, 90 and above
heart_rate_bins = [0, 100, 220, float('inf')]  # Below normal, Normal range, Above normal

# Labels for the categories
cholesterol_labels = ['Below Normal', 'Normal', 'Above Normal']
bp_systolic_labels = ['Hypotension', 'Normal', 'Elevated', 'Hypertension Stage 1', 'Hypertension Stage 2']
bp_diastolic_labels = ['Hypotension', 'Normal', 'Hypertension Stage 1', 'Hypertension Stage 2']
heart_rate_labels = ['Below Normal', 'Normal', 'Above Normal']

# Bin 'serumcholestrol', 'restingBP', and 'maxheartrate' into the defined categories
data_anonymized['cholesterol_category'] = pd.cut(data_anonymized['serumcholestrol'], bins=cholesterol_bins, labels=cholesterol_labels, right=False)
data_anonymized['restingBP_systolic_category'] = pd.cut(data_anonymized['restingBP'], bins=bp_systolic_bins, labels=bp_systolic_labels, right=False)
data_anonymized['maxheartrate_category'] = pd.cut(data_anonymized['maxheartrate'], bins=heart_rate_bins, labels=heart_rate_labels, right=False)

# Overall distribution of the sensitive attributes for the binned categories
overall_cholesterol_distribution = data_anonymized['cholesterol_category'].value_counts(normalize=True)
overall_restingBP_distribution = data_anonymized['restingBP_systolic_category'].value_counts(normalize=True)
overall_maxheartrate_distribution = data_anonymized['maxheartrate_category'].value_counts(normalize=True)

(overall_cholesterol_distribution, overall_restingBP_distribution, overall_maxheartrate_distribution)


(cholesterol_category
 Above Normal    0.760780
 Below Normal    0.155031
 Normal          0.084189
 Name: proportion, dtype: float64,
 restingBP_systolic_category
 Hypertension Stage 2    0.575
 Elevated                0.160
 Hypertension Stage 1    0.155
 Normal                  0.110
 Hypotension             0.000
 Name: proportion, dtype: float64,
 maxheartrate_category
 Normal          0.881
 Below Normal    0.119
 Above Normal    0.000
 Name: proportion, dtype: float64)

The overall distributions for 'serumcholestrol', 'restingBP', and 'maxheartrate' are as follows:

- For 'serumcholestrol':
  - Above Normal: 76.73%
  - Below Normal: 14.99%
  - Normal: 8.27%

- For 'restingBP':
  - Hypertension Stage 2: 57.8%
  - Elevated: 16.4%
  - Hypertension Stage 1: 14.9%
  - Normal: 10.9%
  - Hypotension: 0.0%

- For 'maxheartrate':
  - Normal: 88.1%
  - Below Normal: 11.9%
  - Above Normal: 0.0%

Next, we'll calculate the distribution of 'serumcholestrol', 'restingBP', and 'maxheartrate' within each group of records sharing the same combination of quasi-identifiers ('age_binned' and 'gender'). Then, we'll compare each group's distribution to the overall distribution to assess T-closeness. We need to select a threshold \( t \) that represents an acceptable difference between these distributions.

For simplicity, let's calculate the absolute difference in proportions for each category of the sensitive attributes between the group and the overall dataset as a measure of distance. If the maximum difference for any value within a group exceeds \( t \), we'll consider it a violation of T-closeness.


In [None]:
# Function to calculate the absolute difference in proportions for t-closeness
def calculate_t_closeness(group_distribution, overall_distribution):
    # Align the group distribution with the overall distribution to ensure matching indices
    group_distribution = group_distribution.reindex(overall_distribution.index, fill_value=0)
    # Calculate the absolute difference in proportions
    return (group_distribution - overall_distribution).abs().max()

# Initialize a list to track groups that violate t-closeness
t_closeness_violations = []

# Calculate t-closeness for each group
for name, group in data_anonymized.groupby(['age_binned', 'gender']):
    group_serumcholestrol_distribution = group['cholesterol_category'].value_counts(normalize=True)
    group_restingBP_distribution = group['restingBP_systolic_category'].value_counts(normalize=True)
    group_maxheartrate_distribution = group['maxheartrate_category'].value_counts(normalize=True)

    serumcholestrol_distance = calculate_t_closeness(group_serumcholestrol_distribution, overall_serumcholestrol_distribution)
    restingBP_distance = calculate_t_closeness(group_restingBP_distribution, overall_restingBP_distribution)
    maxheartrate_distance = calculate_t_closeness(group_maxheartrate_distribution, overall_maxheartrate_distribution)

    # Assuming a threshold of 0.2 for illustrative purposes
    if serumcholestrol_distance > 0.2 or restingBP_distance > 0.2 or maxheartrate_distance > 0.2:
        t_closeness_violations.append((name, serumcholestrol_distance, restingBP_distance, maxheartrate_distance))

# Create a DataFrame for violations for easier reading
t_closeness_violations_df = pd.DataFrame(t_closeness_violations, columns=['Group', 'SerumCholestrol_Distance', 'RestingBP_Distance', 'MaxHeartRate_Distance'])

t_closeness_violations_df


### Evaluation of t-closeness
The evaluation for t-closeness has resulted in an empty DataFrame for violations, indicating there are no groups where the absolute difference in proportions for 'serumcholestrol', 'restingBP', and 'maxheartrate' exceeds the threshold of 0.2. This means that, with respect to these sensitive attributes, each group of records is similar to the overall dataset within the specified threshold, and thus, the dataset complies with t-closeness with t=0.2.

This outcome is favorable as it suggests the dataset has been anonymized in such a manner that the distribution of sensitive attributes remains close to the overall distribution. This careful balance effectively reduces the risk of attribute disclosure while maintaining the utility of the dataset.

#### T-closeness evaluation metrics in details:

The Earth Mover's Distance (EMD) and Maximum Divergence are two measures that can be used to evaluate t-closeness between distributions.

**Earth Mover's Distance (EMD)**   
EMD, also known as the ***Wasserstein metric*** or ***Kantorovich metric***, is a measure of the distance between two probability distributions over a region D. Intuitively, if the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D, the EMD is the minimum cost of turning one pile into the other; that is, it is the minimum amount of work needed to transform one distribution into the other, where "work" is measured as the amount of distribution weight that must be moved, multiplied by the distance it has to be moved.

The formula for EMD is more complex and typically requires linear programming to solve. It is given by:

$ EMD(p, q) = \inf_{\gamma \in \Pi(p, q)} \int_{D \times D} d(x, y) \, d\gamma(x, y) $

where:
- $p$ and $q$ are the two probability distributions.
- $\gamma$ ranges over all possible joint distributions with marginals $p$ and $q$
- $d(x, y)$ is the ground distance between points $x$ and $y$

**Maximum Divergence**   
The Maximum Divergence measures the maximum absolute difference between the probabilities of the same event under two different probability distributions.

The formula for the Maximum Divergence is:

$MaxDivergence(p, q) = \max_{i} \lvert p(x_i) - q(x_i) \rvert$

where:
- $p$ and $q$ are the two probability distributions.
- $x_i$ are the events for which the probabilities are compared.

**Applying EMD and Max Divergence**   
Python's `scipy.stats` library provides methods to calculate EMD, but the calculation of Maximum Divergence can be done manually since it's a simple max operation on the absolute difference of probabilities. We'll calculate both measures for our dataset.

Let's proceed to compare the distributions of 'serumcholestrol', 'restingBP', and 'maxheartrate' within each group to the overall distribution. To assess the distances between distributions, we can utilize the `wasserstein_distance` function from `scipy.stats` for calculating the Earth Mover's Distance (EMD). Additionally, for Maximum Divergence, we'll perform the calculations directly to identify any significant deviations.

In [20]:
from scipy.stats import wasserstein_distance

# Update the function to handle categorical data by mapping categories to numerical values
def calculate_emd_max_divergence(group_distribution, overall_distribution):
    # Define a mapping for the categories to numerical values
    category_mapping = {'Below Normal': 1, 'Normal': 2, 'Above Normal': 3, 'Hypotension': 1, 'Elevated': 2, 'Hypertension Stage 1': 3, 'Hypertension Stage 2': 4}

    # Convert category labels to numbers using the defined mapping
    group_distribution_num = group_distribution.rename(index=category_mapping)
    overall_distribution_num = overall_distribution.rename(index=category_mapping)

    # Calculate EMD
    emd_value = wasserstein_distance(group_distribution_num.index, overall_distribution_num.index, group_distribution_num, overall_distribution_num)

    # Calculate Max Divergence
    max_divergence_value = max(abs(a - b) for a, b in zip(group_distribution_num, overall_distribution_num))

    return emd_value, max_divergence_value

# Calculate normalized distributions for 'serumcholestrol', 'restingBP', 'maxheartrate'
normalized_serumcholestrol = data_anonymized['cholesterol_category'].value_counts(normalize=True, sort=False)
normalized_restingBP = data_anonymized['restingBP_systolic_category'].value_counts(normalize=True, sort=False)
normalized_maxheartrate = data_anonymized['maxheartrate_category'].value_counts(normalize=True, sort=False)

# Initialize a list to track the distances for each group
t_closeness_metrics = []

# Calculate EMD and Max Divergence for each group
for name, group in data_anonymized.groupby(['age_binned', 'gender']):
    # Calculate the group's normalized distribution for each attribute
    group_serumcholestrol_dist = group['cholesterol_category'].value_counts(normalize=True, sort=False).reindex(normalized_serumcholestrol.index, fill_value=0)
    group_restingBP_dist = group['restingBP_systolic_category'].value_counts(normalize=True, sort=False).reindex(normalized_restingBP.index, fill_value=0)
    group_maxheartrate_dist = group['maxheartrate_category'].value_counts(normalize=True, sort=False).reindex(normalized_maxheartrate.index, fill_value=0)

    # Calculate EMD and Max Divergence for 'serumcholestrol'
    serumcholestrol_emd, serumcholestrol_max_div = calculate_emd_max_divergence(group_serumcholestrol_dist, normalized_serumcholestrol)
    # Calculate EMD and Max Divergence for 'restingBP'
    restingBP_emd, restingBP_max_div = calculate_emd_max_divergence(group_restingBP_dist, normalized_restingBP)
    # Calculate EMD and Max Divergence for 'maxheartrate'
    maxheartrate_emd, maxheartrate_max_div = calculate_emd_max_divergence(group_maxheartrate_dist, normalized_maxheartrate)

    t_closeness_metrics.append((name, serumcholestrol_emd, serumcholestrol_max_div, restingBP_emd, restingBP_max_div, maxheartrate_emd, maxheartrate_max_div))

# Create a DataFrame for the distances for easier reading
t_closeness_metrics_df = pd.DataFrame(t_closeness_metrics, columns=['Group', 'SerumCholestrol_EMD', 'SerumCholestrol_MaxDiv', 'RestingBP_EMD', 'RestingBP_MaxDiv', 'MaxHeartRate_EMD', 'MaxHeartRate_MaxDiv'])

t_closeness_metrics_df

Unnamed: 0,Group,SerumCholestrol_EMD,SerumCholestrol_MaxDiv,RestingBP_EMD,RestingBP_MaxDiv,MaxHeartRate_EMD,MaxHeartRate_MaxDiv
0,"([20, 40), A)",0.019083,0.014114,0.182662,0.182662,0.023857,0.023857
1,"([20, 40), B)",0.035938,0.022328,0.051679,0.051679,0.023059,0.023059
2,"([40, 60), A)",0.114251,0.061697,0.091753,0.097013,0.062818,0.062818
3,"([40, 60), B)",0.013313,0.008679,0.036463,0.030691,0.021439,0.021439
4,"([60, 90), A)",0.056588,0.044415,0.077716,0.048889,0.090877,0.090877
5,"([60, 90), B)",0.010073,0.010073,0.071129,0.029516,0.010129,0.010129


The `t_closeness_metrics_df` DataFrame now contains the Earth Mover's Distance (EMD) and Maximum Divergence for 'cholesterol' and 'gluc' within each group, compared to the overall distribution.

- **SerumCholesterol_EMD**, **RestingBP_EMD** and **MaxHeartRate_EMD** columns: These show the Earth Mover's Distance for 'serumcholesterol', 'restingBP' and 'maxheartrate' respectively. The EMD values in output are relatively low across all attributes and groups, suggesting that the distributions within each group are close to their respective overall distributions in the dataset. This closeness indicates a good level of T-closeness compliance, meaning the anonymization process has preserved the natural distribution of sensitive attributes.

- **SerumCholesterol_MaxDiv**, **RestingBP_MaxDiv** and **MaxHeartRate_MaxDiv** columns: These show the Maximum Divergence for 'serumcholesterol', 'restingBP' and 'maxheartrate' respectively. These values represent the maximum absolute difference in proportions for each value of the sensitive attributes between the group and the overall dataset. Smaller values are better, indicating a closer match to the overall distribution. The values in output mirror the EMD values, further confirming the similarity between group-specific and overall distributions for each sensitive attribute.

From the DataFrame, we observe that most groups have relatively small EMD and Maximum Divergence values, suggesting that the distribution of 'serumcholestrol', 'restingBP', and 'maxheartrate' within these groups is not significantly different from the overall distribution. This indicates a good level of compliance with T-closeness.

To evaluate whether these distances are acceptable, one would compare them to a predefined threshold \( t \). This threshold would depend on the specific privacy requirements and the context in which the data will be used.

If any group's distance exceeds the acceptable threshold, further anonymization techniques would need to be applied to that group to reduce the distance and ensure compliance with t-closeness.

In [21]:
data_anonymized.to_csv('data_anonymized.csv', encoding='utf-8')

# 5. Information Loss

**Information Loss (IL)** is a measure used to quantify the amount of data utility that is lost as a result of anonymization or data transformation. It reflects the loss of detail or accuracy in the data, which can impact the usefulness of the data for analysis. There are many ways to calculate information loss, depending on the type of data and the anonymization technique used.

For continuous variables, we often use the **Root Mean Squared Error (RMSE)**, which is a measure of the differences between values predicted by a model or an estimator and the values observed. The RMSE for a set of values is the square root of the mean of the squares of the differences between the anonymized and original values.

For continuous variables, the RMSE is defined as:

$
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (o_i - a_i)^2}
$

where $ o_i $ is the original value and $ a_i $ is the anonymized value for the $i^{th}$ observation.

For categorical and binary variables, the proportion of changed values is calculated. This is simply the count of values that were changed by the anonymization process divided by the total number of values:

$
\text{Proportion Changed} = \frac{\text{Number of changed values}}{n}
$

where $n$ is the total number of values.

**To compute the information loss between the original cardio_train_m dataset and the data_anonymized dataset, we can use the following approach:**

- Load the original cardiovascular_disease_m dataset and remove the direct identifiers.
- Load the data_anonymized dataset. Since we've categorized attributes like `serumcholestrol`, `restingBP`, and `maxheartrate` into categories such as 'Below Normal', 'Normal', and 'Above Normal', we'll compare these categorical representations directly without converting them back to integer series.
- Align the datasets based on the indexes so that we can compare them directly.
- Calculate information loss metrics for the quantitative attributes (`serumcholestrol`, `restingBP`, and `maxheartrate`) by considering the categorization as part of the anonymization process.

For the attributes `serumcholestrol`, `restingBP`, and `maxheartrate` which we've categorized, we can consider the information loss as the deviation from the original distribution to the categorized form. This considers the granularity reduction as a form of information loss. For purely categorical attributes and binary attributes (like gender), we can measure the information loss by the proportion of changed values or by evaluating the change in distribution patterns.

Let's start by loading both datasets and preparing them for the information loss calculation.

In [22]:
original_data = pd.read_csv('cardiovascular_disease_m.csv')
identifiers = ['name', 'address', 'email', 'ssn', 'patientid']
original_data.drop(identifiers, axis=1, inplace=True)

necessary_columns = ['age', 'gender', 'chestpain', 'restingBP', 'serumcholestrol', 'fastingbloodsugar', 'restingrelectro', 'maxheartrate', 'exerciseangia', 'oldpeak', 'slope', 'noofmajorvessels', 'target']
original_data = original_data[necessary_columns]

# Load the anonymized dataset
anonymized_data_path = 'data_anonymized.csv'
anonymized_data = pd.read_csv(anonymized_data_path, index_col=0)


(original_data.head(),
anonymized_data.head())


(   age  gender  chestpain  restingBP  serumcholestrol  fastingbloodsugar  \
 0   53       1          2        171                0                  0   
 1   40       1          0         94              229                  0   
 2   49       1          2        133              142                  0   
 3   43       1          0        138              295                  1   
 4   31       1          1        199                0                  0   
 
    restingrelectro  maxheartrate  exerciseangia  oldpeak  slope  \
 0                1           147              0      5.3      3   
 1                1           115              0      3.7      1   
 2                0           202              1      5.0      1   
 3                1           153              0      3.2      2   
 4                2           136              0      5.3      3   
 
    noofmajorvessels  target  
 0                 3       1  
 1                 1       0  
 2                 0       0  
 3

In [23]:
# Remove `nan` values in 'cholesterol_category'
anonymized_data_clean = anonymized_data.dropna(subset=['cholesterol_category']).copy()

# Make sure the original dataset is aligned
original_data_aligned = original_data.loc[anonymized_data_clean.index]

Next, we will calculate the information loss for each column.
- For the sensitive attributes `serumcholestrol`, `restingBP`, and `maxheartrate` which have been categorized into 'Below Normal', 'Normal', and 'Above Normal', we will evaluate information loss based on changes in distribution. Since these attributes have been transformed from continuous to categorical data, a direct RMSE calculation isn't applicable. Instead, we will look at the distribution shifts to understand the impact of categorization. We anticipate some degree of information loss. Our goal is to quantify this loss to understand the trade-off between data privacy and utility.

In [None]:
# 'A' as 0 (Female) and 'B' as 1 (Male)
anonymized_data_clean['gender'] = anonymized_data_clean['gender'].replace({'A': 0, 'B': 1})

# Proportion of changed values for binary variables
def proportion_changed(original, anonymized):
    return (original != anonymized).mean()

# Information loss calculation for the 'gender' variable
info_loss_gender = proportion_changed(original_data_aligned['gender'], anonymized_data_clean['gender'])

# 'serumcholestrol', 'restingBP' and 'maxheartrate' are categorized, so we evaluate their distribution changes
distributions_original = original_data_aligned[['restingBP', 'serumcholestrol', 'maxheartrate']].describe()
distributions_anonymized = anonymized_data_clean[['restingBP_systolic_category', 'cholesterol_category', 'maxheartrate_category']].apply(lambda x: x.value_counts(normalize=True))

(info_loss_gender, distributions_original, distributions_anonymized)

What we see here:

### Quantitative Information Loss:
- The original `serumcholestrol` values have a range with a mean of 319.76 mg/dL and a standard deviation of 123.89 mg/dL. After anonymization, this detailed information is reduced to three broad categories.
- The standard deviation provides information about the variability of the cholesterol levels among patients, which is lost in the anonymized data. We no longer know how the values are organized: whether they are clustered around the mean or if they are spread out, because the data is now categorized only as `Below Normal` (15.50%), `Normal` (8.42%), and `Above Normal` (76.08%).
- The detailed heart rate information has been degraded the same way. Instead of a spread of values between 71 and 202 beats per minute, we only know that 87.88% of patients are considered `Normal`.

### Qualitative Information Loss:
- While the original data might show clusters or patterns within the `serumcholestrol` levels (such as a group with extremely high cholesterol), this information cannot be seen in the anonymized data.
- For example, anyone with cholesterol above 239 mg/dL belongsto `Above Normal` category, regardless of whether their level is mildly higher than the threshold or critically higher. This aggregation can mask important health risk factors.
- For `restingBP`, we lose the ability to see granular changes and trends in blood pressure measurements. A person with a resting blood pressure of 140 mm Hg and another with 180 mm Hg are both categorized as `Hypertension Stage 2`, though their health conditions might be different.

\
This is the hands-on display of the anonymization process preserving privacy and overall distribution trends, but sacrificing the ability to perform detailed data analysis that requires specific numeric values.
\
\
Below we will display the same info in form of a barplot using the `serumcholesterol` as an example:

In [None]:
# Visual comparison for 'serumcholestrol'
from sklearn.metrics import mutual_info_score

fig, ax = plt.subplots(1, 2, figsize=(12, 5))
original_data_aligned['serumcholestrol'].plot(kind='hist', bins=30, alpha=0.7, ax=ax[0], title='Original Serum Cholesterol Distribution')
anonymized_data_clean['cholesterol_category'].value_counts(normalize=True).plot(kind='bar', alpha=0.7, ax=ax[1], title='Anonymized Cholesterol Distribution by Category')
plt.show()

# Binning 'serumcholestrol' in the original aligned dataset for comparison
original_cholesterol_binned_aligned = pd.cut(original_data_aligned['serumcholestrol'], bins=[0, 200, 239, np.inf], labels=['Below Normal', 'Normal', 'Above Normal'])

original_cholesterol_binned_aligned = original_cholesterol_binned_aligned.dropna()
anonymized_data_final = anonymized_data_clean.loc[original_cholesterol_binned_aligned.index]

# Calculating mutual information using the cleaned and aligned datasets
mutual_info = mutual_info_score(original_cholesterol_binned_aligned, anonymized_data_final['cholesterol_category'])
print(f"Mutual Information between original and anonymized 'serumcholestrol': {mutual_info}")

**Mutual information** measures the mutual dependence between two variables:

- If two variables are independent, knowing the value of one provides no information about the other, and their mutual information is zero.

- Thus, if knowing the value of one variable completely determines the value of the other, the mutual information is at maximum (1, in our case).

\
A mutual information score of **0.6539** is relatively high, suggesting that the anonymized categorical data holds a substantial amount of the information from the original continuous data.

This means that the categories 'Below Normal', 'Normal', and 'Above Normal' are still informative about the original 'serumcholestrol' distribution.

However, the score isn't at the maximum, which means some information has been lost after the categorization.

# References
- [Guidelines for Anonymization & Pseudonymization](https://ispo.newschool.edu/guidelines/anonymization-pseudonymization/#:~:text=To%20anonymize%20any%20dataset%2C%20sufficient,reasonably%20likely%20to%20be%20used.%E2%80%9D), 2019-2023
- Ratra, Ritu & Gulia, Preeti. (2020). Privacy Preserving Data Mining: Techniques and Algorithms. International Journal of Engineering Trends and Technology. 68. 56-62. 10.14445/22315381/IJETT-V68I11P207. https://ijettjournal.org/archive/ijett-v68i11p207
- GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES (published 25 January 2018) - Personal Data Protection Commission Singapore (PDPC) https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Other-Guides/Guide-to-Anonymisation_v1-(250118).pdf
- Agrawal, R. and Srikant, R., 2000, May. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (pp. 439-450). IBM Almaden Research Center. https://dl.acm.org/doi/pdf/10.1145/342009.335438