#  Data Anonymization and Privacy-Enhancing Technologies (PETs)

## 🎯 Learning Goals
- Understand what data anonymization is and why it matters.
- Apply anonymization techniques to structured, text and audio data.
- Understand how anonymization supports compliance with privacy laws (GDPR, CCPA).

### 🔐 Introduction to Data Anonymization

- **Data anonymization** removes personally identifiable information (PII) from datasets.
- **Pseudonymization** retains a link to identity (e.g., using a key); anonymization does not.

### 🧰 Techniques for Anonymizing Different Data Types
- **Structured Data**: k-anonymity, l-diversity, t-closeness
- **Text Data**: Named Entity Recognition (NER) + masking
- **Images**: Face blurring, pixelation
- **Audio**: Voice masking, pitch shift

### 📜 Legal Relevance
- **GDPR & CCPA**: Require data minimization and protection.
- Anonymization helps avoid processing 'personal data' under these laws.

##  Part 1: Structured Data Anonymization (k-Anonymity)

### 💡 Task 1 

#### Description:
You are provided with a structured dataset (CSV file) containing personal attributes like Age, Education, Occupation, Relationship, Sex, and Country. Your task is to anonymize this data using **k-anonymity** by binning continuous data (like Age) and generalizing or masking quasi-identifiers such as Country or Occupation. Then assess whether individual identities could still be inferred.


- Generalize and group quasi-identifiers (e.g., Age and Country)
- Ensure each combination of quasi-identifiers occurs at least **k** times
- Evaluate anonymity level before and after transformation


Click [***here***](https://github.com/SanthwanaT/Seminar-3/blob/main/structured_data.csv) to understand more about the dataset 

In [None]:
!pip install pandas scikit-learn spacy opencv-python-headless matplotlib librosa pydub
!python -m spacy download en_core_web_sm

### 💡 Task 1.1 : Upload and Load the Dataset 

- Upload the dataset from your local machine to Colab.
- Read it into a pandas DataFrame.
- Preview the first few rows.

Option 2: You're not in Colab
Then do not use from google.colab import files. Instead, use standard file upload or path loading methods.



In [None]:
# Upload the dataset from  drive link 
from google.colab import files
uploaded = files.upload()

# Load the uploaded CSV file
import pandas as pd

data =   # Use the name of the uploaded file


### 💡 Task 1.2 : Clean and Standardize String Columns

- Whitespace and inconsistent casing can prevent accurate mapping.
- Standardize entries in Education, Occupation, and Country:
- Remove extra spaces.
- Capitalize words consistently.

In [None]:
# Strip whitespace and convert to title case
for col in ['Education', 'Occupation', 'Country']:
    data[col] = data[col].str.strip().str.title()

# View distinct cleaned values
data[['Education', 'Occupation', 'Country']].drop_duplicates()

### 💡 Task 1.3: Bin the 'Age' Attribute into Groups

- Convert numerical Age into intervals (Bins : e.g., "21–30", "31–40"). Bins should be taken from 0 to 100. 
- This reduces granularity and increases privacy.
- Binning helps meet the k-anonymity requirement.

In [None]:
# Define bins and labels
bins = 
labels = 

# Apply binning
data['Age_group'] = 

# View the binned age groups


### 💡 Task 1.4: Generalize Quasi-Identifiers

- Replace specific values with general categories:
  - Education: into { 'Bachelors': 'HigherEd', 'Masters': 'HigherEd', 'Doctorate': 'HigherEd',
    'Hs-Grad': 'HighSchool', 'Some-College': 'HighSchool',
    '11Th': 'Dropout', '9Th': 'Dropout', '7Th-8Th': 'Dropout',
    'Assoc-Acdm': 'Associate', 'Assoc-Voc': 'Associate'}

  - Occupation: into broader fields like { Adm-Clerical': 'Clerical', 'Exec-Managerial': 'Management',
    'Handlers-Cleaners': 'Manual Labor', 'Prof-Specialty': 'Professional',
    'Other-Service': 'Service', 'Sales': 'Sales',
    'Craft-Repair': 'Skilled Trade', 'Transport-Moving': 'Transport',
    'Farming-Fishing': 'Agriculture', 'Machine-Op-Inspct': 'Manufacturing' }

  - Country: into geographic regions { 'United-States': 'North America', 'Cuba': 'Latin America',
    'Jamaica': 'Latin America', 'India': 'Asia', 'Mexico': 'Latin America'}
    
- These fields are quasi-identifiers and must be generalized to protect identity.


In [None]:
# Generalize Education
education_map = 

# Generalize Occupation
occupation_map = 

# Generalize Country to Region
region_map = 

# Apply generalizations
data['Education_gen'] = 
data['Occupation_gen'] =
data['Region'] =

# View generalized fields



### 💡 Task 1.5: Apply k-Anonymity Check

- Define the list of quasi-identifiers (QIs). (eg : Age_group, Education_gen, Occupation_gen, Relationship, Sex, Region)
- Group data by QIs and count the frequency of each group.
- Identify combinations that appear less than k times (violating anonymity).



In [None]:
# Set the k value
k = 3

# Define quasi-identifier columns
qi_columns =

# Count occurrences of each QI group
grouped = 

# Identify violating combinations
violations = 

print(f"\nRows violating {k}-anonymity:")
violations



### 💡 Task 1.6: Save and Download Anonymized Data

- Select only the anonymized/generalized columns.
- Save the result to a new CSV.
- Provide download link for student use

In [None]:
# Select only the anonymized fields
anonymized_columns = 
data_anonymized =

# Save the anonymized dataset


# Download the CSV file in Colab


🔐 What are l-Diversity and t-Closeness?

- l-Diversity extends k-anonymity by ensuring diversity in sensitive attributes (e.g., income, disease) within each group.
- t-Closeness ensures that the distribution of sensitive values in each group is close to the overall distribution.




### 💡 Task 1.7:  Apply l-Diversity

- Ensure that each group (based on quasi-identifiers) has at least l distinct values of the sensitive attribute. This protects against homogeneity attacks.

### 🧾 Instructions:
- Define quasi-identifiers (QI) and the sensitive attribute (e.g., 'Education').
- Group by QI.
- For each group, count the number of unique values in the sensitive attribute.
- Check if this count ≥ l.



In [None]:
# l-Diversity Check

l = 2  # Minimum diversity level required
sensitive_attr = 

l_diverse_groups = []
non_diverse_groups = []

for name, group in data.groupby(qi_columns):
    diversity = 
    if diversity >= l:
       
    else:
        
# Print the Groups satisfying l-diversity and Groups violating l-diversity


### 💡 Task 1.8 : Apply t-Closeness

Prevent attribute disclosure by making sure that the distribution of sensitive values within each group is close to the global distribution (using Total Variation Distance).

### 🧾 Instructions:
- Compute global distribution of the sensitive attribute.
- For each group:
  - Compute local distribution.
  - Compare with global distribution using Total Variation Distance (TVD).
    $$
             \text{Distance} = \frac{1}{2} \sum_{i=1}^{n} \left| P_i - Q_i \right|
    $$
    Where:

     - \( P_i \) = group distribution for class \( i \)  
     - \( Q_i \) = global distribution for class \( i \)


  - Flag groups where TVD > threshold t.

In [None]:
from numpy import sum as npsum

#  t-Closeness Check

t = 0.3  # Threshold for closeness
t_violations = []

# Global distribution
global_dist = 

for name, group in data.groupby(qi_columns):
    group_dist = 
    
    # Align indices
    group_dist = 
    
    # Total Variation Distance (TVD)
    distance = 
    
    if distance > t:
        
# Print the  Groups violating t-closeness


### 🧠 Quiz Time 
### Q1. Which of the following correctly describes the relationship between these techniques?

A. k-Anonymity is stricter than both l-Diversity and t-Closeness  
B. l-Diversity and t-Closeness are enhancements to overcome k-Anonymity's limitations  
C. t-Closeness can be applied without considering quasi-identifiers  
D. All three techniques guarantee perfect privacy





👉 **Type your selected option below:**  
Your answer: `_____`

##  Part 2: Text Anonymization with SpaCy (NER + Masking)

### 💡 Task 2 

#### Description:
You are provided with a dataset containing news headlines with columns: `publish_date`, `headline_category`, and `headline_text`. Your task is to anonymize the `headline_text` column using **Named Entity Recognition (NER)** to identify and mask named entities such as people, organizations, locations, and dates.

Click ***[here](https://github.com/SanthwanaT/Seminar-3/blob/main/legal_text_classification.csv)***  to understand more about the dataset 



### 💡 Task 2.1: Install and Import Required Libraries

- Install spaCy and download the English language model.
- Import necessary libraries (spaCy, pandas).

### 🧾 Instructions:

- SpaCy is not installed by default in Colab; you must install it first.
- The English model (en_core_web_sm) is needed for NER.


In [None]:
# Install spaCy and download the model
!pip install -U spacy
!python -m spacy download en_core_web_sm

# Import required libraries
import spacy
import pandas as pd



### 💡 Task 2.2: Load the Dataset

- Upload the CSV file containing text data.
- Load it using pandas.

### 🧾 Instructions:

Use the Colab file uploader to upload the dataset named legal_text_classification.csv.

In [None]:
# Upload CSV
from google.colab import files
uploaded = files.upload()

# Load the CSV file into a DataFrame
data =

# Preview the data



### 💡 Task 2.3: Define the Anonymization Function

- Write a function using spaCy to identify named entities and mask them with labels.
- Handle missing or non-string data gracefully.

### 🧾 Instructions:

- Use spaCy's nlp() to process each text entry.
- Replace named entities with [LABEL]. (eg : ["PERSON", "GPE", "ORG", "LOC", "DATE", "TIME", "MONEY"])



In [None]:
# Load spaCy NER model
nlp = spacy.load("en_core_web_sm")

# Function to anonymize named entities
def anonymize_text(text):
    



### 💡 Task 2.4: Apply Anonymization and Save Output

- Apply the anonymization function to the case_text column.
- Save the anonymized dataset to a new CSV.
- View anonymized examples.



In [None]:
# Apply the anonymization function
data["Anonymized_Text"] = 

# Save anonymized data

# Display sample results


### 🧠 Quiz Time 
### Q1. Which of the following correctly describes the relationship between these techniques?

A. Common nouns like "city", "company", "year"  
B. Words like "and", "the", "of"                                                                                                                         
C. Named entities like "Barack Obama", "Google", "New York"  
D. Punctuation marks and stopword
 


👉 **Type your selected option below:**  
Your answer: `_____`

##  Part 3:  Audio Data Anonymization

Audio data anonymization refers to the process of altering audio data, such as voice recordings, to prevent the identification of the individual who produced the data while maintaining its overall utility for analysis. This is crucial for applications where personal audio data, such as speech recordings or voiceprints, are used. For example, in datasets like VoxCeleb, speaker identity can be easily inferred from voice features, making it essential to anonymize the audio in order to comply with privacy regulations


### 💡 Task 3

#### Description:
Anonymize audio data,by obscuring speaker identity while preserving the utility of the data for tasks like speech recognition, voice activity detection, or speaker verification.


### 💡 Task 3.1

###  Audio Preprocessing and Pitch Shifting


Start by loading an audio sample from the VoxCeleb dataset and applying a pitch shift to anonymize the speaker's voice.

Note : Understand Voxceleb dataset by clicking here : [***here***](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)

#### **Instructions:**
1. Load an audio sample from the dataset.

2. Apply a pitch shift of +5 semitones to the audio.

3. Visualize the waveform of the original and pitch-shifted audio.


Use any of the [***audio files***](https://github.com/SanthwanaT/Seminar-3/tree/main/example%20datas%20from%20voxceleb) for task 3.1, 3.2, 3.3



In [None]:
pip install matplotlib 

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load audio file
audio_path = 
audio, sr = 

# Apply pitch shifting (e.g., shift up by 4 semitones)

# Plotting

# Original waveform


# Pitch-shifted waveform


###  💡 Task 3.2

### Time Stretching for Anonymization

Now, apply time stretching to alter the duration of the audio without changing its pitch. Choose a stretch rate of 1.2 (i.e., increase the length of the audio by 20%).

#### **Instructions:**


1. Apply time stretching to the same audio sample.

2. Visualize the waveforms of the original and time-stretched audio.

In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt

# Load the audio file
audio_path = 
audio, sr = 

# Apply time stretching (increase duration by 20%)
# A rate < 1 slows down (stretches), > 1 speeds up (compresses)
stretch_rate = 
audio_stretched = 

# Plotting


# Original waveform


# Time-stretched waveform



### 💡 Task 3.3 

### Adding Noise to Anonymize the Audio

Next, add background noise to the audio to further obscure the speaker’s identity.

#### **Instructions:**


1. Generate some random noise and add it to the audio.
   $$
     \text{noise} = \text{randn}(L) \times \text{noise\_factor}
   $$
   where L=len(audio)


   $$
     \text{normalized\_audio} = \frac{\text{audio\_noisy}}{\max\left(|\text{audio\_noisy}|\right)}
   $$



2. Visualize the waveforms of the original and noisy audio.

In [None]:
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

# Load the audio file
audio_path = 
audio, sr = 

# Generate white noise
noise_factor = 0.005  # Adjust for more or less noise
noise = 

# Add noise to the original audio
audio_noisy = 

# Normalize to prevent clipping
audio_noisy = 

# Plotting

# Original waveform


# Noisy waveform



After running the code, reflect on the following:
- Audio Quality: Is the audio still understandable after anonymization?


- Speaker Anonymity: Can you identify the speaker from the anonymized audio?

- Preservation of Content: Does the anonymized audio still convey the intended information (e.g., speech recognition, speaker verification)?