#  Data Anonymization and Privacy-Enhancing Technologies (PETs)

## 🎯 Learning Goals
- Understand what data anonymization is and why it matters.
- Apply anonymization techniques to structured, text and audio data.
- Understand how anonymization supports compliance with privacy laws (GDPR, CCPA).

### 🔐 Introduction to Data Anonymization

- **Data anonymization** removes personally identifiable information (PII) from datasets.
- **Pseudonymization** retains a link to identity (e.g., using a key); anonymization does not.

### 🧰 Techniques for Anonymizing Different Data Types
- **Structured Data**: k-anonymity, l-diversity, t-closeness
- **Text Data**: Named Entity Recognition (NER) + masking
- **Images**: Face blurring, pixelation
- **Audio**: Voice masking, pitch shift

### 📜 Legal Relevance
- **GDPR & CCPA**: Require data minimization and protection.
- Anonymization helps avoid processing 'personal data' under these laws.

##  Part 1: Structured Data Anonymization (k-Anonymity)

### 💡 Task 1 

#### Description:
You are provided with a structured dataset (CSV file) containing personal attributes like Age, Education, Occupation, Relationship, Sex, and Country. Your task is to anonymize this data using **k-anonymity** by binning continuous data (like Age) and generalizing or masking quasi-identifiers such as Country or Occupation. Then assess whether individual identities could still be inferred.


- Generalize and group quasi-identifiers (e.g., Age and Country)
- Ensure each combination of quasi-identifiers occurs at least **k** times
- Evaluate anonymity level before and after transformation


In [None]:
!pip install pandas scikit-learn spacy opencv-python-headless matplotlib librosa pydub
!python -m spacy download en_core_web_sm

In [None]:
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
from collections import Counter
from google.colab import files

# Upload CSV file
display("Upload your structured dataset CSV")
uploaded = files.upload()
filename = next(iter(uploaded))
data = pd.read_csv(filename)

# Bin Age into categories
binner = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
data['Age_group'] = binner.fit_transform(data[['Age']]).astype(int)

# Generalize Country
region_map = {
    'United-States': 'North America', 'Canada': 'North America', 'Mexico': 'North America',
    'India': 'Asia', 'China': 'Asia', 'Japan': 'Asia',
    'Germany': 'Europe', 'France': 'Europe', 'UK': 'Europe'
}
data['Region'] = data['Country'].map(region_map).fillna('Other')

# Drop specific identifying columns
data = data.drop(columns=['Age', 'Country'])

# Group by quasi-identifiers and count
qi_cols = ['Age_group', 'Region', 'Sex']
k_anon_counts = data.groupby(qi_cols).size().reset_index(name='Count')

# Merge back counts to each record
data = pd.merge(data, k_anon_counts, on=qi_cols)

# Check if k-anonymity condition is met (e.g., k >= 3)
k = 3
print(f"Rows violating {k}-anonymity:")
print(data[data['Count'] < k])

##  Part 2: Text Anonymization with SpaCy (NER + Masking)

### 💡 Task 2 

#### Description:
You are provided with a dataset containing news headlines with columns: `publish_date`, `headline_category`, and `headline_text`. Your task is to anonymize the `headline_text` column using **Named Entity Recognition (NER)** to identify and mask named entities such as people, organizations, locations, and dates.


In [None]:
import spacy
import pandas as pd
from google.colab import files

# Load NER model
nlp = spacy.load("en_core_web_sm")

# Upload CSV dataset
uploaded = files.upload()
filename = next(iter(uploaded))
df = pd.read_csv(filename)

# Function to anonymize a single headline
def anonymize_text(text):
    doc = nlp(text)
    for ent in doc.ents:
        text = text.replace(ent.text, f"<MASK_{ent.label_}>")
    return text

# Apply anonymization to the dataset
df['anonymized_headline'] = df['headline_text'].apply(anonymize_text)

# Display a few results
df[['headline_text', 'anonymized_headline']].head()