# Data Anonymization and Privacy-Enhancing Technologies (PETs)



## 1. Introduction to Data Anonymization 
### What is Data Anonymization 

Data anonymization is the process of transforming data so that individuals' personal information is either hidden or removed. It is a critical step for protecting privacy and ensuring compliance with data protection regulations. By anonymizing data, we ensure that the data cannot be linked to any identifiable individual, reducing the risk of privacy violation.

### What is Audio Data Anonymization?

Audio data anonymization refers to the process of altering audio data, such as voice recordings, to prevent the identification of the individual who produced the data while maintaining its overall utility for analysis. This is crucial for applications where personal audio data, such as speech recordings or voiceprints, are used. For example, in datasets like VoxCeleb, speaker identity can be easily inferred from voice features, making it essential to anonymize the audio in order to comply with privacy regulations

Anonymization can involve techniques such as voice masking, pitch shifting, and time stretching to alter the speaker's identity, while ensuring the audio remains useful for tasks such as speech recognition or speaker verification.



### Techniques for Anonymizing Audio Data

There are several methods for anonymizing audio data, each aimed at obscuring speaker identity while preserving the utility of the data for tasks like speech recognition, voice activity detection, or speaker verification.

#### 1. Pitch Shifting
Pitch shifting is a technique where the pitch of a speaker’s voice is altered. This can effectively obscure the identity of the speaker without drastically changing the structure of the speech. The shift in pitch can be applied within a range that still allows for speech intelligibility.

#### 2. Time Stretching

Time stretching involves changing the speed of an audio file without altering its pitch. This technique can be used to anonymize audio while maintaining the rhythm and tone of the speech. It is particularly useful for maintaining the natural flow of the conversation.

#### 3. Voice Masking or Voice Cloning

Voice masking replaces the speaker’s voice with a synthetic voice, typically a neutral or gender-neutral one. This technique is more sophisticated and can be achieved using deep learning methods for generating synthetic voices. It ensures that the original speaker cannot be identified.

### Load and Explore VoxCeleb Metadata

Let's start by loading and exploring the VoxCeleb metadata to get a sense of the distribution of speakers.

In [None]:
import pandas as pd

# Load metadata (example CSV)
metadata_path = "/home/santhwanat1029@alabsad.fau.de/data-governance-seminar/data-governance-seminar/dataHDD/voxceleb/voxceleb_trainer/data/vox1_meta.csv"
metadata = pd.read_csv(metadata_path)

# Inspect data
metadata.head()

### 📝 **Task 1 :**   Audio Preprocessing and Pitch Shifting


Start by loading an audio sample from the VoxCeleb dataset and applying a pitch shift to anonymize the speaker's voice.

#### **Instructions:**
1. Load an audio sample from the dataset.

2. Apply a pitch shift of +5 semitones to the audio.

3. Visualize the waveform of the original and pitch-shifted audio.

In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt

# Load audio sample from VoxCeleb dataset
audio_path = '/path/to/voxceleb/audio/sample.wav'
audio, sr = librosa.load(audio_path, sr=None)

# Apply pitch shifting

# Plot the original and pitch-shifted audio waveforms


### 📝 **Task 2 :**   Time Stretching for Anonymization

Now, apply time stretching to alter the duration of the audio without changing its pitch. Choose a stretch rate of 1.2 (i.e., increase the length of the audio by 20%).

#### **Instructions:**


1. Apply time stretching to the same audio sample.

2. Visualize the waveforms of the original and time-stretched audio.

In [None]:
# Apply time stretching
audio_time_stretched = 

# Plot the original and time-stretched audio waveforms


### 📝 **Task 3 :**   Adding Noise to Anonymize the Audio

Next, add background noise to the audio to further obscure the speaker’s identity.

#### **Instructions:**


1. Generate some random noise and add it to the audio.

2. Visualize the waveforms of the original and noisy audio.

In [None]:
import numpy as np

# Generate random noise

# Add noise to the original audio (ensure it doesn't exceed the max amplitude)

# Plot the original and noisy audio waveforms



After running the code, reflect on the following:
- Audio Quality: Is the audio still understandable after anonymization?


- Speaker Anonymity: Can you identify the speaker from the anonymized audio?

- Preservation of Content: Does the anonymized audio still convey the intended information (e.g., speech recognition, speaker verification)?

### 🎯 **Your Goal:**
- **Describe the anonymization technique applied (e.g., pitch shifting, time stretching, noise addition, reverberation).**
- **Note any challenges faced during the anonymization process (e.g., audio degradation, loss of intelligibility).**


- **Discuss the effectiveness of the technique in obscuring speaker identity while preserving content.** 
