This repository implements LexText, a lexical privacy mechanism for text data, along with a baseline implementation of CusText, evaluation pipelines, and attacker models (NEAR and Bayesian).
attacker_llm.py
extract_results.py
generate_near_jobs.py
generate_sanitization_jobs.py
prepare_data.py
prepare_embeddings.py
remap_base.py
remap_custext.py
remap_lexical_custext.py
roberta_tune.py
run_near_wrapper.py
sanitize_and_run_roberta.py
utils2.py
lm_scorer/ # Modified external repo (see below)
LEXTEXT_ADJUSTMENTS/ # Adjustments specific to LexText
The workflow consists of five main stages:
- Data preparation
- Embedding preparation
- Model fine-tuning
- Sanitization + evaluation
- Attack (NEAR / Bayesian)
Download and preprocess datasets:
python prepare_data.pyThis step:
- Downloads datasets
- Saves them locally in the required format
python prepare_embeddings.pyThis step:
- Downloads word embeddings
- Normalizes them
- Stores them in a compressed format
- Creates:
idx2wordword2idx
python roberta_tune.py --task=<TASK_NAME>This fine-tunes RoBERTa on a specific NLP task (e.g., SST-2, QNLI, MRPC, CoLA).
Main implementation of LexText.
- Extracts Part-of-Speech (PoS) using spaCy
- Uses WordNet to build mappings
- Constructs a
customized_mapping - Partitions tokens into
token::POSpairs - Applies privacy-aware remapping (based on CusText)
Main function:
noise_text(...)→ applies the sanitization mechanism
Baseline mechanism:
- Similar to LexText
- Does not use PoS information
python sanitize_and_run_roberta.py \
--mechanism=<lextext|custext> \
--epsilon=<PRIVACY_BUDGET> \
--task=<TASK_NAME> \
--save_sanitize=<OUTPUT_JSON>This script:
- Applies the selected sanitization mechanism
- Sanitizes the test set
- Runs evaluation 10 times and averages results
- Saves a JSON file containing:
- Original text
- Sanitized text
This JSON is required for the attacker models.
Implemented in:
attacker_llm.py
Recommended usage:
python run_near_wrapper.pyRequires:
- Sanitization JSON file from the previous step
The Bayesian attacker implementation is available here:
https://github.com/mengtong0110/On-the-Vulnerability-of-Text-Sanitization
The lm_scorer/ folder is based on:
https://github.com/simonepri/lm-scorer
Modifications:
- Added functionality to support Qwen2 models
Additional LexText-specific adjustments are located in:
LEXTEXT_ADJUSTMENTS/
generate_sanitization_jobs.py→ batch sanitization experimentsgenerate_near_jobs.py→ batch attacker runs
These scripts demonstrate how to evaluate across:
- tasks
- epsilon values
- mechanisms
python extract_results.pyUsed for:
- Aggregating results
- Extracting evaluation metrics
utils2.py
Contains helper functions used across the project.
- LexText extends CusText by incorporating PoS information
- Sanitization is stochastic → results are averaged over multiple runs
- Attacker models require saved sanitization examples
- Ensure preprocessing steps are completed before running experiments
| Component | Description |
|---|---|
| LexText | PoS-aware privacy mechanism |
| CusText | Baseline mechanism |
| RoBERTa | Downstream task evaluation |
| NEAR | LLM-based attacker |
| Bayesian | External attacker |
| Embeddings | Preprocessed and normalized |
If something fails, verify:
- Data exists (
prepare_data.py) - Embeddings exist (
prepare_embeddings.py) - Model is trained (
roberta_tune.py)