LexText: Lexical Privacy Mechanism for Text

This repository implements LexText, a lexical privacy mechanism for text data, along with a baseline implementation of CusText, evaluation pipelines, and attacker models (NEAR and Bayesian).

📂 Project Structure

attacker_llm.py
extract_results.py
generate_near_jobs.py
generate_sanitization_jobs.py
prepare_data.py
prepare_embeddings.py
remap_base.py
remap_custext.py
remap_lexical_custext.py
roberta_tune.py
run_near_wrapper.py
sanitize_and_run_roberta.py
utils2.py

lm_scorer/                 # Modified external repo (see below)
LEXTEXT_ADJUSTMENTS/      # Adjustments specific to LexText

🚀 Pipeline Overview

The workflow consists of five main stages:

Data preparation
Embedding preparation
Model fine-tuning
Sanitization + evaluation
Attack (NEAR / Bayesian)

1. Prepare Data

Download and preprocess datasets:

python prepare_data.py

This step:

Downloads datasets
Saves them locally in the required format

2. Prepare Embeddings

python prepare_embeddings.py

This step:

Downloads word embeddings
Normalizes them
Stores them in a compressed format
Creates:
- idx2word
- word2idx

3. Fine-tune RoBERTa

python roberta_tune.py --task=<TASK_NAME>

This fine-tunes RoBERTa on a specific NLP task (e.g., SST-2, QNLI, MRPC, CoLA).

4. Sanitization Mechanisms

🔹 LexText (`remap_lexical_custext.py`)

Main implementation of LexText.

Extracts Part-of-Speech (PoS) using spaCy
Uses WordNet to build mappings
Constructs a customized_mapping
Partitions tokens into token::POS pairs
Applies privacy-aware remapping (based on CusText)

Main function:

noise_text(...) → applies the sanitization mechanism

🔹 CusText (`remap_custext.py`)

Baseline mechanism:

Similar to LexText
Does not use PoS information

5. Run Sanitization + Evaluation

python sanitize_and_run_roberta.py \
    --mechanism=<lextext|custext> \
    --epsilon=<PRIVACY_BUDGET> \
    --task=<TASK_NAME> \
    --save_sanitize=<OUTPUT_JSON>

This script:

Applies the selected sanitization mechanism
Sanitizes the test set
Runs evaluation 10 times and averages results
Saves a JSON file containing:
- Original text
- Sanitized text

This JSON is required for the attacker models.

6. Attacker Models

🔹 NEAR (LLM-based attacker)

Implemented in:

attacker_llm.py

Recommended usage:

python run_near_wrapper.py

Requires:

Sanitization JSON file from the previous step

🔹 Bayesian Attacker

The Bayesian attacker implementation is available here:

https://github.com/mengtong0110/On-the-Vulnerability-of-Text-Sanitization

📦 External Code and Modifications

lm_scorer

The lm_scorer/ folder is based on:

https://github.com/simonepri/lm-scorer

Modifications:

Added functionality to support Qwen2 models

LEXTEXT Adjustments

Additional LexText-specific adjustments are located in:

LEXTEXT_ADJUSTMENTS/

🧪 Running Experiments at Scale

generate_sanitization_jobs.py → batch sanitization experiments
generate_near_jobs.py → batch attacker runs

These scripts demonstrate how to evaluate across:

tasks
epsilon values
mechanisms

📊 Results Processing

python extract_results.py

Used for:

Aggregating results
Extracting evaluation metrics

🧰 Utilities

utils2.py

Contains helper functions used across the project.

📝 Notes

LexText extends CusText by incorporating PoS information
Sanitization is stochastic → results are averaged over multiple runs
Attacker models require saved sanitization examples
Ensure preprocessing steps are completed before running experiments

📌 Summary

Component	Description
LexText	PoS-aware privacy mechanism
CusText	Baseline mechanism
RoBERTa	Downstream task evaluation
NEAR	LLM-based attacker
Bayesian	External attacker
Embeddings	Preprocessed and normalized

🔧 Troubleshooting

If something fails, verify:

Data exists (prepare_data.py)
Embeddings exist (prepare_embeddings.py)
Model is trained (roberta_tune.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LexText: Lexical Privacy Mechanism for Text

📂 Project Structure

🚀 Pipeline Overview

1. Prepare Data

2. Prepare Embeddings

3. Fine-tune RoBERTa

4. Sanitization Mechanisms

🔹 LexText (`remap_lexical_custext.py`)

🔹 CusText (`remap_custext.py`)

5. Run Sanitization + Evaluation

6. Attacker Models

🔹 NEAR (LLM-based attacker)

🔹 Bayesian Attacker

📦 External Code and Modifications

lm_scorer

LEXTEXT Adjustments

🧪 Running Experiments at Scale

📊 Results Processing

🧰 Utilities

📝 Notes

📌 Summary

🔧 Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
lexicalcustext/datasets		lexicalcustext/datasets
lextext_bayes		lextext_bayes
lm-scorer		lm-scorer
.gitignore		.gitignore
README.md		README.md
attacker_llm.py		attacker_llm.py
extract_results.py		extract_results.py
generate_near_jobs.py		generate_near_jobs.py
generate_sanitization_jobs.py		generate_sanitization_jobs.py
prepare_data.py		prepare_data.py
prepare_embeddings.py		prepare_embeddings.py
remap_base.py		remap_base.py
remap_custext.py		remap_custext.py
remap_lexical_custext.py		remap_lexical_custext.py
roberta_tune.py		roberta_tune.py
run.sh		run.sh
run_near_wrapper.py		run_near_wrapper.py
sanitize_and_run_roberta.py		sanitize_and_run_roberta.py
utils2.py		utils2.py

Folders and files

Latest commit

History

Repository files navigation

LexText: Lexical Privacy Mechanism for Text

📂 Project Structure

🚀 Pipeline Overview

1. Prepare Data

2. Prepare Embeddings

3. Fine-tune RoBERTa

4. Sanitization Mechanisms

🔹 LexText (remap_lexical_custext.py)

🔹 CusText (remap_custext.py)

5. Run Sanitization + Evaluation

6. Attacker Models

🔹 NEAR (LLM-based attacker)

🔹 Bayesian Attacker

📦 External Code and Modifications

lm_scorer

LEXTEXT Adjustments

🧪 Running Experiments at Scale

📊 Results Processing

🧰 Utilities

📝 Notes

📌 Summary

🔧 Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🔹 LexText (`remap_lexical_custext.py`)

🔹 CusText (`remap_custext.py`)

Packages