Skip to content

MeLeLBGU/LexText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LexText: Lexical Privacy Mechanism for Text

This repository implements LexText, a lexical privacy mechanism for text data, along with a baseline implementation of CusText, evaluation pipelines, and attacker models (NEAR and Bayesian).


📂 Project Structure

attacker_llm.py
extract_results.py
generate_near_jobs.py
generate_sanitization_jobs.py
prepare_data.py
prepare_embeddings.py
remap_base.py
remap_custext.py
remap_lexical_custext.py
roberta_tune.py
run_near_wrapper.py
sanitize_and_run_roberta.py
utils2.py

lm_scorer/                 # Modified external repo (see below)
LEXTEXT_ADJUSTMENTS/      # Adjustments specific to LexText

🚀 Pipeline Overview

The workflow consists of five main stages:

  1. Data preparation
  2. Embedding preparation
  3. Model fine-tuning
  4. Sanitization + evaluation
  5. Attack (NEAR / Bayesian)

1. Prepare Data

Download and preprocess datasets:

python prepare_data.py

This step:

  • Downloads datasets
  • Saves them locally in the required format

2. Prepare Embeddings

python prepare_embeddings.py

This step:

  • Downloads word embeddings
  • Normalizes them
  • Stores them in a compressed format
  • Creates:
    • idx2word
    • word2idx

3. Fine-tune RoBERTa

python roberta_tune.py --task=<TASK_NAME>

This fine-tunes RoBERTa on a specific NLP task (e.g., SST-2, QNLI, MRPC, CoLA).


4. Sanitization Mechanisms

🔹 LexText (remap_lexical_custext.py)

Main implementation of LexText.

  • Extracts Part-of-Speech (PoS) using spaCy
  • Uses WordNet to build mappings
  • Constructs a customized_mapping
  • Partitions tokens into token::POS pairs
  • Applies privacy-aware remapping (based on CusText)

Main function:

  • noise_text(...) → applies the sanitization mechanism

🔹 CusText (remap_custext.py)

Baseline mechanism:

  • Similar to LexText
  • Does not use PoS information

5. Run Sanitization + Evaluation

python sanitize_and_run_roberta.py \
    --mechanism=<lextext|custext> \
    --epsilon=<PRIVACY_BUDGET> \
    --task=<TASK_NAME> \
    --save_sanitize=<OUTPUT_JSON>

This script:

  • Applies the selected sanitization mechanism
  • Sanitizes the test set
  • Runs evaluation 10 times and averages results
  • Saves a JSON file containing:
    • Original text
    • Sanitized text

This JSON is required for the attacker models.


6. Attacker Models

🔹 NEAR (LLM-based attacker)

Implemented in:

attacker_llm.py

Recommended usage:

python run_near_wrapper.py

Requires:

  • Sanitization JSON file from the previous step

🔹 Bayesian Attacker

The Bayesian attacker implementation is available here:

https://github.com/mengtong0110/On-the-Vulnerability-of-Text-Sanitization


📦 External Code and Modifications

lm_scorer

The lm_scorer/ folder is based on:

https://github.com/simonepri/lm-scorer

Modifications:

  • Added functionality to support Qwen2 models

LEXTEXT Adjustments

Additional LexText-specific adjustments are located in:

LEXTEXT_ADJUSTMENTS/

🧪 Running Experiments at Scale

  • generate_sanitization_jobs.py → batch sanitization experiments
  • generate_near_jobs.py → batch attacker runs

These scripts demonstrate how to evaluate across:

  • tasks
  • epsilon values
  • mechanisms

📊 Results Processing

python extract_results.py

Used for:

  • Aggregating results
  • Extracting evaluation metrics

🧰 Utilities

utils2.py

Contains helper functions used across the project.


📝 Notes

  • LexText extends CusText by incorporating PoS information
  • Sanitization is stochastic → results are averaged over multiple runs
  • Attacker models require saved sanitization examples
  • Ensure preprocessing steps are completed before running experiments

📌 Summary

Component Description
LexText PoS-aware privacy mechanism
CusText Baseline mechanism
RoBERTa Downstream task evaluation
NEAR LLM-based attacker
Bayesian External attacker
Embeddings Preprocessed and normalized

🔧 Troubleshooting

If something fails, verify:

  1. Data exists (prepare_data.py)
  2. Embeddings exist (prepare_embeddings.py)
  3. Model is trained (roberta_tune.py)

About

Lextext - Differential privacy with lexical constraints

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors