PiiScrub

A blazing-fast, lightweight Python library and CLI tool designed to scrub Personally Identifiable Information (PII) from datasets for LLM training and RAG pipelines.

Features

Maximum Speed & Zero Dependencies: Relies exclusively on Python's standard library. No pandas, spaCy, or other heavy external packages.
Deterministic Validation: Raw regex matches for high-risk entities (like credit cards and IPs) pass algorithmic checksums (e.g., Luhn algorithm, octet range checks) before being flagged to eliminate false positives.
Pre-compiled Regex: All regular expressions are compiled at the module level using re.compile() for O(1) setup time during execution.
Large Dataset Streaming: Features scrub_stream and extract_stream to process massive datasets chunk-by-chunk without hitting Out-Of-Memory limit.
Multi-Core Parallel Processing: Leverage multiple CPU cores to scrub large files at blazing speed using --parallel.
Pre-Bundled Compliance Profiles: Quickly target specific standards like hipaa, pci-dss, or gdpr using the --profile flag.
Compliance Auditing & Metric Reports: Generate detailed JSON reports with statistics on redacted entities and execution time using --report.
High-Value Secret Detection: Added parsing to locate critical assets like AWS Access Keys, GitHub Tokens, and RSA Private Keys out of the box.
Deterministic Hashing: Replace PII with deterministic SHA-256 hashes instead of generic tags to track uniqueness without leaking data.
Synthetic Data Generation: Replace real PII with realistic "fake" data using the faker library (beta).
Configuration File Support: Manage complex settings via piiscrub.json instead of long CLI commands.
Custom Pattern Injection: Dynamically inject your own regex patterns and validators directly into the engine without modifying the core library.
Allowlist Support: Explicitly bypass scrubbing for public figures, system emails, or company identifiers to prevent false positives.

Supported Entities

Global:
- EMAIL
- PHONE_GENERIC (international)
- CREDIT_CARD (13-16 digits with Luhn algorithm validation)
- IPV4 (validation ensuring all octets <= 255)
- IPV6
US Specific:
- US_SSN
India Specific:
- IN_AADHAAR (12 digits, cannot start with 0 or 1)
- IN_PAN (5 uppercase letters, 4 digits, 1 uppercase letter)
Secrets & Credentials:
- AWS_ACCESS_KEY
- GITHUB_TOKEN
- RSA_PRIVATE_KEY

Installation

pip install .

CLI Usage

Extract PII

piiscrub extract --text "My email is test@example.com"
piiscrub extract --file text.txt

Scrub PII

piiscrub scrub --text "My email is test@example.com"
piiscrub scrub --file text.txt

# Use deterministic hashing instead of standard tags
piiscrub scrub --text "My email is test@example.com" --style hash
# Output: My email is <EMAIL_a1517717>

# Bypass scrubbing for specific public strings
piiscrub scrub --text "Contact support@example.com or user@example.com" --allowlist support@example.com
# Output: Contact support@example.com or <EMAIL>

# Inject Custom Pattern from the CLI
piiscrub scrub --text "This is employee EMP-99881 and email a@b.com" --custom-pattern EMP_ID "\bEMP-\d{5}\b" --entities EMP_ID EMAIL
# Output: This is employee <EMP_ID> and email <EMAIL>

# Synthetic Data Generation
piiscrub scrub --text "Contact me at omkar@example.com" --style synthetic
# Output: Contact me at victoria12@gmail.com

Advanced Features

1. Configuration File (`piiscrub.json`)

You can define a piiscrub.json file in your working directory to simplify your commands:

{
    "style": "hash",
    "entities": ["EMAIL", "PHONE_GENERIC"],
    "allowlist": ["support@mycompany.com"],
    "custom_patterns": {
        "ORDER_ID": "ORD-\\d{5}"
    }
}

Now just run:

piiscrub scrub --file data.txt

2. Parallel Processing

For large files, use multi-core processing:

piiscrub scrub --file large_dataset.txt --parallel --output cleaned.txt

Tip

Parallel mode automatically handles file I/O efficiently and defaults to using all available CPU cores.

3. Pre-Bundled Compliance Profiles

Quickly target common privacy standards without remembering every entity name:

# Scrub only PCI-DSS related data (Credit Cards)
piiscrub scrub --file transactions.txt --profile pci-dss

# Scrub HIPAA related data (SSN, Phone, Email, IP)
piiscrub scrub --file medical_records.txt --profile hipaa

Available profiles: pci-dss, hipaa, gdpr, strict.

4. Compliance Auditing & Metric Reports

Data compliance teams can generate a statistical summary of the scrubbing process as proof of redaction:

piiscrub scrub --file sensitive_data.txt --report audit.json

Sample audit.json output:

{
    "command": "scrub",
    "total_lines_processed": 5000,
    "execution_time_seconds": 1.25,
    "entities_redacted": {
        "EMAIL": 142,
        "CREDIT_CARD": 12,
        "PHONE_GENERIC": 5
    },
    "style": "tag"
}

5. Structured Data Support (JSON & CSV)

Target specific fields in structured files to preserve the format while scrubbing PII.

# Scrub only specific keys in a JSON file
piiscrub scrub --file data.json --json-key email secret

# Scrub only specific columns in a CSV file
piiscrub scrub --file data.csv --csv-column phone email

#### 6. Directory Batch Processing
Process all files within a directory at once. By default, it saves scrubbed files to a new directory with the `_scrubbed` suffix.

```bash
# Scrub all files in a directory (shallow)
piiscrub scrub --dir ./input_logs --output ./cleaned_logs

# Scrub all files recursively and preserve folder structure
piiscrub scrub --dir ./dataset --recursive --output ./cleaned_dataset


### Stream Processing
For extremely large files (e.g. LLM corpus data logs):
```bash
piiscrub scrub --file huge_dataset.jsonl --stream > scrubbed.jsonl
piiscrub extract --file huge_dataset.jsonl --stream > entities.json

Library Usage

from piiscrub.core import PiiScrub
import re

# Initialize with custom generic entities or pattern injection!
custom_patterns = {
    "INTERNAL_ID": re.compile(r"\bEMP-\d{5}\b")
}
cs = PiiScrub(
    entities=["EMAIL", "CREDIT_CARD", "INTERNAL_ID"], 
    custom_patterns=custom_patterns,
    allowlist=["public@example.com"]
)

code = "Contact test@example.com for info on EMP-12345."

# Extract entities
extracted = cs.extract_entities(code)
print(extracted)
# {'EMAIL': ['test@example.com'], 'INTERNAL_ID': ['EMP-12345']}

# Scrub entities using hashing
scrubbed_code = cs.scrub_text(code, replacement_style="hash")
print(scrubbed_code)
# Contact <EMAIL_a1517717> for info on <INTERNAL_ID_b5fb38c3>.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
piiscrub		piiscrub
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PiiScrub

Features

Supported Entities

Installation

CLI Usage

Extract PII

Scrub PII

Advanced Features

1. Configuration File (`piiscrub.json`)

2. Parallel Processing

3. Pre-Bundled Compliance Profiles

4. Compliance Auditing & Metric Reports

5. Structured Data Support (JSON & CSV)

Library Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PiiScrub

Features

Supported Entities

Installation

CLI Usage

Extract PII

Scrub PII

Advanced Features

1. Configuration File (piiscrub.json)

2. Parallel Processing

3. Pre-Bundled Compliance Profiles

4. Compliance Auditing & Metric Reports

5. Structured Data Support (JSON & CSV)

Library Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Configuration File (`piiscrub.json`)

Packages