## LLM-based PII Modification on Enron Email Dataset with NeMo Curator

## Introduction

This tutorial demonstrates how to use NeMo Curator's PII (Personally Identifiable Information) modification capabilities on a real-world dataset. We'll use a subset of the Enron email dataset to showcase both asynchronous and synchronous LLM-based PII modification approaches.

## Why PII Modification?

PII modification is crucial for:
1. **Data Privacy with Utility**: Transforming sensitive information while maintaining data usefulness
2. **Training Data Quality**: Creating realistic but privacy-safe training data
3. **Regulatory Compliance**: Meeting privacy requirements while preserving data characteristics
4. **Research Value**: Enabling research on sensitive datasets without compromising privacy
5. **Safe ML Training**: Ensuring ML models don't learn or expose private information

## Example Transformations

Original Email:

```
From: john.doe@enron.com

Subject: Meeting with Sarah

Hi Sarah, Please call me at (555) 123-4567 to discuss the project.
```

After PII Redaction:

```
From: [EMAIL]

Subject: Meeting with [PERSON]

Hi [PERSON], Please call me at [PHONE_NUMBER] to discuss the project.
```

In [1]:
import os
import tarfile

import pandas as pd
import requests
from tqdm.auto import tqdm

from nemo_curator.datasets import DocumentDataset
from nemo_curator.modifiers.async_llm_pii_modifier import AsyncLLMPiiModifier
from nemo_curator.modifiers.llm_pii_modifier import LLMPiiModifier
from nemo_curator.modules.modify import Modify
from nemo_curator.utils.distributed_utils import get_client

  from .autonotebook import tqdm as notebook_tqdm


# Download Sample Data

## Dataset Information

The [Enron Email Dataset](https://www.cs.cmu.edu/~enron/) is a large public dataset containing real-world business emails from Enron Corporation employees. It was made public during the legal investigation of the Enron corporation and has become a valuable resource for research in natural language processing and email analysis.

## Structure
The dataset is organized as follows:
```
- maildir/
  - user1/
    - inbox/
    - sent/
    - deleted_items/
    ...
  - user2/
    ...
```
In this tutorial, we will focus on extracting and processing emails from Philip Allen's mailbox (`allen-p`).

## Step 1: Download Dataset

This function downloads the Enron email dataset directly from the official CMU source if it is not already present locally. Having a local copy of the dataset allows you to efficiently experiment with PII redaction and other text processing tasks.

In [None]:
def download_enron_dataset(target_dir: str = "enron_data") -> None:
    """Download the full Enron email dataset"""
    url = "https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz"
    tar_file = os.path.join(target_dir, "enron_mail_20150507.tar.gz")

    # Create target directory
    os.makedirs(target_dir, exist_ok=True)

    # Download if not already downloaded
    if not os.path.exists(tar_file):
        print(f"Downloading Enron dataset from {url}")
        response = requests.get(url, stream=True, timeout=10)
        total_size = int(response.headers.get("content-length", 0))

        with open(tar_file, "wb") as f, tqdm(total=total_size, unit="iB", unit_scale=True, desc="Downloading") as pbar:
            for data in response.iter_content(chunk_size=1024 * 1024):
                size = f.write(data)
                pbar.update(size)
    else:
        print(f"Found existing download at {tar_file}")

    return tar_file


# Download the dataset
tar_file = download_enron_dataset()

Downloading Enron dataset from https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz


Downloading: 100%|██████████| 443M/443M [00:49<00:00, 8.95MiB/s] 


## Step 2: Extract Sample Data

This step extracts all email messages for a specific user (in this case, Allen) from the full Enron dataset archive. Focusing on a single user's mailbox allows to efficiently test and demonstrate PII redaction techniques on a smaller, more manageable set of real-world emails.

In [None]:
def extract_user_mailbox(tar_file: str, username: str = "allen-p", target_dir: str = "enron_data") -> None:
    """Extract specific user's mailbox from the dataset"""
    maildir_path = os.path.join(target_dir, "maildir")
    user_path = os.path.join(maildir_path, username)

    if not os.path.exists(user_path):
        print(f"Extracting {username}'s mailbox...")
        with tarfile.open(tar_file, "r:gz") as tar:
            # Get all members that belong to the specified user
            members = [m for m in tar.getmembers() if m.name.startswith(f"maildir/{username}/")]

            # Extract user's mailbox
            for member in tqdm(members, desc=f"Extracting {username}'s emails"):
                tar.extract(member, target_dir)
    else:
        print(f"Found existing extraction at {user_path}")

    return user_path


# Extract Allen's mailbox
allen_path = extract_user_mailbox(tar_file, username="allen-p")
print(f"\nExtracted to: {allen_path}")

# List contents of Allen's mailbox
print("\nMailbox structure:")
MAX_FILES_TO_SHOW = 5
for root, _dirs, files in os.walk(allen_path):
    level = root.replace(allen_path, "").count(os.sep)
    indent = " " * 4 * level
    print(f"{indent}{os.path.basename(root)}/")
    if files:
        subindent = " " * 4 * (level + 1)
        for f in files[:MAX_FILES_TO_SHOW]:  # Show first 5 files in each directory
            print(f"{subindent}{f}")
        if len(files) > MAX_FILES_TO_SHOW:
            print(f"{subindent}... ({len(files) - 5} more files)")

Extracting allen-p's mailbox...


Extracting allen-p's emails: 100%|██████████| 3044/3044 [00:05<00:00, 581.85it/s]



Extracted to: enron_data/maildir/allen-p

Mailbox structure:
allen-p/
    sent/
        423.
        453.
        424.
        105.
        73.
        ... (557 more files)
    _sent_mail/
        423.
        453.
        424.
        105.
        73.
        ... (597 more files)
    inbox/
        73.
        41.
        21.
        22.
        3.
        ... (61 more files)
    notes_inbox/
        41.
        21.
        22.
        3.
        11.
        ... (43 more files)
    all_documents/
        423.
        453.
        424.
        105.
        73.
        ... (623 more files)
    discussion_threads/
        423.
        453.
        424.
        105.
        73.
        ... (407 more files)
    contacts/
        2.
        1.
    straw/
        3.
        2.
        5.
        8.
        4.
        ... (3 more files)
    sent_items/
        105.
        73.
        299.
        173.
        266.
        ... (340 more files)
    deleted_items/
        423.
        453.
   

## Step 3: Load Sample Emails

This step reads a selection of raw email files from the extracted user mailbox (Allen-P) and loads them into a structured DataFrame, preparing the data for further processing and PII redaction.

In [4]:
def load_emails_from_folder(folder_path: str, max_emails: int = 10) -> list[dict]:
    """Load emails from a specific folder"""
    emails = []
    email_count = 0

    print(f"Loading emails from {folder_path}")
    for file in os.listdir(folder_path):
        if email_count >= max_emails:
            break

        file_path = os.path.join(folder_path, file)
        try:
            with open(file_path, encoding="latin-1") as f:
                content = f.read()
                # Basic validation that it's an email
                if "Message-ID:" in content or "Date:" in content:
                    emails.append({"id": f"allen-p_{email_count}", "text": content, "file_path": file_path})
                    email_count += 1
        except (OSError, UnicodeDecodeError) as e:
            print(f"Error reading {file_path}: {e}")
            continue

    return pd.DataFrame(emails)


# Load sample emails from Allen's inbox
inbox_path = os.path.join(allen_path, "inbox")
sample_df = load_emails_from_folder(inbox_path, max_emails=10)

print(f"\nLoaded {len(sample_df)} emails")
print("\nSample DataFrame info:")
print(sample_df.info())

# Show preview of first email
print("\nFirst email preview (first 300 characters):")
print(sample_df.iloc[0]["text"][:300])

Loading emails from enron_data/maildir/allen-p/inbox

Loaded 10 emails

Sample DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         10 non-null     object
 1   text       10 non-null     object
 2   file_path  10 non-null     object
dtypes: object(3)
memory usage: 368.0+ bytes
None

First email preview (first 300 characters):
Message-ID: <17733064.1075862166101.JavaMail.evans@thyme>
Date: Tue, 27 Nov 2001 17:05:30 -0800 (PST)
From: jwills3@swbell.net
To: k..allen@enron.com
Subject: Re: PO spreadsheets
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: James Wills <jwills3


## Step 4: Convert to `DocumentDataset`

NeMo Curator requires conversion of your DataFrame to a `DocumentDataset` format for efficient, compatibility and seamless integration with LLM-based PII redaction and text modification tools.

In [12]:
dataset = DocumentDataset.from_pandas(sample_df, npartitions=2)

In [None]:
# View a sample email in its original form
print(sample_df["text"].iloc[1])

Message-ID: <1449918.1075858645402.JavaMail.evans@thyme>
Date: Mon, 29 Oct 2001 17:35:18 -0800 (PST)
From: arsystem@mailman.enron.com
To: k..allen@enron.com
Subject: Your Approval is Overdue: Access Request for matt.smith@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: ARSystem <ARSystem@mailman.enron.com>@ENRON
X-To: Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=PALLEN>
X-cc: 
X-bcc: 
X-Folder: \PALLEN (Non-Privileged)\Allen, Phillip K.\Inbox
X-Origin: Allen-P
X-FileName: PALLEN (Non-Privileged).pst

This request has been pending your approval for  14 days.  Please click http://itcapps.corp.enron.com/srrs/auth/emailLink.asp?ID=000000000067320&Page=Approval to review and act upon this request.





Request ID          : 000000000067320
Request Create Date : 10/11/01 10:24:53 AM
Requested For       : matt.smith@enron.com
Resource Name       : Risk Acceptance Forms Local Admin Rights - Permanent
Resource Type       : App

In [None]:
# To view the dataset content (DocumentDataset), convert it to a Pandas DataFrame and display the first few rows
dataset.to_pandas().head(10)

Unnamed: 0,id,text,file_path
0,allen-p_0,Message-ID: <17733064.1075862166101.JavaMail.e...,enron_data/maildir/allen-p/inbox/73.
1,allen-p_1,Message-ID: <1449918.1075858645402.JavaMail.ev...,enron_data/maildir/allen-p/inbox/41.
2,allen-p_2,Message-ID: <8113917.1075858644677.JavaMail.ev...,enron_data/maildir/allen-p/inbox/21.
3,allen-p_3,Message-ID: <5393535.1075858644700.JavaMail.ev...,enron_data/maildir/allen-p/inbox/22.
4,allen-p_4,Message-ID: <10326858.1075855377484.JavaMail.e...,enron_data/maildir/allen-p/inbox/3.
5,allen-p_5,Message-ID: <7462038.1075855377703.JavaMail.{{...,enron_data/maildir/allen-p/inbox/11.
6,allen-p_6,Message-ID: <2467021.1075862165862.JavaMail.ev...,enron_data/maildir/allen-p/inbox/63.
7,allen-p_7,Message-ID: <12246129.1075858645002.JavaMail.e...,enron_data/maildir/allen-p/inbox/33.
8,allen-p_8,Message-ID: <14859009.1075862166148.JavaMail.e...,enron_data/maildir/allen-p/inbox/75.
9,allen-p_9,Message-ID: <11341209.1075858645204.JavaMail.e...,enron_data/maildir/allen-p/inbox/35.


## Step 5: Configure and Apply LLM-based PII Modifiers
Below, we use an NVIDIA-hosted NIM with an API key generated from [here](https://build.nvidia.com/meta/llama-3_1-70b-instruct). To set up a self-hosted NIM, please refer to the [NIM documentation](https://docs.nvidia.com/nim/large-language-models/latest/configuration.html).

In [None]:
# Configure asynchronous LLM-based PII modifier
client = get_client()
async_modifier = AsyncLLMPiiModifier(
    base_url="https://integrate.api.nvidia.com/v1",  # Replace with your endpoint
    api_key="",  # Replace with your API key
    model="meta/llama-3.1-70b-instruct",
    max_concurrent_requests=10,
    # pii_labels=["PERSON", "EMAIL_ADDRESS"], # Example: Only detect specific labels # noqa: ERA001
    language="en",  # Default is 'English'
)

# Configure synchronous LLM-based  PII modifier
sync_modifier = LLMPiiModifier(
    base_url="https://integrate.api.nvidia.com/v1",  # Replace with your endpoint
    api_key="",  # Replace with your API key
    model="meta/llama-3.1-70b-instruct",
    # pii_labels=["PERSON", "EMAIL_ADDRESS"], # Example: Only detect specific labels # noqa: ERA001
    language="en",  # Default is 'English'
)

# Create output directories
os.makedirs("output/async_llm_enron", exist_ok=True)
os.makedirs("output/sync_llm_enron", exist_ok=True)

# Perform redaction with both modifiers
print("Redacting with asynchronous LLM-based modifier...")
modify_async = Modify(async_modifier, text_field="text")
modified_async = modify_async(dataset)
modified_async.to_json("output/async_llm_enron", write_to_filename=False)

print("\nRedacting with synchronous LLM-based modifier...")
modify_sync = Modify(sync_modifier, text_field="text")
modified_sync = modify_sync(dataset)
modified_sync.to_json("output/sync_llm_enron", write_to_filename=False)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 42727 instead


Redacting with asynchronous LLM-based modifier...


  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:04<00:13,  4.39s/it][A
 50%|█████     | 2/4 [00:04<00:03,  1.87s/it][A
 75%|███████▌  | 3/4 [00:04<00:01,  1.22s/it][A
100%|██████████| 4/4 [00:12<00:00,  3.21s/it][A
 50%|█████     | 1/2 [00:12<00:12, 12.86s/it]
  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:04<00:00,  4.06s/it][A
100%|██████████| 2/2 [00:16<00:00,  8.46s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:00<00:01,  2.26it/s][A
100%|██████████| 4/4 [00:00<00:00,  6.56it/s][A
 50%|█████     | 1/2 [00:00<00:00,  1.63it/s]
  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  1.70it/s][A
100%|██████████| 2/2 [00:01<00:00,  1.66it/s]
  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:02<00:08,  2.86s/it][A
 50%|█████     | 2/4 [00:05<00:05,  2.66s/it][A
 75%|██

Writing to disk complete for 2 partition(s)






Redacting with synchronous LLM-based modifier...


  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:00<00:01,  2.22it/s][A
 50%|█████     | 2/4 [00:00<00:00,  4.05it/s][A
100%|██████████| 4/4 [00:04<00:00,  1.10s/it][A
 50%|█████     | 1/2 [00:04<00:04,  4.41s/it]
  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  1.75it/s][A
100%|██████████| 2/2 [00:04<00:00,  2.49s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:00<00:01,  2.31it/s][A
 75%|███████▌  | 3/4 [00:01<00:00,  2.17it/s][A
100%|██████████| 4/4 [00:09<00:00,  2.31s/it][A
 50%|█████     | 1/2 [00:09<00:09,  9.23s/it]
  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  1.75it/s][A
100%|██████████| 2/2 [00:09<00:00,  4.90s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:02<00:08,  2.82s/it][A
 50%|█████     | 2/4 [00:06<00:06,  3.23s/it][A
 75%|██

Writing to disk complete for 2 partition(s)


  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:00<00:01,  2.06it/s][A
 50%|█████     | 2/4 [00:05<00:06,  3.18s/it][A
 75%|███████▌  | 3/4 [00:05<00:01,  1.89s/it][A
100%|██████████| 4/4 [00:06<00:00,  1.57s/it][A
 50%|█████     | 1/2 [00:06<00:06,  6.29s/it]
  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:09<00:00,  9.12s/it][A
100%|██████████| 2/2 [00:15<00:00,  7.70s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:00<00:01,  2.18it/s][A
100%|██████████| 4/4 [00:08<00:00,  2.09s/it][A
 50%|█████     | 1/2 [00:08<00:08,  8.35s/it]
  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  2.18it/s][A
100%|██████████| 2/2 [00:08<00:00,  4.41s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:00<00:01,  2.21it/s][A
 50%|█████     | 2/4 [00:00<00:00,  3.20it/s][A
 75%|██

In [None]:
# Let's check out one of the emails after async modification
print(modified_async.df.head(1).iloc[0]["text"])

Message-ID: <17733064.1075862166101.JavaMail.evans@thyme>
Date: {{DATE}}
From: {{EMAIL_ADDRESS}}
To: {{PERSON}}@enron.com
Subject: Re: PO spreadsheets
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: {{PERSON}} <{{EMAIL_ADDRESS}}>@ENRON
X-To: {{PERSON}} </O=ENRON/OU=NA/CN=RECIPIENTS/CN=PALLEN>, {{EMAIL_ADDRESS}}
X-cc: 
X-bcc: 
X-Folder: \PALLEN (Non-Privileged)\{{PERSON}}\Inbox
X-Origin: Allen-P
X-FileName: PALLEN (Non-Privileged).pst

{{PERSON}}, the insurance/repairs numbers are actually overstated; they are based on calculations from USPSL owners and agents like us who have helped clients buy and sell post offices for years. With regards to the exercising of renewal options, you might be interested to know that the USPS actually renews these 94% of the time; some post offices are in the 75 years and above range on leases. And finally, the construction  costs are reflective of the construction codes and practices the USPS requires f

The steps above successfully identified and redacted various types of PII, such as email addresses and locations.

## Step 6: Compare Results
This section showcases how certain PII (such as like email addresses) changed after using the PII modifiers. We also examine some statistics about the modifications.

In [None]:
def compare_pii_modifications(
    sample_df: pd.DataFrame, modified_async: pd.DataFrame, modified_sync: pd.DataFrame, idx: int = 0
) -> str:
    """
    Compare the original, async-modified, and sync-modified versions of an email.
    Also show how much the email length changed after PII redaction.
    """
    # Extract the email text for the selected index
    original_email = sample_df["text"].iloc[idx]
    async_email = modified_async.to_pandas()["text"].iloc[idx]
    sync_email = modified_sync.to_pandas()["text"].iloc[idx]

    print("\n=== ORIGINAL EMAIL ===")
    print(original_email[:300])

    print("\n=== ASYNC LLM MODIFIED ===")
    print(async_email[:300])

    print("\n=== SYNC LLM MODIFIED ===")
    print(sync_email[:300])

    # Show how much the length changed
    print("\n=== How Much Did the Emails Change? ===")
    print(f"Original email was {len(original_email)} characters long")
    print(
        f"Async modified version is {len(async_email)} characters (changed by {len(async_email) - len(original_email)} characters)"
    )
    print(
        f"Sync modified version is {len(sync_email)} characters (changed by {len(sync_email) - len(original_email)} characters)"
    )


print("Starting comparison analysis...")
compare_pii_modifications(sample_df, modified_async, modified_sync)

Starting comparison analysis...

=== ORIGINAL EMAIL ===
Message-ID: <17733064.1075862166101.JavaMail.evans@thyme>
Date: Tue, 27 Nov 2001 17:05:30 -0800 (PST)
From: jwills3@swbell.net
To: k..allen@enron.com
Subject: Re: PO spreadsheets
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: James Wills <jwills3

=== ASYNC LLM MODIFIED ===
Message-ID: <17733064.1075862166101.JavaMail.evans@thyme>
Date: Tue, 27 Nov 2001 17:05:30 -0800 (PST)
From: {{email}}
To: k..allen@enron.com
Subject: Re: PO spreadsheets
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: {{name}} <{{email}}>@ENRON
X-

=== SYNC LLM MODIFIED ===
Message-ID: <17733064.1075862166101.JavaMail.evans@thyme>
Date: Tue, 27 Nov 2001 17:05:30 -0800 (PST)
From: {{email}}
To: k..allen@enron.com
Subject: Re: PO spreadsheets
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: {{name}} <{{emai

## Key Findings

- Both methods preserved the email's structure and business content
- Since both PII modifiers used the same LLM, they produced the same results

## Conclusion
This tutorial has demonstrated a practical implementation of PII modification using NeMo Curator's LLM-based modifiers on the Enron email dataset. 

The PII modification process used an NVIDIA-hosted LLM to identify and redact personally identifiable information in the emails. Every PII identified was replaced by a generic such as `{{name}}` and `{{email}}`.