## LLM-based PII Modification with NeMo Curator

This tutorial demonstrates how to use NVIDIA's NeMo Curator library to modify text data containing Personally Identifiable Information (PII) using large language models (LLMs). We'll explore both asynchronous and synchronous approaches using `AsyncLLMPiiModifier` and `LLMPiiModifier`.

PII modification with NeMo Curator provides a sophisticated approach to privacy protection while maintaining data utility. The LLM-based modifiers offer intelligent, context-aware transformations that preserve the natural flow and usefulness of the dataset.

## Using Large Language Models (LLMs) for PII Modification
Beyond rule-based systems like [Presidio](https://microsoft.github.io/presidio/) (used by `PiiModifier`), NeMo Curator also offers capabilities to leverage large language models (LLMs) for identifying and redacting PII. This approach can potentially identify a wider range of PII types or handle more nuanced cases, depending on the LLM used and the provided prompts. This requires access to an LLM endpoint compatible with the [OpenAI API standard](https://platform.openai.com/docs/api-reference/introduction), such as [NVIDIA NIM](https://developer.nvidia.com/nim) (NVIDIA Inference Microservices). NeMo Curator provides two primary modifiers for this purpose:

- `AsyncLLMPiiModifier`: Performs PII detection and redaction using asynchronous calls to the LLM endpoint. This is generally more efficient for large datasets as it can handle multiple requests concurrently.

- `LLMPiiModifier`: Performs PII detection and redaction using synchronous calls to the LLM endpoint. This might be simpler for smaller tasks or debugging but is less scalable.

## Prerequisites

- Python 3.10 or later
- NVIDIA NeMo Curator library
- Access to a NVIDIA Inference Microservice (NIM) endpoint

## Step 1: Installation and Imports

First, let's install the necessary packages and import required libraries.

In [None]:
# Install NeMo Curator with all features
# !pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]

In [None]:
import os

import pandas as pd

from nemo_curator.datasets import DocumentDataset
from nemo_curator.modifiers.async_llm_pii_modifier import AsyncLLMPiiModifier
from nemo_curator.modifiers.llm_pii_modifier import LLMPiiModifier
from nemo_curator.modules.modify import Modify
from nemo_curator.utils.distributed_utils import get_client

  from .autonotebook import tqdm as notebook_tqdm


## Step 2: Initialize Dask Client (Optional)

If you're working with large datasets, you might want to initialize a Dask client. This step is optional for small datasets.

In [None]:
# Optional: Start a Dask client (recommended for larger datasets)
client = get_client()

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:01<00:01,  1.40s/it][A
100%|██████████| 2/2 [00:01<00:00,  1.24it/s][A
100%|██████████| 1/1 [00:01<00:00,  1.61s/it]
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/3 [00:00<?, ?it/s][A
 33%|███▎      | 1/3 [00:01<00:03,  1.59s/it][A
 67%|██████▋   | 2/3 [00:02<00:00,  1.09it/s][A
100%|██████████| 3/3 [00:02<00:00,  1.30it/s][A
100%|██████████| 1/1 [00:02<00:00,  2.31s/it]
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:01<00:01,  1.22s/it][A
100%|██████████| 2/2 [00:01<00:00,  1.41it/s][A
100%|██████████| 1/1 [00:01<00:00,  1.42s/it]
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/3 [00:00<?, ?it/s][A
 33%|███▎      | 1/3 [00:01<00:02,  1.48s/it][A
100%|██████████| 3/3 [00:02<00:00,  1.47it/s][A
100%|██████████| 1/1 [00:02<00:00,  2.04s/it]
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/2 

## Step 3: Create Sample Dataset
Let's create a sample dataset containing various types of PII. This dataset will demonstrate different types of personally identifiable information that we want to modify.

In [4]:
# Create sample data with various PII types
data = {
    "doc_id": range(1, 6),
    "text": [
        "Contact Sarah Johnson at sarah.j@company.com or call 555-0123.",
        "Patient ID: 12345, SSN: 123-45-6789, DOB: 01/15/1980",
        "Send payment to Bitcoin wallet 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa",
        "Meeting with Dr. James Wilson at 123 Medical Center, Suite 456, New York, NY 10001",
        "User @tech_jane (Jane Smith) posted from IP address 192.168.1.1",
    ],
}

# Create Pandas DataFrame
df = pd.DataFrame(data)

# Display original data
print("=== Original Dataset ===")
display(df)

# Convert to DocumentDataset
dataset = DocumentDataset.from_pandas(df, npartitions=2)

=== Original Dataset ===


Unnamed: 0,doc_id,text
0,1,Contact Sarah Johnson at sarah.j@company.com o...
1,2,"Patient ID: 12345, SSN: 123-45-6789, DOB: 01/1..."
2,3,Send payment to Bitcoin wallet 1A1zP1eP5QGefi2...
3,4,Meeting with Dr. James Wilson at 123 Medical C...
4,5,User @tech_jane (Jane Smith) posted from IP ad...


## Step 4: Configure Asynchronous LLM PII Modifier
Now we'll set up the asynchronous LLM-based PII modifier. This modifier uses [asyncio](https://docs.python.org/3/library/asyncio.html) to send multiple requests to the LLM endpoint concurrently, making it suitable for processing large datasets efficiently. The example below uses a NVIDIA hosted NIM. 

Using a Self-hosted NIM is the fastest way to run the `AsyncLLMPiiModifier`. Check out the following documentation on how to set up a local NIM:

 - [NVIDIA NIM](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html)
 - [NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/personalidentifiableinformationidentificationandremoval.html#data-curator-pii)

For now, we will skip setting up a local NIM and use an API key generated from [here](https://build.nvidia.com/meta/llama-3_1-70b-instruct).

In [None]:
# Set up your configuration for the LLM
NIM_BASE_URL = "https://integrate.api.nvidia.com/v1"  # Or a local endpoint like "http://0.0.0.0:8000/v1"
NIM_API_KEY = "API key"
MODEL_NAME = "meta/llama-3.1-70b-instruct"  # Or your desired model compatible with the endpoint
MAX_CONCURRENT_REQUESTS = 10  # Adjust based on your endpoint capacity and rate limits

Note: `MAX_CONCURRENT_REQUESTS` will be determined by desired throughput threshold and resource availability. Larger models will typically need more resources per request.

In [None]:
# Configure the async PII modifier
# You can customize 'pii_labels' or provide a custom 'system_prompt'
# See nemo_curator.utils.llm_pii_utils for default prompt and labels
async_modifier = AsyncLLMPiiModifier(
    base_url=NIM_BASE_URL,
    api_key=NIM_API_KEY,
    model=MODEL_NAME,
    max_concurrent_requests=MAX_CONCURRENT_REQUESTS,
    # pii_labels=["PERSON", "EMAIL_ADDRESS"], # Example: Only detect specific labels # noqa: ERA001
    language="en",  # Default is 'English'
    # system_prompt="Your custom system prompt here..." # Advanced: Define a custom prompt # noqa: ERA001
)

In [None]:
# Perform async LLM-based PII redaction

async_modified = Modify(async_modifier)(dataset)

# Create output directory if it doesn't exist
output_dir = "output_files_async"
os.makedirs(output_dir, exist_ok=True)

try:
    # Display results
    print("\n=== Async LLM Results ===")
    modified_df = async_modified.to_pandas()  # Convert directly to Pandas
    display(modified_df)

    # Save results using DocumentDataset's to_json method
    async_modified.to_json(
        output_path=output_dir,
        write_to_filename=False,  # This ensures proper partitioning into .part files
        keep_filename_column=False,
    )
    print(f"\nResults saved to: {output_dir}")

    # Optionally also save as Parquet for better performance with large datasets
    parquet_dir = f"{output_dir}_parquet"
    os.makedirs(parquet_dir, exist_ok=True)
    async_modified.to_parquet(output_path=parquet_dir, write_to_filename=False, keep_filename_column=False)
    print(f"Results also saved as Parquet in: {parquet_dir}")


except OSError as e:
    print(f"IO error saving results: {e!s}")
except ValueError as e:
    print(f"Value error: {e!s}")

# Optional: Shutdown Dask client if started
finally:
    client.close()


=== Async LLM Results ===


Unnamed: 0,doc_id,text
0,1,Contact {{PERSON}} at {{EMAIL_ADDRESS}} or cal...
1,2,"Patient ID: {{PATIENT_ID}}, SSN: {{SSN}}, DOB:..."
2,3,Send payment to {{LOCATION}}
3,4,Meeting with {{PERSON}} at {{LOCATION}}
4,5,User {{PERSON}} posted from IP address {{LOCAT...


Writing to disk complete for 2 partition(s)

Results saved to: output_files_async


Make sure the Dask workers are still up and running. If they are closed, use the `get_client` command to bring it up.

## Step 5: Configure Synchronous LLM PII Modifier

Let's also apply the synchronous LLM-based PII modifier to the dataset. This produces the same results as the asynchronous modifier, but is slower on larger datasets.

In [None]:
sync_modifier = LLMPiiModifier(
    base_url=NIM_BASE_URL,
    api_key=NIM_API_KEY,
    model=MODEL_NAME,
    # pii_labels=["PERSON", "EMAIL_ADDRESS"], # Example: Only detect specific labels # noqa: ERA001
    language="en",  # Default is English
    # system_prompt="Your custom system prompt here..." # noqa: ERA001
)

In [None]:
# Perform synchronous LLM-based PII redaction
sync_modified = Modify(sync_modifier)(dataset)

# Display results
print("\n=== Sync LLM Results ===")
modified = sync_modified.to_pandas()  # Convert directly to Pandas
display(modified)


# Save sync results
sync_modified.to_json("sync_modified_data.json")


=== Sync LLM Results ===


Unnamed: 0,doc_id,text
0,1,Contact {{PERSON}} at {{EMAIL_ADDRESS}} or cal...
1,2,"Patient ID: {{PATIENT_ID}}, SSN: {{SSN}}, DOB:..."
2,3,Send payment to {{LOCATION}}
3,4,Meeting with {{PERSON}} at {{LOCATION}}
4,5,User {{PERSON}} posted from IP address {{LOCAT...


Writing to disk complete for 2 partition(s)


## Conclusion

We have successfully demonstrated how NVIDIA NeMo Curator's LLM-based PII modifiers can intelligently transform text data. We explored both the `AsyncLLMPiiModifier` and `LLMPiiModifier`, highlighting their distinct approaches to privacy protection while preserving data utility.

We observed that both `AsyncLLMPiiModifier` and `LLMPiiModifier` accurately modify PII. However, for larger datasets, the asynchronous approach is recommended for substantial data volumes.