# Redacting Personally Identifiable Information with NVIDIA's GLiNER-PII Model

This notebook demonstrates the use of NVIDIA's [GLiNER PII model](https://huggingface.co/nvidia/gliner-PII) in NeMo Curator. The GLiNER PII model detects and classifies a broad range of Personally Identifiable Information (PII) and Protected Health Information (PHI) in structured and unstructured text. It is non-generative and produces span-level entity annotations with confidence scores across 55+ categories.

This tutorial requires at least 1 NVIDIA GPU with:
  - Voltaâ„¢ or higher (compute capability 7.0+)
  - CUDA 12.x

Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.

Additionally, the tutorial uses the [GLiNER](https://github.com/urchade/GLiNER) library. It can be installed with:

In [1]:
# !uv pip install gliner

In [2]:
# Silence Curator logs via Loguru
import os

os.environ["LOGURU_LEVEL"] = "ERROR"

The following imports are required for this tutorial:

In [3]:
import pandas as pd
from gliner_pii_redactor import GlinerPiiRedactor

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader.jsonl import JsonlReader
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter

To run a pipeline in NeMo Curator, we must start a Ray cluster. This can be done manually (see the [Ray documentation](https://docs.ray.io/en/latest/ray-core/starting-ray.html)) or with Curator's `RayClient`:

In [4]:
try:
    ray_client = RayClient()
    ray_client.start()
except Exception as e:
    msg = f"Error initializing Ray client: {e}"
    raise RuntimeError(msg) from e

# Initialize Read, PII Redaction, and Write Stages

Functions in NeMo Curator are called stages. For this tutorial, we will initialize 3 stages: a JSONL file reader, a PII redactor stage, and a JSONL file writer.

For this tutorial, we will create a sample JSONL file to use:

In [None]:
input_file_path = "./input_data_dir"

# Create sample dataset for the tutorial
text = [
    "Please contact John Doe at johndoe@example.com for more information.",
    "Jane's address is 123 Main St, Anytown, USA.",
    "The phone number of the company is (123) 456-7890.",
    "The patient is 34 years old and has a ssn of 123-45-6789.",
    "This text does not contain any sensitive information.",
]
df = pd.DataFrame({"text": text})

try:
    os.makedirs(input_file_path, exist_ok=True)
    df.to_json(input_file_path + "/data.jsonl", orient="records", lines=True)
except Exception as e:
    msg = f"Error creating input file: {e}"
    raise RuntimeError(msg) from e

2025-10-28 11:24:05,736	INFO usage_lib.py:447 -- Usage stats collection is disabled.
2025-10-28 11:24:05,736	INFO scripts.py:914 -- [37mLocal node IP[39m: [1m127.0.1.1[22m


We can define the reader stage with:

In [None]:
# Read existing directory of JSONL files
read_stage = JsonlReader(input_file_path, files_per_partition=1)

We use the `GlinerPiiRedactor` as a wrapper around the [GLiNER](https://github.com/urchade/GLiNER) library. Please see the `gliner_pii_redactor.py` script for the full `ProcessingStage` implementation.

In [7]:
# Initialize the Gliner PII Redactor
pii_redactor_stage = GlinerPiiRedactor(text_field="text")

Finally, we can define a stage for writing the results:

In [8]:
# Write results to a directory
output_file_path = "./gliner_pii_redactor_results"

# Use mode="overwrite" to overwrite the output directory if it already exists
# This helps to ensure that the correct output is written
write_stage = JsonlWriter(output_file_path, mode="overwrite")

# Initialize Pipeline

In NeMo Curator, we use pipelines to run distributed data workflows using Ray. Pipelines take care of resource allocation and autoscaling to achieve enhanced performance and minimize GPU idleness.

For PII redaction, we are able to achieve speedups by ensuring that model inference is run in parallel across all available GPUs, while other stages such as I/O are run across all available CPUs. This is possible because Curator pipelines are composable, which allows each stage in a pipeline to run independently and with its own specified hardware resources.

In [9]:
pii_redaction_pipeline = Pipeline(name="pii_redaction_pipeline", description="Run a PII redaction pipeline")

# Add stages to the pipeline
pii_redaction_pipeline.add_stage(read_stage)
pii_redaction_pipeline.add_stage(pii_redactor_stage)
pii_redaction_pipeline.add_stage(write_stage)

Pipeline(name='pii_redaction_pipeline', stages=[jsonl_reader(JsonlReader), gliner_pii_redactor(GlinerPiiRedactor), jsonl_writer(JsonlWriter)])

Composability is also what allows a model to sit between pre-processing and post-processing stages. Typical text pre-processing add-ons include text normalization (lowercasing, URL/email removal, Unicode cleanup) and language identification and filtering (to keep only target languages). A full pipeline may look something like:

```python
pipeline = Pipeline(name="full_pipeline")
pipeline.add_stage(read_stage)                # reader (JSONL/S3/etc.)
pipeline.add_stage(lang_id_stage)             # optional: language filter
pipeline.add_stage(pii_redactor_stage)        # pii model and redactor
pipeline.add_stage(write_stage)               # writer (JSONL/Parquet)
```

# Run the PII Redaction Pipeline

Let's run the full pipeline:

In [None]:
# Run the pipeline
result = pii_redaction_pipeline.run()

Since the pipeline ran to completion and the result was written to a JSONL file, we can shut down the Ray cluster with:

In [11]:
try:
    ray_client.stop()
except Exception as e:  # noqa: BLE001
    print(f"Error stopping Ray client: {e}")

# Inspect the Output

The write stage returns a list of written files. We can read the output file as a Pandas DataFrame for inspection.

In [12]:
# For simplicity, we take the first written file from the writer stage
# In real pipelines, the writer may return multiple files (shards) or objects
result_file = result[0].data[0]

result_df = pd.read_json(result_file, lines=True)
result_df.head()

Unnamed: 0,text
0,Please contact {first_name} {last_name} at {em...
1,"{first_name}'s address is {street_address}, {c..."
2,The phone number of the company is {phone_numb...
3,The patient is {age} years old and has a ssn o...
4,This text does not contain any sensitive infor...


We can see that the sensitive information was redacted as expected.