<img src="./images/DLI_Header.png" style="width: 400px;">


# 1. Basics of Data Curation

******

Generative AI developemet requires a havy data curation process. The quality the model largely depends on the quality of the data used for training. NVIDIA NeMo Curator is an open-source framework designed to streamline this process by preparing large-scale, high-quality datasets for pretraining and continuous training.

NeMo Curator offers built-in workflows for curating data from various public sources such as Common Crawl, Wikipedia, and arXiv. At the same time, it provides the flexibility to customize pipelines to suit the specific needs of your project.

This notebook guides the process of basic data preparation involved in most Language Models developements: 

**[1.1 Text Cleaning and Unification](#1.1-Text-Cleaning-and-Unification)<br>**
**[1.2 Document Size Filtering](#1.2-Document-Size-Filtering)<br>**
**[1.3 Filter Personally Identifiable Information (PII)](#1.3-Filter-Personally-Identifiable-Information-(PII))<br>**


***************
### Environment Setup

For large-scale data processing, NeMo Curator provides both GPU and CPU based modules. Understanding how these modules interact and how to configure your environment is key to optimizing performance.

CPU-based modules rely on [Dask](https://www.dask.org/) to distribute computations across multi-node clusters while GPU-accelerated modules uses [RAPIDS](https://rapids.ai/) to handle large-scale datasets efficiently.

Let's check first your current environment.

In [1]:
# check CPU details
! lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i9-14900HX
    CPU family:           6
    Model:                183
    Thread(s) per core:   2
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             1
    BogoMIPS:             4838.40
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht sysc
                          all nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xt
                          opology tsc_reliable nonstop_tsc cpuid tsc_known_freq 
                          pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2
                          apic movbe popcnt tsc_deadline_timer

In [2]:
import platform
import psutil  # you may need to install: pip install psutil

print(f"Processor: {platform.processor()}")
print(f"Physical cores: {psutil.cpu_count(logical=False)}")
print(f"Total cores: {psutil.cpu_count(logical=True)}")
print(f"Max Frequency: {psutil.cpu_freq().max:.2f} MHz")

Processor: x86_64
Physical cores: 16
Total cores: 32
Max Frequency: 0.00 MHz


In [3]:
# check GPU details
! nvidia-smi

Fri Jan  2 13:01:40 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.75                 Driver Version: 566.24         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060 ...    On  |   00000000:01:00.0  On |                  N/A |
| N/A   51C    P5              6W /   80W |     625MiB /   8188MiB |     16%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

NeMo Curator provides a simple function `get_client` that can be used to start a local Dask cluster or connect to an existing one.  Let's initialize the Dask Cluster. 

The next cell starts a Dask `LocalCluster` on your CPU. It can be reused for all modules except for deduplication, which requires a GPU cluster.

In [4]:
from nemo_curator.utils.distributed_utils import get_client

# Start Dask cluster with limited workers for WSL stability
client = get_client(
    cluster_type="cpu",
    n_workers=4,  # Reduced from default (32 cores) to prevent worker crashes
    threads_per_worker=2,  # Limit threads per worker
)

In [5]:
! pip install nemo-curator

Defaulting to user installation because normal site-packages is not writeable
Collecting nemo-curator
  Downloading nemo_curator-1.0.0-py3-none-any.whl.metadata (14 kB)
Collecting comment_parser (from nemo-curator)
  Downloading comment_parser-1.2.4.tar.gz (8.3 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting cosmos-xenna==0.1.2 (from nemo-curator)
  Downloading cosmos_xenna-0.1.2-py3-none-any.whl.metadata (16 kB)
Collecting jieba==0.42.1 (from nemo-curator)
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
     ---------------------------------------- 0.0/19.2 MB ? eta -:--:--
     ------- -------------------------------- 3.4/19.2 MB 16.8 MB/s eta 0:00:01
     -------------- ------------------------- 6.8/19.2 MB 16.1 MB/s eta 0:00:01
     -------------------- ------------------ 10.0/19.2 MB 16.3 MB/s eta 0:00:01
     --------------------------- ----------- 13.6/19.2 MB 16.5 MB/s eta 0:00:01
     ----------------------------


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: C:\Users\tomas\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [5]:
import nemo_curator
print(nemo_curator.__version__)

0.9.0


In [5]:
! pip install nemo-curator
! pip show nemo-curator

Collecting nemo-curator
  Downloading nemo_curator-1.0.0-py3-none-any.whl.metadata (14 kB)
Collecting absl-py<3.0.0,>=2.0.0 (from nemo-curator)
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting comment_parser (from nemo-curator)
  Downloading comment_parser-1.2.4.tar.gz (8.3 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting cosmos-xenna==0.1.2 (from nemo-curator)
  Downloading cosmos_xenna-0.1.2-py3-none-any.whl.metadata (16 kB)
Collecting fsspec (from nemo-curator)
  Downloading fsspec-2025.12.0-py3-none-any.whl.metadata (10 kB)
Collecting jieba==0.42.1 (from nemo-curator)
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.2/19.2 MB[0m [31m90.5 MB/s[0m  [33m0:00:00[0mm0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to 

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

: 

In [6]:
import nemo_curator
print(f"NeMo Curator version: {nemo_curator.__version__}")

NeMo Curator version: 0.9.0


Lear more about Nemo Curator's CPU and GPU Modules with Dask in the dedicated [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html)

## 1.1 Multilingual Datasets

In this notebook, we will use a subset of the [MC4](https://huggingface.co/datasets/allenai/c4), the C4 Multilingual Dataset.

For the sake of this exercice, to create a more diverse dataset:
- We merged Spanish and French samples (100 per language)
- We duplicated all samples (making 200 samples per language)
- We shuffled the samples

So, we have 400 samples, 200 from each language. The structure is a JSON format with 3 filed: `text`, `timestamp` and `url`. 

Let's have a look at the dataset:

In [7]:
# set dataset file path
multilingual_data = "./original_data/file.json"

In [8]:
# check number of samples
! wc -l {multilingual_data}

400 ./original_data/file.json


In [9]:
# Count lines in a file
with open(multilingual_data, 'r', encoding='utf-8') as f:
    line_count = sum(1 for line in f)
print(f"{line_count} {multilingual_data}")

400 ./original_data/file.json


In [10]:
# In a Jupyter cell
! powershell -Command "(Get-Content '{multilingual_data}').Count"

/bin/bash: line 1: powershell: command not found


In [11]:
# show the 3 first samples
! head -n 3 {multilingual_data} | jq

/bin/bash: line 1: jq: command not found


In [12]:
import json
# Read first 3 lines
with open(multilingual_data, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= 3:
            break
        print(json.dumps(json.loads(line), indent=2))

{
  "text": "Dragon Ball: Le 20e film de la sage sortira le 14 d\u00e9cembre, premi\u00e8re image teaser sur Buzz, insolite et culture\nDragon Ball: Le 20e film de la sage sortira le 14 d\u00e9cembre, premi\u00e8re image teaser\nLe 20e film Dragon Ball sortira le vendredi 14 d\u00e9cembre 2018. La premi\u00e8re affiche teaser montre un Gok\u00fb jeune adulte, environ celui de la fin de Dragon Ball et le d\u00e9but de Dragon Ball Z. \u00c0 lire aussi >>> Le gouvernement mexicain pr\u00e9voit la diffusion sur place publique des \u00e9pisodes 130 et 131 de Dragon [\u2026]...\nLire la suite du buzz sur bleachmx Source : bleachmx - 12/03/2018 22:31 - trending_up142\nfilm comm\u00e9moration Akira Toriyama dbz dragon ball Dragon Ball Super dragon ball z affiche Dragon Ball Super Anime V Jump D\u00e9cembre 2018 Dragon Ball Z Battle of Gods Dragon Ball Z Fukkatsu No [F] Dragon Ball Z La R\u00e9surrection de [F] Potins Films\nLe site Deadline indique ce Jeudi que le film d\u2019animation Dragon 

Notice, **languages are not annotated in the dataset**, allowing us to leverage AI-based language separation later in the workflow.

Let's now create a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation

NeMo Curator's `DocumentDataset` employs Dask's distributed dataframes to mangage large datasets across multiple nodes and allow for easy restarting of interrupted curation. `DocumentDataset` supports reading and writing to sharded *jsonl* and *parquet* files both on local disk and from remote sources such as S3 bukets.

Let's load our multilingual dataset with Nemo Curator

In [13]:
import warnings

warnings.filterwarnings("ignore")

In [14]:
from nemo_curator.datasets import DocumentDataset

multilingual_data_path = "./original_data"
multilingual_dataset = DocumentDataset.read_json(
    multilingual_data_path, add_filename=True
)

Reading 1 files with blocksize='1gb' / files_per_partition=None


In [15]:
! pip list | findstr nemo

/bin/bash: line 1: findstr: command not found
ERROR: Pipe to stdout was broken
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe


In [16]:
multilingual_dataset.head(3)

Unnamed: 0,file_name,text,timestamp,url
0,file.json,Dragon Ball: Le 20e film de la sage sortira le...,2019-01-21 03:52:10,https://cultinfos.com/buzz/332814-dragon-ball-...
1,file.json,Cours D'histoire Des États Européens: Depuis L...,2019-01-17 23:25:39,https://www.bookvoed.ru/book?id=1433688
2,file.json,Se realizó una jornada de promoción del buentr...,2018-04-21 07:38:28,http://www.desarrollosocial.gob.ar/noticias/se...


## 1.2 Basic Text cleaning and Unification

NeMo Curator provides various `DocumentModifier` implementations such as the `UnicodeReformatter` which uses [ftfy](https://pypi.org/project/ftfy/) (fixes text for you) to resolve all unicode issues in the dataset. 

It is also possible to implement your custom text cleaner using `DocumentModifier`. For instance, we can standardize inconsistent quotation marks that appear very often in curated large dataset, remove HTML, URLs, and email tags, etc.


Let's first create the output folders to save the cleaned step outputs:

In [17]:
import os

# Set dataset file path
curated_data_path = "./curated"
clean_and_unify_data_path = os.path.join(curated_data_path, "01_clean_and_unify")

# Create directories
os.makedirs(curated_data_path, exist_ok=True)
os.makedirs(clean_and_unify_data_path, exist_ok=True)

Let's now implement a custom text cleaner `QuotationTagUnifier`.

It is designed to modify text documents by normalizing quotation marks and removing unwanted elements. 

The result is a cleaned and standardized text output.

In [18]:
import re

import dask
import pandas as pd
from nemo_curator.modifiers import DocumentModifier, UnicodeReformatter
from nemo_curator.modules.modify import Modify


class QuotationTagUnifier(DocumentModifier):
    def modify_document(self, text: str) -> str:
        text = text.replace("‘", "'").replace("’", "'")
        text = text.replace("“", '"').replace("”", '"')
        text = text.replace("\t", " ")
        text = re.sub(
            r"(<[^>]+>)|(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",
            "",
            text,
        )

        return text

Next, we can chain modifiers together using the `Sequential` class, which takes a list of operations to be run sequentially and applies them to a given `DocumentDataset`.ipynb_checkpoints/
 
Let's call this sequence the `cleaners`:

In [19]:
from nemo_curator import Sequential

cleaners = Sequential(
    [
        # Apply: Unify all the quotation marks and remove tags
        Modify(QuotationTagUnifier()),
        # Apply: Unify all unicode
        Modify(UnicodeReformatter()),
    ]
)

Let's run that on a toy example with few sentences:

In [20]:
# create the toy samples
dataframe_toy = pd.DataFrame(
    {
        "text": [
            "Ryan went out to play ‘footbal’",
            "He is very  \t  happy.",
            "Visit <a href='www.example.com'>example.com</a> for more information or contact us at info@example.com",
        ]
    }
)

dataset_toy = DocumentDataset(dask.dataframe.from_pandas(dataframe_toy, npartitions=1))

# check the samples
dataset_toy.head()

Unnamed: 0,text
0,Ryan went out to play ‘footbal’
1,He is very \t happy.
2,Visit <a href='www.example.com'>example.com</a...


Now, let's apply our sequence of cleaners to the toy samples. To execute this sequence on the Dask cluster, we use `.persist()`, which keeps the transformed data in memory for optimized processing. 

In [21]:
dataset_test_clean_and_unify = cleaners(dataset_toy).persist()

Let's check the output.

Expected output are samples with normalized quotations, removed tabs and HTML, URL and Email tags. 

In [22]:
# check cleaned toy samples
dataset_test_clean_and_unify.head()

Unnamed: 0,text
0,Ryan went out to play 'footbal'
1,He is very happy.
2,Visit example.com for more information or cont...


Now, let's apply this cleaning step to our multilingual dataset. We can achieve this by creating a sequence of curation steps, starting with the cleaning sequence as the first function in our data curation pipeline.

Run the next cell to create the cleaning step as a function that would be the first curation step.

In [23]:
# define the sequence of cleaning operations as a function
def clean_and_unify(dataset: DocumentDataset) -> DocumentDataset:
    cleaners = Sequential(
        [
            # Apply: Unify all the quotation marks and remove tags
            Modify(QuotationTagUnifier()),
            # Apply: Unify all unicode
            Modify(UnicodeReformatter()),
        ]
    )
    return cleaners(dataset)


# sequence of data curation setps. so far, only cclean_and_unify is defined
curation_steps = Sequential([clean_and_unify])

Let's now execute this step on out multilingual dataset:

In [24]:
%%time
print("Executing the pipeline...")

dataset_clean_and_unify = curation_steps(multilingual_dataset).persist()

Executing the pipeline...
CPU times: user 28 ms, sys: 0 ns, total: 28 ms
Wall time: 26.6 ms


Let's check some outputs:

In [25]:
dataset_clean_and_unify.head()

Unnamed: 0,file_name,text,timestamp,url
0,file.json,Dragon Ball: Le 20e film de la sage sortira le...,2019-01-21 03:52:10,https://cultinfos.com/buzz/332814-dragon-ball-...
1,file.json,Cours D'histoire Des États Européens: Depuis L...,2019-01-17 23:25:39,https://www.bookvoed.ru/book?id=1433688
2,file.json,Se realizó una jornada de promoción del buentr...,2018-04-21 07:38:28,http://www.desarrollosocial.gob.ar/noticias/se...
3,file.json,Restaurantes con Web Y Telefono Y Dias Y Horar...,2020-08-11 16:33:05,http://mendoza.guia.clarin.com/restaurantes-co...
4,file.json,Responsable qualité - Intérim : Emploi et recr...,2020-08-07 01:17:37,https://images3.meteojob.com/Emploi-Interim-Re...


We can save the created Document into a json file. 

In [26]:
# save output to json
dataset_clean_and_unify.to_json(clean_and_unify_data_path, write_to_filename=True)

Writing to disk complete for 1 partition(s)


In [34]:
! head -n 1 {clean_and_unify_data_path}/file.jsonl | jq

/bin/bash: line 1: jq: command not found


## Dataset document size Filtering

Extremely short documents may lack sufficient context or information for the model to learn meaningful concepts. By filtering out such documents, we can ensure that the data used for training is sufficiently informative and balanced.

Let's explore how to apply word counts and filtering using NeMo Curator.

In [27]:
# import relevant libraries
from nemo_curator import ScoreFilter
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
    DocumentFilter,
    RepeatingTopNGramsFilter,
    WordCountFilter,
)

In [28]:
class IncompleteDocumentFilter(DocumentFilter):
    """
    If the document doesn't end with a terminating punctuation mark, then discard.
    """

    def __init__(self):
        super().__init__()
        # Accepted document terminators.
        self._story_terminators = {".", "!", "?", '"', "”"}

    def score_document(self, text: str) -> bool:
        """
        Determines if a document's score is valid based on the last character of the text.
        Args:
            text (str): The document text.
        Returns:
            bool: True if the document's score is valid, False otherwise.
        """
        return text.strip()[-1] in self._story_terminators

    def keep_document(self, score) -> bool:
        return score

The following code defines a function, `filter_dataset`, that cleans a `DocumentDataset` by applying several filters:

- **Word Count Filter**: Removes documents with fewer than 80 words by default.
- **Incomplete Document Filter**: Removes incomplete documents.
- **Repeating N-Grams Filters**: Removes documents with excessive repetition of word sequences (2-grams, 3-grams, 4-grams) above certain thresholds (20%, 18%, 16% respectively).

These filters are applied sequentially to refine the dataset.

In [29]:
def filter_dataset(dataset: DocumentDataset) -> DocumentDataset:
    filters = Sequential(
        [
            ScoreFilter(
                WordCountFilter(min_words=80),
                text_field="text",
                score_field="word_count",
            ),
            ScoreFilter(IncompleteDocumentFilter(), text_field="text"),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16),
                text_field="text",
            ),
        ]
    )

    return filters(dataset)

Let's now apply that on our multilingual dataset:

In [30]:
%%time
curation_steps = Sequential([clean_and_unify, filter_dataset])

print("Executing the pipeline...")
filtered_dataset = curation_steps(multilingual_dataset).persist()

Executing the pipeline...
CPU times: user 109 ms, sys: 12.9 ms, total: 122 ms
Wall time: 119 ms


We can check the outputs. Notice that a new field named `word_count` has been added:

In [31]:
filtered_dataset.head()

Unnamed: 0,file_name,text,timestamp,url,word_count
1,file.json,Cours D'histoire Des États Européens: Depuis L...,2019-01-17 23:25:39,https://www.bookvoed.ru/book?id=1433688,111
6,file.json,Copy-Paste ejecutado en el Windows Phone 7.[Ví...,2019-07-20 07:52:44,https://geeksroom.com/2010/12/copy-paste-ejecu...,137
8,file.json,Agenda de eventos y actividades en Barcelona p...,2018-07-22 22:13:01,http://barcelona.carpediem.cd/events/?dt=06.04...,746
11,file.json,"EE.UU | EE.UU\nenero 22, 2014 Juan Pedro Sánch...",2019-09-20 03:50:18,https://makeexperience.wordpress.com/tag/ee-uu/,323
13,file.json,jandrobell - Pesca Mediterraneo 2\njandrobell\...,2018-08-18 11:01:22,http://www.pescamediterraneo2.com/foros/profil...,1865


Let's save the output, we need to create the folder first.

In [32]:
filtered_data_path = os.path.join(curated_data_path, "02_filter_dataset")

! mkdir -p {filtered_data_path}

In [41]:
# save output to json
filtered_dataset.to_json(filtered_data_path, write_to_filename=True)

Writing to disk complete for 1 partition(s)


Check the saved file by running the next cell.

In [33]:
! head -n 1 {filtered_data_path}/file.jsonl | jq

/bin/bash: line 1: jq: command not found


In [46]:
!pip install fr_core_news_sm

Collecting fr_core_news_sm
  Downloading fr_core_news_sm-3.8.0-py3-none-any.whl.metadata (12 kB)
Downloading fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.3/16.3 MB 43.3 MB/s eta 0:00:00
[?25hInstalling collected packages: fr_core_news_sm
Successfully installed fr_core_news_sm-3.8.0


### 1.3 PII Identification and Removal

The Personal Identifiable Information (PII) identification tool is designed to remove sensitive data from datasets.

The identification leverages [presidio_analyzer](https://pypi.org/project/presidio-analyzer/) a Python based service for detecting PII entities in text.

Let's try to analyze a toy sample: *My name is Dana and my number is 212-555-5555*

Expected output is the type `PERSON` and `PHONE_NUMBER` and the char start and end position.

```
 type: PERSON, start: 11, end: 15, score: 0.85,
 type: PHONE_NUMBER, start: 33, end: 45, score: 0.75
```

In [40]:
import warnings

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Hide deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
LANGUAGES_CONFIG_FILE = "./languages-config.yml"
# Create NLP engine based on configuration file
provider = NlpEngineProvider(conf_file=LANGUAGES_CONFIG_FILE)
nlp_engine_with_spanish = provider.create_engine()

analyzer = AnalyzerEngine(
    supported_languages=["en", "es", "fr"], nlp_engine=nlp_engine_with_spanish
)

results = analyzer.analyze(
    text="My name is Dana and my number is 212-555-5555",
    entities=["PHONE_NUMBER", "PERSON"],
    language="en",
)
print(results)

Collecting de-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.7.0/de_core_news_md-3.7.0-py3-none-any.whl (44.4 MB)
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.4/44.4 MB 66.2 MB/s eta 0:00:00
Installing collected packages: de-core-news-md
Successfully installed de-core-news-md-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting es-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.7.0/es_core_news_md-3.7.0-py3-none-any.whl (42.3 MB)
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.3/42.3 MB 64.0 MB/s eta 0:00:00
Installing collec

Run the analyzer for French sample:

In [41]:
analyzer.analyze(text="Mon email est mon@example.com", language="fr")

[type: EMAIL_ADDRESS, start: 14, end: 29, score: 1.0,
 type: URL, start: 18, end: 29, score: 0.5]

Try your own examples in these three languages for accurate results.

In [42]:
input = "my email address is at tomaspries@gmail.com"
analyzer.analyze(text=input, language="en")

[type: EMAIL_ADDRESS, start: 23, end: 43, score: 1.0,
 type: URL, start: 34, end: 43, score: 0.5]

Nemo Curator integrates PII Identification and Removal efficiently leveraging Dask for parallelization. The tool currently supports the identification and removal of the following sensitive data types:

`ADDRESS`,`CREDIT_CARD`,`EMAIL_ADDRESS`,`DATE_TIME`,`IP_ADDRESS`,`LOCATION`,`PERSON`,`URL`,`US_SSN`,`US_PASSPORT`,`US_DRIVER_LICENSE`,`PHONE_NUMBER`,

Let;s run the Nemo Curator PII Identification `PiiModifier` on a toy sample. 

In [43]:
# create toy samples with PII data
dataframe_toy = pd.DataFrame(
    {
        "text": [
            "Ryan went out to play football",
            "His email is ryan@example.com and phone is 212-555-5555",
        ]
    }
)
dataset_toy = DocumentDataset(dask.dataframe.from_pandas(dataframe_toy, npartitions=1))

dataset_toy.head()

Unnamed: 0,text
0,Ryan went out to play football
1,His email is ryan@example.com and phone is 212...


In [63]:
client.restart()



Let's build and apply the `PiiModifier` on the toy sample. 

In [64]:
from nemo_curator.modifiers.pii_modifier import PiiModifier

modifier = PiiModifier(
    batch_size=2000,
    language="en",
    supported_entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
    anonymize_action="replace",
    device="cpu" # ADD THIS LINE - Force CPU usage
)

modify = Modify(modifier)
modified_dataset = modify(dataset_toy)

In [65]:
# check modified data
modified_dataset.head()

2026-01-02 13:20:33,772 - distributed.worker - ERROR - Compute Failed
Key:       ('modify_document-3e3a6604bba9b67bf50c243ae92a51ce', 0)
State:     executing
Task:  <Task ('modify_document-3e3a6604bba9b67bf50c243ae92a51ce', 0) apply_and_enforce(..., ...)>
Exception: "ValueError('Cannot use GPU, CuPy is not installed')"
Traceback: '  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/dask/dataframe/core.py", line 98, in apply_and_enforce\n    df = func(*args, **kwargs)\n         ^^^^^^^^^^^^^^^^^^^^^\n  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/nemo_curator/modifiers/pii_modifier.py", line 79, in modify_document\n    deidentifier = load_object_on_worker("deidentifier", self.load_deidentifier, {})\n                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/nemo_curator/utils/distributed_utils.py", line 1103, in 

ValueError: Cannot use GPU, CuPy is not installed

Now, let's integrate this PII identification step into our curation sequence and apply it to the multilingual dataset. This will ensure that sensitive data is properly detected and removed while maintaining data quality. 

Let's create first the `redact_pii` function for PII identification and removal.

In [None]:
from nemo_curator.modifiers.pii_modifier import PiiModifier


def redact_pii(dataset: DocumentDataset) -> DocumentDataset:
    redactor = Modify(
        PiiModifier(
            supported_entities=["PERSON"],
            anonymize_action="replace",
            device="cpu",
        ),
    )

    return redactor(dataset)

Let's now run the sequence of curation steps including the PII removal function


In [None]:
%%time
curation_steps = Sequential([clean_and_unify, filter_dataset, redact_pii])

print("Executing the pipeline...")
redact_pii_dataset = curation_steps(multilingual_dataset).persist()

In [None]:
# check the filtered data
redact_pii_dataset.head()

Let's now save the fileted data. We need to create the folder to save the output.

In [None]:
redact_pii_data_path = os.path.join(curated_data_path, "03_redact_pii_data_path")

! mkdir -p {redact_pii_data_path}

In [None]:
# save
redact_pii_dataset.to_json(redact_pii_data_path, write_to_filename=True)

In [None]:
# check the saved file
! head -n 1 {redact_pii_data_path}/file.jsonl |jq

The current PII removal s Nemo Curator implementation is limited to HPC clusters using Slurm as the resource manager. Check the [documentation](https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst) for more details.

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you have used Nemo Curator to apply a sequence of basic text curation steps designed to clean and preprocess the dataset.

Before moving on to the next notebook, make sure to stop the Dask cluster. Please run the next cell.

In [None]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)  # automatically restarts kernel


We are now ready to move to the next notebook to explore advanced data preparation steps. 

Let's move to the [02_advanced_data_processing](02_advanced_data_processing.ipynb) 

<img src="./images/DLI_Header.png" style="width: 400px;">
