Semantic Dedup Tutorial + bug fixes #1067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

thomasdhc merged 17 commits into NVIDIA-NeMo:main from praateekmahajan:praateek/dedup-tutorial

Sep 19, 2025

Contributor

praateekmahajan commented Sep 16, 2025 •

edited

Loading

Description

Two tutorials

e2e - this one basically runs TextSemanticDeduplicationWorkflow in one step
step_by_step - this one
- creates id generator
- runs embedding generation
- run SemanticDeduplicationWorkflow
- run IdentifyDuplicatesStage
- run TextDuplicatesRemovalWorkflow

We also show how to infer results from sem dedup before running removal by plotting the ecdf plot

Two small bug fixes

Reader stage doesn't have CURATOR_DEDUP_ID in outputs
Semantic Workflow errors out when eps is None in logging

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

praateekmahajan added 2 commits

September 16, 2025 12:18


          bug fixes

89168e9

Signed-off-by: Praateek <praateekm@gmail.com>


          add notebooks

4162ac8

Signed-off-by: Praateek <praateekm@gmail.com>

praateekmahajan commented

View reviewed changes

nemo_curator/stages/text/io/reader/base.py Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to test

September 16, 2025 19:21

Inactive

copy-pr-bot bot had a problem deploying to nemo-ci

September 16, 2025 19:21

Error

copy-pr-bot bot had a problem deploying to nemo-ci

September 16, 2025 19:21

Error

praateekmahajan commented

View reviewed changes

nemo_curator/stages/deduplication/semantic/workflow.py

    
                              logger.success(f"Total documents identified as duplicates: {total_duplicates}")

                              logger.info(f"Similarity threshold used: {1.0 - self.eps:.3f} (eps={self.eps})")

                          else:

                          elif self.eps is not None:

Contributor Author

praateekmahajan Sep 16, 2025

This was a bug since when epis is None then the logger fails since it does {1.0 - self.eps}


          change input path

2e12579

Signed-off-by: Praateek <praateekm@gmail.com>

praateekmahajan requested review from VibhuJawa, ayushdg and sarahyurick

September 16, 2025 19:22

copy-pr-bot bot temporarily deployed to test

September 16, 2025 19:23

Inactive

copy-pr-bot bot had a problem deploying to nemo-ci

September 16, 2025 19:23

Error

copy-pr-bot bot had a problem deploying to nemo-ci

September 16, 2025 19:23

Error


          add comment about input filetype

127a480

Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot bot temporarily deployed to test

September 16, 2025 19:28

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 16, 2025 19:28

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 16, 2025 19:28

Inactive

copy-pr-bot bot had a problem deploying to nemo-ci

September 16, 2025 19:43

Error

sarahyurick requested changes

View reviewed changes

Contributor

sarahyurick left a comment

Thanks for working on this, left some thoughts.

I think we can add a README for how to download the dataset?

tutorials/text/deduplication/semantic/semantic_e2e.ipynb

Comment on lines 62 to 93

    
                  "workflow = TextSemanticDeduplicationWorkflow(\n",

                  "    input_path=input_path_to_data,\n",

                  "    output_path=output_path,\n",

                  "    cache_path=cache_path,\n",

                  "    perform_removal=True,\n",

                  "    # Embedding generation parameters\n",

                  "    text_field=\"text\",\n",

                  "    model_identifier=\"sentence-transformers/all-MiniLM-L6-v2\",\n",

                  "    embedding_max_seq_length=512,\n",

                  "    embedding_max_chars=None,\n",

                  "    embedding_pooling=\"mean_pooling\",\n",

                  "    embedding_model_inference_batch_size=256,\n",

                  "    # Semantic deduplication parameters\n",

                  "    n_clusters=100,  # this number can be much higher when the data is large\n",

                  "    # For large scale data we should use CURATOR_DEDUP_ID_STR if we are\n",

                  "    # also performing removal.\n",

                  "    id_field=\"id\",\n",

                  "    eps=0.01,\n",

                  "    # K-means clustering parameters\n",

                  "    ranking_strategy=RankingStrategy(metadata_cols=[\"cosine_dist_to_cent\"], ascending=True),\n",

                  "    pairwise_batch_size=1024,\n",

                  "    # ID generator parameters\n",

                  "    # For large scale data we should set use_id_generator to True if we are performing removal.\n",

                  "    use_id_generator=False,\n",

                  "    id_generator_state_file=None,\n",

                  "    # I/O parameters\n",

                  "    input_filetype=input_filetype,\n",

                  "    input_files_per_partition=1,\n",

                  "    output_filetype=output_filetype,\n",

                  "    verbose=True,\n",

                  "    clear_output=True,\n",

                  ")"

Contributor

sarahyurick Sep 16, 2025

I was wondering if we could break these values into individual cells per stage, for example a cell might contain just:

# K-means clustering parameters
ranking_strategy = RankingStrategy(metadata_cols=["cosine_dist_to_cent"], ascending=True)
pairwise_batch_size = 1024

with a small markdown explanation for what the stage does and what each parameter means (could be copy/pasted from the docstring). Then we have a final cell defining TextSemanticDeduplicationWorkflow with all of the parameters listed from the previous cells.

Contributor Author

praateekmahajan Sep 16, 2025

I think we can let this one be here, since the step_by_step already does the "breaking it up"..

At the end of the day the user must pass together if that makes sense.

tutorials/text/deduplication/semantic/semantic_e2e.ipynb Outdated

    
                   "name": "stderr",

                   "output_type": "stream",

                   "text": [

                    "2025-09-16 11:59:48,062\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",

Contributor

sarahyurick Sep 16, 2025

Personal preference, we could clear these logs for the final commit, to help the tutorial look a bit cleaner. Ultimately up to you though, since the alternative argument is that they could be considered helpful for the user to know what to expect here.

tutorials/text/deduplication/semantic/semantic_e2e.ipynb

    
                 "source": [

                  "from nemo_curator.core.client import RayClient\n",

                  "\n",

                  "client = RayClient(num_cpus=64, num_gpus=4)\n",

Contributor

sarahyurick Sep 16, 2025

Are there any general rules of thumb for setting hardware requirements for this workflow, that the user should know?

Contributor Author

praateekmahajan Sep 16, 2025

Good callout, will document it.. i have forgotten if we need 2x the memory of embeddings or 1x.. we needed 2x in dask but i think in ray now its 1x.. for now i'll say 2x to be on the safer side

tutorials/text/deduplication/semantic/semantic_e2e.ipynb Outdated Show resolved Hide resolved

tutorials/text/deduplication/semantic/semantic_e2e.ipynb Outdated Show resolved Hide resolved

tutorials/text/deduplication/semantic/semantic_e2e.ipynb Outdated

Comment on lines 550 to 564

    
                  "##### Visualizing Similarity in our dataset\n",

                  "\n",

                  "Depending on our dataset size we can read through all of the files and plot how much data is similar to one another.\n",

                  "Here we show how to read file by file and then perform a reduce. \n",

                  "\n",

                  "In our dataset we can see that ~20% of our data has cosine_similarity of 0.9 or more.\n",

                  "\n",

                  "Based on the analysis here we can decide what our `eps` should be. \n",

                  "However in this tutorial we pre-ran with eps set and perform_removal to be True.\n",

                  "\n",

                  "However ideally, users do this analysis, inspect the duplicates, come up with an `eps` and then run a pipeline that includes the `IdentifyDuplicates` stage.\n",

                  "\n",

                  "And finally perform removal.\n",

                  "\n",

                  "**NOTE : If you run with `use_id_generator=False` (which affects removal performance at large scale) you will see the actual ids of instead of the int ids. You can then also inspect the original dataset for those ids, and that can also help your decision for `eps`**"

Contributor

sarahyurick Sep 16, 2025

I think this is a great section for users to know how to interpret the results and play around with some parameters here. Could we flesh it out more?

Contributor Author

praateekmahajan Sep 16, 2025

Can you share how to flesh it out more?

tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb Show resolved Hide resolved

tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb Outdated Show resolved Hide resolved

tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb Outdated

    
                  "1. When running removal workflow, we must specify the same input configuration as we did when we \"generated ids\".\n",

                  "2. In this tutorial that happened at the embedding generation step.\n",

                  "3. Therefore its required that we match the saame arguments of filepath, filetype and files_per_partition / blocksize.\n",

                  "4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",

Contributor

sarahyurick Sep 16, 2025

Suggested change

      
                "4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",
          
                "4. This is required because ids are generated based on hash (filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",

Contributor Author

praateekmahajan Sep 16, 2025

I meant to write hash(filename) as in programmatically f(x) to share that we hash the filenames

tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb Outdated

    
                  "4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",

                  "\n",

                  "### Performance\n",

                  "1. If you notice OOMs during this stage you can try using RayDataActor\n",

Contributor

sarahyurick Sep 16, 2025

Will this be in the docs somewhere? Otherwise I think this instruction is a bit confusing to users first becoming acquainted with Curator.

Contributor Author

praateekmahajan Sep 16, 2025

This should be in docs for sure, along with a lot of the other id generator stuff.

I'm hoping @arhamm1 / @lbliii can help here


          add download dataset too

2f4e47f

Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot bot temporarily deployed to test

September 16, 2025 20:53

Inactive

copy-pr-bot bot had a problem deploying to nemo-ci

September 16, 2025 20:53

Error

copy-pr-bot bot had a problem deploying to nemo-ci

September 16, 2025 20:53

Error


          pr comments

c9edd12

Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot bot temporarily deployed to test

September 16, 2025 21:04

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 16, 2025 21:04

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 16, 2025 21:04

Inactive

VibhuJawa requested a review from Copilot

September 16, 2025 21:08

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This PR adds two comprehensive Jupyter notebook tutorials for semantic deduplication and includes important bug fixes. The tutorials provide both end-to-end workflow examples and step-by-step implementations, with useful visualizations to help users understand their deduplication results.

Key Changes:

Adds two semantic deduplication tutorial notebooks with data visualization
Fixes reader stage output schema to include CURATOR_DEDUP_ID when ID generation is enabled
Fixes logging error in semantic workflow when eps parameter is None

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb	Step-by-step semantic deduplication tutorial with ECDF plotting for similarity analysis
tutorials/text/deduplication/semantic/semantic_e2e.ipynb	End-to-end semantic deduplication workflow tutorial
nemo_curator/stages/text/io/reader/base.py	Fixes reader stage to include CURATOR_DEDUP_ID in output schema when ID generation is enabled
nemo_curator/stages/deduplication/semantic/workflow.py	Fixes logging error when eps parameter is None

nemo_curator/stages/text/io/reader/base.py

Comment on lines +55 to +59

    
                      output_fields = self.fields or []

                      if self._generate_ids or self._assign_ids:

                          from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR

                          output_fields.append(CURATOR_DEDUP_ID_STR)

Copilot AI Sep 16, 2025

The import statement inside the conditional block could fail if the module is not available. Consider moving the import to the top of the file or adding proper error handling to avoid potential ImportError at runtime.

Copilot uses AI. Check for mistakes.

Contributor Author

praateekmahajan Sep 16, 2025

Moving it to top level for 99% use cases where generate_ids or assign_ids is not set will lead to resolution of all imports inside deduplication/id_generator.py, deduplication/utils.py and if anything in future gets added to deduplication/__init__.py which is not necessary until either of those args are set to true

tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 16:57

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 16:57

Inactive

sarahyurick added r1.0.0 and removed r1.0.0 labels

sarahyurick approved these changes

View reviewed changes

Contributor

sarahyurick left a comment

Thanks!

copy-pr-bot bot had a problem deploying to nemo-ci

September 19, 2025 17:10

Error


          Merge branch 'main' into praateek/dedup-tutorial

485d0a1

copy-pr-bot bot temporarily deployed to test

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:32

Inactive

copy-pr-bot bot temporarily deployed to nemo-ci

September 19, 2025 17:44

Inactive

praateekmahajan enabled auto-merge (squash)

September 19, 2025 19:04

thomasdhc disabled auto-merge

September 19, 2025 19:25

thomasdhc merged commit a6764d8 into NVIDIA-NeMo:main

25 of 26 checks passed

chtruong814 pushed a commit that referenced this pull request


          Semantic Dedup Tutorial + bug fixes (#1067)

90706df

* bug fixes

Signed-off-by: Praateek <praateekm@gmail.com>

* add notebooks

Signed-off-by: Praateek <praateekm@gmail.com>

* change input path

Signed-off-by: Praateek <praateekm@gmail.com>

* add comment about input filetype

Signed-off-by: Praateek <praateekm@gmail.com>

* add download dataset too

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* json -> jsonl

Signed-off-by: Praateek <praateekm@gmail.com>

* fc

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

* change graph

Signed-off-by: Praateek <praateekm@gmail.com>

* pr reveiw

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

---------

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

praateekmahajan pushed a commit that referenced this pull request


          Semantic Dedup Tutorial + bug fixes (#1067) (#1079)

725961d

jnke2016 pushed a commit to jnke2016/Curator that referenced this pull request


          Semantic Dedup Tutorial + bug fixes (NVIDIA-NeMo#1067)

98a14c0

* bug fixes

Signed-off-by: Praateek <praateekm@gmail.com>

* add notebooks

Signed-off-by: Praateek <praateekm@gmail.com>

* change input path

Signed-off-by: Praateek <praateekm@gmail.com>

* add comment about input filetype

Signed-off-by: Praateek <praateekm@gmail.com>

* add download dataset too

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* json -> jsonl

Signed-off-by: Praateek <praateekm@gmail.com>

* fc

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

* change graph

Signed-off-by: Praateek <praateekm@gmail.com>

* pr reveiw

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

---------

Signed-off-by: Praateek <praateekm@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.0.0