Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

@praateekmahajan praateekmahajan commented Sep 16, 2025

Description

Two tutorials

  1. e2e - this one basically runs TextSemanticDeduplicationWorkflow in one step
  2. step_by_step - this one
    • creates id generator
    • runs embedding generation
    • run SemanticDeduplicationWorkflow
    • run IdentifyDuplicatesStage
    • run TextDuplicatesRemovalWorkflow

We also show how to infer results from sem dedup before running removal by plotting the ecdf plot
image

Two small bug fixes

  1. Reader stage doesn't have CURATOR_DEDUP_ID in outputs
  2. Semantic Workflow errors out when eps is None in logging

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
logger.success(f"Total documents identified as duplicates: {total_duplicates}")
logger.info(f"Similarity threshold used: {1.0 - self.eps:.3f} (eps={self.eps})")
else:
elif self.eps is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug since when epis is None then the logger fails since it does {1.0 - self.eps}

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, left some thoughts.

I think we can add a README for how to download the dataset?

Comment on lines 62 to 93
"workflow = TextSemanticDeduplicationWorkflow(\n",
" input_path=input_path_to_data,\n",
" output_path=output_path,\n",
" cache_path=cache_path,\n",
" perform_removal=True,\n",
" # Embedding generation parameters\n",
" text_field=\"text\",\n",
" model_identifier=\"sentence-transformers/all-MiniLM-L6-v2\",\n",
" embedding_max_seq_length=512,\n",
" embedding_max_chars=None,\n",
" embedding_pooling=\"mean_pooling\",\n",
" embedding_model_inference_batch_size=256,\n",
" # Semantic deduplication parameters\n",
" n_clusters=100, # this number can be much higher when the data is large\n",
" # For large scale data we should use CURATOR_DEDUP_ID_STR if we are\n",
" # also performing removal.\n",
" id_field=\"id\",\n",
" eps=0.01,\n",
" # K-means clustering parameters\n",
" ranking_strategy=RankingStrategy(metadata_cols=[\"cosine_dist_to_cent\"], ascending=True),\n",
" pairwise_batch_size=1024,\n",
" # ID generator parameters\n",
" # For large scale data we should set use_id_generator to True if we are performing removal.\n",
" use_id_generator=False,\n",
" id_generator_state_file=None,\n",
" # I/O parameters\n",
" input_filetype=input_filetype,\n",
" input_files_per_partition=1,\n",
" output_filetype=output_filetype,\n",
" verbose=True,\n",
" clear_output=True,\n",
")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could break these values into individual cells per stage, for example a cell might contain just:

# K-means clustering parameters
ranking_strategy = RankingStrategy(metadata_cols=["cosine_dist_to_cent"], ascending=True)
pairwise_batch_size = 1024

with a small markdown explanation for what the stage does and what each parameter means (could be copy/pasted from the docstring). Then we have a final cell defining TextSemanticDeduplicationWorkflow with all of the parameters listed from the previous cells.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can let this one be here, since the step_by_step already does the "breaking it up"..

At the end of the day the user must pass together if that makes sense.

"name": "stderr",
"output_type": "stream",
"text": [
"2025-09-16 11:59:48,062\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, we could clear these logs for the final commit, to help the tutorial look a bit cleaner. Ultimately up to you though, since the alternative argument is that they could be considered helpful for the user to know what to expect here.

"source": [
"from nemo_curator.core.client import RayClient\n",
"\n",
"client = RayClient(num_cpus=64, num_gpus=4)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any general rules of thumb for setting hardware requirements for this workflow, that the user should know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, will document it.. i have forgotten if we need 2x the memory of embeddings or 1x.. we needed 2x in dask but i think in ray now its 1x.. for now i'll say 2x to be on the safer side

Comment on lines 550 to 564
"##### Visualizing Similarity in our dataset\n",
"\n",
"Depending on our dataset size we can read through all of the files and plot how much data is similar to one another.\n",
"Here we show how to read file by file and then perform a reduce. \n",
"\n",
"In our dataset we can see that ~20% of our data has cosine_similarity of 0.9 or more.\n",
"\n",
"Based on the analysis here we can decide what our `eps` should be. \n",
"However in this tutorial we pre-ran with eps set and perform_removal to be True.\n",
"\n",
"However ideally, users do this analysis, inspect the duplicates, come up with an `eps` and then run a pipeline that includes the `IdentifyDuplicates` stage.\n",
"\n",
"And finally perform removal.\n",
"\n",
"**NOTE : If you run with `use_id_generator=False` (which affects removal performance at large scale) you will see the actual ids of instead of the int ids. You can then also inspect the original dataset for those ids, and that can also help your decision for `eps`**"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great section for users to know how to interpret the results and play around with some parameters here. Could we flesh it out more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share how to flesh it out more?

"1. When running removal workflow, we must specify the same input configuration as we did when we \"generated ids\".\n",
"2. In this tutorial that happened at the embedding generation step.\n",
"3. Therefore its required that we match the saame arguments of filepath, filetype and files_per_partition / blocksize.\n",
"4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",
"4. This is required because ids are generated based on hash (filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to write hash(filename) as in programmatically f(x) to share that we hash the filenames

"4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n",
"\n",
"### Performance\n",
"1. If you notice OOMs during this stage you can try using RayDataActor\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be in the docs somewhere? Otherwise I think this instruction is a bit confusing to users first becoming acquainted with Curator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in docs for sure, along with a lot of the other id generator stuff.

I'm hoping @arhamm1 / @lbliii can help here

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds two comprehensive Jupyter notebook tutorials for semantic deduplication and includes important bug fixes. The tutorials provide both end-to-end workflow examples and step-by-step implementations, with useful visualizations to help users understand their deduplication results.

Key Changes:

  • Adds two semantic deduplication tutorial notebooks with data visualization
  • Fixes reader stage output schema to include CURATOR_DEDUP_ID when ID generation is enabled
  • Fixes logging error in semantic workflow when eps parameter is None

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb Step-by-step semantic deduplication tutorial with ECDF plotting for similarity analysis
tutorials/text/deduplication/semantic/semantic_e2e.ipynb End-to-end semantic deduplication workflow tutorial
nemo_curator/stages/text/io/reader/base.py Fixes reader stage to include CURATOR_DEDUP_ID in output schema when ID generation is enabled
nemo_curator/stages/deduplication/semantic/workflow.py Fixes logging error when eps parameter is None

Comment on lines +55 to +59
output_fields = self.fields or []
if self._generate_ids or self._assign_ids:
from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR

output_fields.append(CURATOR_DEDUP_ID_STR)
Copy link

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement inside the conditional block could fail if the module is not available. Consider moving the import to the top of the file or adding proper error handling to avoid potential ImportError at runtime.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it to top level for 99% use cases where generate_ids or assign_ids is not set will lead to resolution of all imports inside deduplication/id_generator.py, deduplication/utils.py and if anything in future gets added to deduplication/__init__.py which is not necessary until either of those args are set to true

@sarahyurick sarahyurick added r1.0.0 Pick this label for auto cherry-picking into r1.0.0 and removed r1.0.0 Pick this label for auto cherry-picking into r1.0.0 labels Sep 19, 2025
Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@praateekmahajan praateekmahajan enabled auto-merge (squash) September 19, 2025 19:04
@thomasdhc thomasdhc disabled auto-merge September 19, 2025 19:25
@thomasdhc thomasdhc merged commit a6764d8 into NVIDIA-NeMo:main Sep 19, 2025
25 of 26 checks passed
chtruong814 pushed a commit that referenced this pull request Sep 19, 2025
* bug fixes

Signed-off-by: Praateek <praateekm@gmail.com>

* add notebooks

Signed-off-by: Praateek <praateekm@gmail.com>

* change input path

Signed-off-by: Praateek <praateekm@gmail.com>

* add comment about input filetype

Signed-off-by: Praateek <praateekm@gmail.com>

* add download dataset too

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* json -> jsonl

Signed-off-by: Praateek <praateekm@gmail.com>

* fc

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

* change graph

Signed-off-by: Praateek <praateekm@gmail.com>

* pr reveiw

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

---------

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
jnke2016 pushed a commit to jnke2016/Curator that referenced this pull request Nov 12, 2025
* bug fixes

Signed-off-by: Praateek <praateekm@gmail.com>

* add notebooks

Signed-off-by: Praateek <praateekm@gmail.com>

* change input path

Signed-off-by: Praateek <praateekm@gmail.com>

* add comment about input filetype

Signed-off-by: Praateek <praateekm@gmail.com>

* add download dataset too

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* json -> jsonl

Signed-off-by: Praateek <praateekm@gmail.com>

* fc

Signed-off-by: Praateek <praateekm@gmail.com>

* pr comments

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

* change graph

Signed-off-by: Praateek <praateekm@gmail.com>

* pr reveiw

Signed-off-by: Praateek <praateekm@gmail.com>

* ..

Signed-off-by: Praateek <praateekm@gmail.com>

---------

Signed-off-by: Praateek <praateekm@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.0.0 Pick this label for auto cherry-picking into r1.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants