-
Notifications
You must be signed in to change notification settings - Fork 214
Semantic Dedup Tutorial + bug fixes #1067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic Dedup Tutorial + bug fixes #1067
Conversation
Signed-off-by: Praateek <praateekm@gmail.com>
| logger.success(f"Total documents identified as duplicates: {total_duplicates}") | ||
| logger.info(f"Similarity threshold used: {1.0 - self.eps:.3f} (eps={self.eps})") | ||
| else: | ||
| elif self.eps is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug since when epis is None then the logger fails since it does {1.0 - self.eps}
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
sarahyurick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, left some thoughts.
I think we can add a README for how to download the dataset?
| "workflow = TextSemanticDeduplicationWorkflow(\n", | ||
| " input_path=input_path_to_data,\n", | ||
| " output_path=output_path,\n", | ||
| " cache_path=cache_path,\n", | ||
| " perform_removal=True,\n", | ||
| " # Embedding generation parameters\n", | ||
| " text_field=\"text\",\n", | ||
| " model_identifier=\"sentence-transformers/all-MiniLM-L6-v2\",\n", | ||
| " embedding_max_seq_length=512,\n", | ||
| " embedding_max_chars=None,\n", | ||
| " embedding_pooling=\"mean_pooling\",\n", | ||
| " embedding_model_inference_batch_size=256,\n", | ||
| " # Semantic deduplication parameters\n", | ||
| " n_clusters=100, # this number can be much higher when the data is large\n", | ||
| " # For large scale data we should use CURATOR_DEDUP_ID_STR if we are\n", | ||
| " # also performing removal.\n", | ||
| " id_field=\"id\",\n", | ||
| " eps=0.01,\n", | ||
| " # K-means clustering parameters\n", | ||
| " ranking_strategy=RankingStrategy(metadata_cols=[\"cosine_dist_to_cent\"], ascending=True),\n", | ||
| " pairwise_batch_size=1024,\n", | ||
| " # ID generator parameters\n", | ||
| " # For large scale data we should set use_id_generator to True if we are performing removal.\n", | ||
| " use_id_generator=False,\n", | ||
| " id_generator_state_file=None,\n", | ||
| " # I/O parameters\n", | ||
| " input_filetype=input_filetype,\n", | ||
| " input_files_per_partition=1,\n", | ||
| " output_filetype=output_filetype,\n", | ||
| " verbose=True,\n", | ||
| " clear_output=True,\n", | ||
| ")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could break these values into individual cells per stage, for example a cell might contain just:
# K-means clustering parameters
ranking_strategy = RankingStrategy(metadata_cols=["cosine_dist_to_cent"], ascending=True)
pairwise_batch_size = 1024
with a small markdown explanation for what the stage does and what each parameter means (could be copy/pasted from the docstring). Then we have a final cell defining TextSemanticDeduplicationWorkflow with all of the parameters listed from the previous cells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can let this one be here, since the step_by_step already does the "breaking it up"..
At the end of the day the user must pass together if that makes sense.
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "2025-09-16 11:59:48,062\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference, we could clear these logs for the final commit, to help the tutorial look a bit cleaner. Ultimately up to you though, since the alternative argument is that they could be considered helpful for the user to know what to expect here.
| "source": [ | ||
| "from nemo_curator.core.client import RayClient\n", | ||
| "\n", | ||
| "client = RayClient(num_cpus=64, num_gpus=4)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any general rules of thumb for setting hardware requirements for this workflow, that the user should know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout, will document it.. i have forgotten if we need 2x the memory of embeddings or 1x.. we needed 2x in dask but i think in ray now its 1x.. for now i'll say 2x to be on the safer side
| "##### Visualizing Similarity in our dataset\n", | ||
| "\n", | ||
| "Depending on our dataset size we can read through all of the files and plot how much data is similar to one another.\n", | ||
| "Here we show how to read file by file and then perform a reduce. \n", | ||
| "\n", | ||
| "In our dataset we can see that ~20% of our data has cosine_similarity of 0.9 or more.\n", | ||
| "\n", | ||
| "Based on the analysis here we can decide what our `eps` should be. \n", | ||
| "However in this tutorial we pre-ran with eps set and perform_removal to be True.\n", | ||
| "\n", | ||
| "However ideally, users do this analysis, inspect the duplicates, come up with an `eps` and then run a pipeline that includes the `IdentifyDuplicates` stage.\n", | ||
| "\n", | ||
| "And finally perform removal.\n", | ||
| "\n", | ||
| "**NOTE : If you run with `use_id_generator=False` (which affects removal performance at large scale) you will see the actual ids of instead of the int ids. You can then also inspect the original dataset for those ids, and that can also help your decision for `eps`**" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great section for users to know how to interpret the results and play around with some parameters here. Could we flesh it out more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share how to flesh it out more?
tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb
Outdated
Show resolved
Hide resolved
| "1. When running removal workflow, we must specify the same input configuration as we did when we \"generated ids\".\n", | ||
| "2. In this tutorial that happened at the embedding generation step.\n", | ||
| "3. Therefore its required that we match the saame arguments of filepath, filetype and files_per_partition / blocksize.\n", | ||
| "4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n", | |
| "4. This is required because ids are generated based on hash (filenames) in each task. If the same hash is not found in the id generator it'll error out.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to write hash(filename) as in programmatically f(x) to share that we hash the filenames
| "4. This is required because ids are generated based on hash(filenames) in each task. If the same hash is not found in the id generator it'll error out.\n", | ||
| "\n", | ||
| "### Performance\n", | ||
| "1. If you notice OOMs during this stage you can try using RayDataActor\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be in the docs somewhere? Otherwise I think this instruction is a bit confusing to users first becoming acquainted with Curator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds two comprehensive Jupyter notebook tutorials for semantic deduplication and includes important bug fixes. The tutorials provide both end-to-end workflow examples and step-by-step implementations, with useful visualizations to help users understand their deduplication results.
Key Changes:
- Adds two semantic deduplication tutorial notebooks with data visualization
- Fixes reader stage output schema to include CURATOR_DEDUP_ID when ID generation is enabled
- Fixes logging error in semantic workflow when eps parameter is None
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb | Step-by-step semantic deduplication tutorial with ECDF plotting for similarity analysis |
| tutorials/text/deduplication/semantic/semantic_e2e.ipynb | End-to-end semantic deduplication workflow tutorial |
| nemo_curator/stages/text/io/reader/base.py | Fixes reader stage to include CURATOR_DEDUP_ID in output schema when ID generation is enabled |
| nemo_curator/stages/deduplication/semantic/workflow.py | Fixes logging error when eps parameter is None |
| output_fields = self.fields or [] | ||
| if self._generate_ids or self._assign_ids: | ||
| from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR | ||
|
|
||
| output_fields.append(CURATOR_DEDUP_ID_STR) |
Copilot
AI
Sep 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement inside the conditional block could fail if the module is not available. Consider moving the import to the top of the file or adding proper error handling to avoid potential ImportError at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving it to top level for 99% use cases where generate_ids or assign_ids is not set will lead to resolution of all imports inside deduplication/id_generator.py, deduplication/utils.py and if anything in future gets added to deduplication/__init__.py which is not necessary until either of those args are set to true
tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb
Outdated
Show resolved
Hide resolved
sarahyurick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
* bug fixes Signed-off-by: Praateek <praateekm@gmail.com> * add notebooks Signed-off-by: Praateek <praateekm@gmail.com> * change input path Signed-off-by: Praateek <praateekm@gmail.com> * add comment about input filetype Signed-off-by: Praateek <praateekm@gmail.com> * add download dataset too Signed-off-by: Praateek <praateekm@gmail.com> * pr comments Signed-off-by: Praateek <praateekm@gmail.com> * json -> jsonl Signed-off-by: Praateek <praateekm@gmail.com> * fc Signed-off-by: Praateek <praateekm@gmail.com> * pr comments Signed-off-by: Praateek <praateekm@gmail.com> * .. Signed-off-by: Praateek <praateekm@gmail.com> * change graph Signed-off-by: Praateek <praateekm@gmail.com> * pr reveiw Signed-off-by: Praateek <praateekm@gmail.com> * .. Signed-off-by: Praateek <praateekm@gmail.com> --------- Signed-off-by: Praateek <praateekm@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
* bug fixes Signed-off-by: Praateek <praateekm@gmail.com> * add notebooks Signed-off-by: Praateek <praateekm@gmail.com> * change input path Signed-off-by: Praateek <praateekm@gmail.com> * add comment about input filetype Signed-off-by: Praateek <praateekm@gmail.com> * add download dataset too Signed-off-by: Praateek <praateekm@gmail.com> * pr comments Signed-off-by: Praateek <praateekm@gmail.com> * json -> jsonl Signed-off-by: Praateek <praateekm@gmail.com> * fc Signed-off-by: Praateek <praateekm@gmail.com> * pr comments Signed-off-by: Praateek <praateekm@gmail.com> * .. Signed-off-by: Praateek <praateekm@gmail.com> * change graph Signed-off-by: Praateek <praateekm@gmail.com> * pr reveiw Signed-off-by: Praateek <praateekm@gmail.com> * .. Signed-off-by: Praateek <praateekm@gmail.com> --------- Signed-off-by: Praateek <praateekm@gmail.com>
Description
Two tutorials
TextSemanticDeduplicationWorkflowin one stepWe also show how to infer results from sem dedup before running removal by plotting the ecdf plot

Two small bug fixes
CURATOR_DEDUP_IDin outputsUsage
# Add snippet demonstrating usageChecklist