Skip to content

Conversation

@sarahyurick
Copy link
Contributor

No description provided.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Feb 26, 2025
Comment on lines +69 to +75
embedding_column=semdedup_config.embedding_column,
random_state=semdedup_config.random_state,
sim_metric=semdedup_config.sim_metric,
which_to_keep=semdedup_config.which_to_keep,
sort_clusters=semdedup_config.sort_clusters,
kmeans_with_cos_dist=semdedup_config.kmeans_with_cos_dist,
clustering_input_partition_size=semdedup_config.clustering_input_partition_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these 🙏
I was running the script recently and had to these too

Comment on lines +168 to +172
kmeans = KMeans(
n_clusters=self.n_clusters,
max_iter=self.max_iter,
random_state=self.random_state,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm this is the main change right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, everything else is docstring updates or updates to make sure all parameters are being used in the scripts directory. Thanks!

@sarahyurick sarahyurick merged commit e662ac0 into NVIDIA-NeMo:main Feb 26, 2025
6 checks passed
@sarahyurick sarahyurick deleted the kmeans_seed branch February 28, 2025 18:31
jnke2016 pushed a commit to jnke2016/Curator that referenced this pull request Nov 12, 2025
…eMo#575)

* Add KMeans random_state to semantic deduplication configs

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

* edit docstrings

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

---------

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpuci Run GPU CI/CD on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants