Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of candidates returned for clustering should be capped #695

Closed
tfmorris opened this issue Mar 7, 2013 · 2 comments · Fixed by #2996
Closed

Number of candidates returned for clustering should be capped #695

tfmorris opened this issue Mar 7, 2013 · 2 comments · Fixed by #2996
Assignees
Labels
clustering Issues related to the clustering operation, to merge similar values in a text column Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Milestone

Comments

@tfmorris
Copy link
Member

tfmorris commented Mar 7, 2013

I have a file where 31,787 clusters were found and the top clusters have 39 variants in them.

We need to cap the number returned to the browser because otherwise it becomes unresponsive (or crashes).

@magdmartin
Copy link
Member

So how will you cluster the full group? In several pass so when one click
merge and re-cluster it display the rest of the group?
On 2013-03-06 8:28 PM, "Tom Morris" notifications@github.com wrote:

I have a file where 31,787 clusters were found and the top clusters have
39 variants in them.

We need to cap the number returned to the browser because otherwise it
becomes unresponsive (or crashes).


Reply to this email directly or view it on GitHubhttps://github.com//issues/695
.

@wetneb wetneb added clustering Issues related to the clustering operation, to merge similar values in a text column Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. labels Aug 2, 2017
@tfmorris
Copy link
Member Author

My latest experiment is a 400K row file of menu dishes which are very messy and highly duplicated with the top cluster being 49 ways to spell "fried sweet potatoes." The 41K clusters take 3-7 seconds to compute on the server, but then many minutes to render in my browser.

The default maximum should probably be something like 5-10K, perhaps with a setting to allow it to be increased, like the text facet choices.

tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Jul 27, 2020
…enRefine#695

Fixes OpenRefine#695
- Caps the total number of choices displayed at 10,000 and warns when
  over the limit. Users can use facets to tune which clusters are displayed.
- Doubles the performance of the Javascript processing
- Only displays count of rows for a choice if it's > 1 to DOM elements
- Adds internationalization for row count

For 41K clusters containing 118K choices, processing dropped from
3m20s to 1m20s, but with the 10K choice cap total time is ~10sec.
@tfmorris tfmorris self-assigned this Jul 31, 2020
@tfmorris tfmorris added this to the 3.5 milestone Jul 31, 2020
wetneb pushed a commit that referenced this issue Aug 1, 2020
… (#2996)

* Clustering dialog choices limit & performance improvements - fixes #695

Fixes #695
- Caps the total number of choices displayed at 10,000 and warns when
  over the limit. Users can use facets to tune which clusters are displayed.
- Doubles the performance of the Javascript processing
- Only displays count of rows for a choice if it's > 1 to DOM elements
- Adds internationalization for row count

For 41K clusters containing 118K choices, processing dropped from
3m20s to 1m20s, but with the 10K choice cap total time is ~10sec.

* Restore even/odd row class

* Updates from review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clustering Issues related to the clustering operation, to merge similar values in a text column Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants