Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document clustering algorithms #3033

Closed
tfmorris opened this issue Aug 6, 2020 · 6 comments · Fixed by #4718 or #4719
Closed

Document clustering algorithms #3033

tfmorris opened this issue Aug 6, 2020 · 6 comments · Fixed by #4718 or #4719
Assignees
Labels
clustering Issues related to the clustering operation, to merge similar values in a text column Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Type: Documentation Issues related to improving project documentation or tutorials.
Milestone

Comments

@tfmorris
Copy link
Member

tfmorris commented Aug 6, 2020

The documentation on the clustering algorithms is scant, and in some cases, non-existent. The user interface should provide a brief description of each algorithm as well as guidance on how to pick the most appropriate one for their application as well as any pertinent tradeoffs.

Some are obvious from their names (or at least easily Google-able), but other are much more opaque.

PPM is a particularly egregious example, but since I just looked at the implementation, here's what it does:

PPM stands for Prediction by Partial Matching which is an arithmetic coding technique. The implementation that OpenRefine uses is based on the MIT SIMILE project's Vicino package, which, in turn, uses Bob Carpenter's arithcode package, which has a tutorial available here: https://raw.githubusercontent.com/bob-carpenter/java-arithcode/master/arithcode-1.2/doc/tutorial.html. The two strings to be compared are concatenated together and then "compressed" using the coder. The PPMModel uses 8 bytes of context and returns as the score the length of the encoded output. This is used four times to compute distance scores for (x,y), (y,x), (x,x), (y,y) (because the scores are directional) which are then combined together in this fashion to generate a final score:

10.0d * ((d(x,y) + d(y,x)) / (d(x,x) + d(y,y)) - 1.0d)

This score is relatively computationally expensive to compute and has to be done in a pairwise fashion for every pair of strings. To prevent an N^2 explosion, both the "nearest neighbor" clustering algorithms first divide the strings into manageable sized blocks and only does the pairwise comparison for members of the block. Depending on the blocking algorithm used, it's possible for two similar strings to end up in different blocks and never get compared to each other.

@tfmorris tfmorris added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Type: Documentation Issues related to improving project documentation or tutorials. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Aug 6, 2020
@wetneb wetneb added clustering Issues related to the clustering operation, to merge similar values in a text column and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Aug 6, 2020
@Babi-B
Copy link
Contributor

Babi-B commented Apr 4, 2022

Hello @tfmorris and @wetneb.

I checked the OpenRefine Documentation but found nothing on clustering algorithms
To work on this issue will I need to be versed in the knowledge of clustering algorithms or can I learn as I go? If the latter, are there resources I could refer to?

Also, is there a guideline to follow?

@antoine2711
Copy link
Member

antoine2711 commented Apr 4, 2022

Will I need to be versed in the knowledge of clustering algorithms or can I learn as I go? If the latter, are there resources I could refer to?

EDIT: see @ostephens comment below.

You can assign yourself if you want to take it.

Regards, A.

@ostephens
Copy link
Member

There are two locations in documentation where we already have information about the clustering algorithms:

I'd suggest first reviewing the information at https://docs.openrefine.org/manual/cellediting#clustering-methods and see if you think these need any more information. I think this needs to be aimed at a non-technical audience - I would not see information such as the detail of the implementation of PPM as suitable here.

The type of information that Tom gives above on PPM I'd see as more appropriate to https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

@Babi-B
Copy link
Contributor

Babi-B commented Apr 4, 2022

@ostephens okay, thanks.

@Babi-B
Copy link
Contributor

Babi-B commented Apr 7, 2022

Hello @antoine2711 @ostephens. I have a few suggestions:

  1. The phonetic algorithms Metaphone3, Cologne-phonetic, Daitch-Moktoff, and Beider-Morse are not explained in the in-depth document but merely stated. I could add some flesh to them.
  2. I could also make the clustering-methods document for the end-users more skimmable.

If you approve these changes, could you assign me to work on them?

@ostephens
Copy link
Member

Thanks @Babi-B - the first of these would be really great - please pay attention not just to the general description but also how they are implemented in OpenRefine (by examining the code) as (for example) I only recently found out that the Metaphone3 clustering only uses the first 8 characters of the Metaphone3 encoding of the value as the clustering key - these details are important.

I'd also say that examples are often helpful to explain so I'd recommend considering if adding examples helps make any of the documentation clearer as you write it.

I'm not 100% sure what you are proposing when you say you could

make the clustering-methods document for the end-users more skimmable

I don't know if you can explain here before doing the work or if it's easier to make a PR for the docs and get feedback on that - I'd be happy either way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clustering Issues related to the clustering operation, to merge similar values in a text column Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Type: Documentation Issues related to improving project documentation or tutorials.
Projects
None yet
5 participants