-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document clustering algorithms #3033
Comments
I checked the OpenRefine Documentation but found nothing on clustering algorithms Also, is there a guideline to follow? |
EDIT: see @ostephens comment below. You can assign yourself if you want to take it. Regards, A. |
There are two locations in documentation where we already have information about the clustering algorithms:
I'd suggest first reviewing the information at https://docs.openrefine.org/manual/cellediting#clustering-methods and see if you think these need any more information. I think this needs to be aimed at a non-technical audience - I would not see information such as the detail of the implementation of PPM as suitable here. The type of information that Tom gives above on PPM I'd see as more appropriate to https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth |
@ostephens okay, thanks. |
Hello @antoine2711 @ostephens. I have a few suggestions:
If you approve these changes, could you assign me to work on them? |
Thanks @Babi-B - the first of these would be really great - please pay attention not just to the general description but also how they are implemented in OpenRefine (by examining the code) as (for example) I only recently found out that the Metaphone3 clustering only uses the first 8 characters of the Metaphone3 encoding of the value as the clustering key - these details are important. I'd also say that examples are often helpful to explain so I'd recommend considering if adding examples helps make any of the documentation clearer as you write it. I'm not 100% sure what you are proposing when you say you could
I don't know if you can explain here before doing the work or if it's easier to make a PR for the docs and get feedback on that - I'd be happy either way |
The documentation on the clustering algorithms is scant, and in some cases, non-existent. The user interface should provide a brief description of each algorithm as well as guidance on how to pick the most appropriate one for their application as well as any pertinent tradeoffs.
Some are obvious from their names (or at least easily Google-able), but other are much more opaque.
PPM is a particularly egregious example, but since I just looked at the implementation, here's what it does:
PPM stands for Prediction by Partial Matching which is an arithmetic coding technique. The implementation that OpenRefine uses is based on the MIT SIMILE project's Vicino package, which, in turn, uses Bob Carpenter's arithcode package, which has a tutorial available here: https://raw.githubusercontent.com/bob-carpenter/java-arithcode/master/arithcode-1.2/doc/tutorial.html. The two strings to be compared are concatenated together and then "compressed" using the coder. The PPMModel uses 8 bytes of context and returns as the score the length of the encoded output. This is used four times to compute distance scores for (x,y), (y,x), (x,x), (y,y) (because the scores are directional) which are then combined together in this fashion to generate a final score:
10.0d * ((d(x,y) + d(y,x)) / (d(x,x) + d(y,y)) - 1.0d)
This score is relatively computationally expensive to compute and has to be done in a pairwise fashion for every pair of strings. To prevent an N^2 explosion, both the "nearest neighbor" clustering algorithms first divide the strings into manageable sized blocks and only does the pairwise comparison for members of the block. Depending on the blocking algorithm used, it's possible for two similar strings to end up in different blocks and never get compared to each other.
The text was updated successfully, but these errors were encountered: