-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fingerprint functions should support Unicode whitespace and punctuation #3282
Comments
* Add more keyer tests - All forms of Unicode whitespace for both fingerprint & N-gram fingerprint - additional N-gram fingerprint cases * Improve fingerprint keyers - Update N-gram fingerprint keyer to match (missed last time) - refactor string normalization to reduce redundancy between two keyers - add C1 controls to control characters that are stripped - include all Unicode whitespace characters in splitting delimiter and don't strip controls which are whitespace (HT, LF, VT, FF, CR, NEL) - minor cleanups, simplifications, and performance optimizations
One outstanding issue from this work is this: OpenRefine/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java Lines 111 to 113 in 45a3622
Currently we remove all punctuation which has the effect of potentially merging two words, e.g. "this&that" or "this,that" will become the single word "thisthat" This is what we've historically done, but arguably the punctuation should get replaced with white space to separate the two tokens, as is common practice with most tokenizers. The downside is that "they'll" becomes "they" and "ll" and similarly for other contractions and possessives. |
@tfmorris I guess we should begin to answer what is our We can look to other programs to see how they handle things, ElasticSearch uses our fingerprint algorithm by the way because they liked the choices that Stefano made.
But we are not ElasticSearch with Lucene and it's entire ecosystem of Text Analysis. We could be. But that's not entirely what our users need always. However they do need customization of certain common parameters I would say. This also depends on the stop words configuration typically in other programs, like ElasticSearch. https://www.elastic.co/guide/en/elasticsearch/guide/master/stopwords.html We could also argue that there a lots of customization things around our fingerprint algorithm that we don't provide overrides for users to control, we hard code with smart defaults even down into our own StringUtils, but some can argue that they are not always "smart" for their edge use cases, like "why is fingerprint considering THAT character to be a whitespace? I really don't like that in this language...hmm" and having a "customized whitespace tokenizer" might make sense for fingerprint or it might not. elastic/elasticsearch#54341 (comment) All I know is our users love when we give them more options and put the control in their hands. |
My question was much simpler. It can be summarized as "should punctuation be a token separator or not. Lucene's StandardTokenizer splits on both punctuation and whitespace, with the exception of periods which are not followed by whitespace ie "example.com" is a single token.
My interpretation of that writeup was that it was describing the then current Java code in English for users' benefits, not that it was intended to be a proselytizing of the one true way. We've already modified the algorithm to do full Unicode case & diacritic folding as well a decomposed normalization. We need to update the description to match the code, not slavishly follow history if we've identified improvements.
That filter applies to a token stream, so isn't relevant to the question of tokenization.
All that is true for both options: 1) tokenizing on whitespace and 2) tokenizing on whitespace plus punctuation. The question is really which one is better for users as the default. |
@tfmorris I would say 2) |
* Add more keyer tests - All forms of Unicode whitespace for both fingerprint & N-gram fingerprint - additional N-gram fingerprint cases * Improve fingerprint keyers - Update N-gram fingerprint keyer to match (missed last time) - refactor string normalization to reduce redundancy between two keyers - add C1 controls to control characters that are stripped - include all Unicode whitespace characters in splitting delimiter and don't strip controls which are whitespace (HT, LF, VT, FF, CR, NEL) - minor cleanups, simplifications, and performance optimizations
I am not opposed to improving the existing tokenization algorithm. But we need to keep in mind that this is a change that will be tricky to review, since it is very hard to predict the impact of the change on the wide variety of things that users will throw at the algorithm. For me, I would rather like to make this more user-configurable. This could take the form of:
I think the second option would give the most flexibility. |
Even fancy user configurable things need defaults. Should the default tokenizer include punctuation as a token delimiter or not? We certainly can (over)engineer a super fancy solution, but if we can improve our current implementation with a single character change, that has a much higher cost/benefit ratio. |
Intuitively, including more characters as delimiters should only break up the source string into more tokens so it should have a stronger normalizing power (clustering together inputs that formed separate clusters before), right? So the risk would be that this brings up more spurious clusters that users would need to ignore. We should make a reasonable attempt at imagining what those spurious clusters could be. No concrete examples come to my mind right now, but I think this deserves more thought. |
tl:dr Stefano wrote up the Fingerprint to remove all punctuation and control chars and split the string into whitespace separated tokens. As you said in the first lines of this issue... if you think Stefano missed a few punctuation marks that should be removed, or splitting on whitespace isn't including all whitespace or word boundaries as defined by Unicode, then that should be fixed and then we close this issue. What we consider as punctuation and word boundaries like whitespace for Fingerprint as a default might lie below. For defaults though, I do wonder if it makes sense that Key Collision methods might stick to the standard as defined by the Unicode Text Segmentation as other tools do. No surprises.
The problem however, is that SOME users have asked for Cluster break rules that fall outside Unicode Language conventions, because they have needs that are not just Human Language related, but sometimes Code, Data, some weird strings they want to Cluster. Hence they need fine granular control. Oftentimes, their own Clustering algorithm. I just don't know how you give them that control other than stripping away the Cluster rules, and then they have to supply their own Clustering algorithm. But then here we are talking about Fingerprint, which is basically just a very fast normalizing strategy, specifically a Key collision method for binning (not a Grapheme Clustering Break strategy, etc.). And Fingerprint was designed for Human Language word/sentence clustering, not Code, Data, or weird strings. |
Although the fingerprint keyers currently do diacritic folding for alphabetic characters, they don't correctly handle all Unicode whitespace and punctuation characters.
Proposed solution
Both the FingeprintKeyer and NGramFingerprintKeyer should be extended to correctly handle all Unicode whitespace characters (e.g. em space, NBSP, ZWSP, etc) and punctuation.
Additionally, the (almost) duplicate code in the N-gram keyer should be removed and use the common methods from the fingerprint keyer to make maintenance easier and less bug prone.
The text was updated successfully, but these errors were encountered: