Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprint functions should support Unicode whitespace and punctuation #3282

Closed
tfmorris opened this issue Oct 21, 2020 · 8 comments · Fixed by #3283
Closed

Fingerprint functions should support Unicode whitespace and punctuation #3282

tfmorris opened this issue Oct 21, 2020 · 8 comments · Fixed by #3283
Assignees
Labels
expression language Support for scripting languages (GREL, Python…) Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Milestone

Comments

@tfmorris
Copy link
Member

Although the fingerprint keyers currently do diacritic folding for alphabetic characters, they don't correctly handle all Unicode whitespace and punctuation characters.

Proposed solution

Both the FingeprintKeyer and NGramFingerprintKeyer should be extended to correctly handle all Unicode whitespace characters (e.g. em space, NBSP, ZWSP, etc) and punctuation.

Additionally, the (almost) duplicate code in the N-gram keyer should be removed and use the common methods from the fingerprint keyer to make maintenance easier and less bug prone.

@tfmorris tfmorris added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. expression language Support for scripting languages (GREL, Python…) labels Oct 21, 2020
@tfmorris tfmorris self-assigned this Oct 21, 2020
@tfmorris tfmorris added this to the 3.5 milestone Oct 21, 2020
wetneb pushed a commit that referenced this issue Oct 25, 2020
* Add more keyer tests

- All forms of Unicode whitespace for both fingerprint & N-gram fingerprint
- additional N-gram fingerprint cases

* Improve fingerprint keyers

- Update N-gram fingerprint keyer to match (missed last time)
- refactor string normalization to reduce redundancy between two keyers
- add C1 controls to control characters that are stripped
- include all Unicode whitespace characters in splitting delimiter
  and don't strip controls which are whitespace (HT, LF, VT, FF, CR,
NEL)
- minor cleanups, simplifications, and performance optimizations
@tfmorris
Copy link
Member Author

tfmorris commented Nov 3, 2020

One outstanding issue from this work is this:

if (strong) {
// TODO: Should these be converted to spaces instead of being removed?
s = punctctrl.matcher(s).replaceAll("");

Currently we remove all punctuation which has the effect of potentially merging two words, e.g. "this&that" or "this,that" will become the single word "thisthat" This is what we've historically done, but arguably the punctuation should get replaced with white space to separate the two tokens, as is common practice with most tokenizers. The downside is that "they'll" becomes "they" and "ll" and similarly for other contractions and possessives.

@thadguidry
Copy link
Member

thadguidry commented Nov 3, 2020

@tfmorris I guess we should begin to answer what is our fingerprint and should it be customizable or very rigid and opinionated?
I think it should stay fairly rigid to the algorithm that Stefano wrote up https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint and that if we want to offer users more Text Analysis functions we can offer those separately.

We can look to other programs to see how they handle things, ElasticSearch uses our fingerprint algorithm by the way because they liked the choices that Stefano made.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-tokenfilter.html
To your particular question, Stefano clearly wrote this:

because whitespace is normalized, characters are lowercased, and punctuation is removed, those parts don't play a differentiation role in the fingerprint. Because these attributes of the string are the least significant in terms of meaning differentiation, these turn out to be the most varying parts of the strings and removing them has a substantial benefit in emerging clusters.

But we are not ElasticSearch with Lucene and it's entire ecosystem of Text Analysis. We could be. But that's not entirely what our users need always. However they do need customization of certain common parameters I would say.

This also depends on the stop words configuration typically in other programs, like ElasticSearch. https://www.elastic.co/guide/en/elasticsearch/guide/master/stopwords.html the, their, then, there, these, they, this, to, was, will, with etc.

We could also argue that there a lots of customization things around our fingerprint algorithm that we don't provide overrides for users to control, we hard code with smart defaults even down into our own StringUtils, but some can argue that they are not always "smart" for their edge use cases, like "why is fingerprint considering THAT character to be a whitespace? I really don't like that in this language...hmm" and having a "customized whitespace tokenizer" might make sense for fingerprint or it might not. elastic/elasticsearch#54341 (comment)

All I know is our users love when we give them more options and put the control in their hands.
For fingerprint, we should stick to Stefano's rules.
If users wanting more control on those rules, we should offer more Text Analysis functions that can be controlled and overridden, just like they can in ElasticSearch. (the long road)

@tfmorris
Copy link
Member Author

I guess we should begin to answer what is our fingerprint and should it be customizable or very rigid and opinionated?

My question was much simpler. It can be summarized as "should punctuation be a token separator or not. Lucene's StandardTokenizer splits on both punctuation and whitespace, with the exception of periods which are not followed by whitespace ie "example.com" is a single token.

I think it should stay fairly rigid to the algorithm that Stefano wrote up https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint

My interpretation of that writeup was that it was describing the then current Java code in English for users' benefits, not that it was intended to be a proselytizing of the one true way. We've already modified the algorithm to do full Unicode case & diacritic folding as well a decomposed normalization. We need to update the description to match the code, not slavishly follow history if we've identified improvements.

ElasticSearch uses our fingerprint algorithm by the way because they liked the choices that Stefano made.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-tokenfilter.html

That filter applies to a token stream, so isn't relevant to the question of tokenization.

To your particular question, Stefano clearly wrote this:

because whitespace is normalized, characters are lowercased, and punctuation is removed, those parts don't play a differentiation role in the fingerprint. Because these attributes of the string are the least significant in terms of meaning differentiation, these turn out to be the most varying parts of the strings and removing them has a substantial benefit in emerging clusters.

All that is true for both options: 1) tokenizing on whitespace and 2) tokenizing on whitespace plus punctuation.

The question is really which one is better for users as the default.

@thadguidry
Copy link
Member

@tfmorris I would say 2)

maheshjindal pushed a commit to maheshjindal/OpenRefine that referenced this issue Nov 25, 2020
* Add more keyer tests

- All forms of Unicode whitespace for both fingerprint & N-gram fingerprint
- additional N-gram fingerprint cases

* Improve fingerprint keyers

- Update N-gram fingerprint keyer to match (missed last time)
- refactor string normalization to reduce redundancy between two keyers
- add C1 controls to control characters that are stripped
- include all Unicode whitespace characters in splitting delimiter
  and don't strip controls which are whitespace (HT, LF, VT, FF, CR,
NEL)
- minor cleanups, simplifications, and performance optimizations
@wetneb
Copy link
Sponsor Member

wetneb commented Dec 8, 2020

I am not opposed to improving the existing tokenization algorithm. But we need to keep in mind that this is a change that will be tricky to review, since it is very hard to predict the impact of the change on the wide variety of things that users will throw at the algorithm.

For me, I would rather like to make this more user-configurable. This could take the form of:

  • letting users choose between various predefined tokenizers (such as the ones in Apache Lucene)
  • letting users define tokenizers themselves with GREL expressions, or configure other parts of the fingerprinting process as expressions.

I think the second option would give the most flexibility.

@tfmorris
Copy link
Member Author

tfmorris commented Dec 8, 2020

Even fancy user configurable things need defaults. Should the default tokenizer include punctuation as a token delimiter or not?

We certainly can (over)engineer a super fancy solution, but if we can improve our current implementation with a single character change, that has a much higher cost/benefit ratio.

@wetneb
Copy link
Sponsor Member

wetneb commented Dec 9, 2020

Should the default tokenizer include punctuation as a token delimiter or not?

Intuitively, including more characters as delimiters should only break up the source string into more tokens so it should have a stronger normalizing power (clustering together inputs that formed separate clusters before), right? So the risk would be that this brings up more spurious clusters that users would need to ignore. We should make a reasonable attempt at imagining what those spurious clusters could be. No concrete examples come to my mind right now, but I think this deserves more thought.

@thadguidry
Copy link
Member

thadguidry commented Dec 9, 2020

@tfmorris

tl:dr Stefano wrote up the Fingerprint to remove all punctuation and control chars and split the string into whitespace separated tokens. As you said in the first lines of this issue... if you think Stefano missed a few punctuation marks that should be removed, or splitting on whitespace isn't including all whitespace or word boundaries as defined by Unicode, then that should be fixed and then we close this issue. What we consider as punctuation and word boundaries like whitespace for Fingerprint as a default might lie below.


For defaults though, I do wonder if it makes sense that Key Collision methods might stick to the standard as defined by the Unicode Text Segmentation as other tools do. No surprises.

The default boundary determination builds upon the uniform character representation of the Unicode Standard, while handling the large number of characters and special features such as non-spacing marks and conjoining jamos in an effective manner. As this mechanism lends itself to a completely data-driven implementation, it can be tailored to particular orthographic conventions or user preferences without recoding.

The problem however, is that SOME users have asked for Cluster break rules that fall outside Unicode Language conventions, because they have needs that are not just Human Language related, but sometimes Code, Data, some weird strings they want to Cluster. Hence they need fine granular control. Oftentimes, their own Clustering algorithm. I just don't know how you give them that control other than stripping away the Cluster rules, and then they have to supply their own Clustering algorithm.

But then here we are talking about Fingerprint, which is basically just a very fast normalizing strategy, specifically a Key collision method for binning (not a Grapheme Clustering Break strategy, etc.). And Fingerprint was designed for Human Language word/sentence clustering, not Code, Data, or weird strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expression language Support for scripting languages (GREL, Python…) Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants