Skip to content

Fingerprint functions should support Unicode whitespace and punctuation #3282

@tfmorris

Description

@tfmorris

Although the fingerprint keyers currently do diacritic folding for alphabetic characters, they don't correctly handle all Unicode whitespace and punctuation characters.

Proposed solution

Both the FingeprintKeyer and NGramFingerprintKeyer should be extended to correctly handle all Unicode whitespace characters (e.g. em space, NBSP, ZWSP, etc) and punctuation.

Additionally, the (almost) duplicate code in the N-gram keyer should be removed and use the common methods from the fingerprint keyer to make maintenance easier and less bug prone.

Metadata

Metadata

Assignees

Labels

Type: Feature RequestIdentifies requests for new features or enhancements. These involve proposing new improvements.expression languageSupport for scripting languages (GREL, Python…)localizationanything to do with i18n Internationalization and I10n localization

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions