-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index/Tokenizer problem (RHEL 8, perl 5.26) #374
Comments
There looks to be two issues here:
So 1 can be fixed by hand but some investigation is needed to ensure this table is created differently in future to use 2 I expect will require a code change because there must be something this treats accented characters as a separator rather than a valid character that may appear in a value for |
I have debugged to some part. Point 1 should be irrelevant - Index::Tokenizer::apply_mapping should take care that only ASCII characters are stored in the word . Point 2 is not clear, if apply_mapping works as expected from MetaField::Name::get_index_codes_basic, that shouldn't be relevant, neither. You are correct, that a form of utf8_bin is used currently, in our case it is utf8mb3_bin: show full columns from eprint__rindex; |
There is a lot I still have to learn about EPrints, and indeed Unicode. Am wondering why the need to have character mapping in the first place. Is it helpful to know that diacritic(accent) insensitive matching/comparison is possible with Perl Core / Perl's Standard Modules? https://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html/ |
Here a follow-up: After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed. However, we applied a solution now that we also use in cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes. |
We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm
Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.
The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek.
If we search with Bzdušek, the records are not found.
We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3
BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different:
https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm
https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm
We have tried both versions, to no avail.
Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).
The text was updated successfully, but these errors were encountered: