Index/Tokenizer problem (RHEL 8, perl 5.26) #374

mpbraendle · 2024-02-19T12:21:40Z

We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm

Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.

The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek.

If we search with Bzdušek, the records are not found.

We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3

BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different:
https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm
https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm

We have tried both versions, to no avail.

Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).

drn05r · 2024-02-19T22:44:18Z

There looks to be two issues here:

The collation for eprint__rindex.word is utf8_bin if this is utf8_general_ci then if you search for Bzdusek then you get a result even of the eprint__rindex.word value is set to bzdušek.
If the search term contains accented characters rather than filtering onbzdušek for eprint__rindex.word it tries to filter on bzdu and ek as the only indexed term is bzdušek and the filter match exactly (not partial) then no results are found.

So 1 can be fixed by hand but some investigation is needed to ensure this table is created differently in future to use utf8_general_ci collation rather than utf8_bin, so this does not need to be manually fixed on new repositories. utf8_general_ci collation is already used for the eprint__rindex table but not for the word amongst other fields it contains.

2 I expect will require a code change because there must be something this treats accented characters as a separator rather than a valid character that may appear in a value for eprint__rindex.word.

mpbraendle · 2024-02-20T07:57:47Z

I have debugged to some part.

Point 1 should be irrelevant - Index::Tokenizer::apply_mapping should take care that only ASCII characters are stored in the word .
For author names, it is called from MetaField::Name::get_search_conditions (for searching) and MetaField::Name::get_index_codes_basic. Due to some reason, the encoding seems not to be correct and apply_mapping doesn't transliterate.
Also, apply_mapping is able to transliterate multiple characters into one and also has special rules. I doubt that utf8_general_ci will do that.

Point 2 is not clear, if apply_mapping works as expected from MetaField::Name::get_index_codes_basic, that shouldn't be relevant, neither.

You are correct, that a form of utf8_bin is used currently, in our case it is utf8mb3_bin:

ajmetz · 2024-05-31T11:11:08Z

There is a lot I still have to learn about EPrints, and indeed Unicode. Am wondering why the need to have character mapping in the first place. Is it helpful to know that diacritic(accent) insensitive matching/comparison is possible with Perl Core / Perl's Standard Modules? https://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html/

mpbraendle · 2024-05-31T12:52:51Z

Here a follow-up:

After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed.
This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist.
This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars).
It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl.

However, we applied a solution now that we also use in cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode

This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes.
Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all.
Also, it is possible to override the Text::Unidecode transliteration tables if one needs to. See https://metacpan.org/pod/Text::Unidecode
Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4).

drn05r added the bug Something isn't working label Feb 19, 2024

drn05r self-assigned this Feb 19, 2024

drn05r added this to the 3.4.6 milestone Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index/Tokenizer problem (RHEL 8, perl 5.26) #374

Index/Tokenizer problem (RHEL 8, perl 5.26) #374

mpbraendle commented Feb 19, 2024

drn05r commented Feb 19, 2024

mpbraendle commented Feb 20, 2024 •

edited

ajmetz commented May 31, 2024

mpbraendle commented May 31, 2024

Index/Tokenizer problem (RHEL 8, perl 5.26) #374

Index/Tokenizer problem (RHEL 8, perl 5.26) #374

Comments

mpbraendle commented Feb 19, 2024

drn05r commented Feb 19, 2024

mpbraendle commented Feb 20, 2024 • edited

ajmetz commented May 31, 2024

mpbraendle commented May 31, 2024

mpbraendle commented Feb 20, 2024 •

edited