Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why some variants are not taken into account #1418

Closed
BoboTiG opened this issue Nov 5, 2022 · 18 comments · Fixed by #1421
Closed

Investigate why some variants are not taken into account #1418

BoboTiG opened this issue Nov 5, 2022 · 18 comments · Fixed by #1421
Labels
bug Something isn't working

Comments

@BoboTiG
Copy link
Owner

BoboTiG commented Nov 5, 2022

Reported by Alex6.

In French, the word encalminés doesn't return definitions.
Using our CLI, we can confirm what is available on the Wiktionary: encalminés has 2 variants (encalminé, and encalminer):

$ python -m wikidict fr --get-word encalminés 
encalminés \ɑ̃.kal.mi.ne\  

[variants] encalminer, encalminé
$ python -m wikidict fr --get-word encalminer
encalminer \ɑ̃.kal.mi.ne\  

De calme, avec le préfixe en-.


  1. Faire cesser, ou cesser, d’avancer, par manque de vent.
  2. (Figuré) Bloquer l’avancement, le progrès.
$ python -m wikidict fr --get-word encalminé 
encalminé \ɑ̃.kal.mi.ne\  

(Siècle à préciser) De calme avec les affixes en- et -iné.


  1. (Marine) Immobilisé, en parlant d’un voilier, par l’absence de vent ou à l’abri dans un havre.
  2. (Figuré) En parlant d'une personne : à l'arrêt, en attente de.
[variants] encalminer

I open the issue for more investigations, for instance I didn't check Kobo logs yet.

In such a situation, I would have thought both words to be displayed 🤔

@BoboTiG BoboTiG added the bug Something isn't working label Nov 5, 2022
@lasconic
Copy link
Collaborator

lasconic commented Nov 5, 2022

To test, we can create a dictionary with these 3 words only.

@lasconic
Copy link
Collaborator

lasconic commented Nov 5, 2022

My first guess would be that we don't register encalminés because it's a plural and we hoped for Kobo to display encalminé automatically.
Because of : https://pgaskin.net/dictutil/dicthtml/format.html and this statement:

It is not necessary to explicitly specify variants for things like plurals with a s suffix, as if a word cannot be found, the first prefix match will be used (i.e. tests will match a headword/variant named test). For this reason, words should be sorted from shortest to longest in alphabetical order in general (to prevent accidental matches of the wrong word).

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 9, 2022

Arf we no longer have access to logs, Kobo is now encrypting them (cf https://www.mobileread.com/forums/showpost.php?p=4166564&postcount=5). What a shame.

Asked for some hints on pgaskin/dictutil#20.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

Another use case:

$ python -m wikidict fr --get-word contingente                                      
contingente \kɔ̃.tɛ̃.ʒɑ̃t\ f. 

  1. (Désuet) (Chez Descartes) Variante de cotangente.
[variants] contingent, contingenter

So the Kobo will only display 1. (Désuet) (Chez Descartes) Variante de cotangente., but the awaited definition here is the one from "contigent".

Same issue with "frauduleuses" that doesn't display definitions (but it should go through "frauduleuse" -> "frauduleux", and display definitions from "fraudlueux").

Maybe could we improve that without having to rewrite the Kobo firmware :D

@pgaskin
Copy link

pgaskin commented Nov 10, 2022

My first guess would be that we don't register encalminés because it's a plural and we hoped for Kobo to display encalminé automatically.

IIRC, I was only referring to English when I said that.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

Here are logs when looking for "encalminés":

Nov 10 14:50:25 nickel: (    16.265 @ 0x2a1a7f0 / ui.warning) static QByteArray Unzipper::extractFile(const QString&, const QString&) "/mnt/onboard/.kobo/dict/dicthtml-fr.zip" file not found "prefix_exceptions" 
Nov 10 14:50:36 nickel: (    27.046 @ 0x2a1a7f0 / dictionary.debug)  word:  "encalminés" 
Nov 10 14:50:36 nickel: (    27.215 @ 0x2a1a7f0 / dictionary.debug) got alternative search terms:  ("encalminés", "ENCALMINÉS", "Encalminés", "encalmins")  for word:  "encalminés" 

It doesn't look for "encalminé" but "encalmins" 😨

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

Using a custom dict created like:

$ python -m wikidict fr --gen-dict 'encalminer,encalminé,encalminés,encalminées' --output 'issue-1418'

we got that en.raw.html file:

<w>
    <p><a name="encalminé" /><b>encalminé</b> \ɑ̃.kal.mi.ne\<br /><br />
    <p>...</p>
    <var>
        <variant name="encalmines" />
    </var>
</w>

<w>
    <p><a name="encalminer" /><b>encalminer</b> \ɑ̃.kal.mi.ne\<br /><br />
    <p>...</p>
    <var>
        <variant name="encalmine" />
        <variant name="encalmines" />
        <variant name="encalminees" />
    </var>
</w>

</html>

We can see that accents were removed from variants. I'm a little bit rusty, is it expected @lasconic?

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

Removing the call to

def strip_accents(self, s: str) -> str:
return "".join(
c
for c in unicodedata.normalize("NFD", s)
if unicodedata.category(c) != "Mn"
)
fixes the issue.

It was added in #1227. Maybe could we call strip_accents() only for the Spanish dictionary?

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

If we decided to apply the fix, "encalminées" would redirect to "encalminer" instead of both "encalminer", and "encalminé". And it's expected because on the Wiktionary "encalminées" only links to "encalminer". We would have to add a section on the Wiktionary to fix that (it they like such modifications though).

@lasconic
Copy link
Collaborator

About contingente, this is the expected behavior. If a word has a definition we don't use the variants at all.

About frauduleuses, the variant is "frauduleux" already.

python -m wikidict fr --get-word frauduleuses
frauduleuses \fʁɔ.dy.løz\  

[variants] frauduleux

And frauduleux has definitions so, it should be displayed.

If you run

python -m wikidict fr --gen-dict=frauduleux,frauduleuse,frauduleuses --output=test_wik

and check test_wik/tmp/fr.raw.html

@lasconic
Copy link
Collaborator

We can see that accents were removed from variants. I'm a little bit rusty, is it expected @lasconic?

The VARIANT must be normalized by trimming whitespace and lowercasing it (following unicode normalization rules, i.e. normalizing accented characters as well)
According to https://pgaskin.net/dictutil/dicthtml/format.html again

@lasconic
Copy link
Collaborator

If you have a better behaviour without it, let's remove it for spanish as well.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

Hm but we can see that not normalizing the variant works (at least for the current example). And normalization was added sooner in the year, quite "late" in the project.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

It's just that those lines seem weird:

if guess_prefix(v) == name:
var += f'<variant name="{self.strip_accents(v)}"/>'

guess_prefix() uses the non-normalized variant, while we store the normalized version in the dict. Maybe such behaviour is inefficient, or only necessary for spanish? 🤔

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

(Or maybe latest Kobo updates improved the "search" process, and normalization became obsolete.)

@lasconic
Copy link
Collaborator

Yes, I would get rid of the normalisation if it works for french and see if problems arise in spanish.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

Yes, I would get rid of the normalisation if it works for french and see if problems arise in spanish.

Let's see then 👍

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 10, 2022

About frauduleuses, the variant is "frauduleux" already.

Indeed. I don't know what I did, but it works like a charm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants