Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not distinguish between ASCII and non-ASCII vowels #126

Closed
gunnarhj opened this issue Jul 16, 2020 · 19 comments
Closed

Does not distinguish between ASCII and non-ASCII vowels #126

gunnarhj opened this issue Jul 16, 2020 · 19 comments

Comments

@gunnarhj
Copy link
Contributor

gunnarhj commented Jul 16, 2020

The Swedish alphabet includes three additional vowels besides the ASCII letters: å, ä and ö. When using Typing Booster with Swedish spellchecking enabled, the suggestion window seems to treat a, å and ä as the same letter. For a Swedish speaker that is just as weird as if the application would treat a, e, i, etc. as the same letter.

@gunnarhj gunnarhj changed the title Does not distinguish between non-ASCII vowels Does not distinguish between ASCII and non-ASCII vowels Jul 16, 2020
@mike-fabian
Copy link
Owner

I am not sure whether I understand. I do the completion accent insensitive on purpose to make it easier to type accents if you either do not have a suitable keyboard layout or if you are not sure which characters need accents.

For example, when I type my native language, German, and want to type “grün”, I can just type “grun” and select “grün” from the candidates. I am using the US keyboard layout which does not have a “ü”. Actually I am using a version of the US layout which I modified a bit and I have “ü” on AltGr+u, but that needs two keys for typing, just typing “grun” and selecting is often faster.
I also mostly have “t-latn-post” configured in ibus-typing-booster, so I can also type u" to get a ü, but that needs even three keystrokes on the US layout as the " needs the Shift key.

When I am typing French, Spanish, or Italian, i.e. languages which I don’t yet know very well, I am often unsure which characters need accents. So it is often very helpful to just type the word without accents and look at the suggestions and select the correct version of the word I am trying to type.

@mike-fabian
Copy link
Owner

mike-fabian commented Jul 17, 2020

Originally, long ago, the completion was not accent insensitive, for me it was a big improvement to have it accent insensitive.

I could add an option to make this a choice, but I am not sure how useful such an option really would be. For me,
accent insensitive match is much better, also for German but very much better for French.

Adding such an option to choose between accent sensitive and accent insensitive matching would be quite easy.
As I added the insensitive matching on purpose, I could also disable it optionally.

I am hesitating to introduce lots of extra options, unless they are really necessary and useful, I am not yet convinced that accent sensitive matching is useful, I didn’t like it at all when ibus-typing-booster matched accent sensitive long ago.

@mike-fabian
Copy link
Owner

Maybe a compromise is possible without introducing a new option:

There are actually two things where accent insensitive matching can happen:

  • Matching words in the user database, i.e. words you have typed before. For example, if I have typed "grün" before and therefore "grün" is in the user database, typing "grun" will match it (with the current, accent insensitive code)
  • Matching words in a hunspell dictionary, i.e. when the dictionary contains "grün" and I type "grun", it matches.

Matching in the dictionary could be easily disabled for Swedish (and maybe Finnish, Norwegian, Danish, ...).

See this code:

https://github.com/mike-fabian/ibus-typing-booster/blob/master/engine/hunspell_suggest.py#L93

Currently “sv” is on that list.

If accent insensitive matching makes not sense at all for Swedish, removing
“sv” from that list would get us half the way there.

But you would still get accent insensitive matching for words you typed before.

I.e. if you typed “alkoholförgiftning” once, you have

input: “alkoholforgiftning” expands to: “alkoholförgiftning”

in the user database and the next time you type “alkoholforgiftning” you will get “alkoholförgiftning” as a match. Because the current code strips the accents from the “input” column of the user database table. sqlite does not have accent sensitive/insensitive matching. So my current solution for this is to record the input of the user always with the accents stripped and save it in the input column but save the full word with all the accents in the result column.

If I add an option to disable this behaviour, it would affect only the words typed
after the option was changed, if

input: “alkoholforgiftning” expands to: “alkoholförgiftning”

was typed before changing the option to "accent sensitive", this word would
still match accent insensitive. If one does not like a match from the user database,
one can delete it with Control+ though.

In most cases it is not really necessary to delete unwanted candidates because if one
never selects an unwanted candidate, it fades away because other candidates which are selected increase their score over time but the unwanted candidate is never selected and thus stays at the same score. And it moves lower and lower in the candidate list automatically.

I could also do this:

  • "accent insensitive matching" is on by default
  • If this option is switched off, all lines from the user database where the accents did not match are deleted

This would delete also lines which contain non-Swedish words like French where the accent insensitive matching might have been useful. So I am not sure whether this would not go too far ...

@mike-fabian
Copy link
Owner

mike-fabian commented Jul 17, 2020

How about removing "sv" from

https://github.com/mike-fabian/ibus-typing-booster/blob/master/engine/hunspell_suggest.py#L93

and test how much this helps? Maybe it is enough already and I don’t have to implement an extra option.

If "sv" were not on that list,

input: “alkoholforgiftning” expands to: “alkoholförgiftning”

would never be added to the user database anyway because “alkoholforgiftning” would not match “alkoholförgiftning” when matching in the hunspell dictionary and therefore one could not select “alkoholförgiftning” after typing only “alkoholforgiftning”, there would be no such candidate in the beginning. And if one cannot select such a candidate after typing only “alkoholforgiftning”, then there is no way that

input: “alkoholforgiftning” expands to: “alkoholförgiftning”

could be added to the user database.

So my guess it that just removing “sv” from that list would give us 99% of what we want for Swedish without introducing a new option and without harming other languages like German and French.

What do you think?

@gunnarhj
Copy link
Contributor Author

Thanks for your quick response, @mike-fabian. Have been AFK today, and will be for a few more hours. Will read your comments later, and try what you suggested. Please don't spend more time on it in the meantime.

@gunnarhj
Copy link
Contributor Author

Thanks a lot for the in depth explanation of your considerations on the topic.

You make me hesitate on what I would prefer. Swedish does make use of accented letters such as é and à. But those are just that, i.e. accented letters not included in the alphabet, and in case of those I would most certainly prefer the current accent insensitive behavior.

As regards å, ä and ö they are distinct additional vowels included in the Swedish alphabet, and I consider those to be something else but just accented letters.

Then I read your use case with typing a European language using some English layout, and I understand what you say.

I would like to consult with my Swedish friend who called my attention to the current behavior, and also give it some own further thought. We may experiment with dropping 'sv' from the accent_languages list.

In the meantime I would appreciate if you could keep this issue open. We will get back later after having thought it through.

@mike-fabian
Copy link
Owner

Another idea: Leave “sv” in the list but when this list is used to process a hunspell dictionary and the language is “sv”, do not drop all accents but keep å, ä and ö, only drop all other accents.

I think I could do that, for in addition to the above list of languages where accent insensitive matching in the hunspell dictionaries makes sense, add another list which accented letters to keep

non_ascii_base_letters = {'sv': 'åäö', 'nb': 'åø', ...}

The question would be whether I can do that fast enough that it doesn’t make the startup of ibus-typing-booster noticeably slower.

If you look at the code below the above list of languages you see that if a language is in that list, I generate a list of word pairs from the hunspell dictionary:

word with accents removed, original word

Then I later do the matching in the word with accents removed and if it matches I display the original word as a candidate.

When generating this list of word pairs, I use the function itb_util.remove_accents(x) which returns the string
without accents. Maybe I could improve that function to take a second parameter which non-ascii letters to keep,
maybe like itb_util.remove_accents(x, keep='åäö') and pass different keep lists depending on the language.
If I find a way to do that and keep itb_util.remove_accents() fast enough that processing a dictionary does not take too much time, then this might be an idea to do it. The dictionaries are processed that way only once, when ibus-typing-booster starts,
i.e. when you switch to ibus-typing-booster using the gnome-panel for the first time in that gnome session, or, if ibus-typing-booster is already selected when the gnome-session starts then this is done when the session starts.

@mike-fabian
Copy link
Owner

https://github.com/mike-fabian/ibus-typing-booster/blob/master/engine/itb_util.py#L2605

Here is itb_util.remove_accents().

It already iterates over the string. I calls unicodedata.normalize('NFKD', text) once for the whole string though and then iterates over the result. When I want to make exceptions for something like keep='åäö', I would need to iterate over the string and keep every character on the keep list and call unicodedata.normalize('NFKD', character) once for each character which is not on the keep list. translate() would also need to be called once per character instead once for the whole word. unicodedata.category(x) is already called once per character. So it would need a few more function calls, but maybe it is still fast enough.

If that idea sounds good, I’ll try it and measure whether it is still fast enough. I guess it will be still fast enough.

I could also use the current code if keep='' and use the code with more function calls only if the keep list is not empty,
then it would become somewhat slower in startup only for languages which have a non-empty keep list.

@gunnarhj
Copy link
Contributor Author

Maybe I could improve that function to take a second parameter which non-ascii letters to keep,

That idea sounds interesting indeed. Actually I was about to ask whether it's possible to special case 'åäö' for Swedish somehow, but since you made me hesitate above, I abstained for now.

If you would go ahead along those lines, I suppose that it would be of interest for other languages but Swedish. Not sure which ones, though.

But let's not rush things. Neither I nor my friend is a regular ibus-typing-booster user (yet). I just packaged the thing...

@mike-fabian
Copy link
Owner

If you would go ahead along those lines, I suppose that it would be of interest for other languages but Swedish. Not sure which ones, though.

Certainly Norwegian and Danish and also Finnish. They all sort several “accented” letters after “z” which indicates that they are considered different base letters.

@gunnarhj
Copy link
Contributor Author

@mike-fabian: After having given it some more thought, and talked with my friend, I would say that the idea with a parameter stating which non-ascii letters to keep would be a significant improvement for the Nordic languages, provided it can be done without slowing it down too much. I would really appreciate if you could give it a try.

@mike-fabian
Copy link
Owner

I am working on this.
The slowdown will be insignificant.

I tried what happens when I just remove "sv" from the list, and it worked as I expected, i.e. it seemed to work well, it was an improvement for öäå. But as é and à are different, it seems better to not remove "sv" from the list completely. So I am working on the a bit more complicated way of doing it, which is probably better.

@mike-fabian
Copy link
Owner

mike-fabian commented Aug 5, 2020

The performance impact of what I implemented seems negligible if only dictionaries are used which don’t need special character treatment.

If a Swedish dictionary is used, a few letters get special treatment and count as distinct letters, not variants of a base letter:

 'sv': 'åÅäÄöÖ',

See also:

7cf78ee#diff-362ec3fe0b816c6315ff940765ec676fR140

Same for Norwegian, Danish, and Finnish.

This reduces the performance slightly when using the Swedish dictionary, I can measure it, but interactively I don’t really notice it.

@mike-fabian
Copy link
Owner

This testcase shows what it does for Swedish:

7cf78ee#diff-e9cfd35c4ea77c6b831cb468e313f07aR264

    def test_sv_SE(self):
        h = hunspell_suggest.Hunspell(['sv_SE'])
        self.assertEqual(
            h.suggest('östgo'),
            [('östgot', 0),
             ('östgöte', 0),
             ('östgotisk', 0),
             ('östgötsk', 0),
             ('östgötska', 0)])

I.e. when typing “östgo”, one gets the matches
“östgot”, “östgöte”, “östgotisk”, “östgötsk”, and “östgötska”.

So the final “o” in the input still matches both “o” and “ö”. This
is, because the “o” when thinking of the input in normalization form D
is still incomplete, it could still turn into an “ö” when a combining
diaeresis is typed or when a “"” is typed after the “o” using the
Latin postfix input method.

But when adding one more letter “t”:

        self.assertEqual(
            h.suggest('östgot'),
            [('östgot', 0),
             ('östgotisk', 0),
             ('Östgot', -1)])

This “finishes” the “o”, there is no way it can become an “ö” anymore, so
one gets only matches where this letter is “o”.

If the input is “östgö”, the last letter is complete and thus it does not match
“o” anymore:

        self.assertEqual(
            h.suggest('östgö'),
            [('östgöte', 0),
             ('östgötsk', 0),
             ('östgötska', 0)])

When adding one more letter and the input is “östgöt”, one gets one
match with “o” back, “östgot”. This comes from spellchecking the input
“östgöt” the spellchecker (python3-enchant) gives “östgot” as a possible
spelling correction for “östgöt”:

        self.assertEqual(
            h.suggest('östgöt')[0:4],
            [('östgöte', 0),
             ('östgötsk', 0),
             ('östgötska', 0),
             ('östgot', -1)])

In this lists, the spelling corrections are marked with the weights
-1, the suggestions which are just extensions of the input without
changing any letters of the input are marked with the weights 0.

@gunnarhj
Copy link
Contributor Author

gunnarhj commented Aug 5, 2020

That sounds very promising, @mike-fabian. I'm about to upload 2.9.5, and will test it once available in Debian/Ubuntu. Thanks a lot for your work with this!

Also, as regards uploads, I have an OT question: Debian uses a tool for fetching the upstream tarball, and in the case of ibus-typing-booster it fetches the 2.9.5.tar.gz file. I accidentally noticed that it differs from the ibus-typing-booster-2.9.5.tar.gz file. Why is there a difference and which one is the most 'right'?

@mike-fabian
Copy link
Owner

@mike-fabian
Copy link
Owner

mike-fabian commented Aug 6, 2020

If you use https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz, you have to start with

./autogen.sh --some-options

to create ./configure and some other files.

https://github.com/mike-fabian/ibus-typing-booster/releases/download/2.9.5/ibus-typing-booster-2.9.5.tar.gz

contains more files. This file is what you get after doing

./autogen.sh --prefix=/usr
make
make dist

The make dist produces the ibus-typing-booster-2.9.5.tar.gz. And this file produced by make dist is the file:

https://github.com/mike-fabian/ibus-typing-booster/releases/download/2.9.5/ibus-typing-booster-2.9.5.tar.gz

So you can start with https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz as well if you want and if it doesn't bother you to run ./autogen.sh and have a few more tools needed for building the package.

@gunnarhj
Copy link
Contributor Author

gunnarhj commented Aug 6, 2020

Thanks for explaining. Well, I'm not very bothered - the build is done magically by Debian's build system (Ubuntu's buildlog as an example) and I'm assuming that some equivalent to the commands you mentioned is accomplished. So since there is no difference in contents, I think I'll keep using https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz for now.

I have tested the result of 7cf78ee, and there is a substantial change in the right direction. It behaves much better from a Swedish POV now. Thanks again!

@mike-fabian
Copy link
Owner

Yes, if the Debian build system can handle it, it is perfectly fine to use https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants