New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does not distinguish between ASCII and non-ASCII vowels #126
Comments
I am not sure whether I understand. I do the completion accent insensitive on purpose to make it easier to type accents if you either do not have a suitable keyboard layout or if you are not sure which characters need accents. For example, when I type my native language, German, and want to type “grün”, I can just type “grun” and select “grün” from the candidates. I am using the US keyboard layout which does not have a “ü”. Actually I am using a version of the US layout which I modified a bit and I have “ü” on AltGr+u, but that needs two keys for typing, just typing “grun” and selecting is often faster. When I am typing French, Spanish, or Italian, i.e. languages which I don’t yet know very well, I am often unsure which characters need accents. So it is often very helpful to just type the word without accents and look at the suggestions and select the correct version of the word I am trying to type. |
Originally, long ago, the completion was not accent insensitive, for me it was a big improvement to have it accent insensitive. I could add an option to make this a choice, but I am not sure how useful such an option really would be. For me, Adding such an option to choose between accent sensitive and accent insensitive matching would be quite easy. I am hesitating to introduce lots of extra options, unless they are really necessary and useful, I am not yet convinced that accent sensitive matching is useful, I didn’t like it at all when ibus-typing-booster matched accent sensitive long ago. |
Maybe a compromise is possible without introducing a new option: There are actually two things where accent insensitive matching can happen:
Matching in the dictionary could be easily disabled for Swedish (and maybe Finnish, Norwegian, Danish, ...). See this code: https://github.com/mike-fabian/ibus-typing-booster/blob/master/engine/hunspell_suggest.py#L93 Currently “sv” is on that list. If accent insensitive matching makes not sense at all for Swedish, removing But you would still get accent insensitive matching for words you typed before. I.e. if you typed “alkoholförgiftning” once, you have input: “alkoholforgiftning” expands to: “alkoholförgiftning” in the user database and the next time you type “alkoholforgiftning” you will get “alkoholförgiftning” as a match. Because the current code strips the accents from the “input” column of the user database table. sqlite does not have accent sensitive/insensitive matching. So my current solution for this is to record the input of the user always with the accents stripped and save it in the input column but save the full word with all the accents in the result column. If I add an option to disable this behaviour, it would affect only the words typed input: “alkoholforgiftning” expands to: “alkoholförgiftning” was typed before changing the option to "accent sensitive", this word would In most cases it is not really necessary to delete unwanted candidates because if one I could also do this:
This would delete also lines which contain non-Swedish words like French where the accent insensitive matching might have been useful. So I am not sure whether this would not go too far ... |
How about removing "sv" from https://github.com/mike-fabian/ibus-typing-booster/blob/master/engine/hunspell_suggest.py#L93 and test how much this helps? Maybe it is enough already and I don’t have to implement an extra option. If "sv" were not on that list, input: “alkoholforgiftning” expands to: “alkoholförgiftning” would never be added to the user database anyway because “alkoholforgiftning” would not match “alkoholförgiftning” when matching in the hunspell dictionary and therefore one could not select “alkoholförgiftning” after typing only “alkoholforgiftning”, there would be no such candidate in the beginning. And if one cannot select such a candidate after typing only “alkoholforgiftning”, then there is no way that input: “alkoholforgiftning” expands to: “alkoholförgiftning” could be added to the user database. So my guess it that just removing “sv” from that list would give us 99% of what we want for Swedish without introducing a new option and without harming other languages like German and French. What do you think? |
Thanks for your quick response, @mike-fabian. Have been AFK today, and will be for a few more hours. Will read your comments later, and try what you suggested. Please don't spend more time on it in the meantime. |
Thanks a lot for the in depth explanation of your considerations on the topic. You make me hesitate on what I would prefer. Swedish does make use of accented letters such as As regards Then I read your use case with typing a European language using some English layout, and I understand what you say. I would like to consult with my Swedish friend who called my attention to the current behavior, and also give it some own further thought. We may experiment with dropping 'sv' from the accent_languages list. In the meantime I would appreciate if you could keep this issue open. We will get back later after having thought it through. |
Another idea: Leave “sv” in the list but when this list is used to process a hunspell dictionary and the language is “sv”, do not drop all accents but keep å, ä and ö, only drop all other accents. I think I could do that, for in addition to the above list of languages where accent insensitive matching in the hunspell dictionaries makes sense, add another list which accented letters to keep
The question would be whether I can do that fast enough that it doesn’t make the startup of ibus-typing-booster noticeably slower. If you look at the code below the above list of languages you see that if a language is in that list, I generate a list of word pairs from the hunspell dictionary: word with accents removed, original word Then I later do the matching in the word with accents removed and if it matches I display the original word as a candidate. When generating this list of word pairs, I use the function |
https://github.com/mike-fabian/ibus-typing-booster/blob/master/engine/itb_util.py#L2605 Here is It already iterates over the string. I calls If that idea sounds good, I’ll try it and measure whether it is still fast enough. I guess it will be still fast enough. I could also use the current code if |
That idea sounds interesting indeed. Actually I was about to ask whether it's possible to special case ' If you would go ahead along those lines, I suppose that it would be of interest for other languages but Swedish. Not sure which ones, though. But let's not rush things. Neither I nor my friend is a regular |
Certainly Norwegian and Danish and also Finnish. They all sort several “accented” letters after “z” which indicates that they are considered different base letters. |
@mike-fabian: After having given it some more thought, and talked with my friend, I would say that the idea with a parameter stating which non-ascii letters to keep would be a significant improvement for the Nordic languages, provided it can be done without slowing it down too much. I would really appreciate if you could give it a try. |
I am working on this. I tried what happens when I just remove "sv" from the list, and it worked as I expected, i.e. it seemed to work well, it was an improvement for öäå. But as é and à are different, it seems better to not remove "sv" from the list completely. So I am working on the a bit more complicated way of doing it, which is probably better. |
The performance impact of what I implemented seems negligible if only dictionaries are used which don’t need special character treatment. If a Swedish dictionary is used, a few letters get special treatment and count as distinct letters, not variants of a base letter:
See also: 7cf78ee#diff-362ec3fe0b816c6315ff940765ec676fR140 Same for Norwegian, Danish, and Finnish. This reduces the performance slightly when using the Swedish dictionary, I can measure it, but interactively I don’t really notice it. |
This testcase shows what it does for Swedish: 7cf78ee#diff-e9cfd35c4ea77c6b831cb468e313f07aR264
I.e. when typing “östgo”, one gets the matches So the final “o” in the input still matches both “o” and “ö”. This But when adding one more letter “t”:
This “finishes” the “o”, there is no way it can become an “ö” anymore, so If the input is “östgö”, the last letter is complete and thus it does not match
When adding one more letter and the input is “östgöt”, one gets one
In this lists, the spelling corrections are marked with the weights |
That sounds very promising, @mike-fabian. I'm about to upload 2.9.5, and will test it once available in Debian/Ubuntu. Thanks a lot for your work with this! Also, as regards uploads, I have an OT question: Debian uses a tool for fetching the upstream tarball, and in the case of ibus-typing-booster it fetches the 2.9.5.tar.gz file. I accidentally noticed that it differs from the ibus-typing-booster-2.9.5.tar.gz file. Why is there a difference and which one is the most 'right'? |
If you use https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz, you have to start with
to create contains more files. This file is what you get after doing
The So you can start with https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz as well if you want and if it doesn't bother you to run |
Thanks for explaining. Well, I'm not very bothered - the build is done magically by Debian's build system (Ubuntu's buildlog as an example) and I'm assuming that some equivalent to the commands you mentioned is accomplished. So since there is no difference in contents, I think I'll keep using https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz for now. I have tested the result of 7cf78ee, and there is a substantial change in the right direction. It behaves much better from a Swedish POV now. Thanks again! |
Yes, if the Debian build system can handle it, it is perfectly fine to use https://github.com/mike-fabian/ibus-typing-booster/archive/2.9.5.tar.gz |
The Swedish alphabet includes three additional vowels besides the ASCII letters:
å
,ä
andö
. When using Typing Booster with Swedish spellchecking enabled, the suggestion window seems to treata
,å
andä
as the same letter. For a Swedish speaker that is just as weird as if the application would treata
,e
,i
, etc. as the same letter.The text was updated successfully, but these errors were encountered: