Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Dictionary Encoding #183

Closed
xylographe opened this issue Jun 25, 2019 · 8 comments
Closed

User Dictionary Encoding #183

xylographe opened this issue Jun 25, 2019 · 8 comments

Comments

@xylographe
Copy link

When trying to add an unknown word to the user dictionary, I get,

“Word cannot be added — Sadly, this word contains symbols out of current dictionary encoding, thus it cannot be added to user dictionary. You can convert this dictionary to UTF8 (don't forget to change the line SET {encoding} in .aff file) or choose the different one with appropriate encoding.”

No real surprise, the language is Dutch, and there is no standard (ISO/Windows) fixed-width 8-bit encoding that can represent all Dutch characters. Hence, UTF-8 (or some other variable-width Unicode Transformation Format) is mandatory.

What I don't understand is, how to convert this user dictionary to UTF-8. There is no .aff file, only a nl_NL.usr file. Inserting a UTF-8 BOM at the start of nl_NL.usr doesn't help. What shall I do?

@Predelnik
Copy link
Owner

First of all do you download Dutch dictionary from default location e.g. https://github.com/LibreOffice/dictionaries?
It seems to be already in UTF-8 that's why the error seems a bit strange but maybe it's incorrectly reported. Secondly if there's no problem can you tell me which word you are trying to add since it may depend on letters in it, etc.?

Also it's strange that you don't see .dic and .aff files, they should be either in %appdata%\Notepad++\plugins\config\Hunspell\ or directly beside plugin in c:\Program Files\Notepad++\plugins\ or similar.

If I remember correctly this error appears when the dictionary is some non-utf8 encoding and you are adding a word which contains symbols not available in that encoding, for example if you are trying to add word with Chinese characters to English dictionary which uses ISO8859-1 encoding currently.

@xylographe
Copy link
Author

Thank you for the fast reply.

I'm beginning to understand now. The user dictionary must have the same encoding as the main dictionary. I was looking in %APPDATA%\Notepad++\plugins\Hunspell (with fr-FR.usr, nl-NL.usr, etc.), when I should have been looking in <NPP>\plugins\DSpellCheck\Hunspell. The nl-NL.aff file in the latter directory did have SET ISO8859-1. I could have sworn it was downloaded by DSpellCheck, but apparently not. After removing the existing nl-NL.aff and nl-NL.dic, I had DSpellCheck download new ones from https://github.com/LibreOffice/dictionaries, and this new nl-NL.aff has SET UTF-8. Everything is working fine again, like it has been for quite some time (on previous computers).

BTW the Dutch word was stijl with U+0133 – latin small ligature ij, which obviously cannot work in an ISO8859-1 encoding. :) The new dictionary recognises it out of the box.

Thank you very much for your support, and, last but not least, for providing and maintaining this great plug-in!

@Predelnik
Copy link
Owner

You're welcome and thank you for putting the work to figure all this out! Maybe adding the absolute path to e.g. the .aff file in error message would actually be helpful for the people with issues like this one in the future.

@xylographe
Copy link
Author

Yes, agreed, the absolute path to the .aff would have helped.

With Get-ChildItem (powershell) I managed to track down all .aff and .dic files. I found no less than 48 dictionaries for six different languages, and a dozen others for languages I don't even understand. :-)
After examining the .aff files I copied the most recent ones to %PROGRAMFILES%\Common Files\Hunspell, and replaced two of them with updates I found via Google. Finally, I removed all the other dictionaries, replacing them with hard links to the .aff and .dic files in the new Hunspell directory. So now all applications use the same set of good dictionaries, and when a dictionary is updated, the update will be immediately available to all applications.

Though it took nearly three hours to get there, I'm very happy and contented with the result. 😄

@endolith
Copy link

Shouldn't everything just be UTF8 by default? I am getting the same error for trying to add μs to the dictionary.

@Predelnik
Copy link
Owner

@endolith
That's the question you should address to dictionary owners e.g. LibreOffice (here's an issue in their repo LibreOffice/dictionaries#7)

What I can do however is support some repositories containing utf-8 dictionaries:
Like
https://github.com/titoBouzout/Dictionaries
or
https://github.com/wooorm/dictionaries
Currently they do not share the same directory structure and end up not being parsed correctly by plugin unfortunately.

@endolith
Copy link

I mean the .aff file and user dictionary in DSpellCheck / notepad++ should be UTF8 by default so we don't have to modify it to add words with special characters

@Predelnik
Copy link
Owner

@endolith I do not provide any dictionaries with plugin, all dictionaries are downloaded from some other source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants