Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EN DASH (U+2013) is not ignored by speller #1

Closed
snomos opened this issue Oct 19, 2015 · 4 comments
Closed

EN DASH (U+2013) is not ignored by speller #1

snomos opened this issue Oct 19, 2015 · 4 comments

Comments

@snomos
Copy link

snomos commented Oct 19, 2015

The following text will trigger a red underline in MS Word using the SME speller (version: Divvun-sme-2015.292.177.msi, 2015-10-19, 02:57):

– Fertejit čielga njuolggadusat

The words are accepted, but not the initial EN DASH.

@snomos
Copy link
Author

snomos commented Oct 22, 2015

Note that the set of characters considered part of legal words varies a bit from language to language. E.g. is colon ":" not part of words in English, Danish and Norwegian (and presumably Greenlandic), whereas it could or could not be a part of a legal word in Swedish, Finnish and the Sámi languages, where it is used as a separator between a stem and inflectional endings for acronyms, digits etc:

CD:s (from SME), TV:n (swe)

For these languages it is not a part of the word if it is the last char in the word - in that case it could be an indication of direct speech coming next, just as in e.g. Danish.

@TinoDidriksen
Copy link
Owner

Fixed in latest versions, http://apertium.projectjj.com/spellers/
Btw, the 2015.292.177 part is an absolute timestamp, with minute precision.

The concern about which characters are legal where, is already part of the algorithm. The verbatim input is always tested first, before any manipulation to find a valid form is attempted.

Whether MS Word cares about it is another matter. I have no control whatsoever over what MS Word decides to send to the speller as a token, nor can I inspect the context of a given token. I get what I get, and I better be happy with it.

@snomos
Copy link
Author

snomos commented Oct 23, 2015

It seems that MS Word is still confused, at least the latest nightly build is still giving red underlines under these characters. MS Office 2010, 13, 16, Win7, 8, 10.

@TinoDidriksen
Copy link
Owner

There was an issue with trailing non-alphanumerics, fixed in latest builds.

My test text for sme: –Finnmárkku– (báhppa) () [vákten láhkai] vákten ládjii vákten ládje Finnmárkkubáhppa –artistta guovttis– artisttaguovttis Innst. O. Nr CD:s yielding:

speller-sme

bbqsrc pushed a commit to bbqsrc/hfst-ospell-old-cpp that referenced this issue Nov 23, 2015
eaxelson pushed a commit to hfst/hfst-ospell that referenced this issue Mar 3, 2016
…pellers#1

git-svn-id: svn://svn.code.sf.net/p/hfst/code/trunk@4494 941e2c2b-deac-454f-805a-451daa25f33c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants