Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in derived proper nouns #14

Open
snomos opened this issue Dec 3, 2015 · 2 comments
Open

Regression in derived proper nouns #14

snomos opened this issue Dec 3, 2015 · 2 comments

Comments

@snomos
Copy link

snomos commented Dec 3, 2015

The North Sami speller for MS Office has a regression, in that it does not anymore (compared to last week) accept derived proper nouns with initial lower case:

skjermbilde 2015-12-03 kl 15 15 05

  • jiellevárihat
  • skánitlaš

These are accepted by the command line speller (hfst-ospell -S se.zhfst), but not by the MS Office speller (*.msi package).

Because of this diff, I suspect there is something with the nightly build environment that causes the issue. I have updated our test files with test cases for these words, and running "make check" on the built speller fst's should reveal issues related to the build system, if any. "make check" succeeds on my system, and should also on the build system (there are a couple of cases of known fails, but they are properly marked, so should not break the testing).

"make check" is only known to pass for SME, I have not tested the other languages yet.

There are a number of other regressions as well, and they all point in the direction of (im)proper handling of flag diacritics. It might be changes in hfst that has caused these regressions (my hfst installation is from nov. 27).

@snomos
Copy link
Author

snomos commented Dec 8, 2015

I have now confirmed that there is something in the build setup for the nightly builds causing the regression. I fed one of the test words through hfst-ospell using a freashly built se.zhfst (built on my own OSX box today):

$ echo skánitlaš | hfst-ospell -S build/newspellers/tools/spellcheckers/fstbased/hfst/se.zhfst 
"skánitlaš" is in the lexicon...

That is, the speller behaves as expected. I then copied the zhfst file from the installed msi package, and used that with the same input:

$ echo skánitlaš | hfst-ospell -S ~/se.zhfst 
"skánitlaš" is NOT in the lexicon:
Corrections for "skánitlaš":
Skánitlaš    25.436646
Skániklaš    35.436646
skánálaš    35.436646
skážirlaš    35.436646
skibitlaš    35.436646
skánjalaš    35.436646
skánalaš    35.436646
s-Skánitlaš    37.436646
Skánitlaš-    20035.437500

To me this looks like a bug in the handling of flag diacritics - the downcasing of derived proper nouns is handled with flags (Skánit (place name) -> skánitlaš (derived general noun, meaning "someone from Skánit")). There are other regressions as well that point in the same direction.

I was using an hfst version from Dec. 4 to build the zhfst file. The regression is older than that, about 10 days old now.

@snomos
Copy link
Author

snomos commented Dec 10, 2015

I scanned some chats from last week, and it seems we identified the issue(s) December 1. for the first time. Given anything else seems to be identical, could it be something related to changes in Hfst before that day that only affects builds on Windows? And as mentioned above, the only common thing among all failures is the use of flag diacritics, which is an area of trouble in past hfst versions.

Below is a screenshot that displays a list of words that should be accepted, together with the lexicon version and its build date.

skjermbilde 2015-12-10 kl 16 49 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant