Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Words without accents are misidentified #10

Open
LinguaCelta opened this issue Aug 24, 2020 · 0 comments
Open

Words without accents are misidentified #10

LinguaCelta opened this issue Aug 24, 2020 · 0 comments

Comments

@LinguaCelta
Copy link
Member

LinguaCelta commented Aug 24, 2020

It's fairly common for accents on vowels to be missed off, or placed where they don't belong, especially in informal texts. The tagger is currently strict about this, because the lexicon only includes words with the standard use of accents.

E.g. "swn y mor" is almost certain to mean "the sound of the sea", but the standard spelling would be "sŵn y môr". The tagger therefore understands "swn" to be a form of the verb "to be", and "mor" to be an adverb, meaning "so":

59 swn 6,2 bod B Bdibdyf1u

60 y 6,3 y YFB YFB

61 mor 6,4 mor Adf Adf

62 . 6,5 . Atd Atdt

The optimal tagging would be:

59 swn 6,2 sŵn E Egu

60 y 6,3 y YFB YFB

61 mor 6,4 môr E Egu

62 . 6,5 . Atd Atdt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant