Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer Internationalization - French #3

Open
clusterfudge opened this issue Jan 8, 2016 · 4 comments
Open

Tokenizer Internationalization - French #3

clusterfudge opened this issue Jan 8, 2016 · 4 comments

Comments

@clusterfudge
Copy link
Collaborator

We should test to see if the EnglishTokenizer impl is sufficient for French, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

@gcrieloue-main
Copy link

For words such as "j'ajoute", I would like "ajoute" to be a word (a keyword actually) but it doesn't work.

I think french tokenizer is pretty similar to the english one except for this quote rule (which has exceptions such as words like "aujourd'hui").

@penrods
Copy link
Contributor

penrods commented Mar 15, 2018

I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.

Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027

@gcrieloue-main
Copy link

gcrieloue-main commented Mar 15, 2018 via email

@penrods
Copy link
Contributor

penrods commented Mar 16, 2018

C'est la vie! There is a reason I shouldn't be the one implementing the French parsers. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants