Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lingua for language detection #526

Merged
merged 7 commits into from
Oct 30, 2023

Conversation

pierotofy
Copy link
Member

@pierotofy pierotofy commented Oct 30, 2023

This PR replaces PyCLD2 with lingua (via linguars). It seems to work quite better for long texts (>20 characters), but struggles just like the others for shorter texts, as others have highlighted before.

It also comes with a severe runtime penalty, as it's a magnitude slower than PyCLD2. I'm not too inclined to merge this right away, at least until I have a solution to improving short texts as well.

@pierotofy
Copy link
Member Author

pierotofy commented Oct 30, 2023

Perhaps by attempting to identify the words from a dictionary.

@pierotofy
Copy link
Member Author

image

Fixes #508, #395, #247

Supersedes #396

@pierotofy
Copy link
Member Author

Fixes #352

@pierotofy pierotofy merged commit f9712c8 into LibreTranslate:main Oct 30, 2023
4 checks passed
@pierotofy
Copy link
Member Author

We now use a hybrid lingua + https://github.com/LibreTranslate/LexiLang for doing language detection.

Seems like a good improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant