language-detector detects the language of text
pip install language-detector
Works with both Python 2 and 3
from language_detector import detect_language text = "I arrived in that city on January 4, 1937" language = detect_language(text) # prints English
To test the package run
python -m unittest language_detector.tests.test
Test is a comparison of how well language-detector and langid identify languages in the data sources.
|test-duration (in seconds)||0.10||3.83|
If you don't want language-detector to look for certain languages, you can monkey-patch the code. For example, in order to exclude English:
import language_detector language_detector.char_language = [cl for cl in char_language if cl != "English"] # proceed as normal
The following is a list of datasets used for each language:
|Farsi||BBC News Persian|
|Turkish||BBC News Türkçe|
How Does It Work?
When training the model, we scan all the data sources and compute the frequency of how often a character appears in each specific language. We also compute the frequency of how often a characters appears in all of the data sources for all the languages. For each language, we then calculate a score for each character as
frequency_in_language / frequency_in_all_languages. We then save the top ten highest scoring characters for each language.
When detecting a language, we simply iterate through the saved characters (ten for each language), and add their score as a weighted-vote for each language. Whichever, language has the highest score is selected as the winner.
If you'd like to contribute a new language, please consult CONTRIBUTING.md
Contact the package author, Daniel J. Dufour, at firstname.lastname@example.org