GitHub - BishalLakha/Language-Detection-Using-European-Parliament-Proceedings-Parallel-Corpus-: Language detection using machine learning based on European Parliament Proceedings Parallel Corpus

Language Detector

Language detection is a natural language processing task of identifying the language a given document is written in. It is often the first step in a document processing pipeline. Moreover, it is considered to be a critical preprocessing step in applications that require language specific modeling, such as search engines, where depending on the detected language different tokenizers may be used. Another common example of applying language detection is as a preceding step to machine translation, since the language of the text to be translated is not always specified. Therefore, a reliable language detection tool is needed.[1]

Survey

Ivana Balazevic et. al. used character n-grams and bag-of-words features to train different classifiers like SVM, Logistic Regression etc. They used 22,000 tweets in 16 different languages to train their classifiers. They reported 96.92 % F1_score for SVM and 96.72% for Logistic regression. [1]
Archana Garg et. al. have a whole survey about different language identification techniques [2]

Dataset

European Parliament Proceedings Parallel Corpus is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU.

Methodology

Both word and character n-grams and bag-of-words features based on tf-idf was used for feature extraction. Logistic regression and SVM were used as classifier. Details can be found in Language_Detection.ipynb. Only 100 files from each class were used.

Result

30 models were trained and tested. Analysis of the results can be found in result_analysis.ipynb .The best accuracy was found for given model:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Resources		Resources
Results		Results
.gitignore		.gitignore
Language_Detection.ipynb		Language_Detection.ipynb
README.md		README.md
europarl.test		europarl.test
requirements.txt		requirements.txt
result_analysis.ipynb		result_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Detector

Survey

Dataset

Methodology

Result

Reference

About

Releases

Packages

Languages

BishalLakha/Language-Detection-Using-European-Parliament-Proceedings-Parallel-Corpus-

Folders and files

Latest commit

History

Repository files navigation

Language Detector

Survey

Dataset

Methodology

Result

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages