Skip to content

HatemMS2012/LanguageIdentifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 

Repository files navigation

##Language Identification/Guessing Liberary in Java

##Introduction An implementation of a simple method for the automatic identification of text language. The implemented approach is based on character n-gram model as proposed by Cavnar et al. [1].

For each of the 17 supported languages (ara, deu, eng, fra, ita, nld, pol, rus, spa, swe, tur, ukr, urd, bel, fas, dan, cat) (see ISO-639-2 for more information about language abbreviations), we created an n-gram model for n=1,2 and 3.

The training data were obtained from the Leipzig Corpora Collection. For each language, we selected a 300k dataset consisting of 300,000 sentences taken from newspaper texts, randomly collected text or Wikipedia.

In this project the class hms.languageidentification.LanguageProfile was used to generate n-gram files for each language based on the lang_x_x_x-sentences.txt files of the Leipzig Corpora Collection. Each n-gram file lists the n-grams and the corresponding number of occurrences as observed in the training corpus file. You can use this code to extend the number of recognized languages.

##Usage

Example I: inside java code

	public static void main(String[] args) {
		//Create hms.languageidentification.TextLanguageIdentifier object
		TextLanguageIdentifier li = new TextLanguageIdentifier();
		//Call the identifyLanguage(String text) method. 
		String lang = li.identifyLanguage("wie geht es dir");
		//That is all!
		System.out.println(lang);
	}

Example II: using lang-identifier.jar

In your command line type:

	java -jar lang-identifier.jar "your text"

References

[1] N-Gram-Based Text Categorization (1994) by William B. Cavnar , John M. Trenkle Venue: In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages